Taxonomic bias in biodiversity data and societal preferences (Troudet et al. Scientific reports, published 22-08-2017)

Get Complete Project Material File(s) Now! »

Big data in practice

This section focuses on the occurrence.txt file of the all GBIF mediated data dataset extracted from the DwC-A. It is the core tabular file containing all the occurrences downloaded from the GBIF. The size file exceeds 500 Go of disc space once extracted from the archive. For comparison purposes, it represents half the space of a decent external drive or more than 100 DVD.
As explained in the introduction the words “big data” qualify datasets that cannot be manipulated with common tools. Many researchers now work on datasets including thousands of species and occurrences (e.g. Andam et al. 2016, Boucher-Lalonde and Currie 2016, Nicholson et al. 2016), and progresses in informatics facilitate their analyses. One of the most popular software environment used, R (R Development Core Team 2008), allows computing complex operations on thousands to millions of data. Unfortunately, the data used here were still too big to be handled through R alone.

Workflow architecture

In the current era of “big data”, multiple solutions have been created to manipulate very large datasets, including Hadoop (EMC Education Services 2015) and NoSQL databases (Andlinger 2013). Considering the volume of the GBIF mediated data, the analyses planned, as well as my computing skills, I chose a workflow approach. I created a Java application to read data occurrences, do operations on these occurrences and then insert them into a database. This database can be queried and is updated after each operation. From this database, a csv file can be exported. It contains the data later used with R scripts to compute statistical analyses (Fig 3).
To keep the integrity of the data I downloaded, I chose to “tag” the occurrences after each operation rather than to suppress or update them. The downloaded datasets were thus enriched and new columns with additional information were appended to the database.

Characterizing a biodiversity dataset: biases and trends

Once the GBIF mediated data was filtered, indexed and included in the main database, it was possible to study the dataset itself. To facilitate this investigation, I created many additional tables (See appendix 3), allowing for faster searches and simpler queries. Some additional information was also collected to further study the effect of some variables on the biodiversity data collection. Based on those new tables and additional data I could use scripts and statistical tools to get a better understanding of the biases and the trends affecting the GBIF dataset. Those biases are very important because if they affect the GBIF data which is the biggest primary biodiversity data repository they are likely to affect all biodiversity domains (Powney and Isaac 2015). The same reasoning is applied for the trends in GBIF mediated data. The following section will describe how I quantify those biases and trends in the GBIF data but also try to explain them with external data and statistic tools. The works detailed in chapters 2 and 3 rely on this procedure.

Species names from the GBIF Backbone Taxonomy and multimedia files

Two metrics were not available in the main occurrence.txt file extracted from the DwC-A.
The first metric was the number of multimedia files attached to an occurrence. A SpOcc consists of one observation or collect of a specimen at a specific time and place. However, such specimen could be photographed or its vocalization could be recorded so that a single occurrence can be linked to multiple media files. This 1-N relationship cannot be stored in the occurrence.txt file because an occurrence corresponds to only one row. Instead, a multimedia.txt file containing the list of multimedia files is provided. This file has one row for each multimedia file linked to a SpOcc (using the gbifid column) in the GBIF dataset, and a SpOcc can be linked to several rows (i.e. multimedia files). The multimedia.txt file was imported in the database using a simple “copy” query and then queried to get the needed statistics (see chapter 1).
The second metric was the number of described species per taxonomic class, a metric used while investigating the taxonomic bias in biodiversity data (Chapter 4). At first, I imported these figures from Catalogue of Life (www.catalogueoflife.org) but many species referenced in the GBIF were not in this catalogue. Indeed, the GBIF created its own classification system, called the GBIF Backbone Taxonomy, using diverse taxonomic databases, including, but not restricted to, Catalogue of Life (Text box 1).

Multiple correspondence analysis

The multiple correspondence analysis (MCA) was used to find the relation between categorical variables inside the GBIF dataset. For example I tested for relations between the age of an occurrence (number of years since the observation event), its data origin (categories: specimen, observation and unknown) and the data completeness (categories: no problem, missing temporal information, missing spatial information and missing both). The class of the occurrence can also be projected on the resulting plot. These analyses were done using the FactoMineR package for R (Husson et al. 2016).The analysis couldn’t be done on all the GBIF occurrences because R couldn’t load all the occurrences in memory. I made analyses on multiple 5 million random occurrences samples, and even tried to ventilate categories representing less than 0.5 % of the dataset as they could have altered the results.

READ Medley of epenthetic variations: Due to phonological processes or embedded in the phonetics?

Working on biodiversity patterns: delving into ecoinformatics Cleaning data

Systematics and Ecology are now data-intensive sciences. But “Big Data” does not necessarily mean better data (Boyd and Crawford 2012). Data quality must be ensured before further analyses. Some data are faulty, while others can be insufficient (i.e. under-sampled species) to produce meaningful estimates. These two issues have been tackled on terrestrial species before computing species richness across the globe to investigate the Latitudinal Diversity Gradient.
I produced a new table merging occurrences of the same species in the same geographic cell. I kept only the information about the species name, the cell occupied and the number of merged occurrences in the cell. I obtained a new dataset of spatially distinct occurrences to perform computations on. Then, under-sampled species, i.e. species with less than 20 spatially distinct occurrences, were filtered out (see below Selecting well-sampled species).
Data can be faulty in several ways: they can be biased, inaccurate or imprecise. The two latter issues were tackled here at the species level. I analyzed each species separately and identified odd occurrences, called outliers. The majority of GBIF data being correct, whether collected by scientists or citizens (Yesson et al. 2007, Kosmala et al. 2016), an algorithm should be able to identify occurrences inconsistent with the others. The selected algorithm used the orthodromic distance between occurrences and the climatic data associated to the occurrences to find potential outliers. Misidentified occurrences mixing two species with different habitats, and input or typing errors would lead to obvious inconsistencies that should be easily detected. On the opposite, erroneous data coherent with the rest of the species dataset would not be found.
The java code allowing spatial and environmental outlier detection was later cleaned and put in new software with a dedicated interface, designed to be easily usable by the scientific community. The resulting software is detailed in chapter 5.

Table of contents :

Thanks
Index
Introduction
What is Biodiversity? Why it matters? How to study it?
Biodiversity: a multi-faceted concept
Species richness, the golden standard of biodiversity measures
Why studying biodiversity?
Systematics and ecology: two complementary approaches to study biodiversity
Species occurrences, biologists’ raw material
Primary biodiversity data
Datasets: From cabinet of curiosities to databases
“Big data”: a change in scale and practices
Ecoinformatics, the “big data” of biodiversity
Global Biodiversity Information Facility
Data quality and bias
A global pattern of biodiversity: the latitudinal diversity gradient
Species richness varies with latitude
The multiple hypotheses behind the LDG
Questions addressed in this thesis dissertation
Can biological diversity be investigated in its entirety?
How is the practice of biodiversity data gathering evolving?
Latitudinal Diversity Gradient at large taxonomic scale: which factors shape it?
Chapter 1: Material and Methods
Using the GBIF mediated data
Datasets from the GBIF portal
The Darwin Core format
Big data in practice
Workflow architecture
Reading and filtering occurrences
Indexing the table to get a functioning database
Characterizing a biodiversity dataset: biases and trends
Species names from the GBIF Backbone Taxonomy and multimedia files
Public interest and taxonomic research quantity
Putting the GBIF database into numbers
Statistical tools
Working on biodiversity patterns: delving into ecoinformatics
Cleaning data
Estimating species richness
Using our results to understand the LDG
The species richness covariates
Statistical analysis of species richness and its covariates
Chapter 2: The increasing disconnection of primary biodiversity data from specimens: How does it happen and how to handle it? (Troudet et al. Systematic Biology, submitted as a Point of View)
Abstract
Keywords
Introduction
Material and Methods
Data Set
Data Quantity
Data Origin
Supporting Files
Evolution of Data Completeness
Evolution of Taxonomic and Spatial Precision
Results and Discussion
A Shift in the Recording of Primary Biodiversity Data
Primary Biodiversity Data for systematics and evolutionary studies in the 21st
Century: Are We There Yet?
Acknowledgments
Chapter 3: Taxonomic bias in biodiversity data and societal preferences (Troudet et al. Scientific reports, published 22-08-2017)
Abstract
Introduction
Results
Discussion
Methods
Acknowledgements
Author Contributions
Chapter 4: Latitudinal Diversity Gradient: Geometric hypotheses revisited using massive biodiversity occurrences in plants and animals of the New World (article in preparation)
Introduction
Materiel and Methods
Species richness estimates
Explanatory Variables
Statistical analyses
Results
Basic statistics
Latitudinal diversity gradient
Environmental hypotheses
Discussion
Acknowledgement
Chapter 5: DwCSP a fast biodiversity occurrence curator (article in preparation)
Introduction
Software
Data Enrichment
Searching for outliers
Results
Discussion
Discussion
Big-data and biodiversity
The big-data paradigm
The genesis of Biodiversity big-data
A new way of doing science
Primary Biodiversity data, a proxy to assess the state of the study of biodiversity
Biodiversity data are disconnected from specimens
Taxonomic bias while aiming at investigating the whole biodiversity
Using biodiversity data to decipher the origin of global biodiversity patterns .
Estimating global species richness from a large and geographically widespread
taxonomic sample
Further into the Latitudinal Diversity Gradient
The GBIF-mediated data: a fascinating tool for biodiversity analyses
References
Appendixes
Appendix 1: List of used Java Libraries and Dependencies
Appendix 2: List of indexes of the OCCURRENCES table
Appendix 3: List of additional database tables
Appendix 4: VBA script for Web Search Results
Appendix 5: R script for spatial outlier detection
Appendix 6: Worldwide species richness maps and plots