Working on biodiversity patterns: delving into ecoinformatics Cleaning data
Systematics and Ecology are now data-intensive sciences. But “Big Data” does not necessarily mean better data (Boyd and Crawford 2012). Data quality must be ensured before further analyses. Some data are faulty, while others can be insufficient (i.e. under-sampled species) to produce meaningful estimates. These two issues have been tackled on terrestrial species before computing species richness across the globe to investigate the Latitudinal Diversity Gradient.
I produced a new table merging occurrences of the same species in the same geographic cell. I kept only the information about the species name, the cell occupied and the number of merged occurrences in the cell. I obtained a new dataset of spatially distinct occurrences to perform computations on. Then, under-sampled species, i.e. species with less than 20 spatially distinct occurrences, were filtered out (see below Selecting well-sampled species).
Data can be faulty in several ways: they can be biased, inaccurate or imprecise. The two latter issues were tackled here at the species level. I analyzed each species separately and identified odd occurrences, called outliers. The majority of GBIF data being correct, whether collected by scientists or citizens (Yesson et al. 2007, Kosmala et al. 2016), an algorithm should be able to identify occurrences inconsistent with the others. The selected algorithm used the orthodromic distance between occurrences and the climatic data associated to the occurrences to find potential outliers. Misidentified occurrences mixing two species with different habitats, and input or typing errors would lead to obvious inconsistencies that should be easily detected. On the opposite, erroneous data coherent with the rest of the species dataset would not be found.
The java code allowing spatial and environmental outlier detection was later cleaned and put in new software with a dedicated interface, designed to be easily usable by the scientific community. The resulting software is detailed in chapter 5.
The following methods were tried on 100*100km cells.
A limitation of the GBIF mediated data can be easily understood from this accumulation curve: the sampling is never the same depending on both location and taxa. Considering the intrinsic heterogeneity of the GBIF dataset, there is no way to be at the asymptote (even for a small taxon) or even at a common “minimal” level of sampling in every part of the world. Some areas are even devoid of species occurrences like some regions of central Asia (Meyer et al. 2015).
But after all, my aim was not to get the most precise estimation of species richness but having comparable species richness results between different areas. Assuming I could weight species richness with the sampling effort, I would have comparable species richness values. Unfortunately, no attempt to standardize the sampling effort succeeded as the GBIF mediated data were far too heterogeneous.
Another contemplated plan was to do a sub-sampling of the GBIF data to put the oversampled areas at the same level than the under-sampled ones. Then again, the heterogeneity of the GBIF mediated data remained a problem because sub-sampling would have eliminated the majority of the data.
Using our results to understand the LDG
Once I obtained a dataset consisting of all the potential cells/species pairs I could easily determine the species richness of each cell, and even choose to compute the species richness of particular taxa. Having this species richness it was once again relatively trivial to compute a latitudinal richness value by averaging the species richness of the cells inside a series of latitudinal ranges (for 10*10km cells this gives us a species richness value per 100 km²).
Green box 6: Using the Lacerta bilineata known range downloaded from the IUCN website we can see which of our potential cells are consistent with it. The results shows that 1229 cells are located in the known range and 65 outside of it which correspond to 5 % of potential errors.
Comparison of the IUCN (Red List) polygon to the GBIF data and the computed niche. Outliers are represented in empty red squares, correct GBIF occurrences in empty black squares, potential niche cells in blue full squares and the 65 niche cell outside the IUCN polygon in full red squares. The light blue polygon has been downloaded from the IUCN website (www.iucnredlist.org) and represents the potential repartition of Lacerta bilineata.
However, while visualizing the LDG is interesting in its own right; what I really wanted to do was to test hypotheses about the formation of the LDG. Considering the broad taxonomic coverage of my data it would have been complicated to include historical hypothesis such as speciation and extinction rate as well as other phylogenetic hypothesis. However environmental data is easier to obtain and process. Those environmental variables are known to play a role in the latitudinal diversity gradient (Willig et al. 2003) by influencing species richness. I therefore chose to test the influence of those variables on the species richness I computed earlier.
The species richness covariates
The statistical tools I had at my disposal could allow me to test for the correlation between species richness and a set of covariates (explanatory variables). As I had a species richness value for each of my cell, I needed to compute the covariates values for each of those cells. Only after doing this I could use statistical tools on the dataset.
The Ambient Energy (Currie 1991), Productivity (Hutchinson 1959), and Water availability (Hawkins et al. 2003a, Hawkins et al. 2003b) hypotheses suggest that species richness is influenced by environmental variables. The Ambient Energy hypothesis mainly lay on the assumption that sunshine and temperature are physical requirements of organisms (for thermoregulatory purposes mainly) while the Productivity hypothesis links the productivity (plant biomass) of an area to the number of individual, and therefore the number of species, it can support. The water availability hypothesis is based on the potential limiting factor of water availability on plant biomass. These hypotheses are all related to the energy-richness hypothesis (Currie et al. 2004). They suggest that the number of individual in an area is influenced by environmental factors (productivity in particular). As the species richness varies as a function of the number of individuals (Fisher et al. 1943), the productivity should consequently influence the species richness.
Those hypotheses were tested using Potential evapotranspiration and Annual Mean temperature, Actual Evapotranspiration and Annual Precipitation values taken from WorldClim (www.worldclim.org) and Mu et al. (2011). All these variables were available as raster files that can be read by the QGIS software. I used this software to transform those files in a tabulated file format that could be imported into my database. I then computed for each cell the mean values of the environmental variables (the raster files use a finer grid than me).
After this operation I had for each cell the species richness and environmental variables values available.
With the data at my disposal I could also test for additional hypotheses formulated on the LDG, the geometrical hypotheses and the Rapoport’s effect. The geometrical hypothesis was first formulated by Colwell and Hurtt (1994) who suggested that a latitudinal gradient could arise from the random placement of species ranges across the globe without any influence of environmental variables. This hypothesis, also called mid-domain effect, predicts a species richness peak or plateau in species richness, at the center of a bounded domain, when randomly placing a set of different species ranges within that domain. This hypothesis has, however, been contested by Currie and Kerr (2007, 2008). Later Gross and Snyder- Beattie (2016) added environmental limits concepts to this hypothesis to propose a new model. This new model adds a level of complexity to Colwell and Hurtt’s model, and has never been tested on empirical data. Those geometrical hypotheses can also be called null or abiotic models as they imply the LDG could arise as a mathematical artifact, independently of environmental or historical variables.
Statistical analysis of species richness and its covariates
Many studies have been done trying to test the effect of environmental variables on species richness (e.g. Ferrer-Castán et al. 2016, Rodrigues et al. 2017). Many methods are proposed in the literature to study this king of spatial relationship and they can be summarized in three steps for most papers:
• The first step is to use a non-spatial analysis. This analysis builds a model assuming all points (in our case, all cells) are independent from one another. It also assumes that the relationship between species richness and its covariates is stationary across space (the model doesn’t change depending on the location). In my case I used R and an Ordinary Least Square (OLS) analysis. I followed a manual iterative stepwise method selecting first the best null hypothesis for each class and then kept on adding other explanatory variables. At each step the variable added was evaluated using the coefficient of determination (r²) and the variable was not included when it did not improve the model adjusted r² by at least 1 %.
• The second step was to test the model residuals for spatial autocorrelation using Moran’s I test. This test is a measure of spatial autocorrelation. If the test find out that the residual are spatially correlated it means that the data is affected by spatial autocorrelation (Tobler, 1970).
• The third step is usually to use a spatial lag model or a spatial error model (Anselin et al. 1996) to test the model produced with the OLS analysis. This test will produce a regression model that takes into account spatial autocorrelation and ensure that an explanatory variable in not included only because of it.
Those three steps are often the ones used in paper working on the relation between species richness and environmental variables (e. g. Hawkins et al. 2003a, Mora and Robertson 2005). However they assume the model spatial stationarity. Spatial stationarity is rarely tested in such studies (Foody 2004, Mellin et al. 2014) mostly because it is a new tool that needs a lot of computing power. However I had at disposal the data and the computing power and decided to test the spatial stationarity of my final model with Geographically Weighted Regression (GWR). GWR is a local regression method that can be used for diagnosing spatial heterogeneity between dependent and explanatory variables over space (Brunsdon et al. 1996). GWR is performed within local windows centered on each observation of the dataset. Each observation within the local window is weighted based on its proximity to the center of that window and a regression model is then used on this subset of observations. This analysis allowed me to test if the relation between species richness and the explanatory variable is constant across space.
A Shift in the Recording of Primary Biodiversity Data
In the current context of biodiversity crisis, numerous pleas have incited the scientific community to collect as much biodiversity data as possible, out of the fear it might disappear before we even knew of its existence (May 2004; Butchart et al. 2010). These calls have been heard and, indisputably, biodiversity data accumulates faster than ever (Fig. 2 and Supporting Information), a trend most classes of organisms exhibit even though for a few of them the trend is not so strong (Troudet et al. 2017). The >57 million occurrences submitted to the GBIF in 2014, more than five times the amount of data submitted ten years earlier (i.e. 11 million occurrences in 2004), embody this report (Supporting Information). With this spectacular acceleration, the amount of data available to scientists is so huge that the study of biodiversity has entered into the “Big Data” era (Hampton et al. 2013; Joppa et al. 2016; Kelling et al. 2009). Multiple benefits followed such as an increased power in statistical analyses because of larger datasets or the possibility to tackle issues at large taxonomical, temporal or spatial scales (Rosenheim and Gratton 2017). However, the large volume of data is also a curation challenge that must be handled to avoid passing on a dubious source of knowledge to future generations because of a fall in data quality (Howe et al. 2008), a criticism regularly brought up for GBIF mediated data (e.g. Yesson et al. 2007).
Primary Biodiversity Data for systematics and evolutionary studies in the 21st Century: Are We There Yet?
The importance of collecting specimens in taxonomy, evolution and ecology cannot be overemphasized (Huber 1998; Schilthuizen et al. 2015) and two main points, previously discussed in the literature, must be reiterated. First, specimens are needed for species description and for the study of biodiversity in general (Dubois 2017 contra Pape et al. 2017). A crucial argument is the utility of specimens for checking species identification. Goodwin et al. (2015) assessed that up to half of tropical plant identifications in museum collections were false. Correcting identification errors can be done after examining specimens, but is impossible for mere observations. If Goodwin et al.’s estimation is correct and generalizable to all primary data, the need for specimens, or at least ancillary data to observation occurrences, is critical. Second, the revived focus on morphology advocated lately in systematics requires specimens (Jenner 2004, Wiens 2004, Smith and Turner 2005, Yassin 2013, Pyron 2015, Wanninger 2015, Wipfler et al. 2016). Authors recommending this revival underlined that comparative morphology not only brings phylogenetic characters but also allows including fossil taxa in phylogenetic analyses (e.g. Pyron 2011; Wood et al. 2013), enabling us to better estimate the structure and branch length of the reconstructed trees (Wiens et al. 2010; Pyron 2015). Given that phylogenetic thinking has become of paramount importance in biology, improvements in phylogenetic estimation offer large potentialities in evolutionary studies and in the study of biodiversity in general (Losos et al. 2013; Buerki et al. 2015).
However, a specimen is not mandatory for a primary biodiversity data to be useful. Instead of specimens, and in complement to mere observations, digital data or molecular data can be collated. New technologies offer a wide range of tools and methods to collect concrete specimen evidence in nature, and it is now relatively easy and affordable to obtain DNA sequences, images and sound recordings. Then, using molecular and digital data should now be a common practice in the study of biodiversity, as the exponential growth of molecular data and phylogenies, and the development of morphological databases and ontogenies would suggest (Lathe et al. 2008; Parr et al. 2012; Deans et al. 2012, 2015). We show here that digital and DNA data are increasingly used but these data remain patently underemployed (Fig. 4). Only 2.5 % of all the GBIF-mediated occurrences for the 24 focal classes were linked to digital data and 1.5 % to DNA sequences. Worse, proportionally, they become more and more negligible, drowned in the large quantity of observations without supporting data.
This situation might be improving lately, but the post-2008 tendency observed demands to be confirmed in future years (Fig. 4). Moreover, and quite inconsistently, digital and DNA data were less used for OB than for SB occurrences (Fig. 5). They would yet be more useful for OB biodiversity data given that they would constitute the only way to independently check or update observation occurrences, whereas one can refer to specimens, as long as those are kept and the traceability chain is not broken, for SB occurrences (Page 2015; Nualart et al. 2017). The high proportion of sequences associated to primary biodiversity data of unknown origin could suggest that when a sample is performed, occurrences are often classified in the catchall class ‘unknown origin’.
Table of contents :
What is Biodiversity? Why it matters? How to study it?
Biodiversity: a multi-faceted concept
Species richness, the golden standard of biodiversity measures
Why studying biodiversity?
Systematics and ecology: two complementary approaches to study biodiversity
Species occurrences, biologists’ raw material
Primary biodiversity data
Datasets: From cabinet of curiosities to databases
“Big data”: a change in scale and practices
Ecoinformatics, the “big data” of biodiversity
Global Biodiversity Information Facility
Data quality and bias
A global pattern of biodiversity: the latitudinal diversity gradient
Species richness varies with latitude
The multiple hypotheses behind the LDG
Questions addressed in this thesis dissertation
Can biological diversity be investigated in its entirety?
How is the practice of biodiversity data gathering evolving?
Latitudinal Diversity Gradient at large taxonomic scale: which factors shape it?
Chapter 1: Material and Methods
Using the GBIF mediated data
Datasets from the GBIF portal
The Darwin Core format
Big data in practice
Reading and filtering occurrences
Indexing the table to get a functioning database
Characterizing a biodiversity dataset: biases and trends
Species names from the GBIF Backbone Taxonomy and multimedia files .
Public interest and taxonomic research quantity
Putting the GBIF database into numbers
Working on biodiversity patterns: delving into ecoinformatics
Estimating species richness
Using our results to understand the LDG
The species richness covariates
Statistical analysis of species richness and its covariates
Chapter 2: The increasing disconnection of primary biodiversity data from specimens:
How does it happen and how to handle it? (Troudet et al. Systematic Biology, submitted as a Point of View)
Material and Methods
Evolution of Data Completeness
Evolution of Taxonomic and Spatial Precision
Results and Discussion
A Shift in the Recording of Primary Biodiversity Data
Primary Biodiversity Data for systematics and evolutionary studies in the 21st
Century: Are We There Yet?
Chapter 3: Taxonomic bias in biodiversity data and societal preferences (Troudet et al. Scientific reports, published 22-08-2017)
Chapter 4: Latitudinal Diversity Gradient: Geometric hypotheses revisited using
massive biodiversity occurrences in plants and animals of the New World (article in preparation)
Materiel and Methods
Species richness estimates
Latitudinal diversity gradient
Chapter 5: DwCSP a fast biodiversity occurrence curator (article in preparation)
Searching for outliers
Big-data and biodiversity
The big-data paradigm
The genesis of Biodiversity big-data
A new way of doing science
Primary Biodiversity data, a proxy to assess the state of the study of biodiversity
Biodiversity data are disconnected from specimens
Taxonomic bias while aiming at investigating the whole biodiversity ..
Using biodiversity data to decipher the origin of global biodiversity patterns .
Estimating global species richness from a large and geographically widespread
Further into the Latitudinal Diversity Gradient
The GBIF-mediated data: a fascinating tool for biodiversity analyses
Appendix 1: List of used Java Libraries and Dependencies
Appendix 2: List of indexes of the OCCURRENCES table
Appendix 3: List of additional database tables
Appendix 4: VBA script for Web Search Results
Appendix 5: R script for spatial outlier detection
Appendix 6: Worldwide species richness maps and plots
List of figures:
List of tables: