Get Complete Project Material File(s) Now! »
In 1964 Box & Cox proposed a parametric power transformation technique in order to reduce anomalies such as non-additivity, non-normality and heteroscedasticity. The one-parameter Box–Cox transformations (that hold for 𝑦𝑖>0) are defined as:
where λ is the transformation parameter.
In order to determine the optimal value of transformation parameter λ, we use the Box-Cox normality plot. It is formed by:
Vertical axis: Correlation coefficient from the normal probability plot after applying Box-Cox transformation.
Horizontal axis: Value for λ.
The value of λ corresponding to the maximum correlation on the plot is then the optimal choice for λ.
The data set was transformed by log2, and then normalized as follows:
Subtract the mean of all log transformed values.
Divide the difference by the standard deviation of all log transformed values for the given sample.
The objective of this procedure is to obtain all samples with mean 0 and standard deviation 1.
Wisconsin double standardization
This transformation has two steps:
Standardize the matrix of abundance by species maximum standardization.
Standardize the transformed matrix by sample total standardization.
We often multiply the standardized matrix by 100 for convenience.
Principal component analysis (PCA)
We use the PCA to understand the largest source of variation. The objective is to reduce the dimensions of the data while remaining information as much as possible. The principal components are obtained by maximizing the variance-covariance matrix of the data, which is achieved by calculating the eigenvectors/eigenvalues of the variance-covariance matrix. The first PC is the linear combination of the original variables that explains the greatest amount of variation. The amount of dimensions chosen needs to explain most of the variance from the data (Cao et al., 2014). The objective function to solve is arg𝑚𝑎𝑥‖𝑎ℎ‖=1𝑣𝑎𝑟(𝑋𝑎ℎ).
where X is a 𝑛×𝑝 data matrix, 𝑎ℎis the p-dimensional loading vector associated to the principal component h, ℎ=1,…,𝑟, under the constraints that 𝑎ℎ is of norm 1 and is orthogonal to the previous loading vectors 𝑎𝑚, 𝑚<ℎ. The principal component vector is defined as 𝑡ℎ=𝑋𝑎ℎ.
Nonmetric Multidimensional Scaling (nMDS)
Unlike other methods like PCA that rely on Euclidean distances, nMDS uses rank orders, and thus is a more robust method that can be well suited for a wide variety of data. Besides, nMDS is an iterative algorithm, it stops computation after a stable solution has been reached, or after a fixed number of attempts.
The nMDS procedure has the following steps:
Define the original positions of samples in multidimensional space.
Construct the distance matrix (matrix of dissimilarities).
The distance matrix is a matrix of all pairwise distances among samples, which is calculated with an appropriate distance measure, such as Euclidean distance and Bray distance. Bray-Curtis distance is calculated as the following: 𝐷(𝑥1,𝑥2)=𝐴+𝐵−2𝑊𝐴+𝐵−𝑊, where A and B represent the sum of abundances of species from two sites separately, W is the sum of the lesser values for only those species in common between both sites.
Choose a desired number of m reduced dimensions.
Construct an initial configuration of the samples in the m-dimensions.
Regress distances in this initial configuration against the observed distances.
Measure goodness of fit by calculating the stress value.
The most common way to calculate is Kruskal’s Stress.
Tuckey Honestly Significant Difference test (Tuckey HSD)
While an analysis of variance (ANOVA) test can tell us whether groups in the sample have different means, it cannot tell us which groups are different. Thus we apply Tuckey HSD to determine which groups in the sample differ. Therefore Tuckey HSD can only be performed after ANOVA.
The Tuckey method compares the means of every group to the means of every other group. In other words, it applies simultaneously to all pairwise comparison among means. If we run t-test for every comparison, error of type I increases in line with the number of means. In order to control type I error rate, we use Tuckey HSD instead of running several t-tests. The assumptions of the Tukey test are the same as for a t-test: normality, homogeneity of variance, and independent observations.
The Tuckey HSD procedure has the following steps:
Calculate mean and variance for each group.
Compute MSE: mean squared error.
Construct Honest Significant Difference for each group:
where n is the number of observations in each group.
Compute p for each comparison, this value can be obtained from a table of the studentized range distribution
The quantile approach to dimension reduction combines the idea of dimension reduction with the concept of sufficiency, aims to generate low-dimensional summary plot without appreciable loss of information. Compared to existing methods, (1) it requires minimal assumptions and is capable of revealing all dimension reduction directions; (2) it is robust against outliers and (3) it is structure adaptive, thus more efficient (Kong and Xia, 2014).
Sparse PCA uses the low rank approximation property of the SVD and the close link between low rank approximation of matrices and least squares regression. The pseudo code below explains how to obtain a sparse loading vector associated to the first principal component.
Extract the first left and right singular vectors(of norm 1) of the SVD(𝑋ℎ) to initialize 𝑡1=𝛿1𝑢1 and 𝑎ℎ=𝑎1.
Until convergence of:
(C) 𝑁𝑜𝑟𝑚 𝑡ℎ𝑡𝑜 1.
Norm 𝑡ℎ𝑡𝑜 1.
The sparse loading vectors are computed component wise which results in a list of selected variable per component (Cao et al., 2014). We use this function to reduce the number of variables. We keep variables for which their loading values are different form 0.
Table of contents :
I. Introduction of INRA
A. General introduction
B. The Toulouse Centre
A. Context and objectives of research
B. Objective of internship
A. Data transformation
1. Box-Cox transformation
2. Geometric normalization
3. Wisconsin double standardization
B. Ordination methods
1. Principal component analysis (PCA)
2. Nonmetric Multidimensional Scaling(nMDS)
3. Analysis of Similarities (ANOSIM)
4. Canonical correlation analysis (CCA)
5. Regularized generalized canonical correlation analysis (rGCCA)
C. Multilevel modeling
1. Analysis of variance (ANOVA)
2. Mixed effects model
3. Tuckey Honestly Significant Difference test (Tuckey HSD)
D. Dimension reduction
1. Quantile approache
2. sparse PCA
IV. Results and discussion
1. Nonmetric Multidimensional Scaling (nMDS)
2. Analysis of Similarities (ANOSIM)
3. Mixed-effects Model
4. Analysis of variance (ANOVA) and Tuckey HSD test
5. Dimension reductions
6. Regularized canonical correlation analysis (rCCA)
V. Tables and Figures