Observed prediction reliability and robustness of the optimization to variation of heritability

Get Complete Project Material File(s) Now! »

Statistical models for association mapping and power evaluation

Mixed models are now routinely used to control type I error in GWAS (YU et al. 2006). Relatedness among individuals is taken into account by considering that the random polygenic effects are not independent, with a covariance matrix determined by kinship (K, with as many rows and columns as individuals: N). As K includes information on both population structure and relatedness, it is in general not useful to consider admixture information as fixed effects covariates (ASTLE and BALDING 2009). We therefore considered the following statistical model (denoted by MK): 𝒀𝒀=𝟏𝟏𝜇𝜇+𝑿𝑿𝒍𝒍𝛽𝛽𝑙𝑙+𝑼𝑼+𝑬𝑬 , =𝑿𝑿𝑿𝑿+𝑼𝑼+𝑬𝑬 , with 𝑿𝑿=[𝟏𝟏𝑿𝑿𝒍𝒍] and 𝜷 𝑻𝑻=(𝜇𝜇,𝛽𝛽𝑙𝑙) where Y is the vector of N phenotypes, 𝜇𝜇 is the intercept, 𝟏𝟏is a vector of N 1, 𝑿𝑿𝒍𝒍 is the vector of N genotypes at the tested locus (0 and 1 corresponding to homozygotes and 0.5 to heterozygotes), 𝛽𝛽𝑙𝑙 is the additive effect of locus l to be estimated, 𝑼𝑼↝𝑁𝑁(0,𝑲𝑲𝜎𝜎𝑔𝑔𝑔 2) is the vector of random polygenic effects, 𝜎𝜎𝑔𝑔𝑔 2 being the residual polygenic variance, 𝑬𝑬↝𝑁𝑁(0,𝑰𝑰𝜎𝜎𝑒𝑒2) is the vector of remaining residual effects with variance 𝜎𝜎𝑒𝑒2, I is an identity matrix of size equal to the number of individuals (N), U and E are independent. Locus effects in this mixed model can be tested using Wald statistics (WALD 1943). In the general case, a given linear combination of fixed effects 𝑳𝑳𝑻𝑻𝜷 =0 (H0 hypothesis) can be tested against 𝑳𝑳𝑻𝑻𝜷 ≠0 (the alternative hypothesis H1) using: 𝑾𝑾=􀵫𝑳𝑳𝑻𝑻𝜷􀷡􀵯𝑻𝑻􀵤𝑳𝑳𝑻𝑻􁉀𝑿𝑿𝑻𝑻􀵫𝑲𝑲𝜎𝜎􀷜𝑔𝑔𝑔 2+𝑰𝑰𝜎𝜎􀷜𝑒𝑒2􀵯−1𝑿𝑿􁉁−1𝑳𝑳􀵨−1.

Simulation based evaluation of the impact of the estimation of K on false positive control and power

The closed form expression of the non-centrality parameter already revealed that kinship affects power. Comparing the impact of different kinship estimators on power implies to evaluate their ability to guarantee the expected nominal control of false positives under different hypotheses on trait genetic determinism. To this end, we simulated traits influenced by L biallelic QTLs (SNPs).In a first step, QTLs were sampled randomly among the SNPs located on all the chromosomes except one. The chromosome without QTL (further referred to as « H0-chromosome ») was used to estimate the false positive rate. All the H0-markers (the markers on the H0-chromosome) were tested with the above mentioned statistical models for each run of simulation. The efficiency of the different estimations of K to control false positives was evaluated by comparing expected and observed quantiles of H0-Pvalues and histograms of H0-Pvalues. In a second step we applied the same procedure, but now sampling the QTLs among the M SNPs (on all chromosomes). A QTL was declared detected when the Pvalue of the corresponding SNP in the genetic model was below the significance threshold. Power of a given model was computed as the number of QTL which were detected. We also applied a less restrictive definition of QTL detection, considering that a QTL could be detected by SNPs located near it. To do so, another analysis was conducted in which markers within a given genetic distance of a QTL were considered H1-markers and the others H0-markers. The realized false discovery rate (FDR) was defined as the proportion of H0-markers among the markers declared significant. Power of QTL detection was estimated by considering that a QTL was detected when at least one of the corresponding H1-markers had a significant Pvalue. This general method will be exemplified with parameters specific to three maize panels described below.

Diversity and Linkage Disequilibrium in maize panels

Diversity and Linkage Disequilibrium (LD) were investigated within the different panels to provide elements on their ability to detect QTL (ie. their power) along the genome. On average, the Minor Alelle Frequency (MAF) was lower in the CF-Flint than in the other panels. Differentiation among genetic groups (Fst) was higher for CF-Dent (0.15) than for C-K (0.11) and CF-Flint (0.08) (Table 1). The raw LD (r²) and its correction by Kinship (r²K) were variable between and within panels (Figure 1). LD was on average higher in the dent panel. Within each panel, it was higher for centromeric than for telomeric regions. High r² values were observed between physically linked markers but also unlinked markers. This last situation occurred mainly between centromeric regions (Figure 1A, chromosomes 5, 7, and 8 and Figure 1B, chromosome 7). Inter-chromosomic LD was reduced to a large extent when considering r²K rather than r². Taking into account covariance between individuals (r²K) also reduced intra-chromosomic LD, in particular between distant blocks with high LD (Figure 1B chromosome 10). Considering r²K instead of r² globally had the strongest impact in the CF-Dent panel.

Relationship between MAF, Fst, CorK and power

Above described parametrization of QTL effects was used to investigate the influence of MAF, Fst, and the correlation between local and global covariance matrices (estimated as CorK_Freq) on power in the three maize panels. Level plots (Figure 2) showed that the MAF, the Fst, and CorK_Freq had important effects on power, with very similar graphs in all the panels. The highest power was achieved when MAF was high and Fst or CorK_Freq was low. When the MAF was below 0.1, power was close to 0 even if the marker had a low Fst or low CorK_Freq. Some regions of the level plots were not covered by the available markers (regions in white on Figure 2), in particular there was no marker with a CorK_Freq below 0.03.Note that the graphs obtained using K_Chr (or the IBS) were similar to those obtained with K_Freq and led to the same general conclusions (results not shown).
The parameters related to power (MAF, Fst, CorK_Freq) varied between panels (Table 1, see above). As a consequence from above described relationships, the mean analytical power of statistical model MK_Freq varied between the three panels (Table1), and was higher in the C-K panel (11.3%) than in the CF-Dent and CF-Flint panels (below 9.0%).

READ Stress-resultant constitutive equations for elastoplasticity

Simulation based assessment of kinship estimation on false positive control and power

Simulating different genetic models using the genotypes of the three panels allowed the comparison of the efficiency of the three statistical models to control false positives and to detect QTLs. The efficiency to control false positives depended on the genetic model (number of QTLs), the panel, and the estimation procedure for K (Table 2). The distribution of the Pvalues under H0 revealed that M was in accordance with trends of CorK_Freq along the genome. Correlation between the covariance matrix at the marker and the global covariance matrix (K_Freq and K_Chr) was significantly lower for K_Chr than for K_Freq, and particularly in the pericentromeric regions (Figure4). We observed that peaks of Fst corresponded generally to peaks of both correlations (CorK_Freq and CorK_Chr) (Figure4B, chromosome 7, and Figures 4A and 4C chromosome 8). Conversely, pericentromeric regions with low Fst corresponded to a peak of CorK_Freq and a drop of CorK_Chr (Figure4B, chromosomes 8 and 10, and Figure 4C chromosome 7). CorK_Freq, CorK_Chr and the difference between these two parameters were higher in the CF-Dent panel than in the two others.
K_Freq was conservative (Figure 5A) whereas the alternative models MK_Chr and MK_LD gave distributions closer to the expected one (Figures 5B and 5C). The observed Pvalue quantiles were closer to the expected Pvalue quantiles with MK_Chr and MK_LD than with MK_Freq (Table 2). MK_Freq resulted in fewer small Pvalues than expected under H0, for example in the CF-Dent panel we observed only half of the Pvalues that were expected to be below 0.001. Observed Pvalue quantiles with MK_Chr and MK_LD were very close to the expected Pvalue quantiles, although also most of the time below it.

Table of contents :

Chapter 1 Recovering power in association mapping panels with variable levels of linkage disequilibrium
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
Statistical models for association mapping and power evaluation
Analytical evaluation of the impact of panel characteristics on power
Kinship estimation
Simulation based evaluation of the impact of the estimation of K on false positive control and power
Genetic material and genotyping data
Specific parameterization
RESULTS
Diversity and Linkage Disequilibrium in maize panels
Relationship between MAF, Fst, CorK and power
Variation of analytical power and CorK along chromosomes
Simulation based assessment of kinship estimation on false positive control and power
DISCUSSION AND CONCLUSIONS
Analytical investigation of potential power along the genome with usual model (MK_Freq)
Simulation based comparison of type I risk and power of statistical models associated with different estimations of K
Acknowledgments
LITERATURE
Chapter 2 Dent and Flint maize diversity panels reveal important genetic potential for increasing biomass production
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
Genetic material and genotyping data
Diversity analysis
Linkage Disequilibrium (LD)
Phenotypic data
Phenotypic characterization of the genetic groups within each panel
Statistical model for association mapping
RESULTS
Diversity and structure analysis
Linkage disequilibrium
Phenotypic variation
Phenotypic characterization of the genetic groups within each panel
Association mapping results
DISCUSSION AND CONCLUSION
Genetic Diversity organization
Trait variation within and among genetic groups
Association mapping results
Conclusions
Acknowledgments
LITERATURE
Chapter 3 ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
Genetic material
Field data
Genotyping, diversity and relationship matrix
Statistical model
Optimization criteria and CD
Optimization algorithm
Observed prediction reliability and robustness of the optimization to variation of heritability
Link between the PEV and the observed prediction error
Genetic properties of optimized calibration sets
RESULTS
Trait variation
Description of the diversity and of the genomic relationship matrix
Observed prediction reliability and robustness of the optimization to variation of heritability
Link between the PEV and the observed prediction error
Genetic properties of optimized calibration sets
DISCUSSION
Acknowledgments
LITERATURE
General discussion
Increasing power in association mapping
Using molecular information to maximize GS efficiency: optimizing the sampling of the calibration set
Diversity analysis and association mapping in the Dent and Flint Cornfed panels .
Towards an integrated approach in plant breeding
LITERATURE
APPENDICES
Appendix I: supplemental chapter 1
Appendix II: supplemental chapter 2
Appendix III: supplemental chapter 3