Get Complete Project Material File(s) Now! Β»

## Statistical models for association mapping and power evaluation

Mixed models are now routinely used to control type I error in GWAS (YU et al. 2006). Relatedness among individuals is taken into account by considering that the random polygenic effects are not independent, with a covariance matrix determined by kinship (K, with as many rows and columns as individuals: N). As K includes information on both population structure and relatedness, it is in general not useful to consider admixture information as fixed effects covariates (ASTLE and BALDING 2009). We therefore considered the following statistical model (denoted by MK): ππ=ππππ+πΏπΏπππ½π½ππ+πΌπΌ+π¬π¬ , =πΏπΏπΏπΏ+πΌπΌ+π¬π¬ , with πΏπΏ=[πππΏπΏππ] and π· π»π»=(ππ,π½π½ππ) where Y is the vector of N phenotypes, ππ is the intercept, ππis a vector of N 1, πΏπΏππ is the vector of N genotypes at the tested locus (0 and 1 corresponding to homozygotes and 0.5 to heterozygotes), π½π½ππ is the additive effect of locus l to be estimated, πΌπΌβππ(0,π²π²πππππ 2) is the vector of random polygenic effects, πππππ 2 being the residual polygenic variance, π¬π¬βππ(0,π°π°ππππ2) is the vector of remaining residual effects with variance ππππ2, I is an identity matrix of size equal to the number of individuals (N), U and E are independent. Locus effects in this mixed model can be tested using Wald statistics (WALD 1943). In the general case, a given linear combination of fixed effects π³π³π»π»π· =0 (H0 hypothesis) can be tested against π³π³π»π»π· β 0 (the alternative hypothesis H1) using: πΎπΎ=τ΅«π³π³π»π»π·τ·‘τ΅―π»π»τ΅€π³π³π»π»τπΏπΏπ»π»τ΅«π²π²ππτ·πππ 2+π°π°ππτ·ππ2τ΅―β1πΏπΏτβ1π³π³τ΅¨β1.

**Simulation based evaluation of the impact of the estimation of K on false positive control and power**

The closed form expression of the non-centrality parameter already revealed that kinship affects power. Comparing the impact of different kinship estimators on power implies to evaluate their ability to guarantee the expected nominal control of false positives under different hypotheses on trait genetic determinism. To this end, we simulated traits influenced by L biallelic QTLs (SNPs).In a first step, QTLs were sampled randomly among the SNPs located on all the chromosomes except one. The chromosome without QTL (further referred to as Β«Β H0-chromosomeΒ Β») was used to estimate the false positive rate. All the H0-markers (the markers on the H0-chromosome) were tested with the above mentioned statistical models for each run of simulation. The efficiency of the different estimations of K to control false positives was evaluated by comparing expected and observed quantiles of H0-Pvalues and histograms of H0-Pvalues. In a second step we applied the same procedure, but now sampling the QTLs among the M SNPs (on all chromosomes). A QTL was declared detected when the Pvalue of the corresponding SNP in the genetic model was below the significance threshold. Power of a given model was computed as the number of QTL which were detected. We also applied a less restrictive definition of QTL detection, considering that a QTL could be detected by SNPs located near it. To do so, another analysis was conducted in which markers within a given genetic distance of a QTL were considered H1-markers and the others H0-markers. The realized false discovery rate (FDR) was defined as the proportion of H0-markers among the markers declared significant. Power of QTL detection was estimated by considering that a QTL was detected when at least one of the corresponding H1-markers had a significant Pvalue. This general method will be exemplified with parameters specific to three maize panels described below.

**Diversity and Linkage Disequilibrium in maize panels**

Diversity and Linkage Disequilibrium (LD) were investigated within the different panels to provide elements on their ability to detect QTL (ie. their power) along the genome. On average, the Minor Alelle Frequency (MAF) was lower in the CF-Flint than in the other panels. Differentiation among genetic groups (Fst) was higher for CF-Dent (0.15) than for C-K (0.11) and CF-Flint (0.08) (Table 1). The raw LD (rΒ²) and its correction by Kinship (rΒ²K) were variable between and within panels (Figure 1). LD was on average higher in the dent panel. Within each panel, it was higher for centromeric than for telomeric regions. High rΒ² values were observed between physically linked markers but also unlinked markers. This last situation occurred mainly between centromeric regions (Figure 1A, chromosomes 5, 7, and 8 and Figure 1B, chromosome 7). Inter-chromosomic LD was reduced to a large extent when considering rΒ²K rather than rΒ². Taking into account covariance between individuals (rΒ²K) also reduced intra-chromosomic LD, in particular between distant blocks with high LD (Figure 1B chromosome 10). Considering rΒ²K instead of rΒ² globally had the strongest impact in the CF-Dent panel.

### Relationship between MAF, Fst, CorK and power

Above described parametrization of QTL effects was used to investigate the influence of MAF, Fst, and the correlation between local and global covariance matrices (estimated as CorK_Freq) on power in the three maize panels. Level plots (Figure 2) showed that the MAF, the Fst, and CorK_Freq had important effects on power, with very similar graphs in all the panels. The highest power was achieved when MAF was high and Fst or CorK_Freq was low. When the MAF was below 0.1, power was close to 0 even if the marker had a low Fst or low CorK_Freq. Some regions of the level plots were not covered by the available markers (regions in white on Figure 2), in particular there was no marker with a CorK_Freq below 0.03.Note that the graphs obtained using K_Chr (or the IBS) were similar to those obtained with K_Freq and led to the same general conclusions (results not shown).

The parameters related to power (MAF, Fst, CorK_Freq) varied between panels (Table 1, see above). As a consequence from above described relationships, the mean analytical power of statistical model MK_Freq varied between the three panels (Table1), and was higher in the C-K panel (11.3%) than in the CF-Dent and CF-Flint panels (below 9.0%).

#### Simulation based assessment of kinship estimation on false positive control and power

Simulating different genetic models using the genotypes of the three panels allowed the comparison of the efficiency of the three statistical models to control false positives and to detect QTLs. The efficiency to control false positives depended on the genetic model (number of QTLs), the panel, and the estimation procedure for K (Table 2). The distribution of the Pvalues under H0 revealed that M was in accordance with trends of CorK_Freq along the genome. Correlation between the covariance matrix at the marker and the global covariance matrix (K_Freq and K_Chr) was significantly lower for K_Chr than for K_Freq, and particularly in the pericentromeric regions (Figure4). We observed that peaks of Fst corresponded generally to peaks of both correlations (CorK_Freq and CorK_Chr) (Figure4B, chromosome 7, and Figures 4A and 4C chromosome 8). Conversely, pericentromeric regions with low Fst corresponded to a peak of CorK_Freq and a drop of CorK_Chr (Figure4B, chromosomes 8 and 10, and Figure 4C chromosome 7). CorK_Freq, CorK_Chr and the difference between these two parameters were higher in the CF-Dent panel than in the two others.

K_Freq was conservative (Figure 5A) whereas the alternative models MK_Chr and MK_LD gave distributions closer to the expected one (Figures 5B and 5C). The observed Pvalue quantiles were closer to the expected Pvalue quantiles with MK_Chr and MK_LD than with MK_Freq (Table 2). MK_Freq resulted in fewer small Pvalues than expected under H0, for example in the CF-Dent panel we observed only half of the Pvalues that were expected to be below 0.001. Observed Pvalue quantiles with MK_Chr and MK_LD were very close to the expected Pvalue quantiles, although also most of the time below it.

**Table of contents :**

**Chapter 1 Recovering power in association mapping panels with variable levels of linkage disequilibrium **

ABSTRACT

INTRODUCTION

MATERIALS AND METHODS

Statistical models for association mapping and power evaluation

Analytical evaluation of the impact of panel characteristics on power

Kinship estimation

Simulation based evaluation of the impact of the estimation of K on false positive control and power

Genetic material and genotyping data

Specific parameterization

RESULTS

Diversity and Linkage Disequilibrium in maize panels

Relationship between MAF, Fst, CorK and power

Variation of analytical power and CorK along chromosomes

Simulation based assessment of kinship estimation on false positive control and power

DISCUSSION AND CONCLUSIONS

Analytical investigation of potential power along the genome with usual model (MK_Freq)

Simulation based comparison of type I risk and power of statistical models associated with different estimations of K

Acknowledgments

LITERATURE

**Chapter 2 Dent and Flint maize diversity panels reveal important genetic potential for increasing biomass production **

ABSTRACT

INTRODUCTION

MATERIALS AND METHODS

Genetic material and genotyping data

Diversity analysis

Linkage Disequilibrium (LD)

Phenotypic data

Phenotypic characterization of the genetic groups within each panel

Statistical model for association mapping

RESULTS

Diversity and structure analysis

Linkage disequilibrium

Phenotypic variation

Phenotypic characterization of the genetic groups within each panel

Association mapping results

DISCUSSION AND CONCLUSION

Genetic Diversity organization

Trait variation within and among genetic groups

Association mapping results

Conclusions

Acknowledgments

LITERATURE

**Chapter 3 ABSTRACT **

INTRODUCTION

MATERIALS AND METHODS

Genetic material

Field data

Genotyping, diversity and relationship matrix

Statistical model

Optimization criteria and CD

Optimization algorithm

Observed prediction reliability and robustness of the optimization to variation of heritability

Link between the PEV and the observed prediction error

Genetic properties of optimized calibration sets

**RESULTS **

Trait variation

Description of the diversity and of the genomic relationship matrix

Observed prediction reliability and robustness of the optimization to variation of heritability

Link between the PEV and the observed prediction error

Genetic properties of optimized calibration sets

**DISCUSSION **

Acknowledgments

LITERATURE

General discussion

Increasing power in association mapping

Using molecular information to maximize GS efficiency: optimizing the sampling of the calibration set

Diversity analysis and association mapping in the Dent and Flint Cornfed panels .

Towards an integrated approach in plant breeding

LITERATURE

**APPENDICES **

Appendix I: supplemental chapter 1

Appendix II: supplemental chapter 2

Appendix III: supplemental chapter 3