Disentangling group specific QTL allele effects from genetic background epistasis using admixed individuals in GWAS

Get Complete Project Material File(s) Now! »

Genetic structure: theory, inference and a maize perspective

Genetic structure refers to the existence of sub-groups of individuals within a population. It is tightly linked to the history of the species in which it is observed. It generally arises when groups of individuals are spatially separated and preferentially mate with individuals from their own group. Following their separation, diﬀerential evolutionary forces, such as drift or selection, cause a divergence of the allele frequencies between groups. This divergence is classically measured by the fixation index FST , first introduced by Wright (1943, 1949). FST is defined as the correlation of randomly chosen alleles within a given group relative to the entire population, or equivalently as the proportion of genetic diversity due to group diﬀerences in allele frequencies (see Holsinger and Weir (2009) for a review). The locus-specific estimates of FST can be used to identify regions that have been subjected to selection. Such regions may show a particularly high degree of diﬀerentiation compared to the genome-wide distribution of FST , and test procedures are implemented in software like BayesScan (Foll and Gaggiotti, 2008).
From a genome perspective, the genetic structure not only aﬀects the frequency of alleles at given loci, but also the LD between these loci. Group diﬀerences in LD extent and linkage phase between physically linked loci can be observed due to specific demographic histories. The extent of LD is tightly linked to the eﬀective population size and to its dynamics, with a strong impact of demographic events such as bottlenecks or expansions. It tends to decline when the eﬀect of recombination is large relative to drift, as observed in populations with a large eﬀective size. Conversely, LD extent tends to increase in small populations, for which the eﬀect of drift is strong compared to that of recombination (Pritchard and Przeworski, 2001; Rogers, 2014). Such group diﬀerences in LD have been identified using markers in numerous species including human (Sawyer et al., 2005; Evans and Cardon, 2005), dairy and beef cattle (de Roos et al., 2008; Porto-Neto et al., 2014), pig (Badke et al., 2012), wheat (Hao et al., 2011) or maize (Van Inghelandt et al., 2011; Technow et al., 2012; Bouchet et al., 2013; Rincent et al., 2014b). In addition to group diﬀerences in LD, the stratification of a population into genetic groups generates LD between loci that are not genetically linked. These loci are those showing a high diﬀerentiation between groups. An extreme example would consist of a set of bi-allelic loci diﬀerentially fixed between genetic groups, and for which a global LD of 1 would be estimated between all pairs of loci.
Genetic groups are not always perfectly separated as gene flow may occur between groups. Genetic admixture refer to the existence of DNA fragments of diﬀerent ancestries within an individual, forming a mosaic of ancestry blocks. Such individuals result from the mating of individuals from diﬀerent genetic groups and admixture is supposed to be an important factor that allowed the adaptation of species to new environments (Rius and Darling, 2014).
When a population of individuals has been genotyped using molecular markers, one can investigate the existence of population structure within the dataset. This procedure possibly involves diﬀerent but com-plementary objectives including the detection of population structure, the determination of the number of genetic groups and the assignation of individuals to these groups. Principal coordinates analysis can sim-ply be applied to a distance matrix computed using marker data, in order to identify clusters on principal components. Model-based methods were also developed, the best known being the STRUCTURE software developed by Pritchard et al. (2000) which models the probability of observed genotypes using ancestry pro-portions and population allele frequencies. The algorithm was implemented in a Bayesian framework and was further extended to allow for linkage between markers and admixture (Falush et al., 2003). Another sofware, ADMIXTURE, was developed by Alexander et al. (2009) using the same model as STRUCTURE. It was based on maximum likelihood estimation which led to a considerable fastening of computing time. Other methods specialized in the inference of local ancestry information, aiming at locally assigning chromosome blocks to diﬀerent groups, such as LAMP (Sankararaman et al., 2008), RFmix (Maples et al., 2013) and many others (see Liu et al. (2013) and Padhukasahasram (2014) for reviews).
From a maize perspective, genetic structure is a major component of its existing diversity. Genetic groups have been shaped by various dissemination pathways since maize was domesticated in the Balsas region valley of Mexico around 9,000 years ago from a wild ancestor of the Zea genus (Matsuoka et al., 2002). From Mexico, maize would have spread in two main directions: north of Mexico to the United States, and south of Mexico to the Caribbean and South America. These expansions to contrasting environments led to main clusters of diversity including the Mexican highlands, the tropical lowlands, the Andean group or the northern USA group. Each of these clusters can be further divided into distinct genetic groups. For instance, the northern USA group includes the American Northern Flints, which were first introduced to the USA, the Southern Dents later introduced from the Caribbeans and the Corn-Belt Dents originating from their hybridization. The introduction of Maize in Europe probably results from two independent events: tropical material were introduced to Spain and American Northern Flints were introduced to northern Europe, creating the European Northern Flints (Fig. 2). Evidence of admixture events were shown in Europe between genetic materials and led to the creation of new groups such as the European flints or the Italian group (Brandenburg et al., 2017). All these groups were further structured into heterotic groups that are currently used in hybrid breeding. For instance, hybrids between Corn-Belt-Dent and European flints enhanced the productivity of maize in northern Europe and contributed to its propagation in agricultural systems (see Tenaillon and Charcosset (2011) for a European perspective on maize history).
Figure 2: Maize genetic groups and diﬀusions pathways inferred 66 maize landraces adapted from Bran-denburg et al. (2017)
As illustrated by the example of maize, a population stratified into genetic groups seldom involves the existence of clearly separated groups with similar degrees of diﬀerentiation. It is rather the result of a complex phylogeny with hierarchical levels, shaped by demography, migrations or admixture events.

How does genetic structure aﬀect quantitative traits?

The standard model of quantitative genetics does not explicitly account for the existence of genetic structure, but rather assumes a single population. However, the stratification of a population into genetic groups may impact the genetic components in diﬀerent manners.
Let us consider a structured population including P genetic groups studied for a given trait. Within each group, the genetic value of each individual is computed as a sum of alleles eﬀects at M bi-allelic QTLs. For the sake of simplicity, each individual is assumed to be inbred (no heterozygosity) and showing no admixture. QTLs are assumed to be in linkage equilibrium within each group and not to interact with other loci (no epistatic interactions). We can model the genetic value of a given individual as: Gi =Zip βmp0 + Wim βmp1 − βmp0 p=1 m=1
where Gi is the genetic value of individual i, Zip is a variable taking the value « 1 » if i belongs to group p or « 0 » otherwise, (Wim|Zip = 1) ∼ B(fmp) is the genotype (coded 0/1) of individual i at locus m drawn conditionally to the group ancestry of i in a Bernoulli distribution of parameter fmp, with fmp being the frequency of allele 1 at locus m for group p, βmp0 and βmp1 are the QTL allele eﬀects at locus m for group p for allele 0 and 1, respectively, and all random variables are assumed to be independent from each other.
Using this generative model, it is possible to study how genetic structure aﬀects a quantitative trait in terms of expected value and genetic variance, which are important parameters to characterize a breeding population. The expected value of a given individuals from genetic group p will be:
E (G |Z =1)= β0 + f mp β1 − β0 = µ i ip mp mp mp p m=1
where group diﬀerences in expected value may result from group-specific QTL allele frequencies but also from diﬀerences in terms of QTL allele eﬀects. The same observation can be made concerning the genetic variance of a given individual from genetic group p: V (Gi|Zip = 1) = fmp(1 − fmp) βmp1 − βmp0 2 = σG2 m=1
where group diﬀerences in genetic variance may result from group-specific QTL allele frequencies and/or allele eﬀects.
On the one hand, group diﬀerences in allele frequencies at QTLs are very likely by definition, as genetic groups are characterized by specific allele frequencies at loci. These diﬀerences may result from diﬀerential selection pressures in contrasting environments that shift QTL allele frequencies, or may simply be due to an independent drift within each group.
On the other hand, group-specific allele eﬀect at causal QTLs may not be as likely. A possible explanation for their existence lies in epistatic interactions between the QTLs and the genetic background. In this case, the genetic background is represented by one or several loci that are diﬀerentially fixed between groups. For a given QTL A interacting with a single locus B, the QTL allele eﬀect at locus A, defined as βAp1 − βAp0 , will be conditioned by the allele observed at locus B. If two genetic groups are highly diﬀerentiated at locus B, then the mean QTL eﬀect will be diﬀerent between the two genetic groups, as proposed by Tang (2006) and illustrated in Fig. (3). Another explanation is the appearance of a new genetic mutation very close to a QTL in a common founder of a given group, resulting in a diﬀerent eﬀect compared to a group for which the mutation is absent. Evidence of such mutations were found in human, as several Mendelian symptoms of obesity were shown to result from mutations within specific ethnicities (see Stryjecki et al. (2018) for a review)
Figure 3: Schematic illustration of two interacting loci adapted from Tang (2006). Filled bars represent common allele combinations while open bars are not observed in the group: a in the first group, most individuals have genotype « aa » at locus A, and no QTL eﬀect is observed at locus B, for which all allele combinations are common, b in the second group, most individuals have genotype AA at locus A, and the resulting QTL eﬀect at locus B is higher
In analogy to the FST indicator which quantifies the proportion of genetic diversity due to group diﬀerences in allele frequencies, the QST indicator was proposed by Spitze (1993) to quantify the proportion of genetic variance that is due to among-group diﬀerences when studying quantitative traits. This indicator highlights the existence of a proportion of genetic variance that is not directly accessible to a breeder, unless he generates admixture and segregations by crossing individuals of diﬀerent groups.
In conclusion, the stratification of a population into genetic groups impacts quantitative traits through diﬀerences in QTL allele frequencies and possibly through group-specific QTL allele eﬀects. One should notice that other factors may impact the mean and the genetic variance: group diﬀerences in LD between QTLs, in inbreeding, in dominance eﬀects, or even in interactions between QTL allele eﬀects.

Impact of genetic structure on association mapping and genomic selection

The stratification of a population into genetic groups may impact the methods to study quantitative traits, particularly GWAS and GS that involve molecular markers.
Applying GWAS to a structured population raises the issue of spurious associations. They result from the long range LD generated by genetic structure for SNPs and QTLs that are highly diﬀerentiated between groups, as previously discussed. If a given trait is characterized by group-specific means, all the SNPs diﬀerentiated between groups will correlate to it. An eﬃcient control of these spurious associations can be done by taking structure and kinship into account in the GWAS model (Yu et al., 2006; Price et al., 2006). For each bi-allelic marker m among M loci, a GWAS model can be written in a simplified version of that proposed by Yu et al. (2006) as:
Yijk = µ + βjm + αk + Gijk + Eijk
where Yijk is the phenotype of the individual, µ is the intercept, βjm is the eﬀect of the allele j with j ∈ {0, 1} at marker m, αk is the eﬀect of genetic group k, Gijk is random polygenic eﬀect, g is the vector of random polygenic eﬀects with g ∼ N (0, KσG2 ), K is the kinship matrix, σG2 is the genetic variance, Eijk is the error, e is the vector of errors with e ∼ N (0, IσE2 ), I is the identity matrix and σE2 is the error variance. This model can account for diﬀerent levels of structure using αk for the eﬀect of the main stratification into genetic groups, and by modeling the genetic covariance between individuals using the kinship K for groups of related individuals. If genetic structure is not suﬃciently accounted for by the model, false positives may be detected when testing for the existence of a diﬀerential eﬀect between alleles (H0 : β1m − β0m = 0). As an example, the Dwarf8 locus was found to be associated with maize flowering time in early association studies (Thornsberry et al., 2001), and it was later shown that its eﬀect had been greatly overestimated due to insuﬃcient control of the genetic structure (Larsson et al., 2013). Once structure is accounted for by the GWAS model, a low power of detection is generally observed for the highly diﬀerentiated SNPs (Rincent et al., 2014a). QTLs located in diﬀerentiated regions happen to be diﬃcult to detect, especially in case of rare alleles. This is why innovative genetic material were developed such as nested association mapping (NAM) (McMullen et al., 2009) or multi-parent advanced generation inter-cross (MAGIC) (Cavanagh et al., 2008). These genetic materials consist in generating progenies from a limited number of founders in order to ensure a high statistical power along with a large diversity studied and a population structure that is either considered as negligeable using MAGIC or that can easily be controled by the familiy structure using NAM.
From a GS perspective, the stratification of a breeding population into genetic groups may impact genomic prediction accuracy in diﬀerent manners. When a consistency is observed between the training set (TS) and the predicted set (PS) in terms of genetic groups, the group mean diﬀerences are well accounted for by the model through the kinship and participate to the accuracy (Guo et al., 2014). Conversely, when targeting a group-specific PS, training a model on a diﬀerent group can decrease dramatically the accuracy, as shown in several species including dairy and beef cattle (Olson et al., 2012; Chen et al., 2013) and maize (Technow et al., 2013; Lehermeier et al., 2014). The use of multi-group TSs was proposed by de Roos et al. (2009) for several applications including the possibility to apply predictions to a broad range of genetic diversity, the improvement of genomic selection eﬃciency in genetic groups with limited size or the optimization of resources for traits that are expensive to evaluate. Such multi-group TSs showed a good predictive ability in a wide range of species such as dairy cattle (Brøndum et al., 2011; Pryce et al., 2011; Zhou et al., 2013), maize (Technow et al., 2013) or soybean (Duhnen et al., 2017). However, the gain in precision is often limited compared to what could be obtained by applying predictions separately within groups (Carillier et al., 2014; Hayes et al., 2018).
Structure does not only aﬀect genomic prediction accuracy, but also the ability to forecast this accuracy using a priori indicators such as the coeﬃcient of determination (CD) (VanRaden, 2008; Rincent et al., 2012). Forecasting genomic prediction accuracy would allow breeders to evaluate the interest of multi-group TSs and a priori indicators could be used as criteria to optimize their constitution. However, when the population features a strong genetic structure, standard a priori indicators showed a lack of eﬃciency to forecast genomic prediction accuracy in multi-breed dairy cattle populations (Hayes et al., 2009) and to optimize TSs in rice populations (Isidro et al., 2015).
A diﬀerent genetic information captured by SNPs may explain the diﬃculty to borrow genetic information from one group to another. When a set of molecular markers is available for a trait, the genomic information at QTLs is partially captured by SNPs using LD. Group diﬀerences in LD may lead SNPs to capture diﬀerent genetic information between groups, especially at low to medium genotyping densities. This issue led Wientjes et al. (2015b) to propose a method to estimate the consistency of LD between SNPs and QTLs across genetic groups, which uses the selection index theory and simulated QTL allele eﬀects. The existence of group diﬀerences in LD and linkage phases, as well as the possibility of contrasted QTL allele eﬀects between groups, makes the observation of group-specific SNP allele eﬀects in structured populations likely. Such group-specific allele eﬀects may cancel each other out in their overall eﬀect when applying GWAS to a structured panel, making them diﬃcult to detect using standard methods. In GS, accounting for this heterogeneity in QTL allele eﬀects is likely to improve genomic prediction accuracy in a multi-group breeding context. Modeling group specific SNP allele eﬀects in genomic prediction models was proposed by Karoui et al. (2012) and Lehermeier et al. (2015) by adapting multi-trait models to multi-group predictions. In such models, the SNP allele eﬀects are assumed to be diﬀerent but correlated between groups. This same formalism was also used to derive new a priori indicators of accuracy (Wientjes et al., 2015a) or to propose relevant estimators of relatedness to estimate genetic correlations between groups accurately (Wientjes et al., 2017). Other modelings were proposed such as the decomposition of SNP eﬀects into a main SNP eﬀect and group-specific deviations, as proposed by Schulz-Streeck et al. (2012), de los Campos et al. (2015) or Technow and Totir (2015).
However, all these models were restricted to pure individuals and did not accommodate to the presence of admixed individuals. The interest of such individuals in multi-group TSs was shown by Toosi et al. (2013) using simulations, as they may create connections between groups and allow for more genetic information to be borrowed. The « animal model » was adapted to admixed population before the advent of high density genotyping, by considering pedigree relationships between individuals and global admixture proportions. The genetic variance was split into group-specific and segregation components (Lo et al., 1993; García-Cortés and Toro, 2006). The aim was to account for the additional variance observed in an admixed population compared to parental populations due to the segregation of QTL with diﬀerentiated alleles frequencies in admixed individuals. Such methodology was later adapted to genomic prediction by Strandén and Mäntysaari (2013) and Makgahlela et al. (2013) by replacing the pedigree matrix by a standard kinship matrix estimated with SNPs. Alternative methods were also developed to account for various types of heterogeneity between genetic groups, such as computing an alternative covariance matrix based on specific kernel functions (Heslot et al., 2015).
In conclusion, the stratification of a population into genetic groups may aﬀect quantitative genetics studies in several ways. The observation of group-specific allele eﬀects at SNPs, possibly resulting from group-specific allele eﬀects at QTLs, is a major factor aﬀecting both GWAS and GS. While extensive literature exists considering their modeling in a GS context, little attention has been given to their identification using GWAS. In this same perspective, the integration of admixed individuals in GWAS and GS studies has not been much considered so far. They may however be useful to connect genetic group in multi-group TSs or to get some insight concerning the stability of SNP allele eﬀects across genetic backgrounds. As their production requires significant human and material resources, it is important to evaluate their interest according to these objectives.

READ PHYSICOCHEMICAL CHARACTERISATION OF SOLID CATALYST

Objectives of the thesis

From both GS and GWAS perspectives, we studied the impact of genetic structure in quantitative genetics studies using maize structured datasets genotyped at high density. The main objectives were (i) to study the impact of genetic structure on both genomic prediction accuracy and on its a priori estimation based on the coeﬃcient of determination (CD), (ii) to identify and unravel group-specific allele eﬀects at SNPs using GWAS and admixed individuals in addition to pure individuals, (iii) to develop genomic prediction models adapted to admixed individuals that account for group-specific SNP allele eﬀects and (iv) to evaluate the interest of using admixed individuals in multi-group TSs.
To achieve these goals, we used two maize inbred diversity panels, involving diﬀerent levels of genetic structure. The first panel, called « Amaizing Dent », will be presented in Chapter 1. It includes 389 dent lines genotyped for 1M SNP and can subdivided in three genetic groups. This panel was evaluated for hybrid performances, using a common flint tester, for flowering and productivity traits. The second panel will be presented in Chapter 2 and is called « Flint-Dent » panel. It includes 304 flint lines, 300 dent lines included in the « Amaizing Dent » panel and 366 admixed lines. The admixed lines were generated from hybrids, mated according to a factorial design between the pure dent and flint lines of the panel. All lines were evaluated per se for traits related to flowering time and plant heights.
In the Chapter 1, we studied the impact of genetic structure on genomic prediction accuracy using the « Amaizing Dent » panel. For a given size of TS, structure-based scenarios were defined including within-, across- or multi-group predictions. We also evaluated the benefits of adding extra-group individuals to the TS, in order to predict group-specific PSs. All these scenarios were considered to study whether or not genetic information can be borrowed between genetic groups. The genomic prediction accuracy of alternative predicton models, that account for genetic structure explicitly, was also compared to that of standard GBLUP. To study the eﬃciency of a priori indicators of accuracy in structured populations, we compared a standard indicator based on CD to new indicators recently proposed by Wientjes et al. (2015a). The a priori estimation of accuracy was compared to the empirical accuracy obtained in the structure-based scenarios. The objective was to evaluate whether a priori indicators would be eﬃcient to forecast accuracy within a multi-group breeding population, and could later be used to optimize the composition of multi-group TSs. This study was recently published in Theoretical and Applied Genetics (Rio et al., 2019).
In Chapter 2, we developed a GWAS methodology to test for the existence of a heterogeneity of SNP allele eﬀects between genetic groups, and applied it to the « Flint-Dent » panel evaluated for flowering traits. We showed how including admixed individuals to the analysis can help to disentangle the factors causing the heterogeneity of allele eﬀects across groups: local genomic diﬀerences (group diﬀerences in LD or group specific mutations) or epistatic interactions between QTLs and the genetic background. A test for directional epistasis was also proposed to support the existence of epistatic interactions in this dataset. The objective was to study if our method can be used to get insight concerning traits in structured populations as well as to analyze the stability of marker eﬀects at main QTLs across genetic groups. This study will soon be submitted to PLOS Genetics.
In Chapter 3, we developed two genomic prediction models that account for the existence of group-specific allele eﬀects in admixed populations. Both models, called Multi-group Admixed (MAGBLUP) 1 and 2, are taking advantage of both genomic data and local admixtures, defined as the group ancestry of SNP alleles. The first model was derived according to the « animal model », for which the genotypes are random, while the second was derived by assuming a random distribution for allele eﬀects. Both models were evaluated for their precision in variance component estimation and genomic prediction accuracy using the « Flint-Dent » panel evaluated for simulated and real traits. In this chapter, we also evaluated the benefits of adding admixed individuals to multi-group TSs. The structure-based scenarios defined in Chapter 1 were adapted by replacing pure lines with admixed lines in TSs. The objective was to evaluate whether admixed individuals would allow for a better genetic connection between genetic groups within structured breeding populations. This study will soon be submitted in a journal to be determined.

Table of contents :

General introduction
Quantitative genetics in the genomic era
Genetic structure: theory, inference and a maize perspective
How does genetic structure affect quantitative traits?
Impact of genetic structure on association mapping and genomic selection
Objectives of the thesis
1 Genomic selection efficiency and a priori estimation of accuracy in a structured dent maize panel
Abstract
Introduction
Materials and methods
Genetic material and genotypic data
Structure analysis
Phenotypic data
Genomic prediction models
Evaluation of the precision of genomic predictions
A priori estimation of accuracy
Results
Global, within and across group precision of genomic predictions
Accounting for structure in genomic prediction models
A priori estimation of precision
Discussion
The impact of genetic structure on genomic prediction accuracy
Modeling genetic structure to improve predictions
Is it possible to forecast accuracy using CD?
Conclusion
2 Disentangling group specific QTL allele effects from genetic background epistasis using admixed individuals in GWAS: an application to maize flowering
Abstract
Introduction
Materials and methods
Genetic material and genotypic data
Phenotypic data
Global assessment of directional epistasis
GWAS models
Results
Phenotypic analysis and directional epistasis
Associations detected and GWAS strategies
Highlighted QTLs
Discussion
Accounting for genetic groups in GWAS
Benefits from admixed individuals
Heterogeneity of maize flowering QTL allele effects
Conclusion
Appendix A
3 Accounting for group-specific allele effects and admixture in genomic predictions: theory and experimental evaluation in maize
Abstract
Introduction
Statistical context
Statistical context
MAGBLUP
Material and Methods
Flint-Dent dataset
Statistical inference and genomic predictions
Simulated traits
Assessment of the precision of variances estimates
Assessment of the accuracy of genomic predictions
Results
Variance estimates for simulated traits
Genomic prediction accuracy for simulated traits
Application to real traits
Discussion
Modeling group-specific allele in admixed populations
Variance components and genomic predictions
Benefits from admixed individuals in multi-group training sets
Conclusion
Appendix A
Appendix B
Appendix C
Appendix D
Appendix E
Appendix F
Appendix G
Appendix H
General discussion
Perspectives
Bibliography