Get Complete Project Material File(s) Now! Β»

## Detecting positive selection with allele frequency information

There are a multitude of methods to determine if a given genomic region is under positive selection. Here, I discuss the overall concept behind these methods and the type of information they use, as well as additional methods that will be later evaluated in the Results chapter.

The first class of methods use allele frequency information, either by directly analysing deviations in the frequency spectrum of a locus(Fay and Wu, 2000; Tajima, 1989a) or by comparing allele frequencies between populations. For the latter, a good example is FST. Introduced by Sewall Wright in 1950(Wright, 1950) as part of the F-statistics, which measure the expected average heterozygosity for different degrees of population structure, FST measures the average heterozygosity at the subpopulation level relative to the total population. It can also be seen as the amount of genetic variance explained by population differences, relative to the total variance (eq. 5). πΉπΉππππ=πππ π 2ππππ2=πππ π 2ππΜ
(1βππΜ
) (eq. 5).

Because these quantities cannot easily be measured, multiple estimators of FST have been developed (Hudson et al., 1992; Nei, 1973; Weir and Cockerham, 1984). Although in the Results chapter, the evaluated FST is the one estimated by performing analysis of molecular variance (AMOVA)(Excoffier et al., 1992), here it is illustrated with the Hudson estimator defined by eq. 6, where Hw is the number of differences within population and Hb the number of differences between populations. πΉπΉππππ=1βπ»π»π€π€π»π»ππ.

### Detecting positive selection with haplotype information

A second class of methods use information of multiple linked sites, or haplotype information. Indeed, because of genetic hitchhiking, an extended region of homozygosity is created surrounding the beneficial mutation. This type of signature can be detected in the genome, as shown by Sabeti and colleagues(Sabeti et al., 2002), who defined a measure called extended haplotype homozygosity (EHH). Briefly, EHH is the probability that, at a given distance, x, from a core position, two chromosomes are homozygous at all SNPs situated in the interval defined by x and the core position. Under neutrality and with time, recombination would break down the haplotype in the interval, thus resulting in low EHH values for old, relatively high frequency haplotypes. Under positive selection however, recombination does not have enough time to break down the haplotype centred around the beneficial mutation, thus resulting in high EHH values for high frequency haplotypes. This statistic was later improved by Voight and colleagues(Voight et al., 2006), by evaluating the decay of EHH to the left and right of the core position. They introduced the integrated haplotype homozygosity (iHH) as the area under the curve defined by plotting EHH against distance from the core position. They then calculated the iHH value for the derived and the ancestral state of the core allele (see previous paragraph, here they used a chimpanzee sequence alignment to determine the state), and defined a new statistic, the integrated haplotype score (iHS), as seen in eq. 7 πππ»π»π»π»=ππππτπππ»π»π»π»π΄π΄πππ»π»π»π»π·π·τ (eq. 7).

where iHHA and iHHD denote the iHH for the ancestral and derived alleles, respectively. When the rates of EHH decay are similar for both allele states, iHS is approximately equal to 0, while large negative or positive values indicate low rates of EHH decay (and therefore positive selection) for the derived or the ancestral allele, respectively. To adjust for differences in EHH decay due to the allele age (under neutrality, low frequency alleles tend to be younger and associated to long haplotypes), they standardized this score to obtain a final statistic with mean zero and variance equal to 1 (eq. 8).

#### Detecting positive selection: caveats and confounders

Changes in population size, specifically reductions in the case of bottlenecks or founder events, can affect levels of genetic diversity in a way that may be interpreted as signals of positive selection. This is the case for early developed tests such as Tajimaβs D, Fay and Wuβs H statistics. Statistics based on genetic differentiation between populations such as FST may also produce high values in the case of strong, non-shared genetic drift, for instance if one of the analysed populations has had a small effective size. A simple way to avoid this problem is by using an outlier approach, comparing the value of a statistic in each locus to its genome-wide distribution and defining a threshold (usually the top 1% of the distribution) based on which a locus can be considered as a candidate for positive selection. However, this approach has several limitations. First, it assumes that selection acts in a locus specific way while other forces act at a genome-wide scale. Although this has a solid theoretical and empirical background (Kimura, 1968; Kimura and Ohta, 1971; Lewontin and Hubby, 1966; Ohta, 1973; Zuckerkandl and Pauling, 1965), recent works suggest that a much higher proportion of the genome might be functionally important, and thus potentially targeted by selection (Begun et al., 2007; Hahn, 2008; Kern and Hahn, 2018; Schrider and Kern, 2017; Sella et al., 2009). The genome-wide distribution of these statistics would therefore not only reflect demographic events or strong drift but also selection. Second, an outlier approach cannot statistically differentiate the effects of positive selection from pure genetic drift or demography. All distributions will have a βtop 1%β, and whether this reflects positive selection or a bottleneck cannot be known for certain, especially in cases of weak selective events and/or sharp reductions in population effective size. Third, because this approach is based on the position of a value relative to the whole distribution, evolutionary forces that increase the overall variance of the distribution can reduce detection power (by increasing false negatives).

Another evolutionary force that might make difficult the inference of positive selection is, quite ironically given the context of the work presented here, migration or more specifically, admixture. Although an admixture event does not result in an increase of false positives per se, it can obscure the typical signatures of a selective sweep. Indeed, newly arriving genetic variation from the source population can increase the levels of genetic diversity that were decreased by selection in the recipient population, and recombination with newly arriving haplotypes can break down haplotype homozygosity (Gravel, 2012; Lohmueller et al., 2011).

Finally, negative selection can further obscure the detection of positive selection. Deleterious mutations by themselves can interfere with the action of positive selection at nearby sites (Hill and Robertson, 1966) and reduce the fixation probability of a beneficial mutation, especially in the case where recombination levels are low (Birky and Walsh, 1988). More importantly, the effect of negative selection on linked sites (known as background selection) can reproduce, in regions of low recombination, signatures of reduced diversity at neutral genetic variation, like those produced by selective sweeps (Charlesworth et al., 1993).

**The single pulse admixture model**

Although it has not been the most studied migration model in population genetics, the single pulse admixture model is often assumed when trying to estimate admixture proportions or when trying to describe the history of admixture through time (Verdu and Rosenberg, 2011). The model assumes that a given admixed population derives their genetic ancestry from at least two different source populations, resulting from a single instance of unidirectional gene flow (Fig. 5). Under this model, the allele frequencies in the admixed population, ph, can be approximated as a linear combination of the allele frequencies in the source populations p1 and p2, weighted by their contributions to the admixed population, Ξ± and (1 – Ξ±), referred to as admixture proportions (Bernstein, 1931) (eq. 9). ππβ=πΌπΌππ1+(1βπΌπΌ)ππ2 (eq. 9). Importantly, even if allele frequencies at individual loci in the admixed population might drift away, on average, this approximation holds.

**Assessing admixture with allele frequency information (I)**

Even though the analytical framework to estimate admixture proportions has been well established for over 90 years (Bernstein, 1931), a limiting factor for the first estimations (in human populations at least) was the availability of multi-locus genetic data. Not only that, but the framework assumed the exact contributing sources were known, which is rarely the case. It would not be until the availability of high density, genomic data that an alternative framework of analyses, that did not make such assumptions, could be used. These are known as unsupervised analyses, because they do not require any prior information about population affiliation for the studied samples. There are two main types of unsupervised analyses. The first one is principal component analysis (PCA), which reduces the complexity of a large multidimensional dataset (a matrix of hundreds of thousands of genetic markers for hundreds of individuals for instance) by extracting principal components that explain the most the observed variability in the dataset, thus reducing dimensionality but retaining a maximum amount of information. Through PCA it has been shown that geography has had a very large influence in the genetic structure of closely related populations (Novembre et al., 2008). Moreover, it has also been proven that distances in a PCA plot can be correlated to genetic distances, and the proportion of variance explained by PCs equate FST (McVean, 2009). In that sense, individuals that are positioned in between two clusters (corresponding to two populations) along a principal component carry alleles at frequencies that are intermediate between those of the two clusters, which may be interpreted as resulting from an admixture event between these two populations.

The second type of unsupervised analyses are based on a clustering method developed by Pritchard and colleagues and implemented through a software, STRUCTURE (Pritchard et al., 2000). This analysis assumes that sampled individuals derive their genetic ancestry from a given number of unknown source populations, and then simultaneously estimates the allele frequencies of said source populations as well as the ancestry proportions for each sampled individual. How K, the number of source populations, also referred to as ancestry components, is determined depends on the method but usually multiple runs of the algorithm are made until finding the K values that best fits the data.

Both types of methods are great for visualizing and describing the genetic variability of a group of samples. However, in terms of result interpretation, there are several pitfalls (Lawson et al., 2018). Neither PCA nor STRUCTURE-like clustering give any information on the causes for the observed patterns. Even if these can be interpreted as genetic distances, in the case of PCA, whether these distances are due to pure genetic drift, a population bottleneck or an admixture event cannot be differentiated. Sampling strategy is of utmost importance, especially for STRUCTURE-like analyses since they can produce easily misinterpretable patterns when an unsampled population has a strong contribution or a high level of shared ancestry with the rest of the samples.

**Assessing admixture with allele frequency information (II)**

The previous methods can provide information to emit hypotheses about admixture occurring in a given group. However, to formally test and validate these hypotheses, another class of methods were specifically developed, relying also on allele frequency information, called f-statistics (not to be confounded with Wright fixation indexes, denoted F-statistics). Developed by Nick Patterson and introduced by David Reich and colleagues (Patterson et al., 2012; Reich et al., 2009), these statistics are based on the concept of shared genetic drift between populations, which implies a shared evolutionary history. In particular, the f3 statistic can be used as a formal test for admixture. It is defined by eq. 10 ππ3(ππππ;ππ1,ππ2)=πΈπΈ[(ππππβππ1)(ππππβππ2)] (eq. 10) that is the product of the allele frequency differences between PX and P1, and between PX and P2, average across all sampled alleles. In terms of drift, this corresponds to the shared amount of drift between the (PX,P1) and (PX,P2) pairs. In the case of no admixture, the expected value of f3 would reflect the amount of genetic drift specific to the lineage leading to population X since its divergence from the lineage(s) leading to populations P1 and P2 (Fig 4.A and B). In the case of admixture, for instance between P1 and P2 resulting in PX, the shared drift is impacted by the inheritance by PX of ancestry from P1 and P2, such that the amount of shared drift between P1 and P2 (but not with PX) negatively affects the f3 statistic resulting in negative values (Fig 4 C). More intuitively, the f3 statistic takes negative values when pX is intermediate between p1 and p2, which is expected under admixture (Bernstein 1931). However, pX may become lower or larger than both p1 and p2 because of drift since admixture. Consequently, this statistic is particularly powerful when the divergence time between both source populations is particularly old (higher amounts of shared drift between P1 and P2 only) but decreases in performance if PX has been subject to strong amounts of genetic drift (which could lead to positive f3 values).

**Table of contents :**

Acknowledgements

Summary

RΓ©sumΓ© en langue franΓ§aise de la thΓ¨se

Introduction

Objectifs de la thΓ¨se

RΓ©sumΓ© des rΓ©sultats

Introduction

**Chapter 1: Principles of population genetics **

What are genes and how are they inherited?

Genes and populations: the concept of effective population size

The Hardy-Weinberg equilibrium model

Evolutionary forces

Modelling genes within populations

**Chapter 2: Modelling and detecting positive selection **

The classic sweep model

The soft sweep model

Detecting positive selection with allele frequency information

Detecting positive selection with haplotype information

Detecting positive selection: caveats and confounders

**Chapter 3: Modelling and detecting admixture **

The single pulse admixture model

Assessing admixture with allele frequency information (I)

Assessing admixture with allele frequency information (II)

Assessing admixture with haplotype information

Assessing admixture: towards more complex models

**Chapter 4: Gene flow according to evolutionary biology **

A barrier to speciation

A potential source of deleterious variation

A potential source of beneficial variation

**Chapter 5: Admixture according to molecular anthropology **

Initial studies in admixed populations

The ancient DNA revolution: admixture with archaic hominins

The ancient DNA revolution: admixture is everywhere

Adaptive admixture in modern humans

**Chapter 6: Objectives of the thesis **

**Chapter 7: Results **

**Chapter 8: Discussion**

Implications of the presented work: simulation analyses

Implications of the presented work: empirical data

Limitations of the presented work

Perspectives and future directions

**References**