Get Complete Project Material File(s) Now! »

## English Version of the Introduction

In this thesis, we study the length distribution of maximal repeats in eukaryotic genomes, and more generally the length distribution of maximal exact matches between genomes of di erent species. Indeed, these distributions strongly deviate from what one could expect from simple probabilistic models and present a power-law behavior. We will show that simple evolutionary models are able to account for these deviations. In the Introduction, we rst review some simple statistical properties of DNA sequences, and show that deciphering these properties have ini-tiated signi cant progress in the eld of genetics and evolution. The scienti c approach we developed is inherited from these seminal historical studies. We will then introduce some biological processes and mathematical concepts that will be studied more speci cally later on.

**Statistical Properties of Genomes**

The genome of an organism is de ned as the entire genetic material it carries. It contains all the information that an individual needs to develop from a single cell, to grow and to reproduce. This information is transmitted from an organism to its progeny during reproduction. The genetic information is encoded in a long molecule, the Deoxyribonucleic Acid (DNA), which is a polymer of four di erent monomers called nucleotides. Each of these monomers are composed of a sugar, a phosphate group and a nitrogenous base. The sugar and the phosphate groups are the same in all four nucleotides, but there are four possible nitrogenous bases. These four bases (Adenine (A), Guanine (G), Cytosine (C), Thymine (T)) are classi ed into two groups of molecules, the purines (A and G), and the pyrimidines (C and T). Bases belonging to the same group share a high chemical similarity. The purine bases are composed of two aromatic compounds while the pyrimidine bases are composed of only one aromatic cycle, and are thus smaller (see Fig. 1.2 for the full chemical structure of the four nucleotides).

Cells of all living organisms possess the ability to interpret the complex information encoded in the DNA to produce RNA molecules (via a process called transcription) that are then translated into proteins, which in turn perform the essential functions of living cells. For this reason, DNA is often described as the cookbook of an organism, written in an alphabet of four letters.

**Modelling DNA Sequences**

Di erent individuals belonging to the same species share very similar genomes. For this reason, one can build a reference genome for each species, representing the typical genome of one individual. The size of genomes is highly variable from one species to another, ranging from several hundreds of kilobase pairs (kbps), to hundreds of gigabase pairs (Gbps), while the size of typical mammalian genomes is of the order of several Gbps (3:2 Gbps for the Human). To give a comparison, the french Wikipedia was composed of roughly 4:2 billion letters in total in June 2015 according to wikipedia itself. Unlike Wikipedia however, the entire Human genome (as well as other eukaryotic genomes) cannot be directly interpreted. Indeed, only a small percentage of eukaryotic genomes codes for proteins (this proportion is of the order of 1% in the Human genome), while most of the genome (the non-coding DNA) is thought to be mostly non-functional (although the potential function of the non-coding DNA is the subject of a lively and passionated debate in the scienti c community [6{8]). This explains why the size of genomes varies so much from one species to another, and why the complexity of an organism does not correlate well with the size of its genome.

Studying objects of such a large size and composed of repeated units (here nu-cleotides) allows to perform statistical analysis. One of the rst statistical analy-sis performed on genomes aimed at identifying short sequences of nucleotides (or words) that were either exceptionally frequent or rare, with the idea that these ex-ceptional events would be the signature of a biological signi cance of these words [9{11]. Words of particular biological importance, such as for instance motifs rec-ognized as transcription start sites by transcription factors are expected to be overrepresented in the sequence, while certain types of motif might be deleteri-ous, and thus avoided. But in order to identify exceptional words one rst needs to develop a random model, to di erentiate events that occur \by chance » from biologically meaningful events.

Random sequences | One can model a DNA sequence S = (s1; :::; sL) as a chain of L letters, where each letter belongs to the four letter alphabet A = fA; C; G; Tg. The simplest possible way to model a DNA sequence, is the random independent and identically distributed (iid) model. In this framework, all base pairs si are independent from each other, and the probability of each nucleotide to be observed is constant at all positions of the sequence. The parameters of the model are the length of the sequence L, and the frequency of each of the four base pairs fA, fT , fC and fG. In that case we have: P (si = a) = fa; with a 2 fA; C; G; Tg; 8i 2 1; : : : ; L: (1.22).

**Charga ’s Second Parity Rule**

Later on, Charga and his coworkers separated the two strands of the DNA of the model bacterium Bacillus subtilis, and calculated the proportions of each base pair independently in the two strands. They found that even in a single strand, the proportion of A and T on the one hand, and the proportion of G and C on the other were approximately equal [16] : < : nC nG8 nA nT (1.37). where na here represents the number of nucleotides a on one strand. Although this statistical property of DNA | today known as Charga ’s second parity rule (or PR2) | su ers from some exceptions, notably in mitochondria [18], it is ful lled in a wide range of species [17, 18]. The formal explanation for this rule has been found 20 years ago, when Lobry and Lobry [19] showed analytically that this property could be simply explained under the no-strand bias condition [20], which states that mutations a ect similarly both strands of DNA. As we have seen before, whenever a mutation occurs, nucleotides on both strand are changed. For instance, if an A is replaced by a C on one strand, then a T will be replaced by a G at the same position on the complementary strand. It follows that under this no-strand bias condition, the mutation rate associated to these two mutations has to be the same. Similarly, each of the twelve possible mutations has one equivalent mutation, and thus 6 mutation rates are enough to model the evolution of DNA, such that the instantaneous rate matrix is given by: 0 GT CT AT 1 B C Q = AC CG AG (1.38).

where is once again de ned such that the sum over each column is equal to zero. One can show analytically that the evolution of a sequence according to a Markov model with such an instantaneous rate matrix reaches a stationary state where Charga ’s second parity rule always holds [19]. The fact that this rule is ful lled in the genome of the majority of species indicates that most of the time, genomes evolve according to the no-strand bias condition.

Unlike the rst parity rule however, the second parity rule is not exact. Although this rule is most of the times ful lled on the global scale (when a large portion of genome is considered), deviations have been found at the local scale, indicating that in some regions of the genome, the mutations do not a ect both strand symmetrically [21{24]. Studying the deviation from PR2 has revealed itself a powerful tool to highlight a wide amount of features of speci c regions [25], as for instance the position of replication origins [26].

Charga ’s second parity rule gives a good example of the two major interests of simple models in biology. First, while understanding simple statistical fea-tures, one can get insight into global and basic properties of biological processes. Secondly, they o er a global framework from which one can identify local and pe-culiar deviations. Analyzing and explaining these deviations can help to identify new phenomena and to develop a re ned view of biological processes.

#### Match Length Distributions

The goal of this thesis is to study biological processes that generate long repeated segments in DNA sequences, and that are not taken into account in the models we have presented so far. To study these biological processes, we focus on the distribution M(:) of the length of exact matches (segments with an identical se-quence) which are maximal (i.e. they cannot be extended on either side). Such distributions can be obtained for either a self-alignment (by aligning a genome against itself), or for a comparative alignment (by aligning two di erent genomes without allowing for gaps and mismatches).

For sequences generated with the iid model, this distribution is given by an geo-metric distribution:

Miid(r) = 1 L1L2(1 p)2pr; (1.39)

**Table of contents :**

**1 Introduction **

1.1 Version francaise de l’introduction

1.1.1 Les proprietes statistiques des genomes

1.1.1.1 La premiere regle de parite de Charga

1.1.1.2 Modeliser les sequences d’ADN

Sequences aleatoires

Modeles de Markov

Chaines de Markov heterogenes

Modeliser l’evolution des sequences d’ADN

1.1.1.3 La seconde regle de parite de Charga

1.1.1.4 Distribution des Longueurs d’Appariement

1.1.2 Des processus d’evolution plus complexes

1.1.2.1 Les elements transposables

1.1.2.2 Les repetions possedant un petit nombre de copies

Les duplications de gene

Les duplications segmentaires

Retroduplications

Les duplications de genome entier

1.1.3 Plan de la These

1.2 English Version of the Introduction

1.2.1 Statistical Properties of Genomes

1.2.1.1 Charga’s First Parity Rule

1.2.1.2 Modelling DNA Sequences

Random sequences

Markov Models

Heterogenous Markov Chains

Modeling the Evolution of Sequences

1.2.1.3 Charga’s Second Parity Rule

1.2.1.4 Match Length Distributions

1.2.2 Complex Processes of Genome Evolution

1.2.2.1 Transposable Elements

1.2.2.2 Low Copy Number Repeats

Gene Duplication

Segmental Duplications

Retroduplications

Whole Genome Duplication

1.2.3 Thesis Outline

**2 Materials and Methods **

2.1 Computing MLDs

2.2 Power-Law Distributions

General Properties

Representing Power-Laws

Logarithmic Binning

2.3 Yule Trees

2.4 Simulating the Evolution of DNA Sequences

2.5 Bioinformatic Procedures

Genomes

Phylogenetic Tree of Pseudogenes

RepeatMasker

**3 Self-alignment **

3.1 Preliminary Considerations

3.1.1 The Match Length Distribution of the Self-Alignment of a Genome

3.1.2 The Stick Breaking Process on Evolutionary Time Scale .

3.1.3 A Mathematical Framework to Calculate Match Length Distributions

3.2 The Simplest Case: Random Duplications

3.2.1 Theoretical Calculations

The Stationary State with Continuous Duplications .

3.2.2 Simulations

Mutations

Duplications

Results

3.2.3 Discussion

3.2.4 Limitations of the Simple Model

3.3 Yule Trees

3.3.1 Theoretical Calculation

3.3.2 Simulations

3.3.3 Discussion

3.4 The Case of Retroduplication

3.4.1 Theoretical Calculations

3.4.2 Simulations

3.4.3 Discussion

3.5 Biological Insights from our Models

3.5.1 Using the MLD to Infer Information on Dierent Duplication

Mechanisms

3.5.2 Assessing the Quality of the Assembly: Orangutan Example

3.5.3 Assessing the Quality of the RepeatMasking:

Example from the Macaque Genome

3.5.4 Conclusion

**4 Comparative Alignment **

4.1 MLDs of Comparative Alignments

4.2 Pseudogene Hypothesis

4.3 Ladder of Trees

4.4 The Evolution of Conserved Regions

4.4.1 Theoretical Calculations

Comparing Species Shortly after the Split

The comparison of Distantly Related Species

Calculating N( ), the Number of Regions at a Distance

from each others

4.4.2 Simulations

4.4.3 Discussion

The Distribution of Mutation Rates

The case of Paralogs

Power-laws in MLDs of other Comparisons

**5 At the Crossing Between Self and Comparative Alignments: The Case of Whole Genome Duplication **

5.1 The Fate of a Genome after a Whole Genome Duplication

5.2 The Transition Between the Two Regimes

5.2.1 Simulations

5.3 Discussion and Limitations

**6 Comparison of Coding Sequences **

6.1 Comparing the Exome of Dierent Species

6.2 Theoretical Application of the Divergence Model

6.3 Theoretical Calculation of the Value of N( )

Symmetrical Mutation Rate Distributions

Asymmetrical Mutation Rates Distribution

6.4 Investigating Dierent Exon Subclasses

6.5 Conclusions

Hypotheses Regarding N( )

Hypotheses Regarding m(r; )

Hypotheses Regarding the Exon Length Distribution

**7 Conclusion **

7.1 Summary

7.2 Perspectives

**A Extension to the Discrete Case **

**B Non RepeatMasked MLD **

**C Article: Statistical Properties of Pairwise Distances between Leaves on a Random Yule Tree **

**Bibliography **