Get Complete Project Material File(s) Now! »

## Charga ’s First Parity Rule

Early on in the history of genetics, while it had not yet been established that genetic information is encoded in the DNA, scientists started to study statistical features of DNA, in order to understand its properties.

A basic but yet historically very important property has been highlighted by Erwin Charga .

where na stands for the number of nucleotides a. Examples in di erent species from Charga ’s experiment are shown in Table 1.2. This property | together with X-ray pictures of DNA that we owe to Franklin and Gosling [2] | led to the dis-covery of DNA structure by Watson and Crick [3]. Watson and Crick’s model for the structure of DNA states that DNA molecules (or strands) are linked by pairs, and that each nucleotide on one strand is associated to a nucleotide on the com-plementary strand. This link is made possible by chemical interactions between nucleotides. Due to steric constraints, purine-purine and pyrimidine-pyrimidine association are much less stabilizing than purine-pyrimidine pairing. The four possible associations left are thus A – T, A – C, C – G, and C – T. But stabilizing chemical interactions ( -stacking and hydrogen bounds [4]) preferentially occur between A and T on the one hand and C and G on the other. Thus, in double stranded DNA, A are always paired to T and C are always paired to G. These preferential associations result in the fact that in each cell, the proportions of A and T as well as proportions of C and G are always equal, and thus to Charga ’s rst parity rule.

Many fundamental processes involving DNA rely on the association of DNA strands by pairs. Thus, its discovery pave the way to a better understanding of how life works. Due to the pairing rule, both associated strands are complementary and contain the same information, such that the knowledge of one of the two strands is enough to fully reconstruct its complementary strand. Thanks to this property, DNA can also easily replicate. The cell machinery rst separates the two com-plementary strands and each of them then serves as a template to produce a new DNA molecule. At the end of the process, the cell contains two identical pairs of DNA, and it can thus divide into two daughter cells containing the same DNA content.

Table 1.2: This table represents a sample of Charga ’s 1952 data, listing the nucleotide composition of DNA of several organisms. Table reproduced from Bansal [5].

the replication process. For instance, one base can be mistakenly inserted in place of another. As a result, the newly produced double stranded molecules are exactly identical to the original molecule at all positions (or loci) but one. This event is called a point mutation. As point mutations can result of the change of any of the four possible nucleotide in any of the three others, there are twelve possible mutations. Although rare and sources of deleterious e ects, these errors also allow species to evolve and to adapt to their changing environment. These changes in DNA do not only happen during replication, and the alteration of a nucleotide can also occur at another step of the cell life. And since nucleotides belonging to the same chemical group are more closely related to one-another than to the two others, the four mutations that change a purine into another purine or a pyrimidine into another pyrimidine (called transitions, see Fig. 1.2) occur much more often than the eight others (called transversions).

When a mutation occurs on one strand, a newly inserted nucleotide and its ho-mologous base on the other strand are not anymore paired according the pairing rule of Watson and Crick’s model. This results in an alteration of the structure of the DNA, that can be identi ed by the repair machinery present in the cell, whose function is to repair the error. However, most of the time, the cell has no way to identify which base is in the unaltered state, and which base has been changed.

Thus, one of the two bases is randomly replaced so that the two bases are properly associated. Then, half of the time, the mutation is repaired, and half of the time the mutation gets fully integrated.

### Modelling DNA Sequences

Di erent individuals belonging to the same species share very similar genomes. For this reason, one can build a reference genome for each species, representing the typical genome of one individual. The size of genomes is highly variable from one species to another, ranging from several hundreds of kilobase pairs (kbps), to hundreds of gigabase pairs (Gbps), while the size of typical mammalian genomes is of the order of several Gbps (3:2 Gbps for the Human). To give a comparison, the french Wikipedia was composed of roughly 4:2 billion letters in total in June 2015 according to wikipedia itself. Unlike Wikipedia however, the entire Human genome (as well as other eukaryotic genomes) cannot be directly interpreted. Indeed, only a small percentage of eukaryotic genomes codes for proteins (this proportion is of the order of 1% in the Human genome), while most of the genome (the non-coding DNA) is thought to be mostly non-functional (although the potential function of the non-coding DNA is the subject of a lively and passionated debate in the scienti c community [6{8]). This explains why the size of genomes varies so much from one species to another, and why the complexity of an organism does not correlate well with the size of its genome.

Studying objects of such a large size and composed of repeated units (here nu-cleotides) allows to perform statistical analysis. One of the rst statistical analy-sis performed on genomes aimed at identifying short sequences of nucleotides (or words) that were either exceptionally frequent or rare, with the idea that these ex-ceptional events would be the signature of a biological signi cance of these words [9{11]. Words of particular biological importance, such as for instance motifs rec-ognized as transcription start sites by transcription factors are expected to be overrepresented in the sequence, while certain types of motif might be deleteri-ous, and thus avoided. But in order to identify exceptional words one rst needs to develop a random model, to di erentiate events that occur \by chance » from biologically meaningful events.

Random sequences | One can model a DNA sequence S = (s1; :::; sL) as a chain of L letters, where each letter belongs to the four letter alphabet A = fA; C; G; Tg. The simplest possible way to model a DNA sequence, is the random independent and identically distributed (iid) model. In this framework, all base pairs si are independent from each other, and the probability of each nucleotide to be observed is constant at all positions of the sequence. The parameters of the model are the length of the sequence L, and the frequency of each of the four base pairs fA, fT , fC and fG. In that case we have:

P (si = a) = fa; with a 2 fA; C; G; Tg; 8i 2 1; : : : ; L: (1.22)

k-letter words W are de ned as subsequences of k consecutive letters (and are for this reason also termed k-mers). In a sequence of size L, there are L k + 1 words of length k, and the word starting at position i is de ned as Wi = (si; : : : ; si+k 1). The probability of a given word W = (w1; :::; wk) of length k to appear in a sequence of size L at the position i is then:

Markov Models | This section borrows heavily from Robin, Rodolphe, and Schbath [12], where one can nd broader developments of the topic discussed below.

More sophisticated models have been developed and have been shown to be pow-erful tools to understand biological processes that shape genomes. One such class of models which has been widely used are Markov chain models. Such models have been applied to a broad range of elds (for general discussions about Markov Chains, see Karlin and Taylor [13] for instance). The most simple Markov chains (also called rst order Markov chains) are processes where events at step i only depend of the state of the chain at step i 1, and is independent from all other previous states.

In the case of DNA, the value of each letter si depends only on the value of its left neighbor si 1. There are 16 possible couples of letters (si 1; si), and thus 16 transition probabilities. The transition probabilities pa!b from a letter a 2 A to letter b 2 A is de ned as pa!b = P (si = bjsi 1 = a): (1.25).

These probabilities can be estimated from an observed DNA sequence using the maximum likelihood method. The likelihood of a model is de ned as the proba-bility to obtain a set of data given the model. Then, the values of the parameters that maximize the likelihood give accurate estimators of the parameters.

In the present case, the maximum likelihood estimators (MLE) of the transition probabilities are:

where nab is de ned as the count of the 2-letter word W = ab. In that sense, this model is an extension of the iid model where the parameters were the single nucleotide frequencies only. Following a similar procedure, one can build m order Markov chains, where the letter si depends on the m previous letters. In this case, one has to de ne 4m transition probabilities, whose MLE will depend on the frequencies of words of length m + 1. Thus, the number of parameters increases with the order of the Markov chain. Hence, the higher the order of the Markov chain, the more accurately it describes the DNA sequence.

However, regarding modeling, more accurate does not always mean better. Recall that one of the motivation to model DNA was to produce a random model in order to di erentiate exceptional words from those occurring just by chance. For instance, one cannot expect to nd exceptional words of length k in a sequence using a k-order Markov chain model. An extreme case of evidently pointless model consists in representing a sequence of length L using a L 1-order Markov chain.

Heterogenous Markov Chains | So far, we only described homogeneous Markov chain models. In these models, transition probabilities are the same all along the sequence. But we have already seen that only a small fraction of eukary-otic genomes is coding for proteins. As the coding potential of a region constraints its statistical properties, such features are not taken into account in homogeneous Markov chains. To deal with these irregularities, heterogeneous Markov chains have been introduced. In these models, the transition probabilities are di erent from one region to another. One subclass of heterogeneous Markov chains of par-ticular interest for DNA modeling are the hidden Markov models. These models have been widely used, notably to detect coding regions in genomes [14].

Modeling the Evolution of Sequences | One can also use Markov chains to model the evolution of DNA in time. In these models, it is assumed that each site evolves independently. In this case, it is realistic to use a rst order Markov model, that is, to consider that the mutation rate of a nucleotide depends on the present letter only, and not on the letter that could be found at the same position in the past. Indeed, in real genomes, informations regarding the nucleotides that could be found at given positions in the history of a species are not stored.

#### Charga ’s Second Parity Rule

Later on, Charga and his coworkers separated the two strands of the DNA of the model bacterium Bacillus subtilis, and calculated the proportions of each base pair independently in the two strands.

**Table of contents :**

**1 Introduction **

1.1 Version francaise de l’introduction

1.1.1 Les proprietes statistiques des genomes

1.1.1.1 La premiere regle de parite de Charga

1.1.1.2 Modeliser les sequences d’ADN

Sequences aleatoires

Modeles de Markov

Chaines de Markov heterogenes

Modeliser l’evolution des sequences d’ADN

1.1.1.3 La seconde regle de parite de Charga

1.1.1.4 Distribution des Longueurs d’Appariement

1.1.2 Des processus d’evolution plus complexes

1.1.2.1 Les elements transposables

1.1.2.2 Les repetions possedant un petit nombre de copies

Les duplications de gene

Les duplications segmentaires

Retroduplications

Les duplications de genome entier

1.1.3 Plan de la These

1.2 English Version of the Introduction

1.2.1 Statistical Properties of Genomes

1.2.1.1 Charga’s First Parity Rule

1.2.1.2 Modelling DNA Sequences

Random sequences

Markov Models

Heterogenous Markov Chains

Modeling the Evolution of Sequences

1.2.1.3 Charga’s Second Parity Rule

1.2.1.4 Match Length Distributions

1.2.2 Complex Processes of Genome Evolution

1.2.2.1 Transposable Elements

1.2.2.2 Low Copy Number Repeats

Gene Duplication

Segmental Duplications

Retroduplications

Whole Genome Duplication

1.2.3 Thesis Outline

**2 Materials and Methods**

2.1 Computing MLDs

2.2 Power-Law Distributions

General Properties

Representing Power-Laws

Logarithmic Binning

2.3 Yule Trees

2.4 Simulating the Evolution of DNA Sequences

2.5 Bioinformatic Procedures

Genomes

Phylogenetic Tree of Pseudogenes

RepeatMasker

**3 Self-alignment **

3.1 Preliminary Considerations

3.1.1 The Match Length Distribution of the Self-Alignment of a Genome

3.1.2 The Stick Breaking Process on Evolutionary Time Scale

3.1.3 A Mathematical Framework to Calculate Match Length Distributions

3.2 The Simplest Case: Random Duplications

3.2.1 Theoretical Calculations

The Stationary State with Continuous Duplications

3.2.2 Simulations

Mutations

Duplications

Results

3.2.3 Discussion

3.2.4 Limitations of the Simple Model

3.3 Yule Trees

3.3.1 Theoretical Calculation

3.3.2 Simulations

3.3.3 Discussion

3.4 The Case of Retroduplication

3.4.1 Theoretical Calculations

3.4.2 Simulations

3.4.3 Discussion

3.5 Biological Insights from our Models

3.5.1 Using the MLD to Infer Information on Dierent Duplication Mechanisms

3.5.2 Assessing the Quality of the Assembly: Orangutan Example

3.5.3 Assessing the Quality of the RepeatMasking:

Example from the Macaque Genome

3.5.4 Conclusion

**4 Comparative Alignment **

4.1 MLDs of Comparative Alignments

4.2 Pseudogene Hypothesis

4.3 Ladder of Trees

4.4 The Evolution of Conserved Regions

4.4.1 Theoretical Calculations

Comparing Species Shortly after the Split

The comparison of Distantly Related Species

Calculating N( ), the Number of Regions at a Distance

from each others

4.4.2 Simulations

4.4.3 Discussion

The Distribution of Mutation Rates

The case of Paralogs

Power-laws in MLDs of other Comparisons

**5 At the Crossing Between Self and Comparative Alignments: The Case of Whole Genome Duplication **

5.1 The Fate of a Genome after a Whole Genome Duplication

5.2 The Transition Between the Two Regimes

5.2.1 Simulations

5.3 Discussion and Limitations

**6 Comparison of Coding Sequences**

6.1 Comparing the Exome of Dierent Species

6.2 Theoretical Application of the Divergence Model

6.3 Theoretical Calculation of the Value of N( )

Symmetrical Mutation Rate Distributions

Asymmetrical Mutation Rates Distribution

6.4 Investigating Dierent Exon Subclasses

6.5 Conclusions

Hypotheses Regarding N( )

Hypotheses Regarding m(r; )

Hypotheses Regarding the Exon Length Distribution

**7 Conclusion **

7.1 Summary

7.2 Perspectives