Basics of molecular biology
The biologist can skip this section, in which we explain the basics of molecular biology as they are required to understand the next chapters. These fundamental facts can be found in molecular biology books such as Lodish et al. (2000).
Proteins Proteins are the molecules that constitute the building blocks of living organisms. They have a role both in forming the structure of organisms, as well as in performing chemical reactions (these latter proteins are called enzymes). A protein is formed by one or several chains of amino-acids. Amino-acids are small molecules that are present in all living organisms. There are 20 dierent “main” amino-acids (there are a few rare others in some organisms). These 20 amino-acids are conventionally denoted by letters (A for Alanine, V for valine, etc.) to make the display of an amino-acid sequence simple, as it can be displayed as a succession of letters. Amino-acids that form a chain are also called residues (short for amino-acid residues) because they lose their acid group when they bind to each other.
DNA The DNA molecule is a double-helix formed of two dierent strands facing each other. Each strand is a succession of (deoxyribo-)nucleotides. There are 4 types of nucleotides in DNA, which dier by the nucleobase they contain: guanine (G), adenine (A), thymine (T), or cytosine (C). The information in the genome is (mostly) stored as the succession of these four types of nucleotides. Sequencing a genome refers to reading the succession of A, G, T and C along the DNA. The other strand of DNA is a copy of the same information, with each nucleotide replaced by the complement base (G with C, A with T). For example, if one strand contains the sequence ACCT, the other contains TGGA. Since the information of one strand can be easily deduced from the other, only one strand (chosen arbitrarily) is stored in databases and is considered as the reference genome. RNA The RNA molecule is similar to DNA, except for a few details. First, it often occurs in the form of a single-strand molecule. Then, it chemical composition is a little dierent compared to DNA, it is composed of ribonucleotides instead of deoxyribonucleotides. The dierence lies in an extra oxygen atom in the case of ribonucleotides. And finally, the thymine base (T) is replaced by the uracil base (U). The “four letters” of RNA are therefore A, C, G, and U.
The combination of the discovery of natural selection with genomics allows us to understand many things about proteins. When studying a protein-coding gene, we can often find a similar gene in the same species and in other species. We say the two genes are homologous. The are dierent types of homology, which we explain now (Koonin, 2005). Since all living organisms have a common ancestor, it is not surprise that they have similar genes that were inherited from their common ancestor. These genes have been preserved for many generations but because of evolution, they are not exactly identical to the gene of the ancestor, but they remain very similar. This similarity often corresponds to a similarity of function, but it is not necessarily the case. This leads us to define the concept of orthology. We say that these genes are orthologous when their similarity is explained by the presence of the ancestral gene in the common ancestor.
We can also observe two similar genes in the same species. In this case, the explanation is gene duplication. Chromosomal rearrangements during meiosis can lead to some parts of the genome to be duplicated. There can also be a whole genome duplication, in which case the complete genome is duplicated. For each duplicated gene, evolution may either retain the two copies or delete on of them (or make it non-functional). If the two copies are retained, the they may evolve to get dierent functions in the organism. The result is that, many generation later, we can observe two similar genes in the same species. We say that these two genes are paralogous. In the special case where the duplication event was a whole genome duplication, we say the two genes are ohnologous.
Predicting the eect of mutations
In protein synthesis, some steps can be easily predicted. The transcription from DNA to RNA is obvious, and can be done by a computer automatically. Then, some alternative splicing may occur, making the process a little more dicult to predict. Finally, the translation from an RNA to an amino-acid sequence is simple and can be done by a computer in which the genetic code has been recorded. A very dicult question, on the other hand, is to predict what the structure of the protein will be, knowing only the sequence of amino-acids. Even more dicult is to predict the function of the protein, or its interactions with other molecules.
But here, this is not the question we are interested in. We try to solve a simpler problem. Suppose that the function of the protein is known. Then, we observe a change in the sequence of nucleotides of the gene. Thanks to the genetic code, we can predict the resulting change in the amino-acid sequence. This may result in an identical amino-acid sequence, because of the genetic code redundancy. However, suppose that this results in a single change in the amino-acid sequence. Will the structure or the function of the protein be disrupted? It is well known that some changes in the DNA sequence have no eect at all, while others can lead to lethal diseases (for example muscular dystrophy, cystic fibrosis and phenylketonuria). But the eect depends on many factors, which are not trivial. Figure 2.1 shows a diagram summarizing the question.
SCA and ELSC
The SCA method, standing for Statistical Coupling Analysis (Lockless and Ranganathan, 1999), is the first method specifically designed to detect residue coevolution. When detecting coevolution between two positions, SCA estimates the conditional probabilities to find each residue at a position, given each residue at the other position. To make this estimation reliable, many sequences are required. Typically, it is good to have at least 100 sequences. Moreover, these sequence have to be quite dierent from each other, as measured by the number of identical residues. If we have many nearly identical sequences, they will not bring new residues in the distribution, so they will not provide extra information, and will only bias the distribution, and hence they should be removed. This is why SCA requires at least around 100 sequences that are divergent enough.
The output of SCA is a square matrix, with as many rows and columns as residues in the protein. Each cell Mi; j in the matrix is a score which measures the coevolution between the two positions. ELSC, standing for Explicit Likelihood of Subset Co-variation (Dekker et al., 2004), is an alternative to SCA. It is very similar because it is aimed at discovering the same signal and also produces a matrix of coevolution coecients. According to Dekker et al. (2004), it is more sensitive than SCA.
Presentation of P53
The P53 protein is composed of 5 domains, totaling 393 amino-acids: a transactivation domain which interacts with the mdm2 protein, a proline-rich domain, a DNA-binding domain (Figure 3.1), an oligomerization and a carboxy-terminus domain. When running our analyses, we take in account the complete sequence, but the most interesting domain is the third one, ranging from positions 101 to 300. It is the longest, the most conserved and the most mutated in tumors (90% of tumor mutations are in this domain). When a cell is healthy, the P53 gene is downregulated by several other genes, so its level is very low in normal cells. However, in case of stress, the gene is upregulated.
Then, the P53 protein itself regulates other genes (Figure 3.1 shows its binding to DNA), with the final result being either cell cycle arrest or cell death. It is therefore clear that its role is very important to prevent cancer, since this gene will “decide” to kill the abnormal cell, or at least stop it from spreading and forming a tumor. Experiments have been made on mice which show that the absence of the P53 gene leads to a premature death from cancer (Donehower et al., 1992; Jacks et al., 1994).
However, errors occur in the process of DNA replication. Moreover, exposure to carcinogenic factors (radiations, some chemicals and pathogens, etc.) can increase the mutation rate. This can lead to the P53 gene being mutated and losing its function. If the same cell has other mutations that could lead it to form a tumor, P53 will not be able to initiate cell death or cell cycle arrest, and the cell will create a tumor. This is why tumor cell very often have mutations in the P53 gene, since this gene has to be disabled before the cell can form a tumor. So by looking at mutations in tumors, we can, a posteriori, discover which genes and more precisely which positions in these genes are critical to prevent cancer.
It has been discovered that P53 has two paralogous genes: P63 and P73. The duplications that gave birth to them dates back from the common ancestor of vertebrates. The three proteins have a dierent function, as proven by knockout experiments in mice. Without P63, mice have strong developmental defects (Yang et al., 1999; Mills et al., 1999), while mice without P73 have other defects, but do not show tumors (Yang et al., 2000).
Cancer-specific mutation frequency
Another idea could then be to apply a threshold for each cancer (for example +2), and then take the union of all these positions. This idea seems more rational because it gives the same weight to every cancer, while the naive method above gives a higher weight to cancers for which there is more information available.
However, there are some issues with this method too. First, some cancers have very few recorded mutations, so it is not clear whether it is meaningful to consider the mutation frequency at positions for this cancer. The hypothesis behind the calculation of this distribution is that non-observed mutations are neutral while observed mutations are pathogenic. But if there is little data for a cancer, missing positions might just be missing information, and conversely a few false observations might pass the threshold since +2 is then a smaller number.
To mitigate this issue, we can restrict the analysis to cancers which have at least 100 observations (they cover 85% of the database). In the worst case, i.e. for the cancer that passes this criterion with the smallest number of observations (acute myeloid leukemia, with 106 observations), the threshold + 2 correspond to having at least 3 observations of a mutation.
To avoid loosing the remaining information of the database (the remaining 15% of mutations), we take all the remaining cancers together and consider the union of all the associated mutations. We consider them as a single cancer, labeled “others”, and apply the threshold again. With this method, the number of mutations is high enough for a correct estimation of the distribution.
At the end of this process we get a union of 46 positions. If we do all this with + 3 instead, we get 33 positions. They are not a subset of the 35 positions given by the naive method, as 7 of them are not in the set of 35.
To decide which of these data sets to consider as the reference for critical positions, we decided to look at how well we can predict them with conservation methods. If we use simple conservation for instance, we get an AUPR1 of 0.671. When using the cancerspecific threshold, we get AUPR=0.621 with + 2, and AUPR=0.490 with + 3. A similar change is observed on all prediction methods.
Note that the reasoning here is the reverse of the benchmark, because I choose the reference experimental data, so as to maximize the quality of predictions when compared to it. So it seems that in fact the naive method for choosing the critical residues is the best one. We cannot completely rule out the hypothesis that all methods are bad, and that by doing this choice we are just making the task simpler for prediction methods. However, it seems more likely, since this eect is true for all prediction methods, that the choice of a cancer-specific threshold used as a reference introduces noise. This is why we finally decided to use the 35 residues of the naive method, shown in Figure 3.6.
Table of contents :
1 R´esum´e en franc¸ais
1.1 Conservation et co´evolution
1.1.1 Contexte biologique
1.1.2 Pr´ediction des positions critiques de la P53
1.1.3 Pr´edictions des mutations pathog`enes
1.1.4 Pr´edictions des interactions prot´eine-prot´eine chez HCV
1.1.5 Filtrage des s´equences avec PruneTree
1.2 Transform´ees de Fourier et analyse des g´enomes
1.2.1 Analyse de Fourier sur les g´enomes
1.2.2 SPoRE : Un mod`ele pour les prot´eines de recombinaison
1.3 Autres travaux
1.3.1 Simulation de l’´evolution des g´enomes
1.3.2 R-CLAG : un paquet R de clustering
I Conservation and coevolution
2 Coevolution methodology
2.1 Biological background
2.1.1 The genomics era
2.1.2 Basics of molecular biology
2.1.3 Natural selection
2.1.4 Homologous genes
2.2 The biological question
2.2.1 Predicting the eect of mutations
2.2.2 What can evolution tell us?
2.2.3 What is coevolution?
2.3 Coevolution detection methods
2.3.1 Mutual information
2.3.2 SCA and ELSC
3 Predicting critical positions of protein P53
3.1 Presentation of P53
3.2 Experimental data
3.3.1 General methodology
3.3.2 ROC curve
3.3.3 PR curve
3.4 Defining critical positions
3.4.1 Global mutation frequency
3.4.2 Cancer-specific mutation frequency
3.5 Comparison of methods
3.5.1 Methods based on conservation
3.5.2 Methods based on conservation and phylogenetic trees
3.5.3 Methods based on coevolution
3.5.4 Methods based on the prediction of mutation eect
3.5.5 Conclusion on prediction methods
3.6 Sequence alignment preparation for the benchmark
3.6.1 Query coverage, alignment and tree inference
3.6.2 Sequence database
3.6.3 Sequence identity
3.6.4 Homology or orthology?
4 Predicting pathogenic mutations
4.1.2 Gathering homologous sequences
4.1.3 Alignment and tree reconstruction
4.1.4 Testing by position and residue
4.2 Individual benchmark
4.2.2 PolyPhen 2
4.2.5 Functional Impact Score
4.3 Global benchmark
4.3.1 Why is it better?
4.3.4 Z-cons FreqDi
4.3.5 Results for conservation-based methods
4.3.6 Results for coevolution-based methods
4.4 Double threshold
4.5 Comparison with another benchmark
5 Predicting HCV protein-protein interactions
5.1 Biological background
5.3 Analysis with BIS
5.4 Filtering clusters
5.6.1 Interaction matrix
5.6.2 Interaction circle
5.6.3 An example on structures
5.7 Intra-protein interactions
6 PruneTree: a sequence filtering algorithm
6.2 Benchmark with Guidance
6.3 Influence of the parameters
6.4 Correlation analysis
6.6 Application to coevolution detection
6.7 Conclusion and perspectives
II Fourier transforms and genome patterns analysis
7 Biological background
7.1.1 Homologous recombination
7.1.2 Experimental data
7.2 Small RNAs in Phaeodactylum tricornotum
8 Fourier transforms for genome analysis
8.1 Detecting periodicity on a set of genomic positions
8.1.3 Positions to distances translation
8.1.4 Application to recombination hotspots in S. cerevisiae
8.2 Statistical analysis
8.2.1 Simulation algorithm
8.2.2 Random models
8.3 Further analysis of periods
8.3.1 Histogram visualization
8.3.2 Modulo projections
8.3.3 Statistical analysis of modulo projections
8.3.4 Comparison with Solenoid Coordinate Method
8.3.5 Going back to the raw data
8.4 Local periodicity analysis in high-resolution data
8.4.1 General methodology
8.4.3 Application to recombination proteins in S. cerevisiae
8.4.4 Scoring periodicity signals
8.4.6 Results for S. cerevisiae recombination data
8.5 Application to Phaeodactylum tricornotum RNAs
8.6 Technical details
9 A model for yeast recombination proteins
9.1 The SPoRE model and the algorithm
9.1.1 Analysis of convergent and divergent regions
9.1.2 Axis proteins model
9.1.3 DSB model
9.2 Comparison with experimental data
9.2.1 SPoRE model and axis proteins in S. cerevisiae
9.2.2 SPoRE model and DSBs in S. cerevisiae
9.2.3 Coherence of SPoRE predictions with two large-scale experimental datasets
9.2.4 SPoRE predictions on several yeast species
9.2.5 Comparison between SPoRE and other predictive tools
9.3.1 Orientation of genes and chromosomal axis formation
9.3.2 Modeling organisms other than yeast
9.4 Technical details
III Side works
10 Simulating the evolution of genomes
10.1 Phylogenetic tree inference with PhyChro
10.2 Motivation for simulations
10.3 The model
10.3.1 Making the tree
10.3.2 Genome model
10.3.3 Simulating the number of rearrangements
10.3.4 Simulating each type of event
10.4 Benchmarking PhyChro
10.4.1 Preparing synteny blocks
10.4.3 Further analysis
11.1 What is CLAG?
11.2 Description of the CLAG algorithm
11.3 Normalization methods in R-CLAG
11.3.2 Application to a toy example
11.4 Clustering comparison in R-CLAG