The central dogma of molecular biology
Discovering the role of DNA: When one thinks about the laws of genetic information transmission, the first thing that comes to mind is experiments with peas: characteristics of a strain (color of the flowers, shape of the peas, etc.) are observed in the oﬀspring with proportions resulting from the transmission of half of each parent’s information. Yet these experiments led by Johann Gregor Mendel throughout his years in his monastery (?) did not encounter at the time the enthusiasm one could expect. Even though he is now recognized as the pioneer of molecular biology, and the father of hereditary genetic information, it is not until the early 20th century that credit was given to his work. Even then, when in 1944 the biological specificity of the Deoxyribonucleic Acid (DNA) was discovered (?), the work was not well accepted and diﬀused in the scientific community. It is only when its structure was discovered by ? that DNA turned fundamental in the comprehension of living organisms.
DNA and genetic information: DNA is a molecule made of two long sequences of nucleotides called strands, and is present under the form of chromosomes in each cell of an organism: in the cytoplasm in prokaryotes, and in the nucleus in eukaryotes. Nucleotides are made of a five-carbon sugar, one or more phosphate groups and one of 4 possible nucleobases: Adenine (A), Guanine (G), Thymine (T ) and Cytosine (C). The structure of nucleotides (most commonly referred to as bases) gives an orientation, said 50 to 30, to the strand of DNA: the numbers refer to the direction of the 3rd and 5th carbon atoms of the sugar molecule. The couples of nucleotides A-T and C-G are complementary, meaning that they can hybridize, i.e. bound together, through hydrogen bounds in such a way that their orientation is opposite. DNA strands, the Watson (or positive) strand, and the Crick (or negative) strand, are themselves said complementary as the sequence of nucleotides they are made of are hybridized. Figure 1.1 illustrates a possible fragment of DNA.
The central dogma of molecular biology
This hybridization property is the base of natural perpetuation and propagation of the genetic information. It is also the foundation of all sequencing technologies (see Section 1.1.2). When two complementary sequences of nucleotides are present in a medium, they will naturally tend to hybridize to form a double-stranded molecule. Moreover, some specific enzymes, the DNA polymerases, are responsible for DNA replication: reading a strand of DNA from the 50 to the 30 end, they create the complementary strand by associating a T to an A, an A to a T , a C to a G and a G to a C. These sequences of nucleotides are now known to encode for the expression of the phenotype (observable characteristics of organisms) by a process known as the Central Dogma illustrated in Figure 1.2.
Central dogma: In its simplest form, the central dogma can be described as follows. Some regions of the DNA, called genes, contain the inherited genetic information; they are separated by regions said ‘non-coding’. Genes are transcribed into Ribonucleic Acid (RNA). RNA is a molecule very similar to DNA with the two main following diﬀerences: it is usually single-stranded, and Thymine (T ) is replaced by a very similar Uracil (U) nucleobase. This RNA molecule is then itself translated into sequences of amino acids named proteins, essential components responsible for most regulating activities of the organism.
Depending on the complexity of organisms, the non-coding regions represent from 2 % (as in bacteria) to 98 % of the DNA (as in humans). They not only separate genes from one-another, but can also be present inside a gene: the latter is then made of a succession of coding sequences, called exons, and non-coding sequences, called introns. For instance, in the human genome, a gene is on average made of 9 exons thus separated by 8 introns.
In eukaryote organisms, on which we will focus from now on, the central dogma can be detailed as follows (see Figure 1.3). In the nucleus, genes (both exons and introns) present on DNA are transcribed into pre-RNA, which will be subject to a series of processes before reaching maturity. Among these processes, we find the splicing and removal of transcribed introns, the migration of RNA from the nucleus to the cytoplasm, the addition of a 50 cap (a sequence of a few nucleotides) to the 50 end, and the addition of a poly-A tail, a sequence of As with average length varying between species (50-70 nucleotides in yeast, about 250 nucleotides in mammalian cells) to the 30 end. Most importantly, this tail is the last of the transformations undergone by the RNA, and its presence thus characterizes a mature RNA (also called ’messenger’, and denoted mRNA). Part of this mRNA will then be translated into proteins through the ’Genetic Code’: to each triplet of nucleotides corresponds one amino acid.
This last 50 to 30 directed translation process only concerns parts of mRNA: on both ends, sequences of nucleotides are not translated. These include the cap and tail, as well as sequences which were present on DNA, thus which have been transcribed. These sections are called UnTranslated Regions (UTR) on which we will be focusing.
UnTranslated Regions: The central dogma, even in the detailed version described above, is often read in one direction: DNA → RNA → proteins. Because the sequence of UTRs is not directly responsible for protein composition, initially little attention was paid to them. In the last decades however, it was shown that proteins and RNA in turn regulate DNA, and micro-RNA and their role were discovered. It was then realized that UTRs do have an influence over the functionality of organisms. Recent studies have ev-idenced that they play a number of important roles: for instance, they are binding sites for proteins responsible for translation (??), they promote the initiation of translation (?), they are involved in translational regulation (?) and in the location of the translated pro-tein in the cell (?). Moreover, mutations (change of a nucleotide in the DNA sequence) occurring in UTRs may be responsible for genetic diseases (??), for instance by preventing the expression of the gene.
UTRs of a given gene may vary in size depending on the environment condition. In almost all organisms, a large proportion of genes —40 to 50 % in mouse and humans (?), about 72 % in yeast (?)— have more than one polyadenylation sites (position of the genome where the poly-A tail will be added), and thus diﬀerent possible UTR length. Even though 50 UTRs have been less studied, genes may also allow diﬀerent 50 UTR length, and for instance, ? show that they are longer when genes are up-regulated (i.e. are more expressed than in a normal environment).
While each cell of an organism has the exact same genetic information, their specificity is determined by which genes they express. For instance, a gene coding for eye color might be expressed in eye cells but not in heart cells, or the gene coding for cell proliferation might be more expressed in an individual aﬀected by cancer than in another individual.
This cell specificity and gene regulatory role call for methods to assess both genotype and gene expression with the goal of better understanding organism functionality, a key element in the study of pathologies such as cancer. To this end, it is necessary to have the annotation of the genome of the species studied, i.e. the knowledge of the boundaries between coding and non-coding regions, in order to determine the variations between diﬀerent individuals of the same species. The studies cited above and many others agree about the importance of UTRs and the need to annotate them, study their mutations, or compare their length in diﬀerent environment.
Now remembering that UTRs are present both on DNA and mature RNA molecules (they are transcribed but untranslated sequences), sequencing the latter (i.e. the recon-structing its sequence of nucleotides) is an appropriate approach to their study. Section 1.1.2 will briefly recall the history of genome sequencing (be it DNA or RNA) and present a recent technology called ’Next-Generation Sequencing’ (NGS).
Yeast genome: Yeast is a unicellular eukaryote family of about 1500 known species, among which Saccharomyces Cerevisiae is the most famous, mostly due to its use as baking powder. As is commonly the case, when there is no ambiguity we will use the term ’yeast’ to refer to this particular species.
The yeast genome is composed of about 12 million nucleotides divided into 16 chromo-somes. Approximately 6300 genes, with an average length of 1450 base-pairs (bp), have been annotated, and an oﬃcial annotation is available on the Saccharomyces Genome Database (SGD) website: www.yeastgenome.org. Those genes have a rate of 0.007 intron per gene and account for 72 % of the genome. Figure 1.4 presents a portion of the yeast genome: even though genes represent a large percentage of the total DNA, they are usually well separated from one-another, rarely located on both strands at the same time, and very few have introns.
Table of contents :
1.1 Biological framework
1.2 Negative binomial distribution and change-point analysis
2 Segmentation methods for whole genome analysis using RNA-Seq data
2.1 An efficient algorithm for the segmentation of RNA-Seq data
2.2 Model selection
2.3 Constrained HMM approach
2.4 Results on the yeast data-set
3 Segmentation methods for gene annotation using RNA-Seq data
3.1 Method comparison
3.2 Profile comparison
3.3 EBS: R package for Exact Bayesian Segmentation
3.4 Results on the yeast dataset