An Algorithm for Non-Redundant Sampling of RNA Secondary Structures

Get Complete Project Material File(s) Now! »

mRNA and ncRNA, and the discovery of the secondary structure

The discovery of DNA structure has shown that it is very unlikely that DNA itself would be directly transcribed to proteins. The experiments of Jean Brachet in forties of 20th century suggested that the RNA is related to protein synthesis [11]. This led to Crick to postulate, in 1957, his Central Dogma of Molecular Biology [19] that suggested that RNA is the intermediate for passing the information from the DNA in nucleus to the center of protein synthesis in cytoplasm. This was later confirmed by the discovery of lac operon of E. Coli and experiments performed on them by Jacques Monod and Francois Jacob [70]. The unstable RNA they found has become known as messenger RNA (mRNA) since then. This finding however opened a can of new questions like what the intermediate molecule that associates the genetic code to the correct amino acids is and of course the protein synthesis machinery itself. This led to the discovery of the first classes of the non-coding RNAs (ncRNA).
The ribosomes, the center of protein synthesis, are themselves constituted from a special subclass of ncRNAs – ribosomal RNA (rRNA). It was first observed in 1938 by Albert Claude [75] as a cytoplasmic masses he called microsomes, even-tually identified as ribosomes by George Palade [74]. The transfer RNA (tRNA) – the link ensuring the correspondance between the genetic code and the amino-acids – was theoretically assumed by Crick. It was confirmed by Robert Hol-ley et al when he reported a 77 nucleotides long sequence of yeast alanine tRNA along with suggestions of secondary structure for it in 1964 [47]. The last point is, within the context of this thesis, particularly important because it demonstrates how ncRNAs, to which tRNA and rRNA belong, need specific structure to per-form their function. It was then no surprise that a number of methods to identify these structures, first empirical like X-ray Crystallography and Nuclear Magnetic Resonance (NMR), later computational predictions, emerged. This thesis concen-trates mainly on efficient computational prediction of RNA secondary structures and their kinetics.

Representation of RNA secondary structures

A secondary structure can be shown in various ways. Each of them has its advan-tages and weak points. The representations used throughout this work are listed here along with an example on Figure 2.2. Radial representation (2.2A): The paired segments are represented by a rectangular ladder-like structure while the unpaired segments are organized in circles from which the paired regions branch. In the case of a linear RNA, its unpaired base can be represented linearly or in a circle. This visualiza-tion clearly depicts different local substructures (see Section 2.4.2), facilitat-ing to understand its function. On the other hand, it badly handles the cases where base pairs cross. It is good to be used when one needs to explain the function of a secondary structure. Circular representation (2.2B): The (linear) RNA is represented in a circular pattern, base pair being represented by a line or an arc. It has no problem to depict crossing base pairs but it is much harder to interpret. This type of secondary structure will not be employed in this work. Linear representation (2.2C): The RNA is shown as a line, and base pairs are depicted by arcs connecting the concerned nucleotides. This representa-tion can be used to compare two different structures each drawn from other side of line. It is also an easy way to depict and explain the recursions of secondary structure prediction algorithms, which is main use of this visu-alization here. However, it might be a bit harder to interpret, especially for long-range interactions. Dot-bracket notation (2.2D): This is the textual representation of secondary structure with specific symbols. A dot ’.’ indicates an unpaired base, and a base pair is represented by brackets ’(’ and ’)’. The correspondence of opening and closing brackets works in the same manner as in math; six opened brackets must be closed by six closing brackets and the last bracket opened it the first one that is closed. In the case of crossing base pairs, one of the crossing base pairs can be denoted by square bracket possibly followed by a letter, ie. ’[’- ’]’ or ’[A’-’]A’. This format is hard to interpret by human, though there is a visualization software that can depict the re-lated secondary structure, such as VARNA [25]. Its main purpose is to be machine-readable, which is why it became a standard input and output for-mat for RNA secondary structure prediction software.

Combinatorics of RNA secondary structures

First methods used to find the RNA secondary structure and structures in general were empiric. Even the discovery of double helix of DNA resulted from an X-ray crystallography picture [12]. However the crystallography is impossible if the an-alyzed substance does not crystallize. The NMR does not have similar problems, and unlike for proteins, the length of the molecule is not a limit either due to the RNA being shorter molecules. However, NMR and other experimental methods to determine the structure of RNA such as the X-ray crystallography and cryo-genic electron microscopy have complicated protocols that require some time to prepare and they are expensive even today [10]. Therefore with the advent of computational technology the attention turned to the computational prediction of the molecular structure, where the possibilities to make it cheap and more effi-cient are much larger.

About Dynamic Programming

Enumerating all possible secondary structures by simple recursive function is not the most efficient way to handle this problem. Using the algorithm of Smith and Waterman, Michael Zuker and David Sankoff shown that for the sequence with uniformly distributed nucleotides the asymptotic limit for the number of sec-ondary structures of RNA of length n is around 1:867n [111] – it raises exponen-tially. Therefore, trying all possible combinations would result in an exponential complexity. This is why the approaches using Dynamic Programming (DP) are extremely useful. DP scheme decomposes a central problem, here the secondary structure of w, into smaller problems, here the structures of w[i; j], computing those and using the results to solve the main problem. Each calculation is done only once and the result stored in DP matrices whose size depends on the number of states that have to be saved, like done by counting function from Equation 2.1. This makes the computation much faster than a brute-force enumeration of all possible solutions, by avoiding the repeated computation of solution to a given problem. The DP algorithms have to respect one main principle, called Bellman’s Principle of Optimality [7]. To quote Bellman: « An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. »
Translated, every intermediate solution, that leads to the main optimal solution, must be also optimal. These are optimal solutions to given subproblems. Note that the DP schemes might be also applied in cases where the algorithm would give an optimal solution from non-optimal ones for given subproblems or in-versely, where optimal solution to subproblem does not lead to an optimal global solution, but that reduces them to a mere heuristic method. All of the approaches used for structure prediction and mentioned here respect Bellman’s Principle of Optimality unless noted otherwise. The core of most of them is to decompose w into w[i; j] j 1 i < j n and solve the subproblems for sequences of increasing length l = j – i + 1. The result for each subproblem is computed and memorized. Consequently, the results for substrings of length 1; 2:::l-1 are used to compute that for sequences of length l. The main problem is then solved for the sequence w. The same principle can be employed to compute suboptimal structures.

A decomposition of RNA secondary structure

There are multiple ways to decompose a secondary structure. The most simple way was presented by Nussinov algorithm, where the only criterium is whether the base is paired or not. For the Turner Energy Model to apply, the structures are decomposed into an elements called loops. These can be sorted into groups depending on their properties, and for which the energy is computed in the same manner. This way the DP schemes can be broken down into a number of cases depending on the element that is treated, similarly to distinguishing paired and unpaired nucleotides in the scheme used in Nussinov Algorithm. Due to the independence condition, the free energy E of a secondary structure S is given by:
X (2.5) E = El l2S

Table of contents :

1 Introduction
1.1 Preamble
1.2 The Plan of This Work
1.3 Nucleic Acids: A Look to the Past
1.3.1 The Discovery of Nucleic Acids
1.3.2 mRNA and ncRNA, and the discovery of the secondary structure
1.4 A Structural Intermezzo – Levels of RNA structures
2 RNA secondary structure prediction – What has been done
2.1 Notions related to the RNA secondary structures
2.2 Representation of RNA secondary structures
2.3 Combinatorics of RNA secondary structures
2.3.1 The number of secondary structures
2.3.2 About Dynamic Programming
2.3.3 Nussinov Algorithm
2.4 Enter the Thermodynamics
2.4.1 About energy models
2.4.2 A decomposition of RNA secondary structure
2.4.3 Zuker Algorithm
2.5 From RNA thermodynamics to kinetics
2.5.1 RNA folding landscape
2.5.2 Selecting suboptimal secondary structures
2.5.3 Boltzmann Distribution
2.5.4 Completeness, Unambiguity and Acyclicity
2.5.5 McCaskill Algorithm
2.5.6 Sampling and Stochastic backtrack
2.6 Formalization of Dynamic Programming Schemes
2.6.1 Definitions
2.6.2 Applications
2.6.3 Formalizing VZu Algorithm
2.6.4 Formalizing McCaskill Algorithm
2.6.5 Stochastic backtrack
2.7 The relation between RNA thermodynamics and RNA kinetics
2.8 Kinetics of RNA folding landscape
2.8.1 Simplification of a landscape
2.8.2 RNA Kinetics Study
2.9 Coarse-grained model of a folding landscape
2.9.1 Kinetic Simulation
2.10 Sampling Caveats
3 RNANR: An Algorithm for Non-Redundant Sampling of RNA Secondary Structures
3.1 Saffarian Algorithm
3.1.1 State of the Art and the general principle
3.1.2 Parametrization of the Saffarian Algorithm
3.1.3 Computing the Partition Function
3.1.4 Formalization
3.1.5 Comparison of Saffarian and Turner Local Minima
3.2 The Non-redundant Sampling
3.2.1 Parse tree and Incomplete structures
3.2.2 Computing the probability with forbidden structures
3.2.3 Adjacent data-structure for non-redundant sampling
3.2.4 Non-redundant sampling algorithm
3.3 RNANR: Implementation and results
3.3.1 Numerical instability issues
3.3.2 Sorting flat structures
3.3.3 Non-redundant sampling and the speed gain
3.3.4 Landscape quality
3.4 Non-redundancy for other DP schemes
3.4.1 RNAsubopt
3.5 Conclusion – Non-redundant sampling
4 Estimating from non-redundant sampling sets
4.1 The non-redundant estimator
4.1.1 Empirical mean
4.1.2 Removing the dependency from the estimator
4.2 Applications of non-redundant estimator
4.2.1 Estimating base pair probabilities
4.2.2 Graph distance distribution
4.2.3 Shape probabilities
4.3 Resolving the convergence problems
4.4 Conclusion – Non-redundant estimator
5 Extending towards pseudoknots
5.1 State of the Art – pknotsRG algorithm
5.2 Generalizing the pknotsRG algorithm
5.2.1 Imperfect helices
5.2.2 Assembling the pseudoknots
5.2.3 Pseudoknots and partially ordered sets
5.2.4 DP scheme for secondary structures with pseudoknots
5.2.5 Formalizing the extension
5.2.6 Implementation
5.3 Results
5.3.1 Toy Examples
5.3.2 Comparison with pKiss
5.4 Complexity Reduction of the PKnotFind algorithm
5.5 Pseudoknots and non-redundant sampling
5.6 Conclusion – Extension on Pseudoknots
6 Conclusion and Perspectives