The tree-width of pseudoknotted RNAs is typically small

Get Complete Project Material File(s) Now! »

Types of RNA

RNA molecules are single stranded nucleic acids consisting of nucleotides adenine (A), guanine (G), cytosine (C) and uracil (U). Each nucleotide contains three components: a nitrogenous base, a five-carbon sugar (ribose) and a phosphate group.
RNAs are basically classified into two classes: coding RNA and non-coding RNA. Coding RNA refers to the RNA that encodes protein. mRNA is the only known example. Non-coding RNAs (ncRNAs) are functional RNA molecules that do not encode proteins. Similar to proteins, ncRNAs usually form a specific tertiary structure to perform functions. Proteins with similar structures usually have similar sequences. However, ncRNAs often have a conserved secondary structure which may be better preserved than the RNA sequence.
Non-coding RNAs include two classes of notable interest: small ncRNAs and long ncRNAs (lncRNAs) [84]. Small ncRNAs which are the subjects of intensive trans-lational research contain microRNAs (miRNAs), small interfering RNAs (siRNAs), piwi nucleolar RNAs (piRNAs), transcription initiation RNAs (tiRNAs) and so on. lncRNAs (> 200 nucleotides) are mainly involved in cellular processes including alter-native splicing and nuclear import [74]. lncRNAs contain long intergenic non-coding RNAs (lincRNAs), Telomere-associated ncRNAs (TERRAs), Transcribed- ultracon-served regions (T-UCRs) and so on.

RNA structure

Understanding the functions of ncRNAs is clearly not possible without knowing the structures. This section describes the basic ideas about RNA structure. RNA struc-ture is usually divided into three levels: primary, secondary and tertiary structure which is analogous to the classification of protein structure. RNA primary structure refers to the sequence bases in the nucleic acid chain formed by the phosphodiester bonds. To completely determine the primary structure, one has to purify RNA from the source and characterize it by using sequencing methods [54] at the same time to get rid of the e↵ect of post-transcriptional modifications.
RNA secondary structure is a two-dimensional representation of Watson-Crick base pairs (C-G and A-U), the slightly weaker wobble base pairs (G-U) and the ”unpaired” regions. Figure 1.1 gives an example of standard elements found in RNA secondary structure, called bulge loop, multiple loop, internal loop, stacking region and hairpin loop [92]. There are two main computational ways to get the secondary structure of ncRNA from primary structure: (1) using secondary structure prediction programs to fold primary sequence directly; (2) comparative sequence analysis which is a relatively more reliable mean [116]. To verify the correctness of the computation result, one needs to use the biochemical techniques to probe the solution structure of a particular ncRNA [4].
Based on secondary structure, pseudoknotted structure is formed by pairing be-tween a loop and a region located outside (upstream or downstream) of the stem flanking the loop [96]. Since the first RNA pseudoknot was discovered [73], more and more pseudoknots were found. Now more researches have shown that pseudo-knots are essential elements of topology of many structural RNAs, like rRNAs [15], tmRNAs [126]. Some pseudoknots are catalytically active, for example, group I self-splicing intron in which the pseudoknot established the catalytic core [1]. Besides the catalytic activities, pseudoknots also play a key role in altering gene expression by inducing ribosomal frameshifting [29]. Nowadays, there is no accepted nomenclature all kinds of pseudoknots. Therefore, I only introduce several common pseudoknots:
• The H-type pseudoknot. An H-type pseudoknot is formed by the interaction between nucleotides in a hairpin loop in the secondary structure and the single-stranded region as shown in Figure 1.2 A)
• The kissing hairpin pseudoknot. A Kissing hairpin forms when unpaired nu-cleotides in a hairpin region interact with unpaired nucleotides in another hair-pin region as shown in Figure 1.2 B).
• The recursive pseudoknot. A pseudoknot can further have secondary structure elements (including pseudoknots) in the loops as shown in Figure 1.2 C).
• The complex pseudoknot. The complex pseudoknot contains more complex than the former three types. One example is shown in Figure 1.2 D).
The tertiary structure is the three-dimensional arrangement of the secondary struc-ture elements which are associated through numerous van der Waals contacts, spe-cific hydrogen bonds via the formation of a small number of additional Watson-Crick pairs and/or unusual pairs involving hairpin loops or internal bulges [111]. Usually, the tertiary structure is determined by the experiment using X-ray crystallography and NMR spectroscopy. In fact, RNA nucleotides can interact with other nucleotides through many di↵erent ways: (1) base with base, (2) base with ribose sugar, (3) base with phosphate, (4) ribose with ribose, (5) ribose with phosphate, and (6) phosphate with phosphate. Among them, Base-base interaction is the most sequence specific and is the most useful when predicting RNA structure [86]. Actually, as shown in Figure 1.3, each RNA base (purines and pyrimidines) interacts edge-to-edge using one of three edges with other bases. Considering the orientation of glycosidic bonds relative to the hydrogen bonds (cis or trans), there are 12 families of edge-to-edge base pairs [55] as shown in Table 1.1. The often mentioned Watson-Crick base pair in the secondary structure is a cis Watson-Crick/Watson-Crick base pair.
At the tertiary level, many studies have illustrated that the RNA tertiary structure is modular and is composed of recurrent conserved motifs that are embedded in the hairpin loops (HL), internal loops (IL) and multi-helix junction loops (MHJ) in RNA secondary structures [4, 111, 94, 97]. There are several properties for RNA 3D motifs as summarized by Petrov [72]: (1) Modular. They appear as discrete units (2) Autonomous. It means that they can fold into characteristic geometrical structure without being a↵ected by the molecular contexts (3) Recurrent. It means that the 3D motifs can occur over and over in one molecule or in di↵erent RNA tertiary molecules (4) Mutipurpose. One motif can bind with a protein in one structural context and can be involved in the interactions with other RNAs in another context.
One last thing to mention is that RNA folds in a hierarchical and sequential man-ner, which means that the secondary structure forms first, followed by the tertiary structure [99]. In other words, RNA secondary structure often determines tertiary structure. Therefore, an RNA sequence may form multiple secondary structures and consequently multiple tertiary structures. This implies that correct prediction of RNA secondary structure is a key step for predicting RNA tertiary structure from sequence.

READ Computational Efficiency of Geometric Local Search

RNA secondary structure prediction

Secondary Structure

An RNA structure is represented as an arc-annotated sequence (S, P ) where S is a string of characters over⌃= {A, C, G, U} and P is a set of arcs which is a set of pairs of positions in S connecting two distinct characters. There are four levels of arc-annotated sequences as mentioned by Blin et al. [8] and originally proposed by Evans [30]:
• UNLIMITED – no restriction.
• CROSSING – any position has at most one incident arc.
• NESTED – any position has at most one incident arc and no arcs are crossing.
• PLAIN – no arc.
The common secondary structures involving A-U, C-G and G-U belong to the NESTED structure. Pseudoknotted structure which belongs to the CROSSING structure fol-lows the same base pairing rules, but allows crossing arcs [24]. If we ignore the spatial coordinates of the tertiary structure and just draw it planarly, it will normally belong to UNLIMITED structures involving non-Watson-Crick base pairs. In this thesis, we focus on RNA secondary structures including pseudoknots.
Finding the structure of a ncRNA is often the first step to explain its function. There are two general computational methods to predict the secondary structure of an RNA: De novo folding approach when we only have one single sequence and comparative approach when we have additional homologous sequences at hand.
I will first describe De novo folding approach. Historically, the Nussinov algo-rithm [70] is the first attempt to solve the RNA folding problem (predicting RNA secondary structure from single-stranded RNA). The idea is to maximize the number of base pairs in the structure. It is a simple and eﬃcient dynamic programming algo-rithm which starts by calculating the maximum number of base pairs for the smallest subsequences and extends to larger and larger subsequences. However, this algorithm does not give accurate results. There are several reasons. For example, a↵ected by the stacking interaction, the stability of a base pair varies according to the neighboring base pairs. Besides, the algorithm does not distinguish loop size. The energy mini-mization algorithm proposed by Zuker and Stiegler [125] gave an improvement of the Nussinov algorithm. The free energy of an RNA structure is approximated as the sum of the free energies of all the loops and base pair stacks. Zuker dynamic programming algorithm tries to find the minimum free energy (MFE) among all the foldings of a single-stranded sequence. However, there are still limitations for the MFE method. First is the topological limitations which means that this algorithm does not allow pseudoknots. Second, a single-stranded RNA sequence might have multiple confor-mations. To solve the problem, Wuchty et al. designed an algorithm to generate all suboptimal secondary structures within a rangeΔfrom the optimal structure [ 118] and Ding et al. [26] proposed a statistically sampling algorithm to generate the RNA secondary structures according to the Boltzmann equilibrium probability distribution.

Table of contents :

1 Introduction
1.1 Types of RNA
1.2 RNA structure
1.3 RNA secondary structure prediction
1.3.1 Secondary Structure
1.3.2 Pseudoknotted Structures
2 Structure-sequence alignment
2.1 Sequence-sequence alignment
2.2 NESTED structure-sequence alignment
2.3 CROSSING structure-sequence alignment
2.3.1 Han’s algorithm
2.3.2 Matsui’s algorithm
2.3.3 Song’s algorithm
2.3.4 Rinaudo’s algorithm
2.3.5 Other formula
3 Methods
3.1 Model and definitions
3.1.1 Tree decomposition and its practical computation
3.1.1.1 Definitions
3.1.1.2 Practical computation of the tree decomposition
3.1.2 Scoring scheme
3.1.2.1 RIBOSUM
3.1.2.2 Cost function for structure-sequence alignment
3.1.2.3 Scoring a tree decomposition
3.1.2.4 LCost function
3.1.3 Banded dynamic programming
3.2 Probabilistic structure-sequence alignment
3.2.1 Derivation and derivation tree
3.2.2 Completeness and unambiguity of Rinaudo’s DP scheme
3.2.3 Computing the partition function
3.2.4 Stochastic backtrack algorithm
3.2.5 Inside-outside algorithm
3.2.6 Maximum expected accuracy alignment
3.3 Enumerating suboptimal alignments
3.3.1 ! near-optimal alignment
3.3.2 K-best suboptimal alignment
3.3.2.1 Recurrence equation
3.3.2.2 Algorithm
4 Results
4.1 Using LiCoRNA
4.2 The tree-width of pseudoknotted RNAs is typically small
4.2.1 Datasets
4.2.2 Results
4.3 Predictive accuracy of LiCoRNA
4.3.1 Dataset
4.3.2 Competitors
4.3.3 Evaluation metrics
4.3.4 Results
4.4 Analyzing near optimal solutions
4.4.1 Dataset
4.4.2 Reference alignments have quasi optimal scores
4.4.3 Reference alignment are not far down the list of suboptimal alignments
4.5 Stochastic sampling enables the detection of ambiguously-aligned regions
4.5.1 Evaluation metrics
4.5.2 An example
4.6 Structure-based realignment of RFAM families improves support for pseudoknotted base-pairs
4.6.1 Dataset
4.6.2 Evaluation metrics
4.6.3 Results
4.7 Advantages and disadvantages of LiCoRNA
5 Conclusion and Perspectives
5.1 Conclusion
5.2 Perspectives
5.2.1 Score scheme of LiCoRNA
5.2.2 Searching conserved structures in genomes
5.2.3 Identification of RNA 3D motifs
Bibliography