Basics of Genetics: DNA and RNA
DNA is a nucleic acid that acts as the blueprint for the development and func-tioning of living organisms, and serves as the foundation from which other cellular components are constructed. DNA consists of two long polymers made up of smaller units called nucleotides, and are stored in bundles of varying lengths, known as chro-mosomes, which vary in number between diﬀerent organisms. Each DNA molecule has a double-stranded structure in the form of a double helix (Watson and Crick, 1953), where the polymers run anti-parallel to one another. Specifically, the “head” of one strand, known as the 3’ end, binds to the “tail”, or 5’ end, of the other. The nucleotides forming a DNA strand are made up of a backbone of sugar and phos- phate groups joined by ester bonds (Figure 1.1). One of four bases are attached to each sugar: the purines adenine (A) and cytosine (C), and the pyrimidines guanine (G) and thymine (T). Adenine pairs with thymine, and cytosine pairs with guanine. Thus, the two types of possible base pairings are A with T (two hydrogen bonds) and G with C (three hydrogen bonds).
One of the most important functions of DNA is to encode genes that in turn produce proteins, which are compounds of chains of amino acids. Proteins are essen-tial elements in the majority of cellular functions, including biochemical reactions, metabolism, cell signalling, immune responses, and the cell cycle. There are twenty naturally occurring amino acids (Griﬃths et al., 2008) that are encoded by triplets of nucleotides, or codons, in a DNA strand. The transfer of biological information from DNA to proteins is described by the central dogma of molecular biology (Crick, 1970). First, the information encoded in DNA is transcribed by a polymerase into ribonu-cleic acid (RNA), which in turn is translated into proteins by ribosomes (Figure 1.2). The molecular structure of RNA is similar to that of DNA, with three exceptions: (1) RNA strands are generally single-stranded, and much shorter than DNA, (2) RNA nucleotides contain ribose rather than deoxyribose, and (3) the thymine base present in DNA is replaced by the base uracil (U) in RNA. In total, RNA is comprised of a cap, a start codon to initiate translation, a coding sequence made up of nucleotides arranged into codons, a stop codon to terminate translation, and a trail sequence.
Several types of RNA molecules are active during the process of transcription. After the double strand of DNA is split by the RNA polymerase in the transcription step, a complementary strand of RNA called messenger RNA (mRNA) is formed to copy information from the DNA and carry it out of the nucleus. The mRNA chain is read by ribosomes made up of proteins and ribosomal RNA (rRNA), which form the machine that can read the information carried in the mRNA and translate it into amino acids. As codons are read oﬀ, transfer RNA (tRNA) transfers the corresponding amino acids to a growing polypeptide chain. Once the stop codon is read and a protein has been fully assembled, the mRNA detaches from the ribosome and remains in the cell until it degrades.
Serial Analysis of Gene Expression
At roughly the same time that microarrays were introduced, a sequencing-based method for measuring gene expression, known as serial analysis of gene expression (SAGE), was developed (Velculescu et al., 1995, 1997). Although it also measures mRNA abundance, the SAGE protocols and resulting data are quite diﬀerent from that used to produce microarrays (Serial Analysis of Gene Expression, 2010). The premise of SAGE is that a short transcript fragment (on the order of 10 to 14 base pairs), known as a sequence tag, contains suﬃcient information to uniquely identify the transcript, given that the tag is from a unique position in the genome. By linking several short sequence tags together to form long molecules, known as concatamers, sequencing machines that can read the nucleotides in DNA may be used to sequence the tags.
Dynamic Bayesian Networks
Although BN are a powerful and flexible tool for inferring network structure, there are three major issues that limit their applicability to inferring networks from timecourse data. First, continuous gene expression data must typically be discretized, which incurs a substantial loss of information. In addition, defining threshold levels to use for discretizing data is not straightforward and may penalize genes with naturally small ranges of variation (Friedman, 2000). Second, because a BN must be a directed acyclic graph (DAG), the graphical structure cannot contain any directed cycles (Husmeier et al., 2005). In other words, the network cannot include cycles where all edges point in the same direction (as in the loop on node 3 in Figure 2.3). This is problematic, as sophisticated regulatory circuits, such as feedback loops, are common motifs in biological networks (Brandman and Meyer, 2008). Third, it is possible that expanding the joint probability of two different BN can yield the same factorization if they show alternative ways of describing the same set of independence relationships (Husmeier et al., 2005). This is referred to as equivalence classes (Husmeier et al., 2005), and occurs only if two graphs share the same skeleton (i.e., two different networks differ only in the the direction of an edge) as shown in Figure 2.2. To deal with these issues, we focus on an extension of BN known as Dynamic Bayesian Networks (DBN) that have been widely applied in the context of gene regulatory networks, e.g., Beal et al. (2005), Husmeier (2003), Ong et al. (2002), Perrin et al. (2003), Rangel et al. (2004), and Zou and Conzen (2005). A DBN uses timeseries measurements on a set of random variables to unfold a BN over time, as shown in Figure 2.3, which implies that interactions among random variables occur with a time delay. To avoid an explosion in model complexity in this setting, the parameters are typically set such that the transition probabilities between time points t−1 and t are the same for all t (i.e., a homogeneous Markov model) (Husmeier et al., 2005). As such, gene-to-gene interactions are assumed to be constant across time. In addition, this assumption either implies that biological samples are taken at equidistant time points, or that the dynamics of a biological system are equal across differing time intervals (e.g., the intensity of a biological reaction may decrease over time and be measured over increasing intervals of time). Because edges are directed with respect to the flow of time, the acyclicity graph constraint can be met without eliminating feedback loops, and any ambiguity in the direction of the arrows (i.e., equivalence classes) is resolved (Figure 2.3). In addition, continuous observations may be used in a DBN without the need for discretization. In this work, we focus on the application of two subclasses of DBN for the inference of gene regulatory networks: state space models and autoregressive models.
Linear autoregressive (AR) models are another special case of DBN that have proved to be a useful approximation to the complicated network dynamics underlying time-series expression data (Beal et al., 2005; Opgen-Rhein and Strimmer, 2007; Wilkinson, 2009). AR models are finite-order parametric models commonly used in time series analysis. Let yt denote a set of observations measured on a single variable at time t, t = 1, . . . , T. For notational simplicity we consider only the case with R = 1 replicate, but the extension to multiple replicates is straightforward. The notation AR(p) refers to an autoregressive model of order p, which is defined by yt = 1yt−1 + 2yt−2 + . . . + pyt−p + zt (2.4).
where zt is white noise such that E(zt) = 0, E(z2 t ) = 2, and E(ztzs) = 0 for t 6= s. The model parameters are thus 1, . . . , p, and 2. See Shumway and Stoffer (2000) for a more detailed description of AR models.
Approximate Bayesian Methods for Reverse Engineering Gene Regulatory Networks
Due to the high dimensionality of gene expression data, the limited number of biological replicates and time points typically measured, and the complexity of biological systems themselves, the problem of reverse-engineering networks from transcriptome data demands a specialized suite of appropriate statistical tools and methodologies.
We focus on two models that are able to handle continuous, noisy time-course expression data: state space models (Equations 2.2 and 2.3) and vector autoregressive models (Equation 2.6). Both are simple, yet powerful models that make use of linear relationships and first-order Markovian dynamics to describe the patterns of gene expression measurements over time.
In this context, the Bayesian paradigm is particularly well-suited to the inference of gene regulatory networks. First, the number of possible network structures G increases super-exponentially as the number of genes increases (Husmeier et al., 2005). Because a large number of network structures may yield similarly high likelihoods (due in part to the sparsity of expression data), attempting to infer a single globally optimal structure may be meaningless. Instead, examining the posterior distribution of network structures may be more informative about whether edges in the network can be inferred to have significantly positive or negative values. In addition, a Bayesian framework allows a priori knowledge to be encoded in the prior distribution structure. This knowledge can refer to certain features of the network topology (e.g., sparsity in biological networks or the maximum number of regulators per gene) and to prior biological information about well-characterized pathways gleaned from bioinformatics databases.
The two proposed approaches infer the edges of gene regulatory networks from time-course gene expression data, based on approximate Bayesian methodology (Carlin and Louis, 2000; Beaumont et al., 2002). Both methods incorporate the joint behavior of a set of genes over time, rather than examining each time point independently.
This ensures that the correlation structure of gene expression between adjacent time points is maintained and elucidates important information about the network. In the first approach, we develop an empirical Bayes estimation procedure to perform inference (Rau et al., 2010). This method was motivated by that of Beal et al. (2005), based on variational Bayesian learning of state space models. Due to its restrictive distributional assumptions, it is best suited to exploratory analyses of gene regulatory networks where little a priori biological information is known. In the second approach, we apply a simulation-based Bayesian method to conduct a detailed analysis of small, well-characterized pathways under fewer model assumptions. By exploiting the capabilities of modern computing, this method makes possible inference on the posterior distribution of gene networks, even in cases where the likelihood is intractable or difficult to calculate. The two approaches, while not comparable, are complementary, and help illustrate the need for a variety of network inference methods adapted for different contexts. Both approaches are discussed in the context of Bayesian inference and rely on approximate Bayesian methodologies, known as empirical Bayes and approximate Bayesian computing.
Table of contents :
1.1 Basics of Genetics: DNA and RNA
1.2 Measuring Gene Expression
1.2.2 Serial Analysis of Gene Expression
1.2.3 Next-Generation Sequencing
1.2.4 Time-Course Gene Expression
1.3 Gene Regulatory Networks
2 Methods for Inferring Gene Regulatory Networks
2.1 Current State of Network Inference Methods
2.1.1 Regression Techniques
2.1.2 Integrative Bioinformatics Approaches
2.1.3 Statistical Methods
2.1.4 Optimization Methods
2.2 Dynamic Bayesian Networks
2.2.1 State Space Models
2.2.2 Auto-Regressive Models
2.3 Approximate Bayesian Methods for Reverse Engineering Gene Regulatory Networks
3 Approximate Bayesian Methodology
3.1 Empirical Bayes Methods
3.2 Approximate Bayesian Computation
3.2.1 Approximate Sampling from the Posterior
3.2.2 Post-Adjustment Techniques
3.2.3 Monte Carlo Techniques
4 The Empirical Bayes Dynamic Bayesian Network Algorithm
4.1 Model Selection
4.2 Estimation of Hidden States
4.3 Calculation of Posterior Distributions
4.3.1 Implementation of the EM Algorithm
4.3.2 Initial Values
4.4 EBDBN Algorithm Iterations
4.5 R package: ebdbNet
4.6 Unequally Spaced Observations and Missing Data
4.7.1 Methods for Comparison
4.7.2 Model-Based Simulations
4.7.3 Data-Based Simulations
5 ABC-MCMC Network Algorithm
5.1 Simulating Data for Gene Regulatory Networks
5.2 Distance Function
5.3 Network Proposal Distributions
5.4 Network Prior Distributions
5.5 ABC-Net Implementation
5.5.1 Burn-In Period
5.5.2 Chain Convergence
5.6 Extension to RNA Sequencing Data
5.7.1 Choice of and
5.7.2 Sensitivity to prior distribution bounds
5.7.3 Suitability of VAR(1) simulator
5.7.4 Effect of noise in observed data
5.7.5 Including Prior Biological Information
6 Application to Real Data
6.1 T-Cell Activation Network in Humans
6.2 S.O.S. DNA Repair System in Escherichia coli
6.3 Neurotrophin Signaling Pathway in Mouse
7 Summary and Future Work
7.1 Future Work
LIST OF REFERENCES