Bayesian inference of sampled ancestor trees
Phylogenetic analysis uses molecular sequence data to infer evolutionary relationships between organisms and to infer evolutionary parameters. Since the introduction of Bayesian inference in phylogenetics (Yang and Rannala, 1997; Mau et al., 1999; Huelsenbeck and Ronquist, 2001), it has become the standard approach for fully probabilistic inference of evolutionary history with many popular implementations (Drummond et al., 2012; Ronquist et al., 2012b; Bouckaert et al., 2014; Lartillot et al., 2009) of Markov chain Monte Carlo (MCMC) (Metropolis et al., 1953; Hastings, 1970) sampling over the space of phylogenetic trees. Initial descriptions of Bayesian phylogenetic analysis were restricted to considering bifurcating trees (Yang and Rannala, 1997; Mau et al., 1999), but have been extended to include explicit polytomies (Lewis et al., 2005). Here we tackle phylogenetic inference with trees that may contain sampled ancestors (Gavryushkina et al., 2013).
Standard phylogenetic models developed for inferring the evolutionary past of present day species assume that all samples are terminal (leaf) nodes in the estimated phylogenetic tree. However, serially sampled data generated by different evolutionary processes can be analysed using phylogenetic methods (Drummond et al., 2003) and, in some cases, the assumption that all sampled taxa are leaf nodes is not appropriate.
One case in point is when inferring epidemiological parameters from viral sequence data obtained from infected hosts (Pybus et al., 2001; Grenfell et al., 2004; Stadler et al., 2011; Ypma et al., 2013; Kunert¨ et al., 2014). Viral sequences are obtained from distinct hosts and treated as samples from the transmission process. Using standard phylogenetic models (such as coalescent or birth-death models) to describe the infectious disease transmission process entails the assumption that a host becomes uninfectious at sampling (where sampling is obtaining a sequence or sequences from the pathogen population residing in a single infected host). However in many cases, hosts remain infectious after sampling and, when sampling is sufficiently dense, the probability of sampling an individual that later infects another individual which is also sampled is not negligible (Volz and Frost, 2013; Teunis et al., 2013; Vrancken et al., 2014).
A recent analysis of a well-characterised HIV transmission chain (Vrancken et al., 2014) employed a hierarchical model of a gene tree inside a transmission tree to infer the differences in evolutionary rates (substitution rates) within and among hosts. Hierarchical modelling of gene trees inside transmission trees has also been used to infer transmission events for small epidemic outbreaks where epidemiological data is available in the form of known infection and recovery times for each host (Ypma et al., 2013). In both cases the inference of transmission trees assumes complete sampling of the hosts involved, and the host sampling process is not explicitly modelled.
Incomplete sampling is explicitly accounted for by birth-death-sampling models (Stadler, 2010; Stadler et al., 2011; Didier et al., 2012; Stadler et al., 2013), and the probability density functions of the trees are available in closed form, thus making these models tractable for use in Bayesian inference. The birth-death-sampling models do not assume that individuals are removed from the tree process after the sampling. However, using models that allow for infection after sampling has not been possible due to a lack of software, meaning that many analyses simply ignore the possibility of sampled ancestors (Stadler et al., 2011, 2013).
Another problem that may require sampled ancestor models is inferring species divergence times using fossil data. Without the means to calibrate the times of divergences, the length of branches in the estimated molecular phylogeny of contemporaneous sequences are typically described in units of expected substitutions per site. Geologically dated fossil data can be employed to calibrate a phylogenetic tree, thus providing absolute branch lengths in calendar units. The most common approach here is to specify age limits or a probability density function on specific divergence times in the phylogeny, where the constraints are defined using the fossil data (Sanderson, 1997; Thorne et al., 1998; Drummond et al., 2006; Rannala and Yang, 2007; Ho and Phillips, 2009). There are several drawbacks connected to this approach (Ronquist et al., 2012a; Heath et al., 2014). First, there is potential for inconsistency when applying two priors on the phylogeny (Heled and Drummond, 2012): a calibration prior on one or more divergence times and a tree process prior on the entire tree. Second, it is not obvious how to specify a calibration density so that it accurately reflects prior knowledge about divergence times (Ronquist et al., 2012a; Heath et al., 2014). Finally, such densities usually only use the oldest fossil within a particular clade, thus discarding much of the information available in the fossil record (Heath et al., 2014).
Other methods for dating with fossils have been developed recently (Laurin, 2012). One approach that addresses the problems of the node calibration method requires modelling fossilisation events as a part of the tree process prior. This allows for the joint analysis of fossil and recent taxa together in a unified framework (Pyron, 2011; Wood et al., 2013; Ronquist et al., 2012a; Schrago et al., 2013; Heath et al., 2014; Silvestro et al., 2014). Models that jointly describe the processes of macroevolution and fossilisation should account for possible ancestor-descendant relationships between fossil and living species (Foote, 1996), and thus include sampled ancestors.
Wilkinson and Tavare´ (2009) used the inhomogeneous birth-death process with sampled ancestors and approximate Bayesian computation methods to estimate divergence times from fossil records and known features of the extant phylogeny. A birth-death model with sampled ancestors has been used to estimate speciation and extinction rates from phylogenies in (Didier et al., 2012). Heath et al. (2014) have used this model (they call it the fossilised birth-death process) to explicitly model fossilisation events and estimate divergence times from molecular data and fossil records in a Bayesian framework. In their approach, the tree topology relating the extant species has to be known for the inference (Heath et al., 2014). So a method that simultaneously estimates the divergence times and tree topology while modelling incorporation of sampled fossil taxa is an obvious next step.
Full Bayesian MCMC inference using models with sampled ancestors is complicated by the fact that such models produce trees, which we call sampled ancestor trees (Gavryushkina et al., 2013), that are not strictly binary. They may have sampled nodes that lie on branches, forming an internal node with one direct ancestor and one direct descendent. Thus, modelling sampled ancestors induces a tree space where the tree has a variable number of dimensions (a function of the number of sampled ancestors), which necessitates extensions to the standard MCMC tree algorithms.
Here we describe a reversible-jump MCMC proposal kernel (Green, 1995) to effectively traverse the space of sampled ancestor trees and implement it within the BEAST2 software platform (Bouckaert et al., 2014). We study the limitations of birth-death models with sampled ancestors and extend the birth-death skyline model (Stadler et al., 2013) to sampled ancestor trees. We apply the new posterior sampler to two types of data: a serially sampled viral dataset (from HIV), and molecular phylogeny of bear sequences with fossil samples.
Tree models with sampled ancestors
In this section, we consider birth-death sampling models (Stadler, 2010; Stadler et al., 2011; Didier et al., 2012; Stadler et al., 2013) under the assumption that sampled individuals are not necessarily removed from the process at sampling. This results in a type of phylogenetic tree that may contain degree-two nodes called sampled ancestors.
An important characteristic of the models we consider here is incomplete sampling, i.e., we only observe a part of the tree produced by the process. Consider a birth-death process that starts at some point in time (the time of origin) with one lineage and then each existing lineage may bifurcate or go extinct. Further, the lineages are randomly sampled through time. An example of a full tree produced by such process is shown in Figure 3.1 on the left. We have information only about the portion of the process that produces the samples, shown as labeled nodes, and do not observe the full tree. Thus we only consider this subtree relating to the sample, which is called the reconstructed tree (or the sampled tree) and is shown on the right of Figure 3.1.
The sampled ancestor birth-death model
Here we describe a serially-sampled birth-death model with sampled ancestors (Stadler, 2010; Stadler et al., 2011). First we describe a variant of the model suited to modelling transmission processes and then we extend the model to describe speciation and fossilisation processes.
The process begins at the time of origin tor > 0 measured in time units before the present. Moving towards the present, each existing lineage bifurcates or goes extinct according to two independent Poisson processes with constant rates l and m, respectively. Concurrently, each lineage is sampled with Poisson rate y and is removed from the process at sampling with probability r. The process is Bayesian inference of sampled ancestor trees.
1 Introduction to Bayesian inference of dated phylogenies using fossil records
1.1 Calibrating molecular phylogenies
1.2 Problems with calibration methods
1.3 Joint Bayesian inference of dated phylogenies
1.4 Limitations and challenges of the joint inference
2 Recursive algorithms for phylogenetic tree counting
2.2 Serial sampling
3 Bayesian inference of sampled ancestor trees
4 Total-evidence dating with sampled ancestors
4.2 Materials and Methods
4.3 Results and Discussion
GET THE COMPLETE PROJECT
Sampled ancestors and dating in Bayesian phylogenetics.