Splits selection for phylogenetic networks
Background and motivation
Visualising heterogeneity is the subject of this chapter, and of Chapters 4 and 5. All three chapters focus on networks. In this chapter we introduce a statistically rigorous way of inferring split networks.
Phylogenetic trees are awkward subjects for statistical analysis. The trees are mixtures of discrete (topology) and continuous (branch length) parameters, and the continuous parameters for one tree generally do not correspond to continuous parameters for another. The combinatorics of the space of trees is itself arbitrarily complex (Semple and Steel, 2003).
Inference of phylogenetic trees is carried out in two steps.The rst step is to infer the topology of the tree; that is to determine which branches represent the evolutionary history. The second step is to infer the branch lengths.
In a similar manner, phylogenetic split network inference is a two step procedure: choosing a set of splits, and estimating the lengths of the splits. The lengths of the splits are called split weights.
Estimating the split weights is of central importance to network methods such as neighbornet (Bryant and Moulton, 2004) and Q-Net (Grunewald et al., 2007) where poor estima-tion can lead to an overly complex representation of the alignment. The neighbor-net networks have been criticised as having a high false positive rate; that is, they have too many branches in their networks (Nakleh et al., 2005). Usually, a proportion of the splits are associated with small split weights and including them only clutters the phylogenetic signal.
In this chapter, we use a modelling framework based on linear regression to estimate the split weights, sub-setting some of them to zero. We apply linear regression and the positive LASSO algorithm to the problem of picking the neighbor-net splits and split weights. We carry out several experiments into components of the regression framework and into the e ectiveness of the framework in reducing the clutter seen in neighbor-net networks.
In this section we introduce linear models in phylogenetics, estimating the covariance matrix and our method for simulating alignments.
Regression estimation of split weights
In this chapter we focus on distance based phylogenetic methods. The distances are measured on pairs of taxa and therefore they do not contain information on higher order relationships. However, Felsenstein (2004) notes that little information is lost and that the majority of the information on evolutionary relationships in contained in pairwise distances. Therefore, distance based methods assume the estimated pairwise distances represent the true evolutionary time plus noise.
Given a method of clustering the taxa and de ning the tree structure, the distances are used to calculate the branch lengths. Each distance is the sum of the lengths of the branches that separate the two taxa. The branch lengths are typically estimated using least squares (Cavalli-Sforza and Edwards, 1967; Vach, 1989; Rzhetsky and Nei, 1992; Gascuel, 1997).
We also assume that the observed distances are equal to the true distances (or additive distances) plus some error. The true distance between two taxa is the sum of the branch lengths on the path connecting them. Hence, the vector of observed distances y can be written
where each row of y is indexed by a pair of taxa and each row of X is also indexed by pairs of taxa. The entries of X are ones and zeros picking out exactly those edges on the corresponding path between the pairs of taxa. The vector is the vector of estimated split weights. Therefore, the expected distances are modelled using ij = (X )ij. It is the inferred i values that contain information on evolutionary relationships.
An example of a split matrix for a ve taxa tree can be seen in Figure 2.1.
The neighbor-net method of Bryant and Moulton (2004) used non-negative least squares to estimate the split weights given the distance vector and a set of splits known as the circular splits. We use linear regression. The model of Equation (2.1) extends directly to a split network, so long as X contains the splits for the network.
The rst step of either neighbor-net or Q-Net is to select a set of candidate splits. For neighbor-net these splits come from a circular ordering of the taxa; that is, there is an ordering of the taxa t1; t2; : : : ; tM such that every split is of the form fti; ti+1; : : : ; tjgjT fti; : : : ; tjg for some i and j satisfying 1 i j < M. Our methods however are not limited to this set, and can be applied to any set of splits.
Linear regression makes four assumptions, namely a linear model is appropriate, the errors are independently and identically distributed according to the normal distribution; the columns of X are linearly independent; and each element of the predictive vector is an independent observation. While the rst three pose no problems in the implementation of the framework the fourth does not hold as the taxa have evolutionary history in common and therefore are not independent. One potential way to deal with this assumption is to estimate a covariance matrix for the error in the distances and apply a transformation to the regression problem. Therefore we considered two ways of estimating this covariance matrix, both presented in Bulmer (1991).
Estimating the covariance
The rst covariance estimator makes no assumptions about the evolutionary structure of the data. It only assumes that the distances have been corrected using the Jukes-Cantor distance correction (Jukes and Cantor, 1969). The formula for the variance was also presented in Kimura and Ohta (1972). It applies the delta method (Stuart and Ord, 1987) to give approximations for the variance and covariances.
where L is the number of sites, pij;kl is the proportion of site where i di ers from j and k di ers from l, and pij is the proportion of site where i di ers from j. The quantity pij has been assumed to be a proportion from a binomial distribution. The parameter b is the expected proportion of sites di erences when the two sequences are independent of each other. These formulae can be extended to handle missing data.
The second estimator, also from Bulmer (1991), is based on a measure of the shared history between two taxa. Therefore it assumes there is an evolutionary relationship between the taxa and that this is represented by the common path. The formula was also presented in Nei and Jin (1989).
Conveniently, this formula makes no assumption that the splits are from a tree; therefore, the formula can be used for both trees and split networks. This is valid as distance based tree methods can be extended by split networks as shown in Bryant (2005).
When applying covariance matrix, that is when we transform the data, we use the inverse of the upper triangular Cholesky decomposition matrix.
There are standard ways of simulating sequence alignments on a tree. It is possible to construct a topology manually; or you can construct one according to a distribution. Once the tree has been chosen the alignment is created by evolving sites down the tree.
Simulating networks is more complex. One widely-used approach is to use recombination within a single population to simulate alignments which are non-treelike. One algorithm for simulating recombination is that of Hudson (1983).
The procedure has two steps. The rst step generates an ancestral recombination graph; that is, a directed acyclic graph with branch lengths, in which most nodes have a single parent, but some have two parents and these are the recombinants. The second step simulates the characters in the alignment. At each of the nodes with two parents, part of the sequence is inherited from one parent and the remainder from the other parent. The number of recombinant nodes is random and controlled by a recombination rate parameter. As the parameter increases the number of expected recombinations increases. Because of the framework, it is possible that the alignment may not have detectable signs of non-treelike behaviour, even when a non-zero recombination rate is used.
The simulator uses four parameters. The recombination rate determines whether the data is simulated on a tree (a recombination of zero) or on a network (a non-zero recombination rate). The other three parameters are the mutation rate or sequence divergence rate, the number of taxa, and the sequence length.
We used our own implementation of Hudson’s algorithm.
The particular regression approach we took is the least absolute shrinkage and selec-tion operator algorithm, or LASSO (Tibshirani, 1996). It applies a tuning parameter restriction to the full least squares solution, allowing regression to be carried out in a way that subsets the data by including variables a few at a time. Like ridge regression (Hoerl and Kennard, 1970), it reduces the variance of the parameter estimates by adding small amounts of bias.
1.1 Phylogenetics and its assumptions
1.5 Summary of thesis contributions
2 Splits selection for phylogenetic networks
2.1 Background and motivation
2.5 Data Analysis
3 A failed test for `tree-likeness’
3.1 An information criterion approach to testing for tree-likeness
3.2 Investigating the AIC criterion
3.3 Investigating the model of recombination and the error structure
3.4 Investigating whether the power is higher for longer sequences and a higher sequence divergence rate
4 Partial LASSO
4.1 Background and motivation
4.2 The partial LASSO
5 Visualising heterogeneity in a set of trees
5.2 Case studies
6 Condence sets on trees
6.1 Background and motivation
6.2 Constructing condence sets on multiple genes
6.3 Simulation study
6.4 Case study: Tiger Moths
7 Recombination breakpoint detection
7.1 Background and motivation
7.2 Existing recombination breakpoint analysis methods
7.3 Input series for recombination breakpoint detection
8 Investigating recombination using incompatibility
8.1 Background and motivation
8.2 Using incompatibility to detect recombination breakpoints
GET THE COMPLETE PROJECT