Get Complete Project Material File(s) Now! »

## Latent Variable Models of Character Evolution

When studying a discrete trait, parsimony provides us with a rst and simple criterion.

It is based on the \Occam’s Razor » principle, that states that between two equivalent solutions, one should always choose the simplest one. However, evolution is a dynamic process, and it is not clear at all that it actually follows a parsimony principle. Instead of looking at the trait a posteriori and try to explain it using parsimony, we can try to model directly the dynamic evolution of the character through time. Dening such a model has two main advantages. First, it makes all the assumptions used in the model explicit, which makes its limitations more obvious, and can help us choose between various models, depending on the organism or biological process studied. Second, with a model, comes a natural way to evaluate it, in light of the data: likelihood. This continuous score can replace advantageously the discrete parsimony score, that has some identiability issues, and the statistical properties of which are hard to study.

In this section, we describe a popular model for trait evolution on a tree. Under some broad assumptions, we show that this model can be casted into the framework of latent graphical models. We recall some well known properties of these models, that will be useful when studying particular instances of trait evolution models (in Sections 1.3 and 1.4).

**A Generic Model for Trait Evolution on a Tree**

We describe here a very generic model of trait evolution on a tree. We start by making the assumption that the tree is calibrated in time, i.e. that the branch lengths represent units of times elapsed between a node and its child. The idea is then to assume that one or several traits, that can be discrete or continuous, evolve in time according to a given process, left unspecied here, but assumed to model the dynamics of trait evolution through time, on a given branch and for a given species. When there is a speciation event, at a node of the tree, this process splits up into two independent instances of this same process, starting at the same value. We describe in the following denition such a branching process, where branching times are xed by an underlying known phylogenetic tree.

**Definition 1.2.1** (Generic Trait Evolution). Let T = (E;V ) be a rooted tree, with root , oriented from the root to the tips. Assume that each edge e 2 E of the tree has an associated branch length `e. Given a preorder numbering of the vertices, denote by (Xi )1ijV j the sequence of random variables, taking its values in an arbitrary character space C (that can be discrete or continuous, and possibly multidimensional), describing the trait of each node. The law of (Xi )1ijV j is dened by:

X1 D(1): the root follows a given law D, with parameters 1.

Let e 2 E be a branch, with child node i, and parent node pa(i). On this branch, the trait evolve according to a stochastic process (Wet ;0 t `e) with law P(e), independently from other species, conditionally on We 0 = Xpa(i). At node i, dene Xi =We.

Iterate down the tree.

One strong assumption that is made here is that, conditionally on their ancestors, species evolve independently. This is a key assumption, that we cannot alleviate easily.

It will be essential in our approach of the problem, as shown below. It is however not very realistic, as it excludes any kind of interactions, such as competition, predation or mutualism, between species living at the same period of time. Some attempts have recently been made to introduce interactions for some classes of Gaussian processes (see e.g. Drury et al., 2016; Manceau et al., 2016; Bartoszek et al., 2016). In the rest of this thesis, we don’t question this assumption anymore.

Some more assumptions need to be made on the evolution process P itself. These assumptions depend on the trait studied, and on the species considered. In Sections 1.3 and 1.4, we describe in details some of them for discrete or continuous traits. One generic assumption that is almost always made however, for mathematical convenience, is that the process has the Markov property: if (Wt ;0 t) follows P, then, for any 0 s < t, the law of Wt given (Wu;0 u s) is the same as the law of Wt given Ws. In other words, the state of the traits depends on its previous states only through the last one known.

This is another essential assumption, that we will not try to alleviate in the rest of this manuscript.

Under those two assumptions (conditional independence, and Markov property), we show in the next section how this generic model can be casted in a useful statistical framework.

Observe that in the denition, the process can have dierent parameters e on each branch e. Going further, each branch could even have its own process Pe. However, in most applications, the process, as well as the parameters, are taken constant for all the branches (i.e. e = ; 8e 2 E). If it is the case, then from the denition above it follows that, for any node i, Xi =Wti , where ti is the time elapsed between the root and node i, and (Wt ;0 t ti ) follows the law P(), conditionally on W0 = X1.

One of the major goal of this thesis is, for some particular processes P, to relax this assumptions of uniformity, and to assume that the parameters are only constant by part, shifting only a few times on the tree.

**Directed Graphical Models and Latent Variables**

We introduce here directed graphical models and latent variable models, and show how the generic model presented in the previous section can be casted in this statistical framework. We then recall some of their useful properties, as well as inference procedures, such as the Expectation Maximization algorithm, a popular algorithm for the inference of the parameters of these models through likelihood maximization.

**Definition 1.2.2** (Directed Graphical Model). A set X = (X1; : : : ;XjV j) of random variables on an arbitrary state space C follows a Directed Graphical Model if it lies at the vertices of a directed acyclic graph G = (V ;E), and is such that its joint distribution can be factorized as:

p (X) = Y i2V p Xi Xpa(i)

where pa(i) is the set of all direct parents of i in the graph G, and is the vector of parameters of the distribution. Note that one may have pa(i) = ; (e.g. at the root of a tree), in which case p Xi Xpa(i) = p (Xi ) by convention. Equivalently, X follows a Directed Graphical Model if, for any two nodes i and j such that j is not a descendant of i in the graph, then Xi is independent of Xj conditionally on Xpa(i).

When the underlying graph is a tree, each node has only one parent, and the formula above means that we can express the joint distribution as the product over all branches of the laws of each node knowing its parent. Hence, in such a model, we just need a transmission rule for the trait from a parent node to its children to know the entire distribution of the tree.

This formalism is quite fruitful, and can encompass the generic model of trait evolution described in the previous section:

**Proposition 1.2.1.** Let (Xi )1ijV j be some random variables on a rooted and directed tree T = (E;V ), generated according to a process of trait evolution described in Definition 1.2.1, with a process P that has the Markovian property. Then (Xi )1ijV j follows a graphical model on the tree T .

Proof. The proof relies essentially on the two assumptions that we stressed above: conditional independence and Markov property. Using the second denition, take i and j two nodes of T , such that j is not a descendant of i, and show that: Xi ;Xj Xpa(i) . If j is an ancestor of i, then this is true thanks to the Markov property. Otherwise, let k be the most recent common ancestor of i and j. As j is not a descendant of i, k is distinct from i and j. Then as k is an ancestor of i, by the Markov property, we get p independence of Xi and Xj given Xk, this is equal to p using the Markov property one more time.

Once the model is dened, to do some statistical inference, we need to know which nodes are observed, and which ones are hidden. For a phylogenetic tree, representing the evolutionary history of species, we typically only have access to traits that can be observed today, at the tips of the tree. Apart from fossils, that can provide us with some insight on ancestral traits, internal nodes remain mostly unobserved.

**Definition 1.2.3** (Latent Variable Model on a Tree). A set of variables Z is said to be latent, or hidden if it is un-observed, but has a direct eect on a set of observed variables Y. If the set X = (Y;Z) is such that:

X follows a directed graph model on a tree T ,

Y are observed variables at the leaves of the tree,

Z are latent variables at the internal nodes of the tree,

then Y is said to follow a latent variable model on the tree T .

Intuitively, this class of models corresponds to a case where the law of a node given its parents is known, and dened by a model of trait evolution, that can be discrete or continuous, and where only extant species (at the tips of the tree) are observed. See Figure 1.2.1 for a simple example of such a setting.

Figure 1.2.1 { Latent Variable Tree Model. The observed variables are shown in white, and the latent variables in gray. The variables are numbered in a preorder. The likelihood of the completed dataset can be factorized on the edges, with terms of the form Xpa(i) , such as e.g. p(Y5 j Z3 ).

These models appear naturally in phylogeny, and, thanks to the underlying tree, have some nice hierarchical properties that can help us speed up many computations, including the likelihood one.

### Pruning Algorithm

When we have a latent variable model on a tree, the likelihood of the data can be computed thanks to a pruning algorithm, that goes up the tree from the tips to the root, similarly to the Sanko algorithm. The idea relies on the conditional independence of daughter nodes given a parent node in the tree graphical model. It has been exposed and studied in some particular examples of evolution model by Felsenstein (Felsenstein, 1973b,a, 1981, 2004).

**Proposition 1.2.2** (Pruning-like Propagation). Denote by Yi the set of all the traits for tips that are below a given node i (i.e. all tips j 2 des(i)). Then the conditional likelihood of Yi given Xi can be written as: Xi = x = Y j2child(i)

If the nodes of the tree are in a postorder, then this equation makes it possible to propagate the information from the tips of the tree to the root. When some assumptions are made on the space C or on the law p, then some explicit formulas can be derived.

For instance, if C is discrete, then the integrals are just sums, and the formula can be handled more easily (see Section 1.3). Similarly, if the law studied are Gaussian, then all the integrals can be solved analytically, and we get actualization formulas similar to the ones in a Kalman lter (see Section 1.4, and Chapters 2 and 3).

**Expectation Maximization**

Instead of trying to integrate out the marginal distribution of the observed trait values Y directly, it can be better to work with the likelihood of the completed dataset X = (Z;Y), that is in many cases easier to compute. The well known Expectation Maximization (EM) algorithm is designed to exploit this feature. It was introduced by Dempster et al. (1977), and relies on the following decomposition of the likelihood of the observed data: logp(Y) = E [ logp(Y;Z) j Y] E [ logp(Z j Y) j Y]

See e.g. Robin (2014) for a proof of this proposition and the following one. The EM algorithm is then an iterative algorithm that maximizes the marginal log-likelihood logp(Y ), that is deemed intractable, through the conditional expectation of the completed log likelihood E [ logp(Y;Z) j Y], that is supposed to be easier to handle. Informally, the algorithm can be stated as follow:

**Algorithm 1.2.1** (Expectation Maximization). Repeat until convergence: E step Given a current estimate h of the vector of parameters , compute the moments of logph(Z j Y) needed to compute Eh [ logp(Y;Z) j Y] (as a function of ).

M step Update the estimate of as: h+1 = argmax Eh [ logp(Y;Z) j Y] :

In general, there are no guarantees for this algorithm to converge to the global maximum of the marginal likelihood. However, we are guaranteed to converge to a local maximum, thanks to the following property, that states that the likelihood is increasing at each step of the algorithm:

**Proposition 1.2.4** (Increase of the Likelihood). If h and h+1 are estimates obtained by the EM algorithm described above, then: logph+1(Y ) logph(Y ):

This algorithm is quite general, and can be used both for discrete and continuous models of evolution. It is at the core of our inference strategy in Chapters 2 and 3 of this thesis. For the situations we are looking at, one of the key points is to eciently compute the E step. As our latent observations are linked by a directed tree, this can be done thanks to pruning-like algorithms, using the formula shown above (Proposition 1.2.2).

These algorithms, that can also be called \forward-backward », are designed to compute all the needed quantities in just two traversals of the tree, from the tips to the root, and back.

#### Discrete Models of Evolution and Tree Reconstruction

The rst application of the above formalism are discrete models of trait evolution. As they encompass models of DNA or protein evolution, for which a great amount of data has become available over the last few decades, these models have received a lot of attention. They are often at the core of phylogenetic inference strategies to reconstruct the tree between a set of species. It is hence important to understand how they work, and on which modeling assumptions they are based. We refer the interested reader to O’Meara (2012) and Felsenstein (2004) for a more comprehensive review of these models and associated methods.

**Continuous Time Markov Chains**

If the studied trait is discrete, it is natural to model its evolution as a Continuous Time Markov Chain (CTMC). The Markov property of this process P ensures that the resulting random variables follow a latent graphical model on the phylogenetic tree, as dened in 1.2.3 (see Proposition 1.2.1). We start by dening a general CTMC and explore a few of its properties. We then show how it can be used to model trait evolution on a tree.

Finally, we give some classical models of trait evolution based on this formalism, along with their assumptions.

**Table of contents :**

**Introduction **

**1 Background **

1.1 Space of Convex Characters on Phylogenies

1.2 Latent Variable Models of Character Evolution

1.3 Discrete Models of Evolution and Tree Reconstruction

1.4 Continuous Models of Evolution

1.5 Model Selection

Appendices

1.A The Ornstein-Uhlenbeck Process

1.B Multivariate Analysis Tools

**2 Shift Detection for Univariate Processes **

2.1 Introduction

2.2 Statistical Modeling

2.3 Identiability and Complexity of a Model

2.4 Statistical Inference

2.5 Simulations Studies

2.6 Case Study: Chelonian Carapace Length Evolution

Appendices

2.A Enumeration of Equivalence Classes

2.B A Vandermonde Like Identity

2.C Technical Details of the EM

2.D Optimal Shift Location with Fixed Root

2.E Proof of Proposition 2.4.1 for Model Selection

2.F Supplementary Figures

2.G Practical Implementation

**3 Shift Detection for Multivariate Processes **

3.1 Introduction

3.2 Model

3.3 pPCA and Shifts

3.4 Simulations Studies

3.5 Examples

3.6 Discussion

Appendices

3.A PCA: Mathematical Derivations

3.B PhylogeneticEM case study: New World Monkeys

3.C EM Inference

3.D Simulations Appendices

**4 Trait Evolution on Phylogenetic Networks **

4.1 Introduction: Phylogenetic Networks

4.2 Continuous Trait Evolution on a Network

4.3 Tests of Phylogenetic Signal

4.4 The Julia package PhyloNetworks

4.5 Perspectives

Appendices

4.A Documentation: Continuous Trait Evolution

4.B Decomposition of the Covariance Matrix

**5 Extensions and Perspectives **

5.1 Dealing with Tree and Trait Uncertainty

5.2 Convergence and Sparsity

5.3 Non-Ultrametric Trees

5.4 Sampling Scheme and Missing Data

**6 Résumé substantiel**