Drawing tree-indexed data1
The aim of drawing tree-indexed data is to produce informative geometric representa-tions automatically for visualization purposes. Since tree-indexed data can be viewed as directed graphs, they could be draw using conventions and algorithms previously pre-sented (see sub-section 1.1.2 page 10). But it is common practice to modify the directed graph drawing standards for trees to take account of their topological particularities.
Aesthetics of tree drawings Trees and forests are by definition sparse graphs – same number of edges as vertices minus the number of roots – therefore node and link diagrams are preferred to adjacency plots. Usually, node and link diagrams for drawing trees are compared by considering qualitative and quantitative aesthetic criteria:
area, defined as the surface area of the enclosing rectangle of the vertex drawing.
ratio, defined as the ratio of length of shortest side to length of longest side of the enclosing rectangle of the vertex drawing.
subtree separation property. A drawing of T satisfies the subtree separation prop-erty defined by Chan et al. (1997) if, for any two distinct vertices u and v of T , the enclosing rectangles of the drawing of Tu and Tv do not overlap one with the other.
Moreover, considering the simplicity of tree topology compared to general directed graphs, two non-exclusive types of drawing are of considerable interest:
Directional drawing. Considering a directed axis in the coordinate system used to draw the trees, no child is placed before its parent. With such a convention, and for clarity, a switch may be made from arrows to lines in order to represent a directed edge since there is no confusion about edge direction.
Planar drawing. A planar drawing is a drawing in which edges do not intersect. Pla-nar drawings are normally easier to understand than non-planar drawings (i.e. with edge-crossings). Since any tree admits a planar drawing, it is desirable to obtain planar drawings for trees.
Both Eades (1991) and Fruchterman and Reingold (1991) reported that diﬃculties are encountered when drawing trees without edge-crossing by force-directed algorithms. Even when using the magnetic extension of Sugiyama and Misue (1995), if the drawing tends to be directional, there is no guarantee that the result will be be planar. Here-inafter we present only two classes of tree layout algorithms among many others (see Tamassia, 2007, chapter 5 and reference therein). We focus on these two classes since they are relatively easy to understand and implement, while producing high quality layouts.
Level-based layouts The level-based approach in tree drawing is characterized by the fact that vertices at the same depth are aligned on the same straight line (i.e. the level) and for two given depths these straight lines are parallel (Bloesch, 1993; Reingold and Tilford, 1981; Buchheim et al., 2002; Walker, 1990). Algorithms based on this approach produce intuitive drawings that clearly display symmetries and comply with both planarity and directionality conventions (see algorithm 2 and its results in figure 2.2).
Radial layouts While drawing using the level-based layouts in most cases respects the usual tree drawing conventions and the subtree separation property (this is not the case for the algorithms proposed by Buchheim et al. (2002) and Walker (1990)), quantitative aesthetics criteria are not satisfactory. For instance, let T be a perfect binary tree of depth d. In a perfect binary tree every non-leaf vertex has two children. There are 2d leaves. The drawing of T produced using algorithm 2 has:
• an area of d • 2d units,
• a ratio of 2−d.
As presented in Tamassia (2007, Chapter 5), by considering a geometric transformation from Cartesian to polar coordinates, a level-based layout yields a radial layout (see algorithm 3 and its results in figure 2.2).
Compared to the level drawing of T , the radial drawing produced by algorithm 3 has:
• an area of d2 units,
• a ratio of 1.
Such values are far more satisfactory for these criteria. Given a level-based layout re-specting planarity and directionality conventions such as those produced by algorithm 2, the geometric transformation from Cartesian to polar coordinates preserves these con-ventions. Moreover, although algorithm 3 drawings do not respect sensu stricto the subtree separation property, if the enclosing rectangle is changed to an enclosing trian-gle, the subtree separation property is respected.
Tree-indexed data and plants
Tree-indexed data are commonly used in signal processing (Crouse et al., 1998; Dasgupta et al., 2001) and in 2D and 3D images (Choi and Baraniuk, 2001) as multi-scale repre-sentations of path or grid-indexed data. This thesis does not include a broad spectrum of tree-indexed data but focuses on those which can be collected in studies on plant de-velopment. In particular, we present the usefulness of tree-indexed data representation on two scales:
A microscopic scale where tree-indexed data represent cell lineages in tissues through-out their development (Olariu et al., 2009),
A macroscopic scale where tree-indexed data represent whole plant architecture (Du-rand et al., 2005).
Tree-indexed data on a cellular scale
The study of morphogenesis A major challenge in developmental biology is to un-derstand how multi-cellular tissues can give rise to complex shapes in animals or plants. It is therefore crucial to be able to quantify and explain the cellular and tissular patterns that arise during morphogenesis2. Although several studies have provided profound in-sight into the molecular regulatory networks that play a role during development, the eﬀects of such networks on shape transformations are often only described qualitatively. Describing size and shape changes as a geometrical output of gene activity requires the quantification of growth patterns at a cellular resolution. Obtaining accurate geometric information about cell position and shape is key to developing quantitative models of morphogenesis. The identification of groups of cells is also essential, not only based on their diﬀerentiation state, but also on the outcome of the mechanical, genetic and hormonal events that drive morphogenesis.
Meristems and tree representation of tissues3 In plants, meristems drive mor-phogenesis. A meristem is a set of embryonic cells that organize the construction of the plant. It creates new tissues by successive divisions of its stem cells. This division process is coupled with:
• a phenomenon that maintains certain cells obtained by division in a totipotency4 state,
• a phenomenon that enrolls certain cells obtained by division into a diﬀerentiation genetic program.
Divisions occur in such a way that cells entering the diﬀerentiating process will become part of new tissues and organs while the meristem does not disappear since the totipo-tent cells of which it is composed are constantly regenerating. Given the tissues and organs produced, and location within the plant, three main types of meristems can be considered:
The Shoot Apical Meristem (SAM) located at the apex of the stem (see figure 2.3). It is responsible for genesis of the aerial part of the plant, i.e. leaves, stems and inflorescences5. Inflorescence set up is the result of the transformation of a Shoot Apical Meristem (SAM) into a floral meristem.
The Root Apical Meristem (RAM) located at the tip of the root and responsible for genesis of the below ground part of the plant, i.e. the roots.
The secondary meristems which are responsible – when located inside stems – for the thickening of stems or roots when located inside roots.
The idea of representing meristem development in a tree-structure, is to follow meristem cells over time represented by vertices, and use edges to represent lineage relationships (see figure 2.3).
Tree-indexed data collection by 3D + t meristem imaging Fernandez et al. (2010) presented a method to generate 3D digitized tissues at cell resolution with au-tomatic tracking of cell lineage during growth. To create a digitized tissue that can be used to quantitatively analyze growth in four dimensions, they developed an experimen-tal pipeline comprising two key steps:
Multi-angle Acquisition, 3 dimensional Reconstruction and Segmentation (MARS). The multi-angle acquisition produces stacks of 2D images that are transformed into 3D images. Each voxel in the images stores the intensity with the signal associ-ated to cell walls or other sources, for instance genetic markers. At the end of the MARS step, the segmentation associates each voxel with a given cell in the meristem, or with background (see figure 2.4).
Automated Lineage Tracking (ALT). Once the cells have been identified in the MARS step, the goal of the ALT step is to perform cell tracking throughout the experiment. At the end of the ALT step lineage trees of cells are obtained.
Available data In this thesis we focus on joint work concerning flower morphogenesis in Arabidopsis thaliana conducted with J. Legrand, another Ph.D. student in the team (Legrand, 2014). We were interested in SAMs of Arabidopsis thaliana transformed into floral meristems. In contrast to the original SAM, a floral meristem follows a determinate growth process6. This transformation is controlled by the expression of particular genes called identity genes, specifying floral organs and causing determinate growth. This work focused on the L1 cell layer7 (see figure 2.4) and the lineage tree was produced using the Fernandez et al. (2010) MARS-ALT method (see figure 2.3).
Under the assumption that the cell diﬀerentiation process in floral meristems can be assimilated to a succession of finite unobservable cell identities, we aimed tot recover these identities on the basis of genetic and geometric cell characteristics (Legrand, 2014) such as:
• surface areas (internal L1/L2 and external L1),
• inertia values (on three axes),
• principal and secondary curvatures,
• AHP68 concentration.
Moreover, in order to understand the early mechanisms involved during flower morpho-genesis, we aimed to identify and characterize cell identity motifs.
Tree-indexed data on the whole plant scale9
Plant architecture analysis The importance of topological structure in understand-ing and analyzing the development of plants was underlined by Hall´e et al. (1978) and Gatsuk et al. (1980) who were the first to analyze plant architecture. Architectural analysis was at first essentially developed as a qualitative method for describing plants (Barth´el´emy et al., 1989). Subsequently, a major research eﬀort was devoted to validat-ing and refining architectural concepts and to studying their application in agronomic contexts. This led researchers to study progressively how to quantify plant architec-ture and to develop corresponding concepts and tools (see Godin and Caraglio, 1998, and references therein). The quantitative approach rapidly ran into the problem of obtaining computational representations of plants that are consistent with field obser-vations. This problem raises the question of measuring and formally representing plant topological structures.
SAM activity, plant modularity and tree representation of plants The notion of plant topological structure is based on the idea of decomposing a plant into elementary constituents and describing their connections. To obtain natural decompositions, it is possible to take advantage of the fact that the outcome of the plant growth process is modular: a stem is a succession of metamers constituted by an internode, the upper node, leaves and axillary buds attached to the node (see figure 2.5)
The topological structure stemming from a modular organism such as plants consists of a description of the connections between its elementary constituents. Considering only one SAM activity leads to consider stems that can be viewed as sequences: a metamer is connected to an anterior metamer – called the predecessor metamer – and possibly to a posterior metamer – called the successor metamer. But as a SAM produces buds containing other SAMs, as soon as an axillary meristem of a stem develops into a lateral axis, a metamer may have more than one child, counting the successor and the lateral(s) metamer(s) but has only one predecessor. The whole plant can thus be viewed as a tree-like structure (see figure 2.6).
Figure 2.5 – (A) Shoot apical meristem and (B) stem organization (Barth´el´emy and Caraglio, 2007). Each leafy axis (B) ends in an apical meristem frequently protected by an apical bud (A). Each stem comprises a succession of metamers (in gray on (B) constituted by an internode, the upper node, leaves and axil lary buds attached to the node).
Tree-indexed data collection by retrospective measurements Plant growth is often a cyclic phenomenon: metamer set-up may be interrupted by resting phases corresponding for instance to winter in temperate species. It is thus interesting when collecting data on plant plant topological structure to consider meristematic activity at diﬀerent scales depending in the growth strategy of the plant. Indeed, if the metamer is the basic unit of plant architecture, according to the plant growth cycles the tree-indexation of data can be considered at diﬀerent scales (see figure 2.7):
On the Growth Unit (GU) scale. The GU is composed of the metamers established in a ininterrupted phase of growth.
On the annual shoot scale. The annual shoot corresponds to the GUs established over a year.
On the axis scale. The axis corresponds to the succession of annual shoots or GUs produced by the same meristem.
When studying the architecture of a plant, appropriate selection of the botanical entity
– the elementary constituent on the considered scale – is therefore primordial in order to describe the plant’s growth strategy (see figure 2.6):
• The common walnut (Sabatier et al., 1998), possesses two types of annual shoots. Monocyclic annual shoots are preformed in the winter bud. The annual shoot and the GU are thus the same. Conversely, bicyclic annual shoots are made up of two GUs. Depending on the objectives of the analysis, the botanical entity chosen could be the GU or the annual shoot.
Figure 2.6 – Tree-indexed data representation of plants (Durand et al., 2005). (A) The plant is represented on the Growth Unit (GU) scale where each GU is denoted by ev with v ∈ [|0, 14|[. (B) Formal tree graph representation of the same plant: each GU ev is represented by a vertex v. Part of the topological information is not encoded in the graph but can be stored as a property (the three shoots borne by e1). A few other vertex properties can be defined such the length of GUs, their top and bottom diameters. . . depending on the study.
• For some tropical plants, growth may be almost continuous. As a consequence, GUs are no longer relevant. A reasonable choice could therefore be to consider the metamers or the axis as the botanical entity.
Moreover, it is noteworthy that although axis scales are defined for all plants, the GU and annual shoot scales are mainly defined for temperate species.
Morphological markers, which reflect past meristem activity, enable the botanist to reconstruct the life of a plant by identifying a posteriori growth periods. The tree graph
T is therefore constructed with respect to the plant growth strategy (see figure 2.6). Over the same period the univariate set x¯ or more generally the multivariate set x¯ is collected considering characteristics – depending on the experiment – of botanical entities such as length, diameter, number (or presence) of flowers, number (or presence) of fruits.
Therein we only consider the collection of tree-indexed data. But since there are more than one pertinent scale on the same plant, data is actually collected using the Multiscale Tree Graph (MTG) data structure defined by Godin and Caraglio (1998). This MTG data structure can be seen as tree-indexed data where scales are represented by a recursive quotienting of the tree on a finer scale (see Godin and Caraglio (1998) for more details). Choice of scale is therefore made a posteriori in order to produce tree-indexed data and can depend on the growth aspect of the plant studied.
Figure 2.7 – Plant modularity (Barth´el´emy and Caraglio, 2007). This diagram represents main scales of organization (botanical entity) and repetition phenomena (terms in italics or in boxes) in seed plants.
Available data In this thesis we focus on joint work on mango tree phenology con-ducted with Anna¨elle Dambreville, Pierre-Eric Lauri and Fr´ed´eric Normand. We used mango MTGs containing 15 trees belonging to 5 cultivars collected during the thesis work by Dambreville (2012) to highlight and characterize mango tree patchiness phe-nomenon. Like other tropical trees, mango is characterized by marked phenological asynchronisms between and within trees, entailing patchiness (Chacko, 1986, see fig-ure 2.8). Patchiness is characterized by clumps of either vegetative or reproductive GUs within the canopy: while some parts of the tree canopy develop vegetative GUs, other parts may remain in rest or produce inflorescences at the same time. These asynchro-nisms concern more or less large branching systems (Ram´ırez and Davenport, 2010). They entail various agronomic problems, such as the repeated use of pesticides to pro-tect recurrent susceptible phenological stages from pests, or an excessively extended period of fruit maturity which may lead to diﬃculties organizing fruit harvesting.
Previous studies by Dambreville et al. (2013) showed that the fate and burst date of a daughter GU are strongly aﬀected by those of ancestor GUs, indicating that patch-iness pattern formation could be studied using spatio-temporal analysis. Our twofold objective here was to:
• Characterize tree patchiness. As stated above, patchiness corresponds to more or less large branching systems sharing similar GU fates. We therefore aimed to recover a quotient tree of tree-indexed data on the GU scale in which quotients were roughly homogeneous in terms of GU fates.
• Identify the mechanisms responsible for the set-up of tree patchiness. An inquiry of fate alternations along paths within the tree or successions of homogeneous zones in mango trees could reveal the mechanisms involved in this set-up. To this end, we therefore aimed to highlight particular fate motifs in mango trees on the GU scale.
Figure 2.8 – Il lustration of mango tree patchiness (Dambrevil le, 2012). This mango tree is separated into two parts. The left one in dark green is a clump of old GUs wherein fruits can be found. In contrary the right one in light green is a clump of new vegetative GUs.
Markov models for tree indexed-data
Here we assume that the indexed set, x¯ = (xt)t T , or more generally x¯ = (xt)t T , is the outcome of a random process.
here we consider that τ is sensu stricto a tree. The only root of the tree is noted r.
In a forest, trees are considered as independent and identically distributed.
Let us first consider the simple case where x¯ is the realization of a X -valued stochastic ¯ such that X ⊂ N is called the state space. We are interested here process X = (Xt)t T in modeling the distribution of the random process ¯ (2.1) P X = x¯ .
When considering the case of tree-indexed data, the simplest dependent model that can be constructed is that which directly consider the tree-graph of the data as a graphical model combined with a usual homogeneity assumption. Combining both hypotheses leads to the following factorization of (2.1):
¯ P Xt = xt Xpa(t) = xpa(t) . (2.2)
P X = x¯ = P (Xr = xr )
Given factorization (2.2), classical Markovian models for path-indexed data were easily adapted to tree-indexed data. These models are called Independent Markov Out-Tree (IMOT) where ”independent” means that for such models siblings are assumed to be independent given their parent. Considering the mango tree application we aimed to highlight GU fate motifs assuming that at some point a switch occurred from a homogeneous tree to a heterogeneous patchy tree. In order to detect such patterns, we assumed that for a given parent fate:
• and a given growth period, only a few diﬀerent state combinations could be ob-served for children,
• and for a generation, all children states could be observed.
Under these assumptions we sought to model dependencies among children fates in order to obtain such inclusion/exclusion patterns. Since in (2.2) children fates are assumed to be independent given their parent fate, we had to consider other models (see Durand et al., 2005, for a discussion of available models):
• Markov In-Tree (MIT) models. Instead of modeling siblings given their parent as in IMOT, the parent is modeled given its children, introducing the following factorization of (2.1), P ¯ P (Xl = xl ) P Xt = xt X ch(t) = xch(t) , (2.3) X = x¯ = l∈L t∈T \L where siblings are marginally independent but conditionally dependent on their parent.
• Multi-Type Branching Process (MTBP). Under a permutation invariance property (see Haccou et al., 2005; Kimmel and Axelrod, 2002, for more details), an extension of Markov Out-Tree (MOT) models that considers dependencies between children and where tree topology is partially represented thought the parametrization of vertex out-degree combinatorics. The following factorization of (2.1) is therefore introduced P ¯ ∝ P (Xr = xr ) P (N t = nt | Xt = xt) , (2.4) X = x¯ t∈T where N t is the discrete random vector of the number of children of vertex t in each state.
In the context of our mango tree analysis, the assumption of unordered children and the combinatorics induced by the variable and large number of child vertices in each state inflates the number of model parameters. We therefore focused on parametric versions of these models. Since parametric MIT models are not suitable for left-right cases (see chapter 5.4), we thus focused on MTBP models. The issue of specifying parametric MTBPs reduces to the problem of defining parametric models for discrete multivariate counts. The classical discrete multivariate distributions catalog (Johnson et al., 1997) only proposes rigid dependence and covariance patterns. The next step toward modeling mango tree patchiness was thus to derive flexible discrete multivariate distributions with complex dependency patterns. This was dealt by the introduction of mixed graphical models for multivariate discrete random vectors.
Hidden Markov Tree (HMT) models
When confronted by tree-indexed data that do not contain a few discrete outcomes, as in the mango tree case, but multidimensional heterogeneous outcomes as in the floral meristem case, MIT and MTBP models cannot be considered as they stand. A widespread extension of Markov Tree (MT) models in such cases is to consider Hidden Markov Tree (HMT) models. HMT models introduced by Crouse et al. (1998) are for MT models what Hidden Markov Chain (HMC) models are to Markov Chain (MC) models. As for HMC models (see Ephraim and Merhav, 2002, for more details), HMT models are no longer restricted to categorical variables but deal with any type of random variable or vector at a low cost in terms of parameters.
Table of contents :
1 Graphs and graphical models frameworks
1.1 Introduction to graph theory
1.1.3 Graph properties
1.2 Graphical model framework
1.2.1 Random vectors and independencies
1.2.2 From graphs to distributions
1.2.3 From distributions to graphs
1.3 Gaussian graphical models
2 Tree-indexed data and Markov Tree (MT) models
2.1 Introduction to tree-indexed data
2.1.2 Drawing tree-indexed data
2.2 Tree-indexed data and plants
2.2.1 Tree-indexed data on a cellular scale
2.2.2 Tree-indexed data on the whole plant scale
2.3 Markov models for tree indexed-data
2.3.1 Markov models
2.3.2 Hidden Markov Tree (HMT) models
3 Semi-parametric Hidden Markov Out-Tree (HMOT) models for cell lineage analysis
3.2.1 Markov Out-Tree (MOT) models
3.2.2 Hidden Markov Tree (HMT) models
3.3 Computational methods for Hidden Markov Out-Tree (HMOT) models
3.3.1 Upward-downward smoothing algorithm
3.3.2 Application of the EM algorithm
3.3.3 Dynamic programming restoration algorithm
3.4 Application to cell lineage trees
4 Inference ofMixed Acyclic GraphicalModels (MAGMs) inMulti-Type Branching Processes (MTBPs)
4.2.1 Multi-Type Branching Processes (MTBPs)
4.2.2 Poisson Mixed Acyclic Graphical Models (PMAGMs)
4.2.3 Discrete Parametric Mixed Acyclic Graphical Models (DPMAGMs)
4.3 Discrete Parametric Mixed Acyclic Graphical Models (DPMAGMs) inference
4.3.1 Parameter inference
4.3.2 Structure inference
4.4 Application to Multi-Type Branching Processes (MTBPs): the case of mango tree asynchronisms
4.5 Concluding remarks
5 Quantification of plant patchiness via tree-structured statistical mod- els: a tree-segmentation/clustering approach
5.2 Material and methods
5.2.1 Tree-structured representation of plants
5.2.2 Modeling plant patchiness with tree segmentation/clustering models
5.2.3 Plant material
5.3.1 Tree segmentation
5.3.2 Subtree clustering
5.3.3 Cultivar comparisons
Work in progress and perspectives
StatisKit: graphical model inference in C++ and Python
Hidden Markov In-Tree (HMIT) models
Multivariate mixture models in Multi-Type Branching Processes (MTBPs)
Integrative models for deciphering mango tree asynchronisms
Index of references