Get Complete Project Material File(s) Now! »


An evolutionary biologist is interested in how processes affecting evolution have produced the diversity of genes, genomes, organisms, species and communities that are observed today.
A classical approach to study these processes is the reconstruction of phylogenetic trees of genes, genomes, organisms and species (Figure 6); an outcome from the crucial Darwin works on theory of evolution by natural selection, published in his book « On the Origin of Species » (1859) (Darwin 1859). The theory of evolution is based on the idea that all living organisms evolve from earlier forms by modification and divergence according to a tree process as a result of natural selection (Lewontin 1970). It is commonly assumed that such evolving units present a few necessary conditions for evolution by natural selection, namely (i) phenotypic variation among members of an evolutionary unit, (ii) a link between phenotype, survival, and reproduction (i.e., differential fitness), and (iii) heritability of fitness differences (individuals resemble their relatives more than unrelated individuals) (Bapteste et al. 2012).
In this example, the gene has undergone two mutations in the ancestral species, the first mutation giving rise to the ‘blue’ allele and the second to the ‘green’ allele. Random genetic drift in association with the two subsequent speciations results in the red allele lineage appearing in species A, the green allele lineage in species B and the blue allele lineage in species C. Molecular phylogenetics based on the gene sequences will reveal that the red-blue split occurred before the blue-green split, giving the gene tree shown on the right. However, the actual species tree is different, as shown on the left. Based on Li W-H (1997) Molecular Evolution. Sinauer, Sunderland, MA. (Brown 2002).
Like Darwin, scientists believed that evolution was a slow and gradual process. However, the tree model is not enough to explain the evolutionary history of life on Earth (Nutman et al. 2016). Besides the tree-like process, other processes called nongradual, involving combinatorial (e.g. recombination, fusion, fission) and introgressive (e.g. integration of foreign genetic element in to a genome by HGT) exist and cannot be represented accurately by a tree (Dagan et al. 2008; Halary et al. 2010; Corel et al. 2016).
Saltational processes, such as recombination events, fusion, fission or lateral gene transfer (or horizontal gene transfer), are found at different levels of biological organization (Figure 7) Network-based methods have been described to be a well suited approach to detect, analyze and visualize the vertical and horizontal relationships at the genomic level and in several genomes at the same time (Corel et al. 2016). Network approaches are increasingly used to complement phylogenetic analysis in molecular evolution, comparative genomics, classification and ecological studies (Halary et al. 2013). For example, their suitability for investigating introgressive events have enhanced our understanding of the chimeric origin of genes in the eukaryotic proteome (Thiergart et al. 2012; Alvarez-Ponce et al. 2013), the flow of genes between prokaryotes and their mobile genetic elements (Halary et al. 2010; Dagan 2011; Kloesges et al. 2011; Popa et al. 2011; Jaffe et al. 2016) and gene sharing across mobile elements to study the transfer of resistance factors (Fondi and Fani 2010; Tamminen et al. 2012). Networks have also been used to describe complex biological systems, including inferring the “social networks” of biological life forms (Halary et al. 2010), producing maps of genetic diversity (Cheng et al. 2014), detecting distant homologues (Park et al. 1997; Bolten et al. 2001; Bapteste et al. 2012) and exploring gene and genome rearrangements (Jachiet et al. 2013; Meheust et al. 2016).
Composite genes result from the fusion of different gene domains. (B) Composite genomes can result from the introgression of a gene into a genome, or (C) from the introgression of a genome into a genome. (D) Composite organisms can arise from the introgression of a mobile genetic element. Holobionts result from the introgression of a genome (E) or of another cell (F) into a cell. (Corel et al. 2016).
The revolution in DNA sequencing has been a major advance for evolutionists, giving them new opportunities to investigate these diverse kinds of questions with molecular data; however they also present challenges in terms of the scale of the analyses. Consequently, development of new methods for the construction and analysis of networks has been necessary. In the book chapter « The Methodology Behind Network-Thinking: Graphs to Analyze Microbial Complexity and Evolution« , we present the different kinds of networks based on sequence similarity that have been introduced to tackle a wide range of biological questions, including sequence similarity networks, genome networks and bipartite graphs, and a guide for their construction and analyses. This book chapter has been submitted to the editor Anisimova Maria for the book « Evolutionary Genomics: statistical and computational methods » (Humana Press, Springer).
In the post genomic era, large and complex molecular datasets from genome and metagenome sequencing projects expand the limits of what is possible for bioinformatic analyses. Network based methods are increasingly used to complement phylogenetic analysis in studies in molecular evolution, including comparative genomics, classification and ecological studies. Using network methods, the vertical and horizontal relationships between all genes or genomes, whether they are from cellular chromosomes or mobile genetic elements, can be explored in a single expandable graph. In recent years, development of new methods for the construction and analysis of networks has helped to broaden the availability of these approaches from programmers to a diversity of users. This chapter introduces the different kinds of networks based on sequence similarity that are already available to tackle a wide range of biological questions, including sequence similarity networks, genome networks and bipartite graphs, and a guide for their construction and analyses.
An evolutionary biologist is interested in how processes governing evolution have produced the diversity of genes, genomes, organisms, species and communities that are observed today. For example, a biologist interested in the eukaryotes may wonder what symbiotic partners have contributed to their origins and evolution. Eukaryotic nuclear genomes are chimeric in nature, encoding many genes acquired from their alpha-proteobacterial endosymbiont (13). However in recent years it has been proposed that the ongoing gain of genes by both microbial (46) and multicellular Eukaryotes (7, 8) via lateral gene transfer (LGT) has continued to contribute to eukaryotic evolution, though to a lesser extent than prokaryotes (9). A biologist interested in prokaryotes may wish to investigate lateral gene transfer to explore the extent and kinds of genes transferred between bacteria, archaea and their mobile genetic elements (1014). These transfers are important for understanding the accessory genomes of prokaryotes (1517). Further, studying gene transfers in real bacterial communities from different environments can help to test the effect of LGT on ecology and evolution of communities (18). Given the prevalence of introgression (911, 19), one interesting question is whether gene transfer has led to the formation of novel fusion genes that combine parts of genes originating from separate domains of life (20). An ecologist may wish to analyse the distribution of genes and species in the environment (21). A metagenome analyst may need to overcome an additional challenge is exploring the nature of the large proportion of sequences in metagenome sequence projects that have with little or no detectable similarity to characterised sequences in order to study the “microbial dark matter” (22).
High-throughput sequencing technology presents new opportunities to investigate these diverse kinds of questions with molecular data; however they also present challenges in terms of the scale of the analyses. Consequently, a number of network based methods have recently been developed to expand the toolkit available to molecular biologists (23), and these have already made major contributions to our understanding molecular evolution. Networks have been used to shed light on the nature of the “microbial dark matter” (24) and used in ecological studies to explore the geographical distribution of organisms or genes (25, 26) or the evolution of different lifestyles (27). Their suitability for investigating introgressive events have been used to enhance our understanding of the chimeric origin of genes in the eukaryotic proteome (28, 29), the flow of genes between prokaryotes and their mobile genetic elements (3035) and gene sharing across mobile elements to study the transfer of resistance.
While the generation and analysis of networks was previously limited to biologists with programming experience, tools have recently been developed to simplify the process and broaden the availability of network analyses of molecular sequence data. This chapter introduces the different kinds of networks that are already available to biologists and a guide to how these networks can be constructed and analysed for a large range of applications in molecular evolution. More precisely, this chapter will focus on three kinds of network and the types of analyses that are possible using these networks: sequence similarity networks, genome networks, and multipartite graphs (23).

Sequence similarity networks (SSNs)

Sequence similarity networks are the bread and butter of network based molecular sequence analyses, with a huge range of applications in molecular biology. The use of SSNs for molecular sequence analysis first came to the fore in the late 1990s and early 2000s, when SSNs were suggested as a way to analyse the rapid influx of new molecular sequence data due to advances in sequencing technology and cost, as well as to predict gene functions and protein-protein interactions (39, 4143). One of the earliest formal and heuristic uses of SSNs was to define the COG groups of homologous families and facilitate prediction of the functions of large numbers of genes based on homology (39, 40). The need for efficient computation and analyses for large biological databases still pervades; however more recently SSNs have been increasingly appreciated as useful approaches to describe complex biological systems, including inferring the “social networks” of biological life forms (30), producing maps of genetic diversity (27), detecting distant homologues (4446) and exploring gene and genome rearrangements (47, 48).
A SSN is a graph in which each node is a sequence and edges connect any two nodes that are similar at the sequence level above a certain threshold (e.g. coverage, percent identity and E-value) as determined by their pair-wise alignment (Box 1) (Figure 1). While the principle behind SSN construction is simple, the expression of similarity data in this structure can enable the use of powerful algorithms for graph analyses to study complex biological phenomena. Construction of a SSN is also frequently the starting point in a diversity of further graph analyses. A SSN can be constructed directly from fasta formatted sequence files using pipelines, such as EGN (49), the updated and faster performing EGN2 (forthcoming), or PANADA (50). Visualisation of networks can be performed with programs such as Cytoscape (51) or Gephi (52), both of which also have a range of internal tools and external plugins for network analysis. While these programs are useful for the visualisation and analysis of relatively small networks, it can be difficult to load large and complex networks with a lot of edges (e.g. ≥50,000 edges). In these cases the iGraph library offers an extremely powerful and well supported implementation of a broad range of commonly used methods for both complex graph generation and analysis in R, Python and C++ (53). However, using iGraph requires knowledge of programming in at least one of these languages. An additional package for network analysis in Python is NetworkX (54). It is our goal here to further generalise network approaches by explaining how evolutionary biologists with less programming knowledge could analyse their data. A list including many of the tools and programs available for SSN generation is available at
A set of sequences (protein or DNA) in fasta format (A) are aligned in pairs using alignment tools (such as BLAST). These alignments (B) are scored with metrics such as the percentage identity between two sequences (the number of identical nucleotides / amino acids displayed above) or the E-value of the alignment. In the resulting network, sequences are represented as nodes. Two sequence nodes are joined with an edge if they can be aligned above a define threshold, with the weight of the edge often based on percentage identity or E-value.
The first and most important step of SSN construction is the assembly of a dataset of sequences relevant to your biological question, usually in fasta format. This can be used as the initial input for wizards such as EGN or EGN2 (49), which can fully automate the process. The nature of the dataset is highly dependent on the question, so here we focus on the practicalities of database assembly. To construct the similarity network all sequences in the dataset are aligned against one another in a similarity search. This similarity search is often the time limiting step in an analysis, and the total number of searches required is quadratic to the number of sequences in the dataset. For large datasets it is useful to benchmark the alignment using a subset of the data to estimate the timescale for the alignment. Large datasets can generate huge outputs, not only due to the number of sequences but also the length of their identifier. One way to reduce the output size is to replace each sequence name in the fasta file with a unique integer. The use of integers will reduce disk space use and the memory consumption for any software used to analyse the sequence data.
To generate a sequence similarity network all sequences must be aligned against one another in an all versus all search, in which the dataset of sequences is searched against a database including the same sequences. For gene networks, the alignment is usually done with fast pairwise aligner such as BLAST (55, 56) as implemented in EGN (49). Filters are often used to remove low-complexity sequences from the search, as these can cause artefactual hits (BLAST options –seg yes, -soft-masking true). The BLAST method of alignment will be the focus of future discussion in this chapter, however alternatives are available including BLAT (57) (also implemented in EGN), SWORD (58), USEARCH (59) and DIAMOND (60). These alternatives generally include an option to produce a “BLAST” style tabulated output, making them compatible with programs commonly used in network analyses.
Within alignment tools like BLAST it is possible to assign set thresholds, such as the maximum E-value of the alignment to retain only significant hits or to output only the best alignment for a pair of sequences (BLAST option –max_hsps 1), drastically reducing the size of the output. It is not recommended to set minimal thresholds for some parameters (such as % sequence identity) unless required due to memory constraints so that you can generate networks from a single sequence alignment with different thresholds for comparison (e.g. comparison of a 30% similarity threshold to a 90% threshold, where edges will only be drawn between highly similar genes).
Note: It may be intuitive to use additional CPUs to speed up the alignment process, however in BLAST it can be more efficient to split the query file and launch multiple searches on separate cores instead of using the BLAST multithreading option. The pairwise alignment step is generally the most time limiting part of generating a SSN, so benchmarking should be used to establish the optimal settings for the pairwise and/or determine the feasibility of a project given the size of the dataset and the available computational resources.
which must be removed prior to network construction (Erreur ! Source du renvoi introuvable.). When query sequence A in a similarity search is aligned with sequence B in the database, often the reciprocal result is also identified (an alignment between query sequence B and sequence A in the database). These are called reciprocal hits; while the sequences involved are identical, the alignments and scores are not. Retaining both hits would generate two different edges between the same two nodes in a SSN, so generally only the best results from reciprocal hits are retained, based on a score such as the E-value (Erreur ! Source du renvoi introuvable.). Finally, a single query sequence may be significantly aligned multiple times in different positions of the same sequence in the database, however for SSN construction only the best BLAST hit is generally retained (Erreur ! Source du renvoi introuvable.). The selection of the best BLAST hit is again generally often based on the E-value (corresponding to the BLAST -max_hsps 1 option). Removing multiple hits against the same sequence allows the generation of an undirected network where a single edge connects two nodes, representing the best possible alignment between these nodes.
Constructing a SSN from a BLAST output is conceptually simple; an edge is created between two sequences (nodes) that have been aligned in the sequence similarity search. It is common to apply thresholding criteria such as minimal % ID and/or coverage and/or maximal E-value to determine whether an edge is drawn between two sequences in the network (Figure 1). There are different ways to calculate the % coverage of an alignment. This could be based on the coverage of a single sequence in the alignment, selecting either the query or the database sequence in each alignment, or the longest or shortest sequence in each alignment. Alternatively both (mutual coverage) can be used, retaining an alignment when both values are above a given threshold. Edges above the thresholding criteria can be assigned a weight based on these criteria, producing a weighted sequence similarity network that retains information of the properties of the alignment between two sequences (Figure 1). It is often useful to construct and compare several SSNs with variable stringencies defining the edges between sequences, for example, to optimise gene family detection within the SSN (discussed below).
In the output of an all against all sequence similarity search there are a number of features that are often filtered out prior to network construction. Self hits (1/ and 2/), where like sequences are paired in a sequence alignment, are not informative to network construction and are removed (highlighted by the red box surrounding the alignments). In cases where there are reciprocal hits (3/ and 4/) between two sequence then only the alignment with the highest E-value is retained (highlighted with a green box around the retained alignment) to ensure only one edge representing the best possible alignment connects any two nodes in the network, The same is true for cases where a sequence has multiple hits against another sequence, such as when it aligns to another sequence in multiple positions (5/ and 6/).

READ  Knowledge Extraction for Biology and A.Thaliana

Exploiting sequence similarity networks for identification of gene families

A gene family is usually defined as a group of sequences that are similar at the sequence level, indicative of homology and potentially of shared functions, however there is no uniform way to define this similarity (61, 62). One of the early contributions of SSNs in molecular sequence analysis was in the construction of the COG database of homologous protein sequences (39, 40). This study attempted to define gene families based on similarity at the sequence level using the results of sequence similarity searches. Within the results of an all versus all BLAST search, groups of at least three proteins encoded by different genomes that were more similar to each other than they were to other proteins found in the same genomes were defined as a likely orthologous gene family. Orthologous gene families are group of genes in different genomes that show sequence similarity, likely as a result of their shared evolutionary history.
The idea of using graphs to identify gene families is now a core part of many graph-based analyses. Members of a gene family aggregate in a sub-network in a SSN. These sub-networks are called connected components (CCs) at these defined thresholds, i.e. clusters of nodes connected by edges either directly or indirectly (via intermediate nodes) (Figure 3). The size (number of nodes and edges in a CC) and density (the proportion of potential connections between all nodes in a CC that are actually connected by edges in the graph) of CCs will depend on the thresholds used for constructing the SSN as well as the relationships between sequences in the network. For example, for a given dataset at a given mutual coverage threshold, a threshold of 90% sequence identity will identify a large number of small connected components that only include highly similar genes, while at threshold of 30% sequence identity there will be fewer but larger connected components including genes with more variation in sequence similarity. Commonly used thresholds for detecting homologous gene families are an E-value ≤ e-5, mutual coverage ≥ 80% and a percentage of identity ≥ 30% (23).
CCs are often detected in a SSN using the Depth-First Search (DFS) algorithm; however there are also other approaches for the detection of gene families based on the idea of detecting “communities” (63). In some cases, a CC can be further separated into communities of sequences that share more similarity to one another than to other sequences in the CC, thus are more highly linked in the SSN (Figure 3). Communities are commonly identified by using graph clustering algorithms such as Louvain (64), MCL (65) or OMA (66), however different clustering algorithms will result in different outputs. The Louvain weighted method is widely used because it is simple to implement and scales very well to large graphs (Figure 3, Figure 4) (64). MCL is a strong deterministic algorithm that has been implemented, for example, in tribeMCL (65) and orthoMCL (67). A potential drawback of MCL is that it requires user specification of the “inflation index”, a parameter which controls cluster granularity (or “tightness”). A high inflation index increases the tightness of clustering; producing a larger number of clusters that are smaller on average than those that would be obtained clustering the same dataset using a low inflation index. Selecting an appropriate inflation index is not trivial and requires optimisation (65).
The network is assembled from the results of an all versus all alignment, as previously described. Edges can be weighted by E-value, percentage of identity or bitscore. For the purpose of simplification we consider strong or weak weights rather than actual value. A) A giant connected component at relaxed threshold. B) Three connected components at a more stringent threshold. C) Three communites with Louvain clustering algorithm, taking into account edge weights.
A) A single giant connected component from a sequence similarity network. B) The same giant connected component after application of a community detection algorithm. Node colours correspond to the newly assigned communities.
A number of the above approaches have been used to compile additional databases of orthology that can act as useful reference datasets. OMA is a program that uses graph based algorithms and exact Smith-Waterman alignments to identify orthology between genes (68 71). OMA is also available as a web browser (72) including a database of orthologues that, in 2015, included more than 2000 genomes and more than 7 millions proteins (66). SILIX is a software package (73) that aims at building families of homologous sequences by using a transitive linkage algorithm and HOGENOM (74) is a database that contains families inferred by SILIX for 7 millions proteins.
In addition to clustering genes into families, valuable information can be extracted from the connected components using network metrics. Highly conserved sequences tend to form CCs where most of the nodes are connected to each other by edges, while sequences from more divergent families will tend to form more sparsely interconnected CCs. This information can be easily assessed for each component using the clustering coefficient. Conserved families will have a clustering coefficient close to 1, even for stringent thresholds. Identifying such conserved families can be useful to produce multiple sequence alignments (MSA) needed for phylogenetic reconstruction, but SSNs have also been demonstrated to unravel relationships between distant homologues by linking distantly related sequences together (24, 29, 45). In a SSN, two distant sequences A and C which do not share similarity according to BLAST can be linked together due to a sequence B which shows similarity to both A and C.
The idea of distant homology has been particularly illuminating regarding chimeric organisms such as eukaryotes who carry genes inherited from a bacterial ancestor and from an archeal ancestor (29). A common way to analyse sequence similarity networks is to identify certain “paths” of interest, for example, the shortest possible paths between two nodes. This notion describes the path between two nodes in a connected component that minimises the sum of the edge weights. Alvarez-Ponce et al. used this approach to explore the topology of connected components in a SSN including the complete proteomes of 14 eukaryotes, 104 prokaryotes (including archaea and bacteria), 2,389 viruses and 1,044 plasmids. 899 CCs contained sequences from all three domains and of these 208 contained eukaryotic sequences that were not directly similar to one another, but only linked to one-another via a “eukaryote-archaea-bacteria-eukaryote” shortest path. These are putatively distant homologues in Eukaryotes that were present in both the Archaeal-host for the mitochondrial endosymbiont and the alphaproteobacterial endosymbiont, with both copies subsequently retained in Eukaryotes, and as such strong evidence for the chimeric origin of eukaryotes (29). This adds to previous studies to demonstrate the utility of networks in the study of ancient evolutionary relationships including the origin of eukaryotes (28) or rooting the tree of life (75). Simple path analysis for a network is possible using existing plugins within visualisation tools such as Cytoscape (51) and Gephi (52).

Exploiting SSNs to identify signatures of “tinkering” and gene fusion

When discussing identification of gene families we have focused on networks where edges are drawn between protein sequences that show a high enough similarity across their entire length, defined by a high mutual coverage threshold (e.g. 80%). Sequence similarity can also be partial, for example following gene remodelling or “tinkering” (76) producing new combinations of gene domains via gene fusion and fission events, or through the de novo sequence synthesis of gene extensions, adding to existing sequences. The term ”Rosetta Stone sequence” was coined to define the formation of a new fusion protein in a species as the result of the fusion of two proteins that are found separate in another species, with authors originally predicting that these fusions could occur between proteins that physically interact in a common structural complex (77). One of the earliest applications of sequence similarity searches to identify fusion proteins was an attempt to predict pairs of proteins that may physically interact in an organism based on whether they could be identified as a single “composite” fusion protein in another organism (41). Beyond predicting protein-protein interactions, this kind of gene remodelling and recycling of existing gene parts has the potential to contribute to the expansion of functional diversity in genomes, creating new and unique combinations of domains and functions (48, 76, 7882). Similarity search based screens have been implemented to identify composite genes and genome rearrangements in a range of prokaryotes (8385), eukaryotes (78, 8688) and viruses (89).
Early attempts to identify composite genes were based on the output of sequence similarity searches, but without formalising the results of search methods into a graph structure. The first attempt to formalise the problem of identifying “composite” genes in networks was the “Neighbourhood Correlation” approach, aiming to distinguish genuine multi-domain proteins sharing common ancestry (homologues) from novel multi-domain proteins that share domains due to insertions (90). The later development of the FusedTriplets and MosaicFinder tools attempted to unify existing graph based methods for detection of “composite” gene detection (47). FusedTriplets is a graph based implementation of the traditional gene centred method for composite gene identification, originally introduced by Enright et al. 1999, with additional cross-checks on the absence of similarity between the two component genes contributing to a composite gene based on varying thresholds (47, 91). MosaicFinder is a gene family centred approach which will only identify highly conserved composite gene families that form “minimal clique separators” (Figure 5) (47). This graph topology implies that MosaicFinder may fail to detect divergent (e.g. ancient or fast evolving) composite gene families which will tend to form “quasi-cliques” without perfect separation. CompositeSearch (forthcoming: available at is a new program designed to overcome this limitation by identifying both conserved and divergent composite gene families (Box 2).
A) A multiple sequence alignment of composite genes (yellow) with two components (blue and magneta). B) The sequence similarity network corresponding to the multiple sequence alignment. The composite genes (yellow) are a minimal clique separator for the network. Their removal (shown in C) decomposes the network to the two separate component families.
All versus all BLAST search filtered as described in “How to build your own sequence similarity network”.
Composite search takes a filtered BLAST output and a list of genes within as the initial input. Two search algorithms are implemented: “fastcomposites” detects a list of potential composite genes. “Composites” additionally detects potential composite and component gene families. Additional options are included to filter the network based on a number of standard metrics (e.g. E-value, sequence similarity, mutual coverage) and set the maximum overlap allowed between different components aligned on the same potential composite gene. The definition of a maximum overlap allows adjustment for the tendency of BLAST to produce overhanging alignments (91). The output includes a node, edge and information file including information on number of nodes, edges and family connectivity from family detection. Two outputs are included for composite gene detection, a “composites” file with detailed information on each predicted composite gene in fasta format, and “compositesinfo”, summarising the data. Similarly two files provide detailed information on composite gene families and a summary of composite gene families.

Table of contents :

I.2.1 Data avalanche
I.2.2 Sequence Similarity Networks
I.2.3 Computational bottleneck of SSN
III.1.1 CompositeSearch: A new tool for studying modularity in molecular evolution
III.1.2 Impact of genomic structure and mobility on gene remodeling in plasmids, allowing the evolution of new dependency systems.
III.1.3 Evolution of genes and rules of gene remodeling during the transition and stabilization of animal multicellularity


Related Posts