Signature of gene-family scaling laws in microbial ecosystems

Get Complete Project Material File(s) Now! »

High-level functional categories of genes follows quantitative laws

As demonstrated by van Nimwegen [82] and confirmed by a series of follow-up studies [57, 58, 13, 30, 11], striking quantitative laws exist for high-level functional categories of genes. Specifically, the number of genes within individual functional categories exhibit clear power-laws, when plotted as a function of genome size measured in terms of its number of protein-coding genes or, at a finer level of resolution, of their constitutive domains (see Fig. 2.1).
In prokaryotes, such scaling laws appear well conserved across clades and lifestyles [58], sup-porting the simple hypothesis that these scaling laws are universally shared by this group. From the evolutionary genomics viewpoint [42], these laws have been explained as a byproduct of specific “evolutionary potentials”, i.e., per-category-member rates of additions/deletions fixed in the pop-ulation over evolution. As predicted by quantitative arguments, estimates of such rates correlate well with the category scaling exponents [82, 57].
A complementary point of view [55, 64, 30] focuses on the existence of universal “recipes” determining ratios of proteins between di↵erent functions necessary for genome functionality . Such recipes should mirror the “dependency structure” or network operating within genomes as well as other complex systems [65]. According to this point of view the usefulness, and thus the occurrence, of a given functional component depends on the presence of a set of other components, which are necessary for it to be operational.

The analysis of quantitative laws at the domain-family level may explain how the scal-ing of functional categories emerges from the evolutionary dynamics.

Beyond functional categories, protein coding genes can be classified in “evolutionary families” defined by the homology of their sequences. Functional categories usually contain genes from tens or more of distinct evolutionary families.
The statistics of gene families also exhibits quantitative laws and regularities starting from a universal distribution of their per-genome abundance [36], explained by evolutionary models ac-counting for birth, death, and expansion of individual families [69, 38, 16]. While some earlier work connects per-genome abundance statistics of families with functional scaling laws [30], the link between functional category scaling and evolutionary expansion of gene families that build them remains relatively unexplored. Clearly, selective pressure is driven by functional constraints, and thus selection cannot in principle recognize families with identical functional roles. On the other hand, slight di↵erences in the functional spectrum of di↵erent protein domains, and interde-pendency of di↵erent functions can make the scenario more complex. Thus, one central question is how the abundance of genes performing a specific function emerges from the evolutionary dy-namics at the family level. Two alternative extreme scenarios can be put forward:
(i) The high-level scaling laws could emerge only at the level of functions, and be “combina-torially neutral” at the level of the evolutionary families building up a particular function. In this case all or most of the families performing a particular function would be mutually interchangeable.
(ii) Functional categories scaling could be the result of the sum family-specific scaling laws. Therefore the evolutionary potentials would be family-specific and coincide with family evo-lutionary expansion rates, possibly emerging from the complex dependency structure cited above, and from fine-tuned functional specificity of distinct families.
An intermediate possibility is that an interplay of constraints acts on both functional and evolution-ary families. The following sections address the question of which is the most likely scenario by providing a systematic analysis of scaling laws at the family level and their interplay with func-tional category scaling. We will focus only on bacteria.

Comparison with a null model supports the existence of scaling laws at the family level is not simply due to sampling e↵ects.

The families that pass the quality filtering procedure all show a clearly identifiable individual scal-ing when plotted as a function of genome size. As an example, Fig. 2.3 shows the scaling of a set of chosen families in four selected functional categories. It is worth noting that some low-abundance families that occur in all genomes with a very consistent number of copies show definite scaling with exponents close to zero [31], being clearly constant with size, with little or no fluctuations. Additionally, Fig. 2.3 shows that the presence of “outlier families” is common among func-tional categories. In most categories, we found families where the deviations from the category exponents is clear, beyond the uncertainty due to the errors from the fits. Fig. 2.3 shows some examples where in each of the shown categories i is higher, lower or comparable to c. A table containing all the family and category exponents is available at appendix A.
Given that functional categories follow specific scaling laws, likely related to function-specific evolutionary trends [82, 57], there remain di↵erent open possibilities for the behavior of the evolu-tionary families composing the functional categories. One simple scenario is that family scalings are family-specific, thus validating the existence of family evolutionary expansion rates that are quantitatively di↵erent to the one of their functional category. In the opposite extreme scenario the scaling is only function-specific, and individual families performing similar functions are inter-changeable. If this were the case, the observed family diversity in scaling exponent would be only due to sampling e↵ects. To assess the influence of sampling e↵ects, we defined a null model, in which we randomized the families within a category conserving their occurrence patterns and the category average abundance. In more detail, the null model is based on the following ingredients:
(i) The number of domains belonging to a category c in genome g, ngi, is conserved.
(ii) For each genome, domains are not assigned to families that are not present in that genome.
(iii) The average frequency fc(i) for each family i with respect to the category c is conserved.

Family exponents correlate with diversity of biochemical functions but not with con-tact order or evolutionary rate of domains.

Finally, we considered the correlation of family scaling exponents with relevant biological and evo-lutionary parameters. We tested the diversity of EC-numbers associated with families, quantifying the functional plasticity of a given family. The Enzyme Commission (EC) number is a classifica-tion scheme for enzyme-catalyzed chemical reactions. It is built as a four-levels tree where the top nodes are six main groups of reactions, namely Oxidoreductases, Transferases, Hydrolases, Lyases, Isomerases and Ligases [6]. We used the mapping between Superfamilies and EC terms [27], to investigate the correlation between the Superfamily scaling and the number of di↵erent reactions in which the family is involved. This quantity is the count of distinct EC numbers corresponding to the finest level of the EC classification. Table 2.2 shows the correlations with other parameters such as foldability (quantified by size-corrected contact order, SMCO [17]), selective pressure (quantified by the ratio of nonsynonymous to synonymous Ka/K s substitution rates [61]) and overall family abundance.
The results are summarized in Table 2.2. Foldability and Ka/Ks appears to have little correla-tion with scaling exponents. Instead, we found a significant positive correlation of exponents with family abundance, and both quantities are correlated with diversity of EC-numbers in metabolic families. This suggests that, at least for metabolism, functional properties of a fold play a role in family scaling, and that beyond metabolism, abundance and scaling are, on average, not unrelated.

The heterogeneity in scaling exponents is function-specific.

The analyses presented in the previous section support the hypothesis that functional categories contain families with specific scaling exponents. Indeed, the scaling exponents i of the families can be significantly di↵erent from the category exponent c, with deviations that are much larger than predicted by randomizing the categories according to the null model (see Fig. 2.4).
In order to quantify this “scaling heterogeneity” of functional categories, we computed for each family i the distance between its scaling exponent i and the category exponent c: hi = | c i|.
Finally, we defined an index Hc quantifying the heterogeneity of the scaling of the families within a category by averaging this distance over the families associated to a given category c: 1 X Hc = Fc i hi, where Fc is the number of families in category c.
Interestingly, these two quantities are correlated, with categories with larger values of more heterogeneous. Intuitively, categories with small exponents are incompatible with extremely large fluctuations of family exponents, while categories with larger exponents can contain families with small i. Indeed, this trend of heterogeneity with exponents is also observed in the null model, where the heterogeneity of null categories is much smaller than empirical ones, since all families tend to take the exponent category (Fig. 2.4).
Figure 2.5B allows a direct comparison of the heterogeneity of di↵erent categories by subtract-ing the mean trend. It is noteworthy that the Signal Transduction functional category, which also has clear superlinear scaling, has much lower heterogeneity than DNA-binding/transcription fac-tors. Among the categories with linear scaling, Transferases is one of the least heterogeneous ones, while the categories Protein Modification and Ion metabolism and Transport show a large variabil-ity in the exponents of the associated families. For Protein Modification, this signal is essentially due to the Gro-ES superfamily and to the HFSP90 ATP-ase domain, which have a clear superlinear scaling, while other chaperone families, such as FKBP, HSP20-like and J-domain are clearly sub-linear with exponents close to zero. Interestingly, the Gro-EL domains, functionally associated to the Gro-EL, are part of this second class (exponent close to 0.2), showing very di↵erent abundance scaling to the Gro-EL partner domains. Conversely, the category Ion Metabolism and Transport is divided equally into linearly scaling (e.g., Ferritin-like Iron homeostasis domains) and markedly sublinear families, such as SUF (sulphur assimilation) / NIF (nitrogen fixation) domains. On the other hand, categories with small values of heterogeneity are made of families with exponents close to the one of the category, as shown in Table 2.3 in the case of, e.g., Transferases.

READ The Karoo Meat of Origin certification scheme

Determinants of the scaling exponent of a functional category

We have shown that scaling exponents of individual families may correspond to a variable extent to the exponent of the corresponding functional category. However, since categories are groups of families, the scaling of the former cannot be independent of the scaling of the latter. This section explores systematically the connection between the two. As detailed below, we find that in some cases the scaling exponent of functional categories is determined by few outlier families, while in other cases most of the families within a category contribute to the category scaling exponent. While many families have a clear power-law scaling, functional categories may contain many low-abundance families with unclear scaling properties. When considered individually, these fam-ilies do not contribute much to the total number of domains of a category, but their joined e↵ect on the scaling of the category could be potentially important. Fig. 2.7 shows that the sum of these low-abundance families does not su↵er from sampling problems and shows a clear scaling. Inter-estingly, the scaling exponents for these sums once again does not necessarily coincide with the category exponents.
Figure 2.6A illustrates the systematic procedure that we used in order to understand how the scaling of categories emerges from the scaling of the associated families. Families were ranked by total abundance across all genomes (from the most to the least abundant) and removed one by one from the category. At each removal step in this procedure, both the scaling exponent of the removed family and the exponent of the remainder of the category are considered. In other words, the i-th step evaluates the exponent of the i-th ranking family (in order of overall abundance) and of the set of families obtained by removing the i top-ranking families (with highest abundance) from the category. The resulting exponents quantify the contribution of each family to the global category scaling, as well as the collective contribution of all the families with increasingly lower overall abundance.
The results (Fig. 2.6B), show how the heterogeneity features described above are related to family abundance. Pooled together, the low-abundance families within a functional category may show very di↵erent scaling than their category. Additionally, single families follow scaling laws that deviate from the one of the corresponding functional categories. One notable example of this are Transcription-Factor DNA-binding domains. If the abundance of the outliers families is large enough in terms of the fraction of domains in the functional category, they might be responsible for determining the scaling of the entire category, as it happens in the case of DNA-binding (which is more extensively discussed in the following section).

Table of contents :

1 Introduction
2 Family-specific scaling laws in bacterial genomes.
2.1 Introduction
2.1.1 High-level functional categories of genes follows quantitative laws
2.1.2 The analysis of quantitative laws at the domain-family level may explain how the scaling of functional categories emerges from the evolutionary dynamics.
2.2 Families have individual scaling exponents, reflected by family-specific scaling laws
2.2.1 Data analysis
2.2.2 Comparison with a null model supports the existence of scaling laws at the family level is not simply due to sampling e↵ects.
2.2.3 Family exponents correlate with diversity of biochemical functions but not with contact order or evolutionary rate of domains.
2.3 The heterogeneity in scaling exponents is function-specific.
2.4 Determinants of the scaling exponent of a functional category
2.4.1 Super-linear scaling of transcription factors is determined by the behavior of a few specific highly populated families.
2.5 Grouping families with similar scaling exponents shows known associations with biological function and reveals new ones.
2.6 The main results of our analysis hold also for PFAM clans
2.7 Discussion
3 Dependency networks shape frequencies and abundances in component systems
3.1 Introduction
3.1.1 The emergence of universal regularities in empirical component systems may be the e↵ect of underlying dependency structures of the components.
3.2 Model: description of the dependency structure and the algorithm that defines a realization.
3.3 Our positive model recovers the empirical regularities of component systems, namely the Zipf’s law and the Heaps’ law.
3.3.1 The analytical derivation of the components abundance distribution matches simulations and satisfies the Zipf’s law
3.3.2 The power-law distribution of components occurrence is a “null” result of our model
3.4 The analytical mean-field expression of the Heaps’ law matches the results of numerical simulations of the model.
3.4.1 The analytical expression of the Heaps’ law shows three di↵erent regimes
3.4.2 The stretched-exponential saturation is a remarkably good approximation of the simulated data.
3.5 Conclusion
4 Signature of gene-family scaling laws in microbial ecosystems
4.1 Introduction
4.2 Methods
4.3 The analytical implementation of family scaling laws results in the definition of a metagenome invariant.
4.3.1 Analytical derivation of the abundance of a protein family in a metagenome
4.3.2 The metagenomic invariant gives access to the moment of the distribution of genomes size in the metagenome
4.4 The mean genome size and the number of genomes in a metagenome are estimated reliably in simulated metagenomes.
4.4.1 The rescaled family abundance in simulated metagenomes shows clear scaling with family exponent.
4.4.2 The total number of sampled genomes can be estimated reliably in simulated metagenomes
4.4.3 The average genome size can be estimated reliably in simulated metagenomes.
4.4.4 The variance of the genome size distribution deviates from the predicted behavior.
4.5 The mean genome size and the number of genomes are estimated reliably in real metagenomes.
4.6 Conclusions
5 Conclusions and perspectives