Here and in the previous publications,32'36'37 we describe a rather general class of models, which are based on the classical concept of a birth-and-death process and seem to be applicable to the genome evolution process. Similar, although not identical and apparently less general, modeling approaches have been considered by others.16,31'46 Even earlier, evolution of gene families has been modeled within the distinct mathematical framework of multiplicative processes.47
The utility of birth-and-death type models in evolutionary genomics in itself is not a trivial matter and stems from fundamental features of genome evolution which, in part, have been presciendy envisaged by classic geneticists and, in part, became apparent after the advent of genomics. As captured in the tide of Ohno's famous book,18 although foreseen even in the early days of genetics,17'48 gene duplication probably is the principal mechanism of genome evolution. Of course, genomes cannot grow ad infinitum and, through most of the evolutionary history, the number of genes within a given phylogenetic lineage probably remains roughly constant. Hence duplication is intrinsically coupled to gene loss. The results of comparative genomics further show that many genes in each lineage cannot be obviously linked to other genes through duplication. Without necessarily specifying the biological mechanisms (these could involve rapid change after duplication, gene acquisition via horizontal transfer, and possibly, birth of genes from noncoding sequences), it is reasonable to view these unique genes as resulting from innovation. For genomes to maintain equilibrium, the combined rates of duplication and innovation over the entire ensemble of gene families should equal the rate of gene loss, at least when averaged over long time spans. Furthermore, the observed distribution of family sizes, which asymptotically tends to a power law, dictates a much more specific connection between the gene birth and death rates, namely, the second order balance (4).
The incentive to examine these models in detail stems from at least three rather fundamental questions: (i) are the above elementary evolutionary mechanisms sufficient to account for the empirically observed characteristics of genomes, (ii) what is the contribution of natural selection to the general quantifiable features of genomes, such as the size distribution of gene families, and (iii) how similar or how different are the models describing evolution of phyloge-netically distant genomes, such as those of prokaryotes and eukaryotes. The analysis of BDIMs starts to provide some answers, although it is premature to consider these final in any sense. The critical observation made in the course of BDIM analysis was that different versions of these models could be readily distinguished on the basis of goodness of fit to the empirical data. This being the case, we found that the simplest possible model in which all paralogs are considered independent does not explain the data well. Thus, turning to the first of the above questions, we have to conclude the "something else" is required to model genome evolution, on top of the three elementary processes. This "something" is dependence or "interaction" between gene family members which results in self-accelerating family growth. In order to account for the observed stationary distribution of family sizes, it is sufficient to introduce a very weak dependence as embodied in the linear BDIM. However, when we switched from the deterministic to the stochastic version of BDIMs which provide for the possibility of analysis of the dynamics of the systems evolution, we found that evolution under the linear BDIM was much too slow to account for the emergence of the large families of paralogs found in all genomes during the time of life's evolution. Only higher order BDIMs, with degrees between 2 and 3, i.e., with "strong interactions" between family members were found to provide for sufficiendy fast evolution to be compatible with the real biological timescale.
Obviously, these findings beg the question: what is the nature of the mysterious "interactions" between paralogs? This brings us to the second of the above major problems. BDIMs do not explicidy include the notion of selection. However, the simplest interpretation of the interactions implied by the higher order BDIMs seems to be that these reflect adaptive evolution of gene families driven by positive selection. Should that be the case, we are justified to conclude that very weak selection would suffice to explain the stationary distribution of family sizes, but much stronger selective pressure is needed to account for the dynamics of genome evolution. However, the interpretation of BDIM degree as a manifestation of selection is, at this point, no more than a guess. One of the further developments of genome evolution modeling involves introducing selection explicidy and determining whether the resulting more sophisticated models will be equivalent to the higher order BDIMs explored here.
BDIMs worked well in describing evolution of all analyzed genomes, from the smallest prokaryotic ones to the most complex genomes of plants and animals. However, the parameters of the resulting models, i.e., the duplication, deletion, and innovation rates differed significantly, suggesting some tantalizing answers to the third of the questions posed above. In particular, we found that the innovation rates in prokaryotes were an order of magnitude greater than those in eukaryotes.32 An optimistic interpretation of this difference is that the relatively high innovation rates detected for prokaryotes reflect rampant horizontal gene transfer, an increasingly recognized defining feature in the evolution of bacteria and archaea.49"51 Should that be the case, we might be justified to conclude that BDIMs are telling us something new regarding the extent of this phenomenon. However, it would be premature to rule out the pessimistic explanation, i.e., that the observed differences are due to some cryptic modeling artifacts. The issue definitely deserves further investigation, through refined modeling approaches and analysis of additional comparative-genomic data.
In conclusion, it makes sense to ask the $64K question: do the models discussed in this chapter (and similar ones) reveal something new about biology? So far we seem to have only rather equivocal answers. Earlier in this section, we discuss some interesting hints on new aspects of the role of selection in genome evolution and on distinct regimes of evolution in different domains of life. Realistically, however, the principal conclusions seem to be quite general and mosdy methodological. Indeed, it was observed in these and related analyses that important aspects of genome evolution can be realistically modeled with simple, straightforward approaches. Perhaps more importandy, the work summarized here makes the next step by showing (to paraphrase Einstein's famous aphorism) that models of genome evolution should be as simple as possible but not simpler and that we seem to be able to identify the minimal required level of complexity. Future developments will show whether or not a path exists from these general findings to new biology.
1. Pareto V. Cours d'Economie Politique. Paris: Rouge et Cie 1897.
2. Zipf GK. Human behaviour and the principle of least effor. Boston: Addison-Wesley, 1949.
3. Barabasi AL. Linked: The New Science of Networks. New York: Perseus Pr, 2002.
4. Mendes JF, Dorogovtsev SN. Evolution of Networks: From Biological Nets to the Internet and Www. Oxford: Oxford University Press, 2003.
5. Gisiger T. Scale invariance in biology: Coincidence or footprint of a universal mechanism? Biol Rev Carnb Philos Soc 2001; 76:161-209.
6. Luscombe N, Qian J, Zhang Z et al. The dominance of the population by a selected few: Power-law behaviour applies to a wide variety of genomic properties. Genome Biol 2002; 3: (research
7. Koonin EV, Wolf YI, Karev GP. The structure of the protein universe and genome evolution. Nature 2002; 420:218-223.
8. Kuznetsov VA. Distribution associated with stochastic processes of gene expression in a single eukariotic cell. EUROSIP Journal on Applied Signal Processing 2001; 4:285-296.
9. Barabasi AL, Oltvai ZN. Network biology: Understanding the cell's functional organization. Nat Rev Genet 2004; 5:101-113.
10. Barabasi AL, Albert R. Emergence of scaling in random networks. Science 1999; 286:509-512.
11. Bilke S, Peterson C. Topological properties of citation and metabolic networks. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 2001; 64:036106.
12. Dorogovtsev SN, Mendes JF. Scaling properties of scale-free evolving networks: Continuous approach. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 2001; 63:056125.
13. Albert R, Jeong H, Barabasi AL. Error and attack tolerance of complex networks. Nature 2000; 406:378-382.
14. Jeong H, Tombor B, Albert R et al. The large-scale organization of metabolic networks. Nature 2000; 407:651-654.
15. Jeong H, Mason SP, Barabasi AL et al. Lethality and centrality in protein networks. Nature 2001; 411:41-42.
16. Qian J, Luscombe NM, Gerstein M. Protein family and fold occurrence in genomes: Power-law behaviour and evolutionary model. J Mol Biol 2001; 313:673-681.
17. Fisher RA. The possible modification of the response of the wild type to recurrent mutations. Am Nat 1928; 62:115-126.
18. Ohno S. Evolution by gene duplication. Berlin, Heidelberg, New York: Springer-Verlag, 1970.
19. Henikoff S, Greene EA, Pietrokovski S et al. Gene families: The taxonomy of protein paralogs and chimeras. Science 1997; 278:609-614.
20. Jordan IK, Makarova KS, Spouge JL et al. Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res 2001; 11:555-565.
21. Lespinet O, Wolf YI, Koonin EV et al. The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res 2002; 12:1048-1059.
22. Kellis M, Birren BW, Lander ES. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 2004; 428:617-624.
23. Chervitz SA, Aravind L, Sherlock G et al. Comparison of the complete protein sets of worm and yeast: Orthology and divergence. Science 1998; 282:2022-2028.
24. Lander ES, Linton LM, Birren B et al. Initial sequencing and analysis of the human genome. Nature 2001; 409:860-921.
25. Lynch M, Force A. The probability of duplicate gene preservation by subfunctionalization. Genetics 2000; 154:459-473.
26. Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science 2000; 290:1151-1155.
27. Aravind L, Watanabe H, Lipman DJ et al. Lineage-specific loss and divergence of functionally linked genes in eukaryotes. Proc Natl Acad Sei USA 2000; 97:11319-11324.
28. Katinka MD, Duprat S, Cornillot E et al. Genome sequence and gene compaction of the eukary-ote parasite Encephalitozoon cuniculi. Nature 2001; 414:450-453.
29. Koonin EV, Fedorova ND, Jackson JD et al. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol 2004; 5:R7.
30. Gardiner CW. Handbook fo Stochastic Models for Physics, Chemistry and the Natural Sciences. Berlin: Springer-Verlag, 1985.
31. Rzhetsky A, Gomez SM. Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 2001; 17:988-996.
32. Karev GP, Wolf YI, Rzhetsky AY et al. Birth and death of protein domains: A simple model of evolution explains power law behavior. BMC Evol Biol 2002; 2:18.
33. Dokholyan NV, Shakhnovich B, Shakhnovich EL Expanding protein universe and its origin from the biological Big Bang. Proc Natl Acad Sei USA 2002; 99:14132-14136.
34. Pastor-Satorras R, Smith E, Sole RV. Evolving protein interaction networks through gene duplication. J Theor Biol 2003; 222:199-210.
35. Wagner A. How the global structure of protein interaction networks evolves. Proc R Soc Lond B Biol Sei 2003; 270:457-466.
36. Karev GP, Wolf YI, Koonin EV. Mathematical modeling of the evolution of domain composition of proteomes: A birth-and-death process with innovation. In: Galperin MY, Koonin EV, eds. Computational Genomics: From Sequence to Function. Amsterdam: Horizon Press, 2002:3:261-314.
37. Karev GP, Wolf YI, Koonin EV. Simple stochastic birth and death models of genome evolution: Was there enough time for us to evolve? Bioinformatics 2003; 19:1889-1900.
38. Marchler-Bauer A, Panchenko AR, Shoemaker BA et al. CDD: A database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 2002; 30:281-283.
39. Bhattacharya R, Waymire E. Stochastic processes with applications. New York: Wiley, 1990.
40. Ross SM. Introduction to probability models. Boston: Academic Press, 1989.
41. Karev GP, Wolf YI, Berezovskaya FS et al. Gene family evolution: An in-depth theoretical and simulation analysis of nonlinear birth-death-innovation models. BMC Evol Biol 2004; 4:32.
42. Krauss LM, Chaboyer B. Age estimates of globular clusters in the Milky Way: Constraints on cosmology. Science 2003; 299:65-69.
43. Kariin S, McGregor J. The number of mutant forms maintained in a population. In: LeCam L, Neyman J, eds. Proc Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967.
44. Hedges SB, Chen H, Kumar S et al. A genomic timescale for the origin of eulcaryotes. BMC Evol Biol 2001; 1:4.
45. Hedges SB. The origin and evolution of model organisms. Nat Rev Genet 2002; 3:838-849.
46. Reed WJ, Hughes BD. A model explaining the size distribution of gene and protein families. Math Biosci 2004; 189:97-102.
47. Huynen MA, van Nimwegen E. The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol 1998; 15:583-589.
48. Bridges CA. Salivary chromosome maps. ] Hered 1935; 26:60-64.
49. Doolittle WF. Lateral genomics. Trends Cell Biol 1999; 9:M5-8.
50. Koonin EV, Makarova KS, Aravind L. Horizontal gene transfer in prokaryotes: Quantification and classification. Annu Rev Microbiol 2001; 55:709-742.
51. Gogarten JP, Doolitde WF, Lawrence JG. Prokaryotic evolution in light of gene transfer. Mol Biol Evol 2002; 19:2226-2238.
Was this article helpful?