Figure 21. Log-log plot of the frequency of 6-letter words (hexamers) versus their rank for invertebrate coding and non-coding sequences in comparison with the same graphs produced by the random dimeric repeat model.

neglecting the repeats of other types. We also neglect the possibility of imperfect repeats interrupted by several point mutations.

Finally, dimeric tandem repeats can explain the difference observed in the distribution of «-letter words in coding and non-coding DNA (see Fig. 21). As an example, we show the rank-frequency of the 6-letter words (hexamers) for invertebrate coding and noncoding sequences in the form of the so called Zipf plots.114 For natural languages, Zipf graphs show that the frequency of a word in a text is inverse proportional to its rank. For example, in an English text, the most frequent word is "the" (rank 1), the second most frequent word is "oP (rank 2), the third most frequent word is "a" (rank 3) and so on. Accordingly, the frequency of word "of" is roughly two times smaller than the frequency of word "the" and the frequency of word "a" is roughly three times smaller than the frequency of "the". Thus on the log-log scale, the Zipf graph is a straight line with the slope -1. In a DNA sequence, there is no precise definition of the "word", so one can define "word" as any string of the fixed number of consecutive nucleotides that can be found in the sequence. One can notice that the Zipf graph for non-coding DNA is approximately straight but with a slope smaller than 1, while for coding DNA, the graph is more curvy and is less steep. This observation led Mantegna et al 115,116 to conclude that noncoding DNA have some properties of natural languages, namely redundancy. Accordingly, noncoding DNA may contain some "hidden language". However, this conjecture was strongly opposed by the bioinformatics community.117 Indeed, Zipf graphs of coding and non-coding DNA can be trivially explained by the presence of dimeric tandem repeats (Fig. 21).

To conclude, noncoding DNA may not contain any hidden "language" but it definitely has lot of hidden biological information. For example, it contains transcription regulatory information which is very difficult to extract. Application of correlation analysis may help to solve this problem.118


Long range correlations of different length scales may develop due to different mutational mechanisms. The longest correlations, on the length scales of isochores may originate due to base-substitution mutations during replication (see ref. 77). Indeed, it is known that different parts of chromosomes replicate at different stages of cell division. The regions rich in C+G replicate earlier than those rich in A+T. On the other hand, the concentration of C+G precursors in the cell depletes during replication. Thus the probability of substituting All for C/G is higher in those parts of the chromosome that replicate earlier. These unequal mutation rates may lead to the formation of isochors.77 Correlations on the intermediate length scale of thousands of nucleotides may originate due to DNA shuffling by insertion or deletion57,58 of trans-posable elements such as LINES and SINES66,68,119 or due to a mutation-duplication process proposed by W. Li56 (see also ref. 120).

Finally, the correlations on the length scale of several hundreds of nucleotides may evolve due to simple repeat expansion106,108 As we have seen in the previous section, the distributions of simple repeats are dramatically different in coding and noncoding DNA. In coding DNA they have an exponential distribution; in noncoding DNA they have long tails that in many cases may be fit by a power law function. The power law distribution of simple repeats can be explained if one assumes a random multiplicative process for the mutation of the repeat length, i.e., each mutation leads to a change of repeat length by a random factor with a certain distribution (see ref. 106). Such a process may take place due to errors in replication110 or unequal crossing over (see ref. 108 and refs. therein). Simple repeat expansion in the coding regions would lead to a loss of protein functionality (as, e.g., in Huntington's disease110) and to the extinction of the organism.

Thus the weakness of long-range correlations in coding DNA is probably related to the coding DNA's conservation during biological evolution. Indeed, the proteins of bacteria and humans have many common templates, while the noncoding regions can be totally different even for closely related species. The conservation of protein coding sequences and the weakness of correlations in the amino acid sequences121 are probably related to the problem of protein folding. Monte-Carlo simulations of protein folding on the cubic lattice suggest that the statistical properties of the sequences that fold into a native state resemble those of random se-122


The higher tolerance of noncoding regions to various mutations, especially to mutations involving the growth of DNA length—e.g., duplication, insertion of transposable elements, and simple repeat expansion—lead to strong long-range correlations in the noncoding DNA. Such tolerance is a necessary condition for biological evolution, since its main pathway is believed to be gene duplication by chromosomal rearrangements, which does not affect coding regions.123 However, the payoff for this tolerance is the growth of highly correlated junk DNA.


I am grateful to many individuals, including H.E. Stanley, S. Havlin, C.-K. Peng, A.L. Goldberger, R. Mantegna, M.E. Matsa, S.M. Ossadnik, F. Sciortino, G.M. Viswanathan, N.V. Dokholyan, I. Grosse, H. Herzel, D. Holste, and M. Simons for major contributions to those results reviewed here that represent collaborative research efforts. Financial support was provided by the National Science Foundation and National Institutes of Health (Human Genome Project).


1. Stauffer D, Stanley HE. From Newton to Mandelbrot: A Primer in Theoretical Physics. Heidelberg, New York: Springer-Verlag, 1990.

2. Stanley HE. Introduction to Phase Transitions and Critical Phenomena. London: Oxford University Press, 1971.

3. Stauffer D, Aharony A. Introduction to Percolation Theory. Philadelphia: Taylor & Francis, 1992.

4. de Gennes PG. Scaling Concepts in Polymer Physics. Ithaca: Cornell University Press, 1979.

5. Barabisi AL, Stanley HE. Fractal Concepts in Surface Growth, Cambridge: Cambridge University Press, 1995.

6. Mandelbrot BB. The Fractal Geometry of Nature. San Francisco: WH Freeman, 1982.

7. Feder J. Fractals. New York: Plenum, 1988.

8. Bunde A, Havlin S, eds. Fractals and Disordered Systems. Berlin: Springer-Verlag, 1991.

9. Bunde A, Havlin S, eds. Fractals in Science. Berlin: Springer-Verlag, 1994.

10. Garcia-Ruiz JM, Louis E, Meakin P et al, eds. Growth Patterns in Physical Sciences and Biology. New York: Plenum, 1993.

11. Grosberg AY, Khokhlov AR. Statistical Physics of Macromolecules, New York: AIP Press, 1994; Grosberg AY, Khokhlov AR. Giant Molecules. London: Academic Press, 1997.

12. Bassingthwaighte JB, Liebovitch LS, West BJ. Fractal Physiology. New York: Oxford University Press, 1994.

13. Vicsek T. Fractal Growth Phenomena. Singapore: World Scientific, 1992.

14. Vicsek T, Shlesinger M, Matsushita M, eds. Fractals in Natural Sciences. Singapore: World Scientific, 1994.

15. Guyon E, Stanley HE. Fractal Formes. Amsterdam: Elsevier, 1991.

16. Li W. The study of correlation structures of DNA sequences: a critical review. Computers Chem 1997; 21:257-271.

17. Baxter RJ. Exactly Solvable Models in Statistical Mechanics. London: Academic Press, 1982.

18. Azbel MY. Random two-component, one-dimensional Ising model for heteropolymer melting. Phys Rev Lett 1973; 31:589-593.

19. Azbel MY, Kantor Y, Verkh L et al. Statistical Analysis of DNA Sequences. Biopolymers 1982: 21:1687-1690.

20. Azbel MY. Universality in a DNA statistical structure. Phys Rev Lett 1995; 75:168-171.

21. Feller W. An introduction to probability theory and its applications. Vols. 1-2. New York: Jhon Wiley & Sons, 1970.

22. Binder K, ed. Monte Carlo Methods in Statistical Physics. Berlin: Springer-Verlag, 1979.

23. Karlin S, Brendel V. Patchiness and correlations in DNA sequences. Science 1993; 259:677-680.

24. Grosberg AY, Rabin Y, Havlin S et al. Crumpled globule model of the 3-dimensional structure of DNA. Europhys Lett 1993; 23:373-378.

25. des Cloizeaux, J. Short range correlation between elements of a long polymer in a good solvent. J Physique 1980; 41:223-238.

26. Bak P. How Nature Works. New York: Springer 1996.

27. Bäk P, Tang C, Wiesenfeld K. Self-organised criticality: an explanation of 1/f noise. Phys Rev Lett 1987; 59:381-384.

28. Bäk P, Sneppen, K. Punctuated equilibrium and criticality in a simple model of evolution. Phys Rev Lett 1993; 71:4083-4086.

29. Paczuski M, Maslov S, Bak, P. Avalanche dynamics in evolution, growth and depinning models. Phys Rev E 1996; 53:414-443.

30. Jovanovic B, Buldyrev SV, Havlin S et al. Punctuated equilibrium and history-dependent percolation. Phys Rev E 1994; 50, R2403-2406.

31. Peng C-K, Buldyrev SV, Goldberger AL et al. Nature 1992; 356:168.

32. Li W, Kaneko K. Long-range correlations and partial 1/f a spectrum in a noncoding DNA sequence. Europhys Lett 1992; 17:655.

33. Nee S. Uncorrelated DNA walks. Nature 1992; 357:450-450.

34. Voss R. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett 1992; 68:3805-3808.

35. Voss R. Long-Range Fractal Correlations in DNA Introns and Exons. Fractals 1994; 2:1-6.

36. Maddox J. Long-range correlations within DNA. Nature 1992; 358:103-103.

37. Munson PJ, Taylor RC, Michaels GS. DNA correlations. Nature 1992; 360:636-636.

38. Amato I. Mathematical biology-DNA shows unexplained patterns writ large. Science 1992; 257:747-747.

39. Prabhu W, Claverie J-M.Correlations in intronless DNA. Nature 1992; 359:782-782.

40. Chatzidimitriou-Dreismann CA, Larhammar D. Long-range correlations in DNA. Nature 1993; 361:212-213.

41. Li W, Kaneko K. DNA correlations, Nature 1992; 360:635-636.

42. Karlin S, Cardon LR. Computational DNA sequence analysis. Annu Rev Microbiol 1994; 48:619-54.

43. Herzel H, Grosse I. Correlations in DNA sequences: The role of protein coding segments. Phys Rev E 1997; 55:800-810.

44. Grosse I, Herzel H, Buldyrev SV et al. Species independence of mutual information in coding and noncoding DNA. Phys Rev E 2000; 61:5624-5629.

45. Holste D, Grosse I, Herzel H et al. Optimization of coding potentials using positional dependence of nucleotide frequencies. J Theor Biol 206:525-537.

46. Berthelsen CL, Glazier JA, Skolnick MH. Global fractal dimension of human DNA sequences treated as pseudorandom walks. Phys Rev A 1992; 45:8902-8913.

47. Borovik AS, Grosberg AY, Frank-Kamenetski MD. Fractality of DNA texts. J Biomolec Struct Dyn 1994; 12:655-669.

48. Li WT. Are isochore sequences homogeneous? Gene 2002; 300:129-139.

49. Bernaola-Galvan P, Carpena P, Roman-Roldan R et al. Study of statistical correlations in DNA sequences. Gene 2002; 300:105-115.

50. Oliver JL, Carpena P, Roman-Roldan R et al. Isochore chromosome maps of the human genome. Gene 2002; 300:117-127.

51. Alberts B, Bray D, Lewis J et al. Molecular Biology of the Cell. New York: Garland Publishing, 1994.

52. Watson JD, Gilman M, Witkowski J et al. Recombinant DNA. New York: Scientific American Books, 1992.

53. Chen CF, Gendes AJ, Jurka J et al. Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22. Proc Natl Acad Sci USA 2002; 99:2930-2935.

54. Altschul SF, Madden TL, Schaffer AA et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucl Acids Res 1997; 25:3389-3402.

55. Audit B, Vaillant C, Arneodo A et al. Wavelet analysis of DNA bending profiles reveals structural constraints on the evolution of genomic sequences. J Biol Phys 2004; 30:33-81.

56. Li WH. Expansion-modification systems: A model for spatial 1/f spectra. Phys Rev A 1991; 43:5240-5260.

57. Buldyrev SV, Goldberger AL, Havlin S et al. Generalized Levy Walk Model for DNA Nucleotide Sequences. Phys Rev E 1993; 47:4514-4523.

58. Buldyrev SV, Goldberger AL, Havlin S et al. Fractal Landscapes and Molecular Evolution: Modeling the Myosin Heavy Chain Gene Family. Biophys J 1993; 65:2673-2681.

59. Vieira MD, Herrmann HJ. A growth model for DNA evolution. Europhys Lett 1996; 33:409-414.

60. Hansen JP, McDonald IR. Theory of Simple Liquids. London: Academic Press, 1976.

61. Abramowitz M, Stegun IA, eds. Handbook of Mathematical Functions. New York: Dover, 1965

62. Press WH, Flannery BP, Teukolsky SA et al. Numerical Recipes. Cambridge: Cambridge Univ Press, 1989.

63. Burrus CS, Parks TW. DFT/FFT and Convolution Algorithms. New York: John Wiley and Sons, Inc. 1985.

64. Peng CK, Buldyrev SV, Havlin S et al. Mosaic Organization of DNA Sequences. Phys Rev E

1994; 49:1685-1689.

65. Chen Z, Ivanov PC, Hu K et al. Effect of nonstationarities on detrended fluctuation analysis. Phys Rev E 2002; 65:041107.

66. Jurka J, Walichiewicz T, Milosavljevic A. Prototypic sequences for human repetitive DNA. J Mol Evol 1992; 35:286-291.

67. Hattori M, Hidaka S, Sakaki Y. Sequence analysis of a Kpnl family member near the 3' end of human beta-globin gene. Nucleic Acids Res 1985; 13:7813-7827.

68. Hwu RH, Roberts JW, Davidson EH et al. Insertion and/or deletion of many repeated DNA sequences in human and higher apes evolution. Proc Natl Acad Sci USA 1986; 83:3875-3879.

69. Churchill GA. Hidden Markov chains and the analysis of genome structure. Computers Chem 1992; 16:107-116.

70. Zolotarev VM, Uchaikin VM. Chance and Stability: Stable Distributions and their Applications. Utrecht: VSP BV, 1999.

71. Shlesinger MF, Zaslavsky GM, FriscK U, eds. L£vy Flights and Related Topics in Physics. Berlin: Springer-Verlag, 1995.

72. Arneodo A, D'Aubenton-Carafa Y, Audit B et al. What can we learn with wavelets about DNA sequences? Physica A 1998; 249:439-448.

73. Voss RF, Clarke J. 1/f noise in music: music from 1/f noise. J Acoust Soc Amer 1978; 63:258-263.

74. Schenkel A, Zhang J, Zhang, YC. Long Range Correlation in Human Writings. Fractals 1993; 1:47-57.

75. Amit M, Shmerler Y, Eisenberg E et al. Language and codification dependence of long-range correlations in texts. Fractals 1994; 2:7-13.

76. Trifonov EN. 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Physica A 1998; 249:511-516.

77. Gu X. Li WH. A model for the correlation of mutation-rate with gc content and the origin of gc-rich isochores. J Mol Evol 1994; 38:468-475.

78. Viswanathan GM, Buldyrev SV, Havlin S et al. Quantification of DNA patchiness using correlation measures. Biophys J 1997; 72:866-875.

79. Viswanathan GM, Buldyrev SV, Havlin S et al. Long-range correlation measures for quantifying patchiness: Deviations from uniform power-law scaling in genomic DNA. Physica A 1998; 249:581-586.

80. Buldyrev SV, Goldberger AL, Havlin S et al. Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. Phys Rev E 1995; 51:5084-5091.

81. Nyeo SL, Yang IC, and Wu CH. Spectral classification of archaeal and bacterial genomes. J Biol Syst 2002; 10:233-241.

82. Arneodo A, Bacry E, Graves PV et al. Characterizing long-range correlations in dna-sequences from wavelet analysis. Phys Rev Lett 1995; 74:3293-3296.

83. Nikolaou C, Almirantis Y. A study of the middle-scale nucleotide clustering in DNA sequences of various origin and functionality, by means of a method based on a modified standard deviation. J Theor Biol 2002; 217:479-492.

84. Ossadnik SM, Buldyrev SV, Goldberger AL et al. Correlation approach to identify coding regions in DNA sequences. Biophys J 1994; 67:64-70.

85. Uberbacher EC, Mural RJLocating protein-coding regions in human dna-sequences by a multiple sensor neural network approach. Proc Natl Acad Sci USA 1991; 88:11261-11265.

86. Fickett JW, Tung CS. Assessment of protein coding measures. Nucleic Acids Research 1992; 20:6441-6450.

87. Holste D, Grosse I, Beirer S et al. Repeats and correlations in human DNA sequences. Phys Rev E 2003; 67:061913.

88. Bell GI. Roles of repetitive sequences. Comput Chem, 1992; 16:135-143

89. Bell GI. Repetitive DNA sequences: some considerations for simple sequence repeats. Comput Chem 1993; 17:185-190.

90. Bell GI. Evolution of simple sequence repeats. Comput Chem 1996; 20:41-48.

91. Bell GI. and Jurka J. The length distribution of perfect dimer repetitive DNA is consistent with its evolution by an unbiased single step mutation process. J Mol Evol 1997; 44:414-421.

92. Richards RI, Sutherland GR. Simple repeat DNA is not replicated simply. Nature Genetic 1994; 6:114-116.

93. Richards RI, Sutherland GR. Simple tandem DNA repeats and human genetic disease. Proc Natl Acad Sci USA 1995; 92:3636-3641.

94. Chen X, Mariappan SV, Catasti P et al. Hairpins are formed by the single DNA strands of the fragile X triplet repeats: structure and biological implications. Proc Natl Acad Sci USA 1995; 92:5199-5203.

95. Gacy AM, Goellner G, Juramic N et al. Trinucleotide repeats that expand in human disease form hairpin structures in vitro. Cell 1995; 81:533-540.

96. Orth K, Hung J, Gazdar A et al. Genetic instability in human ovarian cancer cell lines. Proc Natl Acad Sci USA 1994; 91:9495-9499.

97. Bowcock AM, Ruiz-Linares A, Tomfohrde J et al. High resolution of human evolutionary trees with polymorphic microsatellites. Nature 1994; 368:455-457.

98. Olaisen B, Bekkemoen M, Hoff-Olsen P et al. VNTR mutation and sex. In: Pena SDJ, Chakraborty, R, Epplen JT et al, eds. DNA Fingerprinting: State of the Science. Basel: Springer-Verlag, 1993.

99. Jurka ]. Pethiyagoda G. Simple repetitive DNA sequences from primates: compilation and analysis. J Mol Evol 1995; 40:120-126.

100. Li YC, Korol AB, Fahima T et al. Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol Ecol 2002; 11:2453-2465.

101. Kremer E, Pritchard M, Lynch M et al. Mapping of DNA instability at the fragile X to a trinucleotide repeat sequence p(CCG)n. Science 1991; 252:1711-1714.

102. Huntington's Disease Collaborative Research Group. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. Cell 1993; 72:971-983.

103. Ionov Y, Peinado MA, Malkhosyan S et al. Ubiquitous somatic mutations in simple repeated sequences reveal a new mechanism for clonic carcinogenesis. Nature 1993; 363:558-561.

104. Kunkel TA. Slippery DNA and diseases. Nature 1993; 365:207-208.

105. Wooster R, Cleton-Jansen AM, Collins N et al. Instability of short tandem repeats (microsatellites) in human cancers. Nat Genet 1994; 6:152-156.

106. Dokholyan NV, Buldyrev SV, Havlin S et al. Distribution of base pair repeats in coding and noncoding DNA sequences. Phys Rev Lett 1997; 79:5182-5185.

107. Dokholyan NV, Buldyrev SV, Havlin S et al. Distributions of dimeric tandem repeats in non-coding and coding DNA sequences. J Theor Biol 2000; 202:273-282.

108. Dokholyan NV, Buldyrev SV, Havlin S et al. Model of unequal chromosomal crossing over in DNA sequences. Physica A 1998; 249:594-599.

109. Charlesworth B, Sniegowski P, Stephan W. The evolutionary dynamics of repetative DNA in eu-karyotes. Nature 1994; 371:215-220.

110. Wells RD. Molecular basis of genetic instability of triplet repeats. J Biol Chem 1996; 271:2875-2878.

111. Levinson G, Gutman GA. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol 1987; 4:203-221.

112. Buldyrev SV, Dokholyan NV, Havlin S et al. Expansion of tandem repeats and oligomer clustering in coding and noncoding DNA sequences. Physica A 1999; 273:19-32.

113. Kruglyak S, Durrett RT, Schug MD et al. Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc Natl Acad Sci USA 1998; 95:10774-10778.

114. Zipf KG. Human Behavior and the Principle of Least Effort. Redwood City: Addison-Wesley 1949.

115. Mantegna RN, Buldyrev SV, Goldberger AL et al. Linguistic features of noncoding DNA sequences. Phys Rev Lett 1994; 73:3169-3172.

116. Mantegna RN, Buldyrev SV, Goldberger AL et al. Phys Rev E 1995; 2939.

117. Bonhoeffer S, Herz AVM, Boerlijst MC et al. Explaining "linguistic features" of noncoding DNA. Science 1996; 271:14-15.

118. Makeev VJ, Lifanov AP, Nazina AG et al. Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. Nucl Acids Res 2003; 31:6016-6026.

119. Jurka J, Kohany O, Pavlicek A et al. Duplication, coclustering, and selection of human Alu retrotransposons. Proc Natl Acad Sci USA 2004; 101:1268-1272.

120. Stanley HE, Afanasyev V, Amaral L AN et al. Anomalous fluctuations in the dynamics of complex systems: From DNA and physiology to econophysics. Physica A 1996; 224 302-321.

121. Pande V, Gosberg A Ya, Tanaka T. Nonrandomness in protein sequences - evidence for a physically driven stage of evolution, Proc Natl Acad Sci USA 1994; 91:12972-12975.

122. Shakhnovich EI, Gutin AM. Implications of thermodynamics of protein folding for evolution of primary sequences. Nature 1990; 346:773-775.

123. Li W-H, Marr TG, Kaneko K. Understanding long-range correlations in DNA sequences. Physica D 1994; 7:392-416.

0 0

Post a comment