J

So we can see that the diagonal elements of this matrix, p(r) = 1/2 + (2p - 1)72 exponentially converge to 1/2 for r —¥ The speed of convergence determines the correlation length:

For T —¥ 0 the correlation length diverges £ = exp(2e/^7) —» while for T —¥ °° the correlation length approaches zero: 4=1/ ln(^77e) —» 0. For finite temperature the correlation length is finite. Hence for one-dimensional Ising model, there is no critical point at positive temperatures, however the absolute zero T= TC=Q can be treated as a critical point because in its vicinity the correlation length diverges faster than any power. So one can identify exponent Vc as being infinite.

The eigenvector ai gives us equal probabilities for a spin to be in positive and negative orientations, thus the spontaneous magnetization being determined as (s(*)) = ¿11 -aj\ = 1/2 — 1/2 = 0 remains zero for all temperatures. In order to compute correlation function, we must compute the average product {s(t)s(t + r)). With probability a\ \ the value s{t) = 1. Given sit) = 1, the probabilities of s(t + r) = 1 and s(t+ r)= -1 are equal to the elements of the first column of matrix Pr. Analogously for j(?) = -1, which occurs with probability ^21, the probabilities of s(t + r)= 1 and s(t+ r)= -1 are given by the elements in the second column of matrix Pr. So

{<*)<* + r)) = ai 1 [/>nM - />2lW] - *2l[/>2iW - P2i(r)] = (2 p - l)r (!6)

and, therefore,

Exponential versus Power Law Correlations

In the previous section, we see that the one-dimensional Ising model in the absence of magnetic field is equivalent to a two-state Markovian process. In general, it is clear that any one-dimensional model with short range interaction is equivalent to a Markovian process with a finite number of states, and for such a process correlations must decay exponentially as A2, where A2 < 1. Thus the correlation length must be finite and can diverge only for T 0. Intuitively, we can imagine a one dimensional model as a row of dancing people each holding hands with two neighbors: one is on the left and one is on the right. Once they are holding hands, the correlation can pass from one neighbor to the next. No matter how strong they are holding hands, there is a finite chance q that they will separate, and the correlation will stop.

The probability that the correlation spreads distance r is proportional to (1 - q)r ~ exp(-gr), and hence the correlation length is finite and is inverse proportional to q.

In contrast, if the number of dimensions is larger than one, the interactions can propagate from point A to point B not only along a straight line from one neighbor to the next, but along an infinite number of possible paths connecting A and B. Accordingly, the correlation length can diverge for T = Tc>0. Unfortunately, there are very few 2-dimensional models which can be solved exacdy17 and even those models have so complicated solutions that they are far beyond the scope of most physics textbooks. The most famous example of an exacdy solvable 2-dimensional model is the Ising model, which was solved by Onsager in 1949. The solution is based on transfer matrices much more complicated than those we use in Section IV to solve the one dimensional model. It is much easier to simulate such a model on a computer and find an approximate numerical solution.

It can be shown that two-dimensional Ising model has a critical point at temperature Tc = 2e/ln(l +s[2 )lkg = 2.269Slkg. At the vicinity of this temperature, the correlation function acquires a non-exponential behavior where 7J = 1/4 is a new critical exponent proposed by M. Fisher in 1964. The correlation length £ diverges as \T- Tc\\ which means that Vc = 1. The spontaneous magnetization for T< Tc is not equal to zero, but, for any sample, it can be either positive or negative.

The absolute value of spontaneous magnetization approaches zero for T-) Tc as (T- Tc)118, so f}c = 1/8. If the temperature increases above Tc (also known as Curie point), the sample loses its magnetization. This phenomenon can be observed by everyone in a kitchen-style experiment: take an arm from a compass, place it into the fire of the burner and keep it there until it starts to glow red. The Curie point for Iron is 700°C. Cool it and place it back into the compass. It does not show North any longer!

Figure 1 shows the results of a computer simulation of a two-dimensional Ising model on the L X L = 1024 X 1024 square lattice. The program is very simple. At any time step, a computer attempts to "mutate" a spin at a randomly chosen lattice site. It first computes the energy change A Urn such a would be mutation. If A U< 0 the mutation always happens, if A U > 0, it happens with probability exp(-Ai//£s7). This algorithm invented by Metropolis in 1953,22 leads to the Boltzmann distribution (3) of the probabilities to find a system in a state with total potential energy U. The proof of this fact is based on the theory of Markovian processes. Indeed, the set of Metropolis rules of flipping the spins can be represented as a transition matrix P with transition probabilities pjjexp(-UjlkgT) = /y-'xp(-UJkgT), where U, and Uj are the potential energies of the corresponding states. Obviously, vector with components dji = exp{-Ui/ksT) taken from the probability distribution (3) satisfies Eq. (9).

The system has periodic boundary conditions, so that pixels on the opposite edges of the system are in close proximity. In fact, the entire system can be viewed as a single line winded around the surface of a bagel. In such a system, site i has 4 neighbors i + 1, i - 1, i + L, and i - L, so the correlation can make really long jumps of length L and —L along the line.

Black and white pixels show spins with positive and negative orientations respectively. One can see patches of irregular shapes and all possible sizes from very small, of one pixel size, to the giant one spanning the entire system. This scale-free property of patches is typical for systems with long range correlations with power law decay. Indeed, exponential decay of correlations C(r) - exp(—r/£) would imply a typical size ^ of patches so that the probability to find larger patches is exponentially small. The same picture can describe the behavior of gas particles near critical point. The molecules form clusters of all possible sizes which scatter light. Does this picture have anything to do with DNA?

It is well known that the DNA sequence has a mosaic structure23 with patches of high concentration of strongly bonded CG base pairs alternating with patches of weakly bonded AT base pairs. These patches are called isochores and can span millions of base pairs. On a smaller scale of genes and exons, coding sequences have larger CG content than non-coding sequences. Finally there exist CpG islands of several hundred base pairs with high CG content.

May these patches have anything to do with Ising model? Of course DNA is not at thermal equilibrium and the concepts of temperature and potential energy cannot be applied to the study of its evolution. However, the evolution of DNA may be thought of as a Markovian process, similar to the Metropolis algorithm described above with mutation probabilities depending on the nature of neighboring nucleotides and on the pool of the surrounding nucleotides during replication process, which may be viewed as an external field or chemical potential.

There are several main objections to this idea:

1. First objection: the DNA chain is one dimensional. As we have seen above, long range correlations cannot exist in a one dimensional system.

This objection can be easily overcome by the argument that the DNA molecule has an extremely complex three dimensional structure in which distant elements along the chain are in close geometrical proximity. Thus the correlation may propagate not only along the chain but may jump many steps ahead as in a toroidal Ising model shown in Figure 1. In 1993 Grosberg et al24 proposed a model based on the distribution of loops in the polymer chain crumpled into a dense globular conformation. This simple model leads to the long range correlations decaying as a power law r"<, where J~ 2/3.

2. Second objection. The long-range correlations emerge only in the narrow vicinity of the critical point. Why in the biological system such as DNA, the probabilities of mutations are such that they correspond to the vicinity of the critical point?

This objection is more difficult to overcome. However there are examples of simple models which drive themselves to the critical behavior. The most relevant example is a polymer chain in the solvent,11 in which the probability to find a monomer in a unit volume at distance r from a given monomer decays as r1/v~3, where V ~ 0.59 is the correlation length exponent first determined by a Nobel prize winner R Flory in 1949. In 1972, another Nobel prize winner EG.de Gennes4 mapped the problem of self-avoiding walks (which are believed to describe the behavior of polymers) to a model of a magnetic similar to an Ising model. He showed that the inverse polymer chain length 1/A^is equivalent to the distance to the critical point T - Tc, and hence the correlation length | (which is equivalent to the radius of the polymer coil) grows as Nv. A polymer chain has also a power law distribution of loops, determined by Des Cloiseaux.25

In recent years, many models have been proposed that have a tendency of self organization (SOC) toward their critical points without any tuning of external parameters.26'27 These models give rise to scaling, and produce sudden avalanche-like bursts of activity distributed according to a power law. Some SOC models are one-dimensional systems and have been applied to biological evolution.28"30 Such models are of great interest and they might be relevant in studies of DNA sequences.

3. Third objection. Biological evolution is an extremely complex process which is governed by many different mechanisms acting at different length and time scales. The interplay of several characteristic length scales may lead to apparent power-law correlations, which thus lack universality of critical phenomena.23

This objection is most probably correct. Indeed, the values of the correlation exponent are different for different species and change with distance r between the nucleotides (See section "Analysis of DNA Sequences"). Never the less, in the beginning of 1990s when the first long DNA sequences became publicly available, the idea to study them by correlation analysis attracted lot of attention.31"41

Correlation Analysis of DNA Sequences

Can correlation analysis be applied to DNA sequences? For a physicist or mathematician a DNA sequence looks like a text written in an unknown language, which is encoded in a 4-letter alphabet A, C, G, T. Each letter in this text corresponds to a DNA base pair. The first question one might ask is what is the overall fraction or frequency of each letter in this text. For example the frequency of letter "A", fA is defined as/a = NA/N, where Na is the number of letters "A" and N is the total length of the sequence.42 This question is easy to answer, especially these days, when the total human genome is sequenced. In human genome,/a =fi~ 0.295 and fG=fc ~ 0.205. Note that these numbers strongly depend on the organism under study. The second question one might ask: "Is there any apparent structure in this text, or it is indistinguishable from a text that would be typed by throwing a 4-sided dice?" (This dice can be made in the form of a Jewish toy, dreidel, with letters A, C, G, T on its sides which have slighdy different surface areas, so that the probability of getting a letter on the top is equal to its frequency in the genome). For a text created by throwing such a dice, the events of getting any two letters at positions k and y are believed to be independent, so the probability of simultaneously getting letter^'at position k and letter Y at position/ is equal to the product fxfy- If there is any structure in the text, the frequency fxy (r) of finding X at position k and Y at position y = k + r will deviate significantly from the predicted value fxfy-

To fully characterize all dependencies among four letters of the DNA alphabet one must compute 16 elements of dependence matrix Dxy{r) = fxv(r) —fxfy^ These dependence coefficients are equivalent to correlation functions used in the previous section to describe Ising model if the nucleotide sequence is replaced by a numerical sequence ix(£) = 1 if nucleotide X is present at position k and sx(k) = 0 otherwise:

where (...) indicates the average over all k.

All other measures of correlations including nonlinear measures such as mutual information can be expressed via dependence coefficients. For example, one can introduce Purine-Pyrimidine (RY) correlation measure, in which any purine (A,G) is replaced by 1 and any pyrimidine (C,T) is replaced by -1. The numerical sequence for RY can be expressed as a linear combination of numerical sequences for each nucleotide sRy = sa — se + ¡g ~ ST-Accordingly,

= Daa + Dec + Dgg + Drr+ 2(Dag + DCT ~ DGT - Drr ~ DAC - DAT).

Analogously, one can introduce Csw, (S = C,G; W = A, 7) or Ckm {K = A,C; M = G, 7) or any other correlation function based on a linear combination of the elementary measures s a, sq sG and xt-.32'46'47 The coefficients of this linear combination can be presented in the form of a vector m = {mA,mc,mG,mi) which we will call a mapping rule. For example, for RY mapping rule, we define m = (1,-1,1,-1), and for Cmapping rule we define m = (0,1,0,0). Accordingly, any correlation measure could be expressed as a quadratic form (m • Dm), where D is the dependence matrix.

Definitely, some of these correlation measures such as Csw are not zero for at least the size of the isochore i.e., a chromosomal region with high or low C + G content. Isochores have a typical size of about 105 base pairs, so the correlations would be non-zero for at least r~ 105.

A physicist whose goal is to understand some general principles of DNA organization may attempt to fit the behavior Dsw(r) by a power law function. A mathematical biologist42'48"50 would rather try to characterize the size distribution of the isochores and their nucleotide content for various chromosomes and species and try to answer questions of biological relevance rather than to measure some power law exponent, which has an ambiguous biological meaning and characterize isochores in a very indirect fashion.

In general, DNA is known for its complex mosaic structure,23'42 with structural elements such as isochores, intergenic sequences, CpG islands, LINE(long interspersed elements) and SINE (short interspersed elements) repeats, genes, exons, introns, and tandem repeats.51'52'53 Each of these structural elements has its different size distribution, nucleotide frequencies, and laws of molecular evolution, so the correlations in the DNA sequence have very complex structure, are different for different spices and can not be characterized by a universal power-law exponent, in a way it is observed in critical phenomena. Correlation studies by their nature involve averaging over large portions of a sequence, so they have a tendency to gloss over particular details. This is the main reason why they are not very popular in bioinformatics whose main tool is the search for sequence similarities54 analogous to finding in an unknown language some already known words or names, which may shed some light on the meaning of their neighbors.

Never the less, characterization of correlations in DNA sequences has some intellectual merit and even practical importance for a biologist whose goal is to understand molecular evolution of DNA sequences.55 There are several reasonable models of DNA evolution in which exact power-law correlations emerge.56"59 The values of the exponents of these power laws depend on the parameters of the model, such as mutation rates and thus can be used to test certain assumptions of the models. These models are discussed in the three sections starting with section "Mutation Duplication Model of DNA Evolution".

Another problem with correlation studies, is that they can be affected by many characteristics of the system, for example sequence length. In order to avoid many potential pitfalls it is very important to understand basic properties of correlation measures and fine-tune them on the well known systems which can serve as golden standards. In the next sections we will introduce various correlation measures and illustrate their usage, applying them to the Ising model, whose correlation properties are well known. Again, an impatient reader may proceed to "Mutation Duplication Model of DNA Evolution".

Correlation Function

In the next four sections we will describe several methods of correlation analysis. To develop some intuition on their advantages and disadvantages, we will apply them to the one-dimensional and two-dimensional Ising models, whose correlation properties are known theoretically.

The most straightforward analysis is the direct computation of the correlation function C(r) defined in Eq. (5). Figure 2 shows the behavior of lnC(r) for the one-dimensional Ising model consisting of L = 21 spins for several values of 7"approaching zero. For small values of r, the graphs are straight lines with the slope equal to the inverse correlation length in complete agreement with Eq. (17). Figure 3 shows the behavior for the two-dimensional Ising model consisting of L1 = 2s X 28 spins above and below critical point. Figure 4 presents the corresponding snapshots of the system. The correlation length increases while temperature decreases toward Tc ~ 2.27 and then very quickly goes down again, as temperature continues to decrease.

This behavior may seem counterintuitive. Indeed, one can argue that correlations below Tc are so strong that the majority of spins acquire the same orientation. However, from a mathematical point of view, the majority of spins, say fraction p, has the same orientation. (In Fig. 4, T = 2.17, it is positive, but in other simulations, it may appear negative). White patches, indicating the negative orientation are small, isolated, and randomly distributed in the sample. These patches of the opposite orientation may be regarded as defects in the crystalline structure. Thus one can regard two spins at distant positions r and r + k to be two independent random variables taking value 1 with probability p and value -1 with probability 1 - p.

Figure 2. Logarithms of the correlation functions for the one-dimensional Ising model with L = 216 spins at three different temperatures T= 1.0 (O), T- 0.6 (□) and T= 0.5 (0). The lines are drawn according to theoretical predictions of Eq. (17).

Figure 2. Logarithms of the correlation functions for the one-dimensional Ising model with L = 216 spins at three different temperatures T= 1.0 (O), T- 0.6 (□) and T= 0.5 (0). The lines are drawn according to theoretical predictions of Eq. (17).

Figure 3. Logarithms of the correlation functions for the two-dimensional Ising model with L = 28 X 28 spins at five different temperatures T= 2.97, T= 2.47, T= 2.37, T= Tc = 2.27, and T= 2.17. The straight horizontal lines show 68% and 95% confidence level for apparent correlations expected to be observed in an uncorrelated data of this length. Away from critical point, the behavior of correlations is well approximated by straight lines indicating exponential decay of correlations. The slopes of these lines are inverse proportional to the correlation length. Close to critical point, correlation length becomes extremely large.

Figure 4. Snapshots of the Ising model far above the critical point T= 2.97, close to the critical point T-237, at the critical point T= 2.27, and below the critical point T= 2.17. One can see that the patch sizes are the largest at the critical point.

Although p» 1 -p, the average product {s(k)s(r + k)) of two independent variables s(k) and s(k + r) must be equal to the product of their averages {s(k)){s(k + r)), so the total correlation C{r) = 0. Note that C(0) = 4p( 1 -p), thus correlation function is small even for small r. Indeed, the graph corresponding to T = 2.17 starts at positions much below the graphs for T> Ta for which C(0) = 1, since p = 1/2.

Note that calculations of InGfr) become very inaccurate as C(r) approaches zero. This is because the statistical error of calculating the correlation function becomes comparable with its value. Indeed, simple probabilistic analysis shows that for two independent variables x and y, the variable (x — (x))(y - (y)) has variance equal to the product of the variances of the variables x and y. When we compute correlation function, we average {(x-(x))(y-{y))) = (s(k)s(r+ k)) - (s(k]){s(k + r)) over N = Lx L positions. In the best possible case, assuming all these measurements are independent, the standard error is the square root of variance divided by the square root of N. Since the variables x = s(k) and y=s(k + r) both have the variance C(0) = Ap{\ - p), where p is the probability of a positive spin, we get this error (7= 4p( 1 — p)l -JlV . Since for T> Tc the probabilities of positive and negative spins are roughly equal, we have <T = 1/256. The horizontal lines indicate levels of O, and 2(7corresponding to 68% and 95% confidence levels. Since in reality* and y are correlated, the number of independent measurements have to be divided by a factor proportional to where d = 2 is the dimensionality. The calculations of C(r) become extremely inaccurate when we approach the critical point at which the correlation length diverges.

One can see that the values of the correlation function can be well approximated by the straight lines above the estimated standard error level, except for T = 2.27 and T= 2.37, when

Figure 5. A) Double logarithmic plot of the correlation functions for the two-dimensional Ising model with L = 210 X 210 spins at critical temperature T= Tc = 2.296 for two realizations of the ferromagnetic model (dash and dotted lines) and the antiferromagnetic model (bold line). The straight line indicates theoretical fit C(r) - r-"4/ a/2 . B) Logarithm of correlation function, multiplied by r"4 -J2 . The linearity of the graph demonstrates exponential behavior of the corrected correlation function. The inverse slopes give values = 400, £ = 270, £ = 220 comparable to half of the system size LI2 =512.

Figure 5. A) Double logarithmic plot of the correlation functions for the two-dimensional Ising model with L = 210 X 210 spins at critical temperature T= Tc = 2.296 for two realizations of the ferromagnetic model (dash and dotted lines) and the antiferromagnetic model (bold line). The straight line indicates theoretical fit C(r) - r-"4/ a/2 . B) Logarithm of correlation function, multiplied by r"4 -J2 . The linearity of the graph demonstrates exponential behavior of the corrected correlation function. The inverse slopes give values = 400, £ = 270, £ = 220 comparable to half of the system size LI2 =512.

the graphs start to bend upward as r decreases. This behavior may indicate a power law decay. To see this more clearly, we simulate a much larger system L = 1024 exacdy at Tc. Figure 5A shows the behavior of correlation function on a double logarithmic plot. For small r the graph is approximately linear with slope -0.25 in agreement with the exact result for an infinite system C(r) = r'v4N2.17 However, one can see that the deviations rapidly increase with r and the agreement breaks down at about r = 10. Analyzing such a data, one can easily dismiss the possibility of power law correlations on the basis that their range is so small. In fact, this early deviation from the power law can be well explained by the finite size of the system L = 1024. Indeed, in a finite system, the correlation length cannot be larger than the radius of the system. In Figure 5B, we show that the correlation function can be well approximated by C(r) ~ r~1/4 exp(-r/§/ -J2 , where £ have different values comparable with the system radius = 512. This example demonstrates difficulties associated with correct identification of power law correlations in a finite system.

It is illuminating to study also the anti-ferromagnetic Ising model, in which neighboring spins prefer to stay in the opposite direction, or be anti-correlated. At low temperatures, an anti-ferromagnetic system looks like a checker board. Mathematically, ferromagnetic and anti-ferromagnetic Ising models are identical, so that any configuration of the anti-ferromagnetic model corresponds to exacdy one configuration of the ferromagnetic model which can be obtained by flipping all the spins according to a simple deterministic rule. Thus in both models, correlation length has the same finite value at any temperature, except at the critical point at which the correlation length in both models diverges. Nevertheless, the behaviors of correlation functions are totally different. In the anti-ferromagnetic case, correlation function is negative for all odd r and is positive for all even r (Fig. 6A.)

For T> Tn the behavior of the absolute value of the correlation function is similar to that of the ferromagnetic model, both decaying exponentially with r, but below Tc in the anti-ferromagnetic case, the absolute values of correlations do not decay at all (Fig. 6B). However, if one average odd and even values of the correlation function, this averaged correlation function decays exponentially to zero as expected. This shows that the correlation length is finite and that there is no true long range correlations.

0 0