H
I J 1—1 —
duplication \ (rate 6}
Figure 2. Three models: A) minimal model with uniform initial distribution, B) minimal model with an arbitrary initial distribution, C) gene deletion, and D) selective pressure.
One important conclusion may be drawn from the generalized model: all initial distributions ultimately lead to the same limiting distribution determined by theAm. Just as before, the dependence on the initial fold distribution N^m) decays with time, leading to the same asymptotic distribution as was found for an initial distribution of Nq folds of size 1 in (9), reflecting the dominance of fold introduction over gene duplication for large times. Of course, the details of how and when the crossover happens will depend on the particular form of Ntmt(m).
Extended Model: Including the Effects of Random Gene Deletion
Gene deletion is a major factor in evolution and is discussed briefly by Qian et al.11 In this section we incorporate an additional parameter, Q, that represents gene deletion.
The most natural extension of (2) that accounts for random gene deletion at rate Q would be the following:
dF(m,t) (m — l)F(m — l,t) mF(m,t) Am + l)F(m + l,t) ^mF(m,t) .
* = —m+Q—mQ~m " (ro"nm
The terms proportional to Q encode the dynamics for gene deletion, which are very similar to gene duplication: on average, Q deletions occur for every duplication event and the gene to be deleted is chosen randomly from all the genes in the genome. In this way the population of a given bin m can either decrease due to gene deletion if the gene to be deleted is from bin m itself, or it can increase as a result of a deletion in the neighboring bin m +1.
In this extended model, gene growth occurs at the uniform rate one would expect: G(t) = Nq + (1 + R  Q)t. In contrast, the behavior of the expected number of folds is more complicated:
Folds of size 1 that are deleted disappear from the genome so F(t) depends explicidy on the population of F{l,t); unlike the Q = 0 case, the dynamics of F(t)can not be determined without knowing the full solution to (10).
The extended model is much more complicated mathematically, primarily because the difference equations are now second order. In the minimal model, the behavior of larger folds depends only on the behavior of smaller folds, so the full solution can be constructed inductively starting from the solution for m=\. With gene deletion operating as well, the dynamics of different fold sizes are coupled together. In many respects, these dynamics are like those describing diffusion phenomena; when Q = 0 the genome exhibits growth due to drift, or directed movement alone, while nonzero Q introduces diffusive, or nondirectional movement as well.
Analytic Results
We were able to derive a full analytical solution only in the absence of any new fold introduction: R = 0. In this case, only stochastic gene deletion and duplication operate. We will restrict our discussion to when gene duplication occurs at a higher rate than gene deletion, which requires 0 < Q < 1, so the genome will still grow in size, at least in terms of number of genes: G(t) = No+(l— Q)t. Note that since R = 0, equation (11) shows that the number of folds will actually decrease with time. Losing folds while gaining genes is possible if the larger folds make up for the loss of genes from smaller folds.
An analytic solution exists for an initial distribution of Nq different folds of size 1 and is worked out in detail in Appendix D. Once again, the distribution is exponential. Figure 3A shows histograms F(m,t) corresponding to three values of Q and a fixed time.
Remarkably, the normalized distribution of fold size (the probability distribution) is independent of the gene deletion rate Q:
Figure 3. The effects of gene deletion: A) fold histogram F(m, t) for TVo = 100 and t = 1000 plotted for Q= 0, 0.4, 0.8 and R = 0; B) normalized largetime limiting fold distribution and C) the total number of folds as a function of time when R = 1.0 and Q = 0, 0.2, 0.4, 0.6; D) normalized largetime fold distribution and E) the total number of folds as a function of time for fixed overall gene growth: 1 + R  Q = 1.6 and Q = 0, 0.2, 0.3, 0.4; F) analytic approximation, shown using solid lines, for parameters plotted in (A).
Figure 3. The effects of gene deletion: A) fold histogram F(m, t) for TVo = 100 and t = 1000 plotted for Q= 0, 0.4, 0.8 and R = 0; B) normalized largetime limiting fold distribution and C) the total number of folds as a function of time when R = 1.0 and Q = 0, 0.2, 0.4, 0.6; D) normalized largetime fold distribution and E) the total number of folds as a function of time for fixed overall gene growth: 1 + R  Q = 1.6 and Q = 0, 0.2, 0.3, 0.4; F) analytic approximation, shown using solid lines, for parameters plotted in (A).
Hence gene deletion does not affect the shape of the distribution at all when R =0, only the overall normalization is changed. This can also be seen direcdy from Figure 3B.
Although an exact analytical solution does not seem possible for arbitrary R and Q, it is nonetheless possible to derive analytic expressions for the higher moments of the fold distribution. Appendix E discusses how this is done and particular, includes an expression for the second moment that will prove useful when fitting the model to genomic data.
Numerical Results for Nonzero R and Q< 1
Numerical solution of (10) reveals for that large times, the normalized histograms of fold size approach a timeinvariant limit that depends solely on R and Q. Figure 3B shows the probability distributions for a fixed rate of new fold acquisition, R=1.0, and increasing rates of gene deletion: Q = 0, 0.2, 0.4, 0.6. The powerlaw character of the distributions is retained even for large values of Q. Quite reasonably, higher rates of gene deletion encourages the dominance of smaller folds, leading to a more rapid decline ofp(m) with fold size m. Common folds require repeated gene duplication and an avoidance of gene deletion events to proliferate. As the probability of avoidance is proportional to 1  Q, the probability of multiple avoidance is suppressed as a power of 1  Q.
Figure 3D explores the effect of deletion when the overall gene growth rate is kept constant: 1 + R — Q = 1.6. In this way, we can contrast the effects of deletion and fold acquisition in a controlled manner. Note that a commensurate increase in R does not overcome an increase in Q, as large folds are suppressed more than small folds. This means that the exponent that best describes the powerlaw decay is not merely a function of R — Q.
On the other hand, the effect of gene deletion is not dramatic; not only is similarity to a powerlaw retained the actual change in exponent is not large. Even for fold of large size, there isn't much difference between the curves even for a fairly large gene deletion rate. In practice, this makes it difficult to estimate Q statistically from the shape of fold histograms derived empirically from genomic data. While the effective gene introduction rate: 1 + R  Q, should be easy to deduce from the data, an identification of Q itself from the rate of decay would require reliable occurrence data for very large folds.
When there is no gene deletion, the expected number of folds increases linearly with time at rate R. Equation (11) suggests gene deletion will lead to a less simple time dependence for F(t). Perhaps surprisingly, F(t) remains, to a good approximation, linear in time, with a slope that is no longer R, as can be seen in the numerical results of Figure 3C. Here F{t) is plotted for fixed R = 1.0 and different values of the gene deletion rate: Q = 0.0, 0.2, 0.4, 0.6. In fact, the slope in each of these cases is less than R and decreases with Q, which is consistent with the analytic solution for F(t) when R = 0, derived in Appendix D (see equation (42)).
If again we choose parameters that fix the growth rate for the expected number of genes (1 + R — Q), a commensurate increase in both R and Q leads to a greater increase in the expected number of folds, as can be seen in Figure 3E. This is entirely reasonable: in our model, the new folds that are continually acquired at rate R are all distinct, so a genome with large R and Q will end up with many small folds, each coded by only a few genes. In contrast, a genome with small R and Q will lead to fewer but larger folds.
Analytic Approximation Based on Perturbation Theory
The numerical results show that gene deletion, even for fairly large values of Q does not dramatically change the growth pattern of the genome, certainly qualitatively and to some extent, even quantitatively. Moreover, the analytic results when R = 0 showed that gene deletion is remarkably benign: in the absence of new gene acquisition, but with gene duplication operating, gene deletion does not change the probability distribution of fold occurrences, but does change expected total number of folds in the genome.
Here we consider an analytic approximation that attempts to capture the effects of gene deletion perturbatively by constructing an approximation around the Q = 0, R > 0 solution as an expansion in powers of Q. The perturbation expansion has to be handled carefully since a naive expansion, one that considers contributions only up to some finite power of Q, will not converge for all fold sizes m. The failure of conventional perturbation theory is explored in Appendix F.
To go beyond naive perturbation theory, we have adopted the following approach: (1) the dominant contribution at every order (or power) of Q is identified, (2) the dominant contribution is approximated, and (3) the resulting new infinite series in Q is summed exacdy to arrive at an approximate solution that remains finite for all Q and m. The details are presented in Appendix F. Although not rigorous, this type of rescue or augmentation of perturbation theory is practiced routinely and often quite successfully on a variety physical models, such as models of phase transitions from statistical physics.24
This approach leads to the following approximation for the limiting fold distribution:
Note that the approximation includes as a special case the exact distribution derived previously for Q = 0 (8). In fact, the approximate distribution for Q nonzero is obtained from the exact solution for Q = 0 by the substituting R* R + QR. This correspondence also makes it clear that for large m, the Q # 0 probability distribution will resemble a power law with exponent R + QR +2, just as Q = 0 distribution approached a powerlaw with exponent R + 2.
The true test of the effectiveness of the approximation rests with a comparison to the numerical results, which is done in Figure 3F. There seems to be good qualitative agreement, and fairly good quantitative agreement as well, even for Q = 0.4. As expected from the nature of the approximation, there is better agreement for large m in all cases. An approximation for the expected number of folds F(t) within the same framework is given in Appendix F.
The Effects of Selection Pressure
Selective pressure plays an important role in evolution. It is well known that different genes have different duplication rates due to the selective pressure.25 So far we have assumed that when genes are duplicated, or deleted, the target gene is chosen with equal probability from all the genes in the genome. A more realistic model would of course allow for favoritism in the selection process: presumably, genes that are useful or necessary are less likely to be deleted and perhaps more likely to be duplicated than genes that are less important. Note, however, that our model is not a differential survival model.
We explore the effects of selection pressure by extending the minimal model to allow for different duplication rates among genes. Suppose now that genes are not only identified with particular folds but also by their duplication types. For simplicity, assume that there are only two types: type "A" and type "B", and that "B" genes are y times more likely to be chosen for duplication than "A" genes. There will still be one duplication event, on average, per unit time, so the total expected number of genes will remain the same, but the allocation of the total between types "B" and "A" will depend on y. We will assume that y > 1, so it is the "B" types that are more likely to be duplicated.
To keep track of the fold population we now need two histograms: Fy^m, t) and F^m, t) to distinguish between the duplication types. The full fold histogram is the sum of both subhistograms: F(m, t) = FA(m, t) + Frfjn, t). Similarly, let CA{t) and G^t) represent the total number of genes for each type and define a new variable Gy(t):
The evolution equations that extend (2) are:
dFA(m,t) _ (m  l)FA{m  1, i) _ mFA{m, t) dt ~ G7(t) G7(t)
Note that we allow new folds to be acquired at different rates for each type: Ra can be different from Rb although we will restrict our numerical examples to the when they are equal.
The equations for the total number of genes of both types follow from the full dynamics (15) and are given in Appendix G. These confirm that the overall duplication rate is still one gene per unit time.
Once again, analytical solutions are possible for the two special parameter values addressed previously: (1) when there is no introduction of new folds, so Ra = Rb = 0; and (2) the limiting distribution when t—> When there is no introduction of new folds, a simple extension of the methodology employed in Appendix A establishes that the each of the subhistograms FA(m, t) and Fgim, t) follows an exponential distribution for all times. The full histogram is consequendy a sum of exponential distributions:
The number of distinct folds of each type, present at t = 0 is given by No and No . The variable u(t) is a rescaled time variable related to Gy(t); the exact form of the dependence appears in Appendix G but is unimportant for the present discussion.
Of greater interest is the other special case: the ultimate evolutionary fate of the genome. The analytic behavior for large times is much easier to derive than an exact solution itself. For large t, GJf) will grow linearly with time: Gy t, according to a constant C^that depends on the rate of fold acquisition and the differential rate of duplication (see Appendix G for details).
In a similar fashion, we define coefficients and Cm, akin to the coefficients Am of the solution to the minimal model (7), that describe the ultimate linear growth of the histogram bins: FA(m, t)  Cmt, and similarly for Fg(m, t). The normalized probability distribution corresponding to this limit is given by:
C7 + 1 Ra + Rb 11 C7 + i + 1 C7 + 7 Ra + Rb ^ C7 + + 1)
The important conclusion to be drawn from (17) is that powerlawlike distributions describe the ultimate fate of the genome even when there are different rates of gene duplication. The probability distribution is the sum of two powerlawlike distributions, each similar to the powerlawlike distributions of the minimal model, but characterized by its own effective exponent. Figure 4 shows a comparison of the predicted distribution and numerical results when Ra = Rb = 0.5 for y = 1, which is corresponds to the minimal model, and y = 10, so type "B" genes
lag m
Figure 4. The effect of selective pressure on the model. Larger values of y indicate larger differences in duplication rates between favored and unfavored protein folds. Large time limit for the fold probability distribution for y = 1 and y = 10. Numerical results are plotted as symbols; analytic results from Equation (16) as lines.
lag m
Figure 4. The effect of selective pressure on the model. Larger values of y indicate larger differences in duplication rates between favored and unfavored protein folds. Large time limit for the fold probability distribution for y = 1 and y = 10. Numerical results are plotted as symbols; analytic results from Equation (16) as lines.
are ten times more likely to be selected for duplication. The two distributions are remarkably close to each other, even when there is an order of magnitude difference between the relative duplication rates of type "B" genes. We have found that the parameter has much less of an effect than differences between the gene introduction rates Ra and Rg
We have also briefly considered the case of more than two duplication types. When there is no introduction of new folds into the genome equation (6) generalizes: the subhistogram for each duplication type is exponential. Furthermore, we have confirmed numerically that the terminal distribution is not dramatically affected by selection pressure, even when there are several families with significandy different rates of duplication. One particular example, involving four duplication types appears in Appendix G.
Fitting the Models to Genomic Data
Clearly, of greatest interest is to observe how our model compares with the genomic distributions. We start with the minimal model, for which we require estimates for the parameters t, No and R for each organism.
Fitting the minimal model requires estimating three parameters: t, No and R. We have determined these parameters separately for each organism by insisting that the minimal model match the number of folds: F, the number of genes: G, and the second moment of the actual fold histogram: Hj, to those predicted by the minimal model. The fitting procedure is gready simplified by the linear relation that exists between the variables (No, t) and (F, G):
Table 1. Fit of the minimal model using genomic data from 20 organisms
Mismatch of of Third
Genome Genes Folds G/F t N0 R Moment (%)
M. genitalium 

Post a comment