## Info

100 200 300 400 500 600 700 800 900 1000 time

Figure 9. Analytic approximation for the total number of folds compared to numerical results of Figure 3.

In the presence of gene deletion, the approximation for F(t) shows linear growth with time at a rate less than R. As expected, a greater rate of gene deletion reduces the growth of F(t). However the approximation predicts that the number of folds will always increase with time, which can be verified by taking the uppermost limit, Q = 1. For small Q, the constant C itself can be approximated more simply: C~ 1 + 2 / (R + 2)(R + 3).

Figure 9 confirms these observations. The approximation for the expected number of folds seems to work quite well and could be useful in trying to infer both R and Q from genomic data. Certainly the impact of gene deletion is easier to identify through F(t) and G{t) than through the shape of the histogram F(m, t).

Appendix G: The Effects of Selection Pressure

Recall that we have assumed that there are only two duplication types: type "A" and type "B", and that "B" genes are / times more likely to be chosen for duplication than "A" genes. There will still be one duplication event, on average, per unit time, so the total expected number of genes will remain the same, but the allocation of the total between types "B" and "A" will depend on y. We will assume that y > 1, so it is the "B" types that are more likely to be duplicated.

To keep track of the fold population we now need two histograms: F^im, t) and Fg {m, t) to distinguish between the duplication types. The full fold histogram is the sum of both sub-histograms: F(m, t) = FA(m, t) + Fn(m, t). Similarly, let GA(t) and Gg(t) represent the total number of genes for each type and define a new variable Gy(t):

Figure 9. Analytic approximation for the total number of folds compared to numerical results of Figure 3.

The evolution equations that extend (2) are:

dFA(m,t) _ (m-l)FA{m-l,t) _ mFA(m,t) dt ~ Gy(t) Gy(t)

dt A Gf(t) dFB{m,t) (m — l)FB(m — l,t) mFB(m,t) , ,, (68)

Note that we allow new folds to be acquired at different rates for each type: RA can be different from Rb although we will restrict our numerical examples to the when they are equal.

As before, we derive equations for the total number of genes from the full dynamics (68):

dt dt dt

This confirms that the overall duplication rate is still one gene per unit time. The evolution of G-Jyt) is more complicated:

dGyjt) dt

It is possible to establish the distributional properties of the genome without having to solve (68) explicidy for the special parameter values encountered previously: (1) the case when there is no introduction of new folds, so Ra = Rb = 0; and (2) the limiting distribution when When there is no introduction of new folds, a simple extension of the repeated integration employed in Appendix A establishes that the each of the sub-histograms FA(m, t) and Fs(m, t) follows an exponential distribution for all times:

FA(m,t) = JV*exp(-u(i)) [1 - exp (-«(i))]™"1

The number of distinct folds of each type, present at t = 0 is given by Nf} and No . The variable u{t) is determined by G-^t):

The full histogram is consequendy a sum of exponential distributions:

The large time behavior of the solution is much easier to derive than an exact solution. For large t, Gy(t) will grow linearly with time: Gy - Cyt, according to a constant Cy that depends on the rate of fold acquisition and the differential rate of duplication: Figure 10. Large time limit for the fold probability distribution for the minimal model (one duplication type) and four duplication types: B = 4, C= 8, D = 16. The total rate of new fold acquisition is the same for both genomes.