Properties of the Protein Domain Universe Graphs

We computed the size of the largest cluster in PDUG and random control graph as a function of Zm¡n.8 We found a pronounced transition of the size of the largest cluster in PDUG at Zm\a = Zc~ 9. The random graphs feature a similar transition, but at a higher value of Zmm =ZC~ 11. The distribution of cluster sizes depends significantly on whether Zmm > Zc or Zmm < Zc for both the PDUG and random graphs. We also found that the probability density P(M) of cluster sizes M for both the PDUG and random graphs follows a power-law at their respective Zc: P{M) °c Af2'5. The observed power-law behavior of P(M) is simply a consequence of criticality at Zcas it is featured prominently both for the PDUG and random graphs. The power-law probability density of cluster sizes is a generic percolation phenomenon that has been observed and explained in both percolation117,118 and random graph theories.119 Gerstein and coworkers also reported a power-law distribution for fold family sizes derived from the SCOP database7 and attributed the observed power-law distribution to a certain evolutionary mechanism. However, we showed in reference 8 that random graphs featured the same power-law distribution for fold family sizes and were simply explained by percolation theory.117"119

In order to characterize the structural properties of the PDUG we computed the probability p(k) of the number of edges per node k taken at Zmm = Zc for individual clusters. It is known that p (k) distinguishes random graphs from various graphs observed in science and technology.120 In drastic contrast with the equivalent random graph, the PDUG is scale-free with p(k) with a high degree of statistical significance (/»-value less than 10"8). The power law fit of (p (k) is most accurate at Zm¡n = Zn and noticeably deteriorates above and below Zc. The fit at Zm\n > Zc quickly becomes meaningless as the range of values of connectivity k rapidly diminishes as greater Zmin lead to mostly disconnected domains. At Zmm < Zc the power law fit also becomes problematic in the whole range of k because at large values of k (50-100) $){k) shows some nonmonotonic behavior which can be interpreted as a maximum at large k (the data are insufficient to conclude that with certainty). However the remarkable property of a maximum (p(k) at k = 0 i.e., dominance of orphans remains manifest at all Zmm values. This is in striking contrast with random graph which is not scale-free at any value of Zm¡n and where (p (k) allows almost perfect Gaussian fit with a maximum at higher values of k.

The discovery of the scale-free character of the protein domain universe is striking. It has immediate evolutionary implication by pointing out the possible origin of all proteins from a single or a few precursor folds—a scenario parallel to the origin of the Universe from Big Bang. An alternative scenario, whereby protein folds evolved de novo and independently, would have resulted in random PDUG rather than the observed scale-free one.

The rigorous method of clustering protein structures8 provides a number of insights. First of all, using graph theory for protein structure classification removes the ambiguities that are inherent in the highly useful, albeit manual, approaches to structural classification of proteins.4'43 Perhaps not surprisingly we observed that the structure of the graph representing the protein domain universe depends on the Zmm threshold value of Z-score above which protein domains are considered structurally similar and are connected by an edge of the graph. However, at a certain critical value Zmm = Z„ the structure of the PDUG becomes remarkably universal, simple and amenable to theoretical understanding from an evolutionary standpoint.

An important component of the analysis presented in reference 29 is random control where PDUG was compared with random graph. Our results showed that random weighted graph having the same weight (Z-score) distribution as PDUG featured same cluster size distribution. Since clusters in PDUG can be associated with fold level classification of protein structure, this observation suggested that nonuniform distribution of nonhomologous proteins over folds may not be due to special features of "most popular" protein folds as suggested previously by some researchers.15'85 However that does not necessarily imply that observed protein folds are not selected based on their physical properties.121 It is possible that the divergent evolution scenario described here occurs only on these selected folds while unfeasible ones are not observed in nature. However the analysis presented in reference 29 points out that explanation of the nonuniform distribution of nonhomologous proteins over observed folds does not require invoking the "designability principle"15 or related conjectures about the nonuniform density of sequences in space of protein folds.85

We discovered that the structure of the PDUG is by far nonrandom, but rather represents a scale-free network featuring power-law distribution of the number of edges per node. The most striking qualitative aspect of the observed distribution is the much greater number of "orphans" (i.e., domains that are not structurally similar to any other domains) compared with random graph control. Importandy this qualitative feature remains prominent at any value of threshold Zmm despite the fact that power-law fits of p {k) gets worse when Zmin deviates from Zc. A natural explanation of this finding is from a divergent evolution perspective. The model of divergent evolution presented in reference 8 is in qualitative agreement with PDUG as it produces a large (compared with random graph) number of orphans (Fig. 4).

Besides reproducing the scale-free behavior of the PDUG, the divergent evolution model also quantitatively captures more specific graph properties of PDUG. In particular it was shown that the distribution of clustering coefficients122 of nodes of PDUG is almost exacdy matched by the divergent evolution model. This is in contrast to the random control where the scale-free PDUG has been randomly rewired while connectivity of each node is kept intact (Deeds and Shakhnovich, unpublished results).

Orphans are created in the model mosdy via gene duplication and their subsequent divergence from a precursor. This may be meaningful biologically because duplicated genes may be under less pressure and hence prone to structural and functional divergence. The divergent

Divergent Evolution Cartoon
j New fold family is born

Figure 4. A cartoon representation of the divergent evolution model presented in reference 8. In this model, proteins diverge through a number of gene duplication events and point mutations (I—>11—>111—> IV). While a single amino acid substitution may not significantly alter the protein structure, a number of them may result in drastic changes in protein structure. If these changes result in a functional and, most importantly, stable protein, a new fold family is born. In the model, each protein is represented by a node. Nodes representing proteins with significant structural similarity are connected by edges with a weight. If in the course of evolution an edge's weight lowers below the threshold value, the nodes become disconnected. At each evolutionary step, a randomly chosen node is duplicated and an edge with a weight (chosen from uniform distribution) is created that connects progeny to its parent. If the weight is below a threshold value, the nodes become disconnected and a new protein family is born. In addition, at each evolutionary step, after gene duplication a newly created node may become connected to its pra-parent.

evolution model presented in reference 8 is a schematic one as it does not consider many structural and functional details and its assumptions about the "geometry" of the protein domain space in which structural diffusion of proteins occurs may be simplistic. However, its success in explaining the qualitative and quantitative features of PDUG supports the view that all proteins might have evolved from a few precursors.

An important aspect of the model proposed in the reference 8 is that it provides only a conceptual framework for reconstructing protein structural space. The fine details of evolution contain crucial ingredients that underlie selective pressure in the model proposed in reference 8. Recendy Deeds et al123 uncovered how the features of an underlying protein structural space might impact protein structural evolution using lattice polymers as a completely characterized model of this space. In reference 123 we developed a measure of the structural comparison of lattice structures in analogy to the one used to understand structural similarities between real proteins. We used this measure of structural relatedness to create a graph of lattice structures and compared this graph (in which nodes were lattice structures and edges were defined using structural similarity) to the graph obtained for real protein structures. In reference 123 we found that the graph obtained from all compact lattice structures exhibited a distribution of structural neighbors per node consistent with a random graph. We also found that subgraphs of 3500 nodes chosen either at random or according to physical constraints, such as selective protein designability,11 also represented random graphs. We developed a divergent evolution model based on the lattice space which produces graphs that were capable of recapitulating the scale-free behavior observed in similar graphs of real protein structures. Indeed, in contrast to this universal behavior, we observed subgraphs with power-law degree distributions only as the result of a very specific evolutionary sampling procedure. This not only demonstrated that scale-free graphs may be derived from such spaces but also that the rules underlying divergent graph evolution models are sufficient to produce this behavior.

Was this article helpful?

0 0

Post a comment