The Drosophila Protein Interaction Network May Be neither Power Law nor Scale Free

J.S. Bader* Abstract

Scale-free networks have become a topic of intense interest because of the potential to develop theories universally applicable to networks representing social interactions, internet connectivity, and biological processes. Scale-free topology is associated with power-law distributions of connectivity, in which most network components have only few connections while a very few components are extremely highly-connected. Here we investigate the power-law and scale-free properties of the network corresponding to protein-protein interactions in Drosophila melanogaster. We examine power-law behavior with a standard statistical technique designed to distinguish whether a power-law fit is adequate to describe the vertex degree distribution. We find that the degree distribution for the entire network, consisting of baits and preys, decays faster than power law. This fit may be confounded by artifacts of the screening procedure. The prey-only degree distribution is less likely to be confounded by the screening procedure, and is fit adequately by a power-law. When only the biologically relevant interactions are considered, however, the degree distribution again decays faster than power-law. Thus, power-law behavior may reflect interactions that are observed in vitro but not in vivo. We next describe an algorithm that may be able to extract the true distribution from the incomplete data. Finally, we investigate scale-free properties by characterizing organizational patterns over increasing spatial scales. We provide evidence for the existence of a length-scale that characterizes organization in the network. The existence of such a correlation length stands in contrast to scale-free networks, in which no length scale is special. These results suggest that the Drosophila protein interaction network may not be power-law and is not scale-free.

Introduction

Technological advances now permit the elucidation of biological networks on a genome scale. A recent report described using the two-hybrid method to identify the protein-protein interactions that underlie the protein complexes and multi-complex pathways in Drosophila melanogaster.1 This was the first large-scale protein-protein interaction network determined for a metazoan and builds on earlier screens conducted for Saccharomyces cerevisiae,2'3 Protein interaction networks have also been probed using mass spectrometry of protein complexes.4'5 Chromatin immunoprecipitation experiments provide analogous data to support the identification of protein-DNA interactions and transcriptional regulatory networks.6

*J.S. Bader—Department of Biomedical Engineering, Johns Hopkins University, 201C Clark Hall, 3400 N. Charles St., Baltimore, Maryland 21218, U.S.A. Email: [email protected]

Power Laws, Scale-Free Networks and Genome Biology, edited by Eugene V. Koonin, Yuri I. Wolf and Georgy P. Karev. ©2006 Eurekah.com and Springer Science+Business Media.

The advent of these large-scale data sets has stimulated interest in developing theories that explain how biological networks are organized, and how they continue to be shaped by evolution. Biological networks are examples of small-world networks, which occupy the middle ground between completely regular networks and random networks.7 Like regular networks, small-world networks have clusters of interconnected vertices; like random networks, most pairs of vertices are connected by a short path of links.

A notable feature of biological networks, also represented in social networks and air travel networks, is the existence of hubs, a colloquial term for vertices with a high number of connections compared to typical vertices. In many examples of networks, the vertex degree distribution shows a decay much slower than the Gaussian-like distribution for random networks; the tail end of this distribution corresponds to hubs. A preferential or rich-get-richer model, in which new connections are biased towards vertices that are already highly connected, leads to a power-law vertex degree distribution and also leads to scale-free self-organization.8

Much is known about the statistical physics of self-organized networks and self-organized criticality. If biological networks are a realization of self-organized criticality, then universal results and scaling laws from physics should apply. Alternately, if biological networks have properties that differ from scale-free networks, then new theoretical developments may be required to describe their properties and behavior. Thus, there is great interest in determining whether evolution has shaped biological networks to resemble self-organized, scale-free networks.

Two testable hypotheses of a scale-free model are a power-law distribution of connections per vertex and the lack of a characteristic length scale for network organization. Here we examine whether the topology of the Drosophila protein interaction network follows these hypotheses.

One possible test for power-law behavior is to calculate the empirical vertex degree distribution, then check whether a power-law functional form provides a better fit than an alternate functional form, typically exponential or normal. This type of test does not confirm that the distribution follows a power law; instead, it indicates that a power law fits less poorly than other functional forms. We describe how a vertex degree distribution may be fit by a family of functions with terms corresponding to power-law decay, exponential decay, and even faster decay. We use standard statistical procedures to decide whether the optimal fit is a power law, or whether it is faster than power law.

These statistical tests are not straightforward due to experimental limitations in sampling the Drosophila network. First, at most 96 colonies were sequenced for each bait, which artificially limits the vertex degree observed for a bait protein. Next, some prey libraries were often obtained from mRNA libraries with power-law distributions of transcript abundances, which influences the bait-prey combinations that are sampled. Finally, analysis of the entire network is questionable as only 25% of the network was judged to be high-confidence for biological relevance. We address each of these factors in turn in an analysis of the Drosophila vertex degree distribution. We then describe an approach to predict the true vertex degree distribution from the incomplete distribution derived from partial sampling of the network.

We investigate structure in the network by characterizing motifs that represent order. A simple motif is the existence of a triangle, three vertices connected one to the next. The ratio of the number of observed to expected triangles is synonymous with the standard definition of the clustering coefficient for a small world network. This statistic is sensitive to organization over short length scales. To investigate organization over longer length scales, we investigate the distribution of longer cycles. This distribution may be measured for an empirical network. We introduce a simple mathematical model for a network organized to have one level of clustering and show that this model is sufficient to explain the observed cycle distribution. Thus, there is no need to invoke a continuous distribution of length scales. Moreover, the one-level model immediately yields a characteristic, testable scaling length for the network, which again stands in contrast to scale-free behavior.

We conclude with a discussion of models that may provide an improved theoretical framework for understanding the properties of biological networks and the evolutionary forces that have shaped them.

Observed Vertex Degree Distribution

The empirical network that serves as the basis of this study was obtained by a two-hybrid screen for protein-protein interactions in Drosophila melanogaster} This work describes 20,405 pair-wise interactions involving 7046 proteins. One of the difficulties in analyzing topological properties of this network is that some properties are biased by the screening procedure. For example, the number of preys for each bait is limited by the number of colonies sequenced for each mating, typically 96 (see ref. 1 and references therein for background on the two-hybrid system). Furthermore, screens conducted with a prey library obtained direcdy from mRNA may be biased for highly-expressed proteins. Finally, interactions identified by the two-hybrid method often have questionable biological relevance.9"11

To address each of these points, we constructed a series of three vertex degree distributions. The first degree distribution considered all of the interactions observed for Drosophila, excluding a small number of self-interactions. This network included 20,278 interactions between 7000 proteins.

Next, we addressed the limited number of colonies sequenced for each bait by considering only the degree distribution for prey proteins. While each bait participates in at most 96 interactions due to the limited sampling, there should be no such limitation for the number of times a prey is observed as an interaction partner.

One possible limitation on observing a prey, however, is that it is not represented in the prey library. Or, if it is present, its abundance may be low. These preys may be systematically under-represented in two-hybrid screens that use prey libraries obtained directly from mRNA isolated from cells. Indeed, mRNA abundances may themselves follow a power-law distribution, with a few highly-represented species being responsible for the majority of the mRNA mass.

The screens in reference 1 attempted to avoid this limitation by conducting two-hybrid screens with two independent prey libraries. One library was obtained by isolating mRNA from Drosophila embryonic and adult developmental stages, then using these transcripts to generate a prey library. The second library was obtained by individually amplifying every predicted Drosophila gene from a cDNA library, with a 75% success rate in generating a prey with verified insert sequence and size. The resulting 10,787 preys were then pooled to yield a nearly perfecdy normalized library. This pool was mated to each of 10,623 baits, yielding 31,270 bait-prey pairs whose sequences could be mapped to release 3.1 of the gene annotations from the Berkeley Drosophila Genome Project. After removing a small number of self-interactions, these 31,270 pairs corresponded tol0,l61 unique prey-bait pairs between 3001 preys and 2657 baits.

One of the challenges in interpreting two-hybrid data is that many of the interactions observed are spurious, with questionable biological relevance. The biological relevance of the interactions reported in reference 1 was modeled statistically. Each interaction was assigned a confidence score in the range from 0 to 1, with 0.5 as the approximate dividing point between low-confidence (< 0.5) and high-confidence (> 0.5) of biological relevance. Starting with the preys from the normalized screen described above, we obtained a third degree distribution by considering only the high-confidence interactions. This network corresponded to 3574 unique prey-bait pairs between 2093 preys and 2130 baits.

Vertex Degree Distributions and Power-Law Fits

We write N[k) as the number of proteins in a network with exactly k neighbors. A typical procedure used to assess power-law behavior is to fit N(k) by a power-law, exponential, or Gaussian decay, each corresponding to different random models, exp(v4 + ti{) log&) exp(y4 +

where A in each case is an appropriate normalization constant. Typically, the fit is performed on a log-scale to minimize the quantity %,2

*2=x[ log7V(*)-logiV(*)]2, and the functional form giving the smallest is accepted as describing the decay. In certain cases, it may be preferable to normalize each term by its anticipated variance l/N(k) or construct bins on a logarithmic scale to avoid small counts.

Rather than comparing three separate fits, a standard statistical procedure is to assess the significance of a series of models of increasing complexity. The log-scale, exponential, and Gaussian functional forms for N(k) can be considered as the first three terms in a Taylor series expansion,

Using forward regression, we can build models of increasing complexity. The first model (power-law decay) fits uses only the terms A and with all higher order coefficients set equal to 0; the second model (power-law truncated by exponential decay) uses/4, Oq, and a\, with all higher order coefficients set to 0; and so on. We then assess the significance of each model relative to the preceding model using analysis of variance, an Ftest of the reduction of X1.

We used bins of width 1 for simplicity. To account for bins with a small number of counts, including empty bins where log N(k) is undefined, we performed a series of fits. First, we excluded bins with 0 counts. Next, we excluded bins with 0 or 1 counts and refit the model. Next, we excluded bins with 0,1, or 2 counts and refit the model. We reasoned that power-law behavior is typically defined only when at least 3 orders of magnitudes of power-law decay are observed. Thus, shaving off the tail of the distribution should not affect our ability to define a power-law exponent. Equivalendy, a robust power-law fit should not require inclusion of the bins with the fewest number of counts.

The empirical vertex degree distributions are depicted in Figure 1 A. The values estimated for the power-law decay parameter ao and the exponential decay parameter a\ are depicted in Figures 1 B,C. We note first that the degree distribution for the entire network has a highly significant exponential decay component when the bins with count 1 are excluded from the fit. The estimate for the exponent is approximately -0.03. The inverse of this exponent is of the same magnitude as the 48 to 96 clones sequenced for each bait, which supports the hypothesis that the experimental design has contributed to a decay that is faster than power law.

In contrast to baits, preys do not have an interaction count that is limited by the experimental design. Thus, we hypothesize that the degree distribution of preys (for each prey, the number of unique baits that identified it as an interaction partner) is less affected by sampling limitations. The vertex degree distribution for preys appears to be a power-law distribution (Fig. 1A). This appearance is borne out statistically with a power-law parameter ranging from -2.0 to -2.3 (Fig. IB) and an exponential decay parameter that is indistinguishable from 0 at a p-value of 0.05 (Fig. 1C). These parameter estimates are stable over a range of minimum bin

Confidence and Social Supremacy

Confidence and Social Supremacy

Surefire Ways To Build Up Your Confidence As Well As Be A Great Networker. This Book Is One Of The Most Valuable Resources In The World When It Comes To Getting Serious Results In Building Confidence.

Get My Free Ebook


Post a comment