## Ntl

Minimum bin count

Figure 1. Vertex degree distributions and fits.

counts used for the fit. Thus, the network obtained from highly-normalized preys is described well by power-law decay.

The interactions that contribute to the prey degree distribution are not all biologically relevant. Some may reflect assay-specific artifacts having to do with the two-hybrid reporter system; other false-positive interactions may arise from weak or nonspecific interactions; other interactions may be highly reproducible in vitro but involve proteins that are restricted to different developmental stages or tissues and never interact in vivo. We hypothesize that the power-law distribution may reflect overall properties of the distribution of binding constants between proteins rather than the number of biologically-relevant interaction partners for a given protein. To test this hypothesis, we eliminated low-confidence interactions from the prey vertex degree distribution and refit the distribution. While the power-law decay parameter is close to the previous estimate of -2 (Fig. IB), the exponential-decay parameter is now significantly different from 0. The value of the exponential-decay parameter ranges from -0.15 to -0.2 depending on the minimum bin count, at a p-value of 0.05 to 0.005.

Synthesizing these results, we propose that the power-law distribution reflects the distribution of binding constants observed between proteins in vitro, while an exponential distribution reflects the number of interaction partners that are relevant in vivo. We note that if the preferential attachment model is modified to restrict the potential interaction partners of a node, an exponential-decay degree distribution may result.12 Restriction may occur naturally in a biological system due to temporal or spatial restrictions to protein expression.

We temper the strength of our conclusion by noting that the number of interaction partners of a protein was used as one of the explanatory variables in deriving the statistical confidence scores.1

### Bait and Prey Distributions Reconciled

How far are we in identifying the complete set of protein-protein interactions that constitute the Drosophila protein interaction network? Answering this question has practical importance for estimating the cost of generating complete maps for other metazoan species, including human. Knowledge of the estimated complexity of metazoan protein interaction networks provides necessary input and tests for developing theories of biological network evolution.

Ideally, we would want to see consistency for the list of interaction partners for a protein when used as a bait and when used as a prey. A discrepancy in the interaction partners may indicate assay artifacts, for example DNA-binding activity in a protein that is used as part of the activation-domain fusion in the two-hybrid system. Unfortunately, when the number of interaction partners identified for a bait protein is limited by the number of clones selected for sequencing, it is not possible to compare the list of bait interaction partners and prey interaction partners directly. Even the simple summary statistic of the number of interaction partners may not be comparable.

Here we describe an approach that may be successful in reconciling the counts of interaction partners for a protein that is used as both a bait and a prey by inferring the true distribution of interaction partners for a protein when used as a bait based on the limited experimental evidence. Using the subscript i to label the bait, we sequence k, clones that correspond to x, unique prey proteins. Our goal is to estimate the total number of interaction partners m, from which the xi observed prey species have been drawn. Thus, we wish to estimate

e e where m represents the underlying interaction counts { m, } for each bait, x represents the observed counts {x;}, k represents the number of clones sequenced for each bait, and 9 represents the set of parameters describing the vertex degree distribution.

To simplify the following discussion, will make the assumption that the probability distribution for 6 is highly peaked near its maximum likelihood estimate

QML = argmaxPr(0|x,k) = argmaxPr(x|fl,k), which corresponds to a flat prior distribution for 9. In this case, m'= 1

The maximum likelihood estimate for q can itself be obtained as

Once functional forms for ¥v(m\6) and Pr(x|m,k) have been specified, the maximum likelihood estimate for 6 may be found by direct maximization or expectation maximization; then, Pr(m|x,k) is readily calculated.

Guided by the results for the prey vertex degree distribution, we suggest that an appropriate functional form for Pr(m| Q) is

Pi(m\6) = exp(-a0lnw - axm )/[a"0_1r(l-a0,a,)], where 6 is defined by the pair of parameters (ab,«i), we have moved to a continuous distribution for m> 1, and we have recognized that the normalization constant is simply related to the standard definition of the incomplete gamma function.

Finally, we require the probability distribution Pr(x,|w„£,). We again make a simplifying assumption, that each of the m, interaction partners is equally likely to yield a clone. While this assumption is unlikely to be true even for a normalized prey library, it provides a necessary starting point for more advanced analysis. We make a second simplifying assumption that the presence or absence of each of the mt prey species in the kj clones is independent. In this case, Pr(x,-| is given by a binomial distribution,

We speculate that the parameters (Xo and a.\ determined by the approach outlined above for the baits should agree with the power-law decay and exponential decay parameters obtained by fitting the vertex degree distribution for the preys. We further suggest that a discrepancy between the number of interaction partners estimated for a bait and the number observed when the same protein is used as a prey could signal an assay-dependent artifact. The value of the formulas provided above is that they provide a quantitative method for making such a determination. These formulas also provide a link between the amount of work done, measured by the parameter k, and the completeness of the map, measured by the factor xl m.

Note also that the prey vertex degree distribution is also affected by the finite sampling of each bait. The approach described here for the bait distribution could be modified to yield an improved estimate for the number of interaction partners for each prey.

### Determining the Length Scale of the Network

We move now from examining the properties of the vertices to more global measures of network organization. A defining property of small-world networks is clustering: a pair of vertices connected to a third vertex has an enhanced likelihood of being connected to each other. This property has been used to infer unobserved connections in protein interaction networks.13

Clustering as typically defined measures the ratio of the number of triangles in a network (three vertices connected together) to the number of triangles observed in an equivalent randomized network. To examine clustering over longer length scales, we defined a more generalized measure by counting the number of higher-order cycles in a network, and comparing this count to the distribution observed in an equivalent randomized network.

Solutions to the cycle-count distribution may also be obtained from mathematical models of random networks. The mathematical models we describe below permit closed-form analytic solutions for the cycle-count distribution. The key simplifying assumption of the mathematical models are simplified vertex degree distributions. As described below, we check these assumptions by also performing simulation studies of an ensemble of randomized networks. Agreement between theory and simulation bolsters the credibility of the theory and suggests that the cycle-count distribution may be insensitive to certain details of the vertex degree distribution.

We start with a mathematical model in which pairs of proteins in a network with TV total proteins are connected with probability J/(N-l). When J is much smaller than N, which is expected for biological networks, this yields a Poisson vertex degree distribution with mean /. The number of cycles of length L, N{L), is equal to the number of ways to select L proteins times the probability that each is connected to the next, divided by the symmetry factor 2L for a closed loop of length L, w (N-Ly\N-l) 2L

The initial combinatorial factor is,

+ 0(L2IN)], where O is the symbol for asymptotic order, and the factor involving J is (JIN)L-[l + O(LIN)]. The simplified result for the cycle-count distribution for a random network is

We anticipate, however, that biological networks will be characterized by structure corresponding to protein complexes, with enhanced connectivity for proteins within a complex. This picture is illustrated in Figure 2. We incorporated this behavior in a random model in which each protein is assigned to one of several protein complexes, and the probability of an interaction is enhanced for proteins residing in the same complex.

To make the model explicit, we define ^complexes with P proteins in each complex, giving N = KP total proteins. Proteins within a complex are connected with probability, (P - J) yielding/w within-complex neighbors on average, and proteins in different complexes are connected with probability, Jb/{N-K) yielding Jg between-complex neighbors on average.

Cycles in this model can exist entirely within a single complex, or can cross between complexes. We first calculate the cycle-count distribution for single-complex cycles,

Here we keep the first two terms for the combinatorial factor,

exp t-L IP

Multiplying this expression with the remaining terms yields the final expression

## Post a comment