## F15kfo gktvk0

and where vk (0,t) and sk°\o,t) are defined in Appendix 1. It follows that under ha, Q1 x1(£1) as n ^to, where the noncentrality parameter £1 is given by £1 = f2^2k=1 vk)2/1'V1- For proportional hazards alternatives, f is seen to reduce to a linear combination of the 5k, and £1 reduces to (1'VD8)2/1'V 1, where 8 = (51,52, • • • , 5K)'.

Next consider the vector estimator 3 which arises from the model (1).

The asymptotic distribution of 3 under ha is shown in Appendix I to satisfy ~ £

\/[email protected] —> N(/, U), where / = (f1, f2, • • • , fK), and

5k 0° gk(t)vk(Q,t)sf\Q,t)Xk(t)dt fk = -0-TTv\-- (5)

It follows that Q2 xl(£2) as n ^ to, where £2 = , and that

Q3 xK(£3) as n ^ to, where £3 = fx'U-1/ . For proportional hazards alternatives, pk reduces to Sk and the non-centrality parameters for Q2 and

Finally, consider the arbitrary linear combination test Qc. It follows from the asymptotic normality of ¡3 that Qc converges to x1(£c), where £c = .

The optimal test in this class, say Qopt, is thus the one using the weight copt = 1'/^ , f, and has non-centrality parameter £opt = f'S-1f. Note the non-centrality parameter Qopt is the same as that of the K df test Q3. Furthermore, by comparing the optimal weight of copt with the weights forming Q1 and Q2, it is not hard to see that Q2 is the optimal test when Pi = P2 = • • • = Pk, and Qi is the optimal test when f is proportional to EVd 1 = V-1V1. We return to these results below.

### 4 Comparisons of Test Statistics

We now use the results of Section 3 to assess the relative power of Q1, Q2, Q3, and Qopt under variety of settings corresponding to homogeneous or heterogeneous treatment effects across failure times, and for special correlation structures among the failure times.

Suppose that the components of the mean of the asymptotic distribution of y/n3 under Ha are equal, i.e., pi = p2 = • • • = Pk = Po. This will result when the treatment groups have a common proportional hazards ratio for each k; that is, gi(t) = • • • = gk(t) = 1 and ¿1 = • • • = Sk. However, this can also arise when non-proportional hazards relationships exist for various k, but in a way that the mean of the resulting asymptotic distribution of i/[email protected] has equal components. As seen in Section 3, the non-centrality parameters of

Qi, Q2, Q3, and Qopt are £1 = pi , and £2 = £3 = £opt = PqI'^-11.

Thus, Q2 is the optimal 1 df directional test for this setting and has the same non-centrality parameter as the K df omnibus test Q3. It also follows that Q2 has greater asymptotic power than Q3 for all K > 1.

Figure 1 displays the asymptotic power of the directional test Qopt and the omnibus test Q3, based on a Type I error of 0.05, different values of £, and for K = 2, 3,4, 5, 6. When p1 = • • • = pK, Q2 = Qopt. Here we use this figure to discuss the choice of Q2 and Q3. The range of £ was chosen to yield powers for Q2 that range from 0.05 when £ = 0(H0) to 0.95. The successive lines beneath the top line represent the powers of Q3 for K = 2, 3, 4, 5, 6. For example, when the power of Q2 is 0.80, the power of Q3 is .71, .65, .60, .56, .52 for K = 2, 3,4, 5, 6. Thus with K = 2 failure times, the omnibus test Q3 maintains reasonably good power against the Q2, the optimal directional test when the treatment effects are homogeneous across failure times. This is consistent with the results found by Hughes [HUG97]. The relative power of the omnibus test remains relatively high when K = 3. For large K, however, the power advantage of using Q2 becomes substantial. This potential advantage must be weighed against the possibility that the true alternative to Ho may not correspond to homogeneity of treatment effects across failure times. For an arbitrary alternative, the asymptotic relative efficiency of Q2 to Q3 can be determined from £2 and £3, and K, and can be substantially lower than 1 for some alternatives, such as when one treatment is better for some K but worse for other K.

Because Qi and Q2 each have 1 df, we can assess their relative powers by examining their asymptotic relative efficiency of Qi to Q2, given by

Let Id denote the diagonal matrix for which Id Id = Vd and let C denote the corresponding correlation matrix, so that V = Id CId and l-1 = Id C-1Id . Then

From this representation, it can be seen that ARE[Q1, Q2] < 1, with equality when V is diagonal or of the form al + bJ, where J is the KxK matrix of 1s. Thus, Qi can be as good as the optimal test Q2 when |1 = 12 = ■ ■ ■ = Ik, provided that correlation structure of Ti, ■ ■ ■ , Tk leads to an asymptotic covariance matrix for y/nfi for which V has this form. This would arise, for example, when the Tk are uncorrelated.

For other correlation structures, however, the ARE of Qi to Q2 can be very low. For example, suppose that K = 4 and that the survival times have the same variance and the correlation matrix

/ 1 0.7 0.7 0.4\ 0.7 10.10.1 0.7 0.1 10.1 , \0.4 0.10.1 1J

then the asymptotic relative efficiency ARE[Qi,Q2] = 0.20, despite the homogeneity of treatment effects across strata. The fact that Qi can be substantially less efficient than Q2 when the treatment hazard ratios are equal may seem counter-intuitive since Qi is derived from model (2), which assumes the /3k are equal. However, the working likelihood function (2) is created as if the failure times were uncorrelated. While this still leads to a consistent estimator of /3, the inefficient form of this working likelihood leads to an inefficient estimator. This can be seen analytically by noting that the asymptotically equivalent linear combination test corresponding to Qi does not use the optimal inverse-weighting that Q2 uses when ^i = ^2 = ■ ■ ■ = I^k .

When the fk are not equal, the asymptotic expressions derived in Section 3 are not analytically difficult. However, because the relative efficiencies of the test statistics depend on the correlation structure among the Tk, the amount of censoring, and the magnitude of the treatment differences, the formula do not lend themselves to simple practical interpretations when the treatment differences are not homogeneous across failure times.

Since the optimal statistic Qopt depends on the unknown parameter f, use of Qopt in practice is generally not feasible. However, when viewed as a "gold standard", this test provides insight into the choice and formation of test statistics. Because the noncentrality parameter of the 1 df optimal directional test is identical to that for Q3, the top line in Figure 1 also represents the power of the optimal directional test for an arbitrary alternative to Ho. Thus, the power of the the omnibus test Q3 for a particular choice of K can be compared to the maximum power achievable by a directional test. When K = 2 or 3, the power of Q3 is surprisingly high compared to that achieved by the optimal directional test. Thus, one may not need to resort to a directional test when K is small. However, as K increases, the advantage of Qopt to Q3 increases substantially, so that the use of a directional test becomes more desirable when there is some confidence about the nature of the alternative to Ho. For example, if there is reason to believe that one treatment is uniformly superior to the other, but that the magnitude of benefit may decrease with k, then one may consider the weight c = , where e'K = (l, ^, ■ ■ ■ , K). Such a directional test would be more powerful than Q3 for alternatives that are reasonably approximated by fk a l/k for k = l, 2, ■ ■ ■ ,K.

### 4.3 Special Correlation Structures

Regardless of whether the treatment effect is homogeneous across failure times, the relative efficiency of Qi and Q2 depends on the correlation structure of Ti,T2,■ ■ ■ ,Tk. Here we note two special cases for which Qi and Q2 are equivalent. When the Tk are uncorrelated, we have V = Vd and hence S = V—1, from which it follows that Qi and Q2 are equivalent with

= £2 = f . In this case, £opt = £3=u'VD f, and thus the ARE of Qi 1 Vd 1

Another special covariance structure V is of the form v2((l — p)I + pJ), where I is identity matrix and J = 11' , i.e., all the marginal survival times have the same variance and the correlation between any two types of events are the same. It can be shown (see Appendix II) in this case v2y k= s2

that Qi and Q2 are equivalent, with £i = £2 = k+pK=-i)k , and that = £„ = v2 (ff__P(1f)2_^

5 Determining Sample Size and K

The focus of the previous section was a comparison of the test statistics for a particular choice of K. However, the design of a study also involves the choice of K, the number of failure times to analyze, and the sample size n. Suppose first that K and a test statistic have been selected for use and it is desired to determine the sample size or power for the study. Then the asymptotic results in Section 3 can be directly applied. To illustrate, suppose that we wish to use Q2 and assume that the treatment groups have proportional hazards for each of the K failure times. That is, gk (t) = 1 for each k. Then upon replacing 5k by —nak, equation (5) reduces to ¡k = yfnak. Once a form for the censoring distribution is assumed, one can evaluate S and thus the noncentrality parameter £2. Standard formula can then be used to determine the necessary sample size based on assuming that Q2 has the 1 df chi-square distribution with noncentrality parameter £2 under the alternative. Similar techniques can be applied to any of the other test statistics.

The choice of K can be complicated because the power of any test depends on the length of follow-up of subjects, the nature and magnitude of the treatment difference for the different failure times, and the correlation of the failure times. The asymptotic formula in Section 3 could be assessed to compare the relative powers of any particular test for various choices of K. However, one may not know enough about the amount of censoring of each failure time and of the relative treatment effects across failure times to evaluate these expressions before doing a trial.

In general, increasing K does not necessarily increase power of a test. One exception to this is the use of the directional test Q2 when ¡1 = ¡2 = • •• = ¡k. Denote this test by Q2(K), and consider the use of the same statistic, say Q2(K') based on K' < K failure times. Since Q2(K) is the optimal directional test in this setting, and Q2(K') can be expressed as a linear combination test based on the K-vector ¡3, it follows that the asymptotic power of Q2(K) is at least as great as that of Q2(K') for any K' < K. That is, when ¡1 = ¡2 = • • • = ik, the power of Q2 increases with K regardless of the censoring distributions or the correlation structure of the failure time. The magnitude of the power gained with increasing K depends on censoring and the correlation structure, however, and in some cases could be small.

Similarly, the power of the optimal test Qopt will increase with K for any alternative to Ho, any censoring pattern, and any correlation structure among the failure times. However, this result is of little practical value because one rarely is certain about the direction and magnitude of the treatment difference for each K. In contrast, the power of the omnibus test Q3 need not increase with K. As we see below, while the noncentrality parameter of Q3 increases with K, the increasing degrees of freedom can offset this and lead to less power as more failure times are included.

Hughes [HUG97] investigates in detail the choice between K = 1 (in which case the directional and omnibus tests are equivalent) and the use of the omnibus test when K = 2 in the setting of proportional hazards alternatives in which one treatment is superior to the other for each K. He shows that the more powerful of these tests depends on the amount of censoring as well as the correlation of the two failure times, and that the test based on a single failure time (K =1) often is as or more powerful to the omnibus test with K = 2.

To get a practical sense of the trade-offs involved in choosing K for more complex settings, we conducted a simulation study. We generated Ti,T2 — Ti,Ts — T2, ■ ■■ , Tk — Tk—i to be K independent exponential random variables, the kth with intensity hk. Thus, the gap times between events are independent. Tj is the sum of j independent exponentials, so that the covari-ance between Tj and Tk for j < k is just the variance of Tj. Two treatment groups are generated, each having a sample size of 200. For the first treatment group, we take hk = 1 for all k, so that Tj has the Gamma distribution with parameters 1 and j. For the second treatment group, we use various choices for hi,h-2, ■■■ , hK. We chose the hk to be increasing, constant, decreasing, or non-monotone. The potential censoring time is taken to be an independent uniform (0, t) random variable with t chosen to give the desired censoring percentage. Each simulation is repeated 1000 times. Despite this simple correlation structure, the resulting failure time Tk for the treatment groups does not have a proportional hazard relationship for k =1, 2, ■■■ ,K.

Table 1 displays the power of Q2, Q3 and Qc, for c = -. "F _elK with eK =

(1, 2,..., K) or (K, K — 1,..., 1), when choosing a univariate analysis (K = 1) or when analyzing K = 2, 3, or 4 events. The choice of ck might be made when one expects the treatment difference to increase (ck = 1, 2,..., K) or decrease (c = K, K — 1, ... , 1) with successive failure times. In these simulations we used t = 4, which leads to heavy censoring occurred in later events. The first column corresponds to gap time hazard ratios of (hi, h-2, h3, h.4) = (1, .8, .6, .4), corresponding to strongly decreasing hk with successive failure times. Since hi = 1, the univariate tests have power equal to the Type I error. In this case, the gain in power for K > 1 is evident for most of the test statistics, especially Q3 and Qc with increasing weight ck = (1, 2, ...K), despite of the increased rate of censoring of the second, third, and fourth failure times. Q2 and the directional test based on the weight ck = (4, 3, 2,1) do poorly here. The second column in Table 1 gives the power when the hk equal 0.8 for each k. Here the power of the directional tests with weight ck = (1, 2, 3, 4) increases with K but the power of the omnibus test Q3 does not, indicating that any gains in information from examining more survival times is more than offset by the increased censoring and degrees of freedom. The 3rd column of Table 1 gives the power of the tests for different K when the hk increase slowly. None of the tests showed power gains by analyzing more events. This is not surprising as the treatment difference no longer exists for the third and fourth gap times. The last column gives the hazard rates of an inconsistent treatment effect among the gap time. The treatment effect of the first and the fourth gap time is just the opposite of the second and the third gap time. Here none of the directional tests are oriented towards such treatment differences and thus none perform well in comparison to Q3.

We conducted another simulation study with K = 2, 3, and 4 unordered events and proportional hazards realtionships between the treatment groups. To do so, we extended Gumbel's bivariate exponential distribution [GUM60] to four exponential variables Ti, T2, T3, and T4 with joint cumulative distribution:

F (ti,t2,ts,t4) = (1 — e-hlt1 )(1 — e-h2t2 )(1 — e-h3t3 )(1 — e-h4t4)

(1 + ai2e-hltl-h2t2 )(1 + ai3e-hltl-h3t3 )(1 + ai4e-hltl-h4t4) (1 + a23e-h2t2-h3t3 )(1 + a24e-h2t2-h4t4)(1 + a34e-h3t3-h4t4)

where hi > 0 (i =1, 2, 3,4) and —1 < aij < 1(1 < i < j < 4). The values of a^j determine the correlations between Ti and Tj. For one treatment group, we set hi = h2 = h3 = h4 = 1. We choose various values of hi for the second treatment group. It follows that the treatment hazard ratio for survival time Ti is hi. The potential censoring time is taken to have the uniform distribution on (0,4). For each simulation setting, we generate a sample size of 100 for each treatment group, and generate 1000 repetitions.

Four sets of aij values are used in these simulations, representing different correlation structures. For each set of a^j value, three sets of hazard (hi ,h2 ,h3 ,h4) are examined. Table 2 lists the power of Q2 and Q3 for K = 2 (the first 2 events), 3 (the first three events), and 4. When the marginal hazard ratios are the same for all the four survival times (hi = 1.25, i = 1, 2, 3, 4), it confirms that Q2 is always superior to Q3 for a given K, the power of Q2 increases as K increases, and the power of Q2 is about 10% higher than that of Q3 when K = 2 and about 15% higher than that of Q3 when K = 3. When the marginal hazard ratios are hi = 1.67, h-2 = 1, h3 = 1.67, and h.4 = 1, there is no treatment group difference for T2 and T4. Here Q2 has poorer power than Q3 for almost all the correlation structures and choice of K in Table 2. Additionally, the power for Q3 does not always increase with K. The power for Q3 when K = 3 is similar to that when K = 4. It is interesting to see that there are not much power loss for Q3 by adding the survival time T4, which has no treatment difference. For one correlation structure, aii = ai2 = ai3 = a23 = a24 = a34 = 1, the power Q3 when K = 4 is slightly greater compared to that when K = 3. In these simulations, the correlation structure has a greater impact on the power of Q2 than on that for Q3. When the marginal hazard ratios are hi = 1.25, h2 = 0.8, h3 = 0.8, and h4 = 1.25, the second treatment group has a higher hazard than the first one for Ti and T4 but a lower hazard for T2 and T3. In this case, Q2 has almost no power to detect treatment differece irrespect of the correlation structures and the choices of K. This is merely because the treatment differences are in the opposit directions for the four survival times. Q3 has better power than Q2 and the power increases as K increases. The power does not change greatly for different correlation structures.

6 Example: Recurring Opportunistic Infections in HIV/AIDS

To illustrate the results presented in this paper, we consider an example from two companion AIDS clinical trials [DAF95], [KLR92]. Patients were followed for opportunistic infections, which could occur repeatedly, and for mortality. The primary endpoint in the trial was the time until a patient's first OI or death, whichever came first. A total of 1530 patients were randomized to receive either AZT (n=512) or ddI (1008). We exclude 10 patients due to missing or incorrect information about the times of subsequent events. Define Tj to be the time from randomization until the jth OI or death, whichever comes first, for j = 1,^ • • ,K. Thus, if K=3 and a patient has 3 OIs, T1,T2, and T3 will denote the elapsed times from randomization to these OIs. However, if a patient has 1 OI and then dies, Ti,T2, T3 are the time until the OI, the time until death, and the time until death, respectively. Extension of the primary endpoint in this way induces a correlation among the failure times.

In the ddI (AZT) group, the number of patients experiencing 1, 2, 3, and 4 distinct OIs were 342(204), 119(80), 26(15), and 4(1). A total of 199 and 95 patients died in the ddI and AZT groups, respectively. The percent of jth events that were censored for the ddI (AZT) group was 59(55), 74(72), 79(80) and 80(81). We define Z = 1 for the ddI group, so that f3j denotes the log relative hazard of ddI to AZT for Tj. The resulting estimates are

the corresponding estimated covariance matrix is f 0.0068 0^0061 0^0058 0^0057 \ 0^0061 0^0107 0^0102 0^0101 00058 0^0102 0.0142 0^0140 ■ y 0^0057 0^0101 0^0140 0^0151 J

The estimators of the common ¡3 derived from (2) and used in Qi are 0141, 0^104, and 0^073, for K = 2, 3, and 4, respectively. The corresponding standard errors are 0^084, 0^087, and 0^090.

Table 2 gives the p-values corresponding to Q1,Q*,Q2, and Q3 for K = 1,2,3,4. As expected, the results for Q1 and Q* are very similar. Note that the estimated regression coefficients steadily decrease, suggesting that the initial advantage for the ddI group is only transient. Not surprisingly, the estimate of ¡3, the common hazard ratio, also decreases but does not become negative, and the p-value corresponding to Q2 steadily increases from 0.05

when analyzing just one event (K = l) to 0.079 when analyzing all four. Despite the increased rate of censoring, the omnibus test Q3 better detects a treatment difference than Q2 when K = 3 or 4, but does less well when only analyzing the first (K = l) or first two (K = 2) events. Qi and its asymptotic equivalent Qi give the least significant p-values when K = 3 or 4.

### 7 Discussion

A goal of this paper was to build upon earlier work of Wei, Lin & Weissfeld [WLW89], Lin [LIN94], Hughes [HUG97], and shed light on whether and how to use the marginal analysis approach of WLW method for the analysis of multivariate survival data. We did so primarily by deriving and evaluating the asymptotic behavior of two directional tests, Qi and Q2, and the omnibus test Q3 that have been previously proposed, as well as a more general class of linear combination tests. Qi derives from solving a constrained working likelihood that assumes that the treatment group hazard ratios are the same across the K survival times. Q2 is also motivated by the same assumption, but instead is based on a specific linear combination of the unconstrained estimators of the treatment hazard ratios. One noteworthy finding is that Q2 is the optimal linear combination test under this assumption, and in general is asymptotically more powerful than Qi, sometimes by a substantial amount. This stems from the fact that the working likelihood function used by WLW is inefficient, and as a result constrained estimators obtained from it are in general inefficient. We thus discourage the use of Qi and recommend Q2 when it is believed that the marginal treatment effects are approximately equal for all events.

One can easily describe the power gains from using the optimal directional test as opposed to the omnibus test based on(Figure 1). These can be substantial when K is large, but are modest for K = 2 or 3. For K = 2, 3 the power of the omnibus test is often close to the maximal power achievable by a directional test, which suggests that the omnibus test should be considered when there is not a good sense of the alternatives to H0 that are likely to occur.

When the treatment hazard ratios vary among the K survival times and the data are censored, the analytic formula for the asymptotic power of these tests do not lend themselves to simple guidelines on whether to use the omnibus test Q3, the directional test Q2, or some other directional test, say Qc. The censoring settings we evaluated in simulations assumed a common censoring variable for all K survival times, as would commonly occur in a clinical trial when the survival times corresponded to recurring events. Here the amount of censoring necessarily increases with successive survival times, thus limiting the amount of information gained by analyzing larger K. In some instances, this along with the additional degrees of freedom on Q3 with increasing K offset any gains in information.

In the HIV/AIDS example, the beneficial effect of the ddl treatment over AZT decreased with successive survival times. As a result, the omnibus test gave the smallest p-values when using K = 3 or 4 events. However, if K = 1 or K = 2 events had been analyzed, the directional test Q2 would have given a more significant result. On the other hand, had the treatment effect increased with successive failure times, as illustrated in the simulation, use of larger K can increase power of a directional test. In many, if not most, settings, analysts would not know in advance how the treatment hazard ratio varies with successive failure times. If one expected a attenuated effect, as was seen in the HIV/AIDS example, then a univariate analysis may prove best, especially when later survival times are heavily censored. Alternatively, one might choose the linear combination test Qc with weights c selected to reflect the suspected trend. On the other hand, if one expects treatment hazard ratios that are approximately equal, then use of Q2 with K > 1 is justified. Exactly how to choose K is not simple, but if one had a sense of the degree of censoring, the noncentrality parameter £ <x 1'S-11 could be evaluated to approximate the power of Q2 for various choices for K.