## The Data Analysis

In chronic diseases such as cancer or HIV, statistical models are more often used to identify groups with predicted good or poor response to treatment than to predict individual outcomes. This approach is consistent with semi-parametric models such as the proportional hazards model [Cox72] and the AFT, where baseline failure rates and mean response values (the intercept in the linear model) are not included in the estimating equations. The goal of the analysis presented here was to find low dimensional predictors to rank subjects according to predicted outcome using stepwise regression, principal

 3 69, T3*, 82* 4 12, 48*, T4 5 16, 19, 20*, T2 6 13 T 14, 41 9 36 10 10* 11 62 12 15, 35 15 64, TT 19 T1* 21 3T 23 93 2T 90* 50 63

*These are the 7 out of 12 codon positions at which mutations are known to have association with resistance to SQV and/or IDV. Codons 24, 46, 54, 84, and 88 are left out since no more than two subjects with mutation at these positions.

Table 5. Codons for which 3 or more patient samples indicated mutations

Number of mutations 01234 5 6789 Number of patients 13784169462

Table 6. Distribution of mutations component regression (PCR) and partial least squares, all adapted to the Buckley-James algorithm for fitting the AFT. The information for ranking is contained in values estimated for 3' Z^, in the proportional hazards model for right-censored event times, these values denote log relative risk, and are sometimes called risk scores. In the AFT, positive values of the covariate effect 3'Zi denote changes in viral load that are larger than the mean change of the group, so in this setting, we call the value 3'Zi a "beneficial score".

Fitting the AFT model with the Buckley-James algorithm presents several computational challenges. The algorithm sometimes fails to converge, and variances of parameter estimates can be difficult to estimate. With this data set, we did not encounter convergence problems, even though the Buckley-James algorithm had to be repeated many times for both the stepwise fitting and partial least squares. We used a non-parametric bootstrap to estimate the standard errors of parameter estimates, drawing 500 bootstrap samples with replacement from the 60 observations and re-estimating regression coef-

ficients using Buckley-James algorithm. We then used the empirical variance of the estimated regression coefficients based on these bootstrap samples to approximate the variance of the estimated regression coefficients. Empirically, we have found that the inference results changed little with more than 500 bootstrap samples, which implies the sufficiency of 500 bootstrap samples. We did not use the [Tsi90] estimating equations based on the marginal likelihood of the ranks of the observed failure times because those equations are also difficult to solve numerically, and we did not use the recent iterative linear programming approach by [JLW03] because it was not easily adapted to the ideas of PLS.

Because of the large number of covariates, the forward stepwise regression with the AFT was preceded by bootstrap-based Wald tests of significance for association between change in logio RNA and the covariates in univariate regression models. Nine covariates were statistically significant at 0.1 level (Table 7).

In the last column of the table, we reported the proportion of reaching convergence within 50 iterations in the 500 bootstrap samples for each univariate regression, where the convergence of Buckley-James algorithm is reached when the relative change of the consecutive estimated coefficients is smaller than 1%. We have found that the proportions of reaching convergence are fairly high and the algorithm is often settled with few closed stable points when the convergence is not reached. The proportion of reaching convergence of Buckley-James algorithm for multivariate case is similarly high.

The final model from the stepwise forward procedure includes mutation in codon 19 and 69, the number of RT inhibitors and the number of weeks of prior saquinavir with respective point estimates (standard errors) for the centered and scaled covariates -0.112(0.033), -0.113(0.018), -0.228(0.064) and -0.192(0.083). In this setting, positive values of the regression coefficient indicate reductions in RNA levels between baseline and week 8, so the mutation at codon 19, 69, more RT inhibitors and longer time of prior saquinavir predict for a poor response. Because of the need to conduct the Buckley-James algorithm on 500 bootstrap samples for every model encountered in the stepwise subset selection procedure, this approach was computationally intensive.

With principal component regression, we made the somewhat arbitrary choice to include the first seven principal components as predictors since they explained 52% of the variation in the covariate space. The low proportion of explained variation by the first 7 principal components indicates a lack of correlation structure in the covariate space. The point estimates (standard errors) for these 7 components were -0.080 (0.045), 0.102 (0.048), -0.018 (0.053), 0.088 (0.050), 0.096 (0.053), 0.029 (0.054), and -0.024 (0.052). Only the estimated regression coefficient for the second principal component is statistically significant at 0.05 level.

The BJ-PLS approach using the leave-two-out cross validation produces a model which has one latent variable with point estimate (standard error) 0.262 (0.058). The single latent variable selected in BJ-PLS had the largest

 Covariates coef. se. p-value % of conv.* codon10 -0.195 0.273 0.476 97% codon12 0.218 0.224 0.330 82% codon13 -0.126 0.334 0.707 70% codon14 0.227 0.243 0.351 85% codon15 0.140 0.206 0.495 85% codon16 0.309 0.299 0.302 93% codon19 -0.366 0.166 0.027 86% codon20 -0.103 0.195 0.598 97% codon35 -0.033 0.182 0.856 89% codon36 -0.128 0.173 0.460 96% codon37 0.162 0.213 0.446 91% codon41 -0.062 0.189 0.741 89% codon48 0.456 0.308 0.139 99% codon62 -0.110 0.236 0.642 99% codon63 -0.325 0.241 0.176 85% codon64 0.014 0.180 0.937 87% codon69 -0.528 0.134 0.000 100% codon71 0.027 0.211 0.899 91% codon72 0.377 0.472 0.425 89% codon73 -0.178 0.289 0.538 89% codon74 -0.086 0.257 0.739 99% codon77 -0.005 0.194 0.978 90% codon82 0.137 0.261 0.599 98% codon90 -0.369 0.201 0.067 98% codon93 -0.258 0.185 0.162 99% wks. prior Saq. -0.005 0.002 0.055 99% CD4 percent 0.023 0.012 0.053 96% CD4 count 0.001 0.001 0.032 90% CD8 percent -0.009 0.007 0.234 97% CD8 count 0.000 0.000 0.593 86% screening viral load 0.024 0.197 0.904 90% Num. RT inhibitors -0.479 0.148 0.001 100% baseline log rna -0.185 0.106 0.081 93% SQVsqc -0.050 0.145 0.730 91% IDV 0.580 0.200 0.004 100%

Table 7. Univariate analysis with AFT model

* percentages of convergence reached in 500 within 50 iterations to obtain the standard bootstrap samples error estimates.

Table 7. Univariate analysis with AFT model observed z-score among all original or transformed variables. The loadings for the original 35 covariates are given in Figure 1, with the covariates listed in the caption in the order in which they appear along the horizontal axis. The latent variable is linear combination of the original covariates, and the loadings are the weights for the covariates in the combination. The loadings can be viewed as a measure of the contribution from the individual covariate to the latent variable. A closer look at the loadings for these data reveals that BJ-PLS and stepwise regression provide similar and somewhat complementary information. covariate index

covariate index

Fig. 1. Loadings for the standardized covariates. The covariates along the horizontal axis are, from right to left, (1) number of RT inhibitors, (2) number of weeks of prior saquinavir, (3) baseline log io viral RNA, (4) codon90, (5) CD8 percentile, (6) codon93, (7) codon63, (8) codon69, (9) codonlO, (10) CD8 cell count, (11) randomization to SQVsgc (12) codon19, (13) codon20, (14) codon62, (15) codon36, (16) codon74, (17) codon77, (18) codon71, (19) codon73, (20) codon13, (21) codon41, (22) codon64, (23) codon82, (24) codon35, (25) screening viral load, (26) codon72, (27) codon15, (28) codon48, (29) codon16, (30) codon37, (31) codon14, (32) codon12, (33) CD4 cell count, (34) CD4 percentile, (35) randomization to IDV

The objective of our analysis was to use baseline clinical measurements and HIV virus mutation information to classify future subjects into potentially "good" or "poor" responders. We computed estimated beneficial scores 3'Zi for all patients based on one latent variable estimated in BJ-PLS, the four variables selected in the stepwise approach to the AFT, and the first 7 principal components. All patients were divided into two groups according to whether or not their estimated beneficial scores were above or below the median score. The two groups were compared using the non-parametric Kaplan-Meier estimates of their distribution functions (Figure 2). For the groups constructed using PLS, the median reduction of HIV RNA from baseline was 0.64 log10 copies/mL (4.32 fold reduction) in the potentially good responders and -0.001 log10 copies (essentially, no reduction) in the other groups. Because of the data dependent way in which the groups were constructed, any p-value comparing these two groups would not be valid. We also divided the subjects into potentially good and poor responders using the fitted model from Step-AFT and PCR, with Kaplan-Meier estimates for the distributions of beneficial scores in two groups also shown in Figure 2. The pairs of survival curves are all similar; each method seems to adequately identify the cases whose response is in the right tail of the distribution of changes in RNA value. The principal components are least able to discriminate between individuals whose change in viral load is in the left tail (increases in viral load). The BJ-PLS and Step-AFT generated curves appear similar, but there are differences. There are 20 subjects who are placed in different groups by Step-AFT and by BJ-PLS. To examine the detailed ordering of the 20 subjects, the predictions from Step-AFT and BJ-PLS for all subjects were plotted in Figure 3. All the subjects that were inconsistently grouped were marked in the figure. Evidently, most subjects with inconsistent grouping by the two methods have relatively small estimated covariate effects by both methods and do not contradict the general trend of consistency between the two. The only exception is that the two methods made quite different prediction for one censored subject with change in viral load of at least 1.1 fold. One possible reason of the discrepancy is that the subject has high CD4 count and CD4 percentile whose beneficial effects are not reflected in the prediction of Step-AFT where the final model does not include CD4 count and percentile.

BJ-PLS Step-AFT  PCR Fig. 2. Kaplan-Meier estimates for distributions of change in logio RNA for good and poor risk groups estimated by BJ-PLS, Step-AFT and PCR

For a future subject with a covariate vector Zf, we can compute the predicted covariate effect by / Zf and the predicted response, the reduction in HIV—1 RNA level (log10 copies/mL) from baseline to week 8, by / Zf + 0.365, where / is the partial least squares estimate of the covariate coefficients with one latent variable and 0.365 is the estimated median of the error term, obtained by inverting the Kaplan-Meier estimate of the survival function of the empirical residuals.

We used a resampling experiment with these data to examine the ability of the leave-two-out cross validation method to prevent over-fitting. We ran-

o consistent prediction inconsistent prediction

prediction by BJ-PLS

Fig. 3. Scatter plot of predictions for covariate effects made by Step-AFT and BJ-PLS (subjects with inconsistent grouping are marked with cross)

domly divided the original data set into a training sample and a validation sample with sample sizes of 32 and 31, respectively. We then fit an accelerated failure time model to the training sample using partial least squares with different numbers of latent variables (k = 1, 3, 5, and 7) and used the resulting parameter estimates to predict the beneficial scores on the subjects in the validation sample. The validation sample was then divided into two groups using the median of the predicted beneficial scores as the cutoff point (high versus low). The difference between the two groups in RNA reduction was compared using the log rank test and the p-values were computed. We repeated this process for B2 = 100 times. Figure 4 gives the Q-Q plots of the p-values obtained from using different numbers of latent variables (k = 1, 3, 5, and 7) against the uniform distribution U(0,1). If the partial least squares estimates do not have much predictive power, the p-values should be uniformly distributed between 0 and 1. The observed pattern of the p-values using one latent variable differed the most from a uniform distribution.

To examine the reproducibility of predictions from these methods, we conducted two resampling experiments. The first, labeled CV I (for cross-validation I) took the models arrived at in the whole data sets from the three methods as fixed, then examined how well coefficients for these models pre-

Uniform Q-Q Plot (K=1

Uniform Q-Q Plot (K=1

 / * s S / / t f* o « sjP* fp*

0.4 0.6 0.8 Theoretical Quantiles

0.4 0.6 0.8 Theoretical Quantiles 0.4 0.6 0.8 Theoretical Quantiles Theoretical Quantiles

Fig. 4. Distribution of p-values

Theoretical Quantiles

### Theoretical Quantiles

Fig. 4. Distribution of p-values dieted outcome in validation samples when the coefficients were re-estimated in training samples. The second, CV II, re-estimated the model (including the model selection process) in each training sample then used a validation sample for predictions. In the CV II, we randomly split the data into training and validation samples with equal sizes. Then we estimated AFT, PCR and PLS model coefficients for the models arrived at earlier using the training sample; the parameter estimates were then used to split the validation sample into high and low beneficial score groups. We used a logrank test to assess the association between HIV-1 RNA change from baseline to week 8 with the two groups. A small p-value indicated a good separation between the groups. We repeated the cross-validation procedure B3 = 200 times. Models with less predictive ability will produce p-values in these 200 replicates that are closer to the uniform distribution. Models with more predictive ability will have p-values clustered near 0. Table 8 summarizes the empirical distributions of the observed p-values.

Overall, the model selected from BJ-PLS gave the smallest p-values, while the model from PCR seemed to have almost no predictive power. The differ-

Methods Min 1st Q. Median 3rd Q. Max Min 1st Q. Median 3rd Q. Max CV I CV II

BJ-PLS 0.000 0.002 0.009 0.035 0.9T1 0.000 0.094 0.22T 0.485 0.9T8 Step-AFT 0.000 0.005 0.016 0.059 0.9T5 0.000 0.085 0.251 0.585 0.988 PCR 0.000 0.200 0.419 0.T20 0.993

Table 8. Summary of the empirical distribution of p-values from cross-validation procedures for BJ-PLS, Step-AFT and PCR

ences between the modeling techniques were smaller in this experiment, most likely because the small training samples made the models chosen less reliable.

Since the PLS method led to a single latent variable, it is possible to use relatively simple methods for model checking. With one latent variable, the model reduces to a simple linear regression of the response variable on the latent variable. The Buckley-James algorithm replaces a censored response with an imputed value, an estimated conditional mean <p(Y°) = E(Y\Y > Y°). The appropriateness of the single latent variable in PLS can be checked in scatter plots of response by the latent variable, where censored responses are replaced by their imputed values, and by residual plots. These two plots are shown in Figure 5 and Figure 6. The least squares and lowess lines on the first of these plots show a strong linear relationship between the latent and response variables. Here, we used the lowess function in R (version 1.6.2) with the default smoother span of 2/3. The slope of the least squares line is 0.263, the same as the coefficient of the latent variable in PLS. The intercept (0.361) of the line can be used as an estimate of the mean of the error distribution, and is interpreted as the average RNA response across the subjects. The lowess line fit to the residual plot also suggests a strong linear relationship between the RNA response and the latent variable.

Figure 7 shows the estimated density of the values of 3 Zi from the latent variable, estimated from the observed values using BJ-PLS. The beneficial scores appear approximately normally distributed, supporting breaking the group of patients into two groups using the median score. Smaller groups could be constructed using the quartiles of this distribution.