## Simulation studies

We used simulation studies to explore the predictive power of the accelerated failure time model using partial least squares and the Buckley-James fitting algorithm. Mean squared prediction error was used to measure how well the covariate effect was predicted, and mean absolute prediction error was used to measure how well the response was predicted. Simulations were done using different numbers of explanatory variables (p = 10, 25, 40, 50, and 100), with different correlations among the covariates (p = 0 and 0.3), and for different underlying error distributions. The simulation design modeled a situation where many variables have moderate effects, a difficult situation for model fitting when sample sizes are not sufficiently large.

The simulations used the model log(Ti) = /'oZi + £i, i = 1, ■■ ■ , n, where Zi - Np (0, a2[(1 — p)I + p11']), and {ei} were independent and identically distributed. The initial parameter vector / ' was selected using independent draws from a uniform distribution on (-0.2, 0.2) to reflect a setting where all variables have moderate effect. We generated {ei} from two different distributions. The first was an extreme value distribution (Table 1) with the survival function S£(x) = exp(— exp(x/a)), with a = 0.5, corresponding to an increasing hazard over time. This resulted in a Weibull distribution for the response variable. The other error distribution was a normal distribution (Table 2) with variance a2 = 0.4, which produces a similar variance for log(Ti) as the chosen extreme value error distribution. The censoring times were generated from a uniform distribution U(0, c) and c was chosen to produce an average censoring proportion of 20%.

Fixing the sample size n (= 50), design matrix Z = (Z1 ,■ ■ ■ , Zn)' and parameter vector / ' , we generated a training sample

{(min{Ti,Ci}, 1{Tí <C'i}, Zi), i = 1, ■ ■■ ,n} and a validation sample {(T*, Z*),i = 1,* ■ ■ ]m}, where m = 100 and {(T*, Z*)} had the same distribution as (T, Z). The true covariate effect for subject i in the validation sample was given by 30 Z*.

To obtain the mean squared prediction error of the covariate effects, we fit an accelerated failure time model on the training sample using the Buckley-James algorithm with all covariates in the model (when p < n) and with partial least squares, then used the resulting parameter estimates to predict the covariate effects for subjects in the validation sample. We computed mean squared prediction error ™=i(3 Z* — 30Z*)2 for various numbers of la tent variables and repeated this process for B\ = 50 times. We calculated the average of the mean squared prediction error over the B\ validation samples for different numbers of latent variables and compared the performance of partial least squares and the Buckley-James method (Table 1 and Table 2). Table 1 also appears in [HH05] and is listed here for convenience.

In linear regression with censored data, when the number of explanatory variables is close to the number of uncensored observations, some dimension reduction technique would very likely be used on the covariates before fitting a linear model with the Buckley-James algorithm. Because no such dimension reduction techniques have been widely studied for the accelerated failure time model, we chose to compare the performance of model estimates using partial least squares with models using all of the data.

The "optimal" mean squared prediction error and number of latent variables were computed, respectively, by averaging over the minimum mean squared prediction error and the corresponding number of latent variables over the validation samples.

The "dominant" number of latent variables was defined as the number of latent variables that provided the minimum average mean squared prediction error over all the validation samples, and the corresponding mean squared prediction error was called the "dominant" mean squared prediction error.

Leave-two-out cross-validation (CV) method was applied to each validation sample to select the number of the latent variables for the partial least squares method. The CV mean squared prediction error was the average of the mean squared prediction error given by the cross-validated number of latent variables, and the corresponding average of the number of latent variables gave the CV number of latent variables.

Across all the simulations, the mean squared prediction errors of the co-variate effects from the partial least squares method using leave-two-out cross-validation to select the number of latent variables are close to that from the partial least squares method using the optimal number of latent variables. The mean squared prediction error of covariate effects from partial least squares was 50% or less of that from a model fit with the Buckley-James algorithm when p < n. The mean squared prediction error using the cross-validated number of latent variables is comparable to the optimal mean squared pre diction error, even when the number of predictors p is twice the sample size n. This suggests that the leave-two-out cross validation method efficiently identifies the number of latent variables.

Number of

Mean Squared Prediction Error of Covariate Effects correlation p = 0

correlation p = 0.3

Number of

Mean Squared Prediction Error of Covariate Effects correlation p = 0

correlation p = 0.3

 covariates BJa 1 Optimal6 Dominant - CVd BJ Optimal Dominant CV P 10 1.2 1 1.2 10 1.9 2 1.7 10 MSE 0.103 0.060 0.060 0.063 0.135 0.069 0.082 0.093 (SE) (0.008) (0.004) (0.004) (0.004) (0.011) (0.004) (0.006) (0.006) P 25 1.4 1 1.2 25 1.8 1 1.5 25 MSE 0.487 0.187 0.195 0.204 0.590 0.181 0.191 0.223 (SE) (0.030) (0.008) (0.007) (0.010) (0.057) (0.004) (0.001) (0.009) P 40 1.5 1 1.2 40 1.9 2 1.6 40 MSE 2.298 0.222 0.235 0.243 3.242 0.288 0.307 0.326 (SE) (0.209) (0.010) (0.009) (0.011) (0.350) (0.007) (0.013) (0.011) P 2.4 2 1.6 1.9 2 1.8 50 MSE N/A 0.452 0.517 0.542 N/A 0.298 0.301 0.373 (SE) (0.014) (0.021) (0.022) (0.009) (0.009) (0.018) P 3.6 3 2.1 3.6 3 5.4 100 MSE N/A 1.136 1.176 1.287 N/A 0.738 0.749 0.912 (SE) (0.026) (0.028) (0.027) (0.016) (0.016) (0.028)

a The Buckley-James algorithm.

b The optimal number of latent variables used at each run. c The same number of latent variables used for all runs. d The cross-validated number of latent variables used at each run.

Table 1. Comparison of mean squared prediction error of covariate effects from the Buckley-James algorithm and partial least squares given n = 50 and approximately 20% censoring, assuming an extreme value error distribution.

We used the conditional median to predict the response of a future subject and mean absolute prediction error to measure the accuracy of the response prediction: