Let Y be an n x 1 column vector of responses for n subjects and Z an n x p predictor matrix of p—dimensional covariate vectors. Some of the methods discussed in the paper are not scale-invariant, that is, they may yield different results when response and/or covariates are rescaled. In this paper, the columns of Z will always be centered to have mean zero and scaled to have variance one, even for binary covariates. The notation Z is used for the covariate matrix to emphasize that point. We denote row i of Z by Z¿. We assume temporarily that the responses are not censored. When Z is singular or nearly so in the linear model Y = ¡3'0Z + e, ordinary least squares (OLS) estimates of the p x 1 parameter vector 3o are not estimable or may be numerically unstable. Data analysts commonly use two classes of methods to mitigate the effect of collinearity in the predictor matrix. One set of methods selects a subset of the original predictors, and numerous subset selection methods are available [DS81, Hoc76, Mil90]. The other set of methods are based on biased (typically shrinkage) estimators of regression coefficients which may reduce mean-squared estimation or prediction error by reducing the variance of the estimators. These shrinkage methods include well known methods such as ridge regression, principal components regression, partial least squares, and some newer methods, such as the LASSO [Tib96]. Both sets of methods are sometimes used when Z is of full rank as well.

Stepwise regression methods are widely used, in part because software for stepwise model selection is available in nearly all standard statistical software packages. There is an extensive literature on efficient numerical algorithms for stepwise fitting of regression models, for incorporating penalty terms such as the AIC or Schwarz criterion (BIC) to reduce the likelihood of over-fitting, and to reduce the potential bias in estimates of coefficients for variables selected. For linear models, the recent monograph by [Mil90] contains an account of both the benefits and drawbacks of stepwise selection techniques for linear regression.

[Hot33] originally proposed principal component analysis to reduce the column dimension of a data matrix of highly correlated variables while retaining a large portion of the variation in the data. Let Ai > X2 >• • > \p > 0 be the eigenvalues of Z' Z, with corresponding orthogonal eigenvectors vi, V2 ,...vp. The vectors Zvj are called the principal components of Z' Z. Let r be the rank of Z'Z. Principal component regression (PCR) replaces the columns in original predictor matrix by the K < r vectors Zvi,... Zvk and fits a regression model using the new predictor matrix. When K < r, the new vectors do not span the column space of Z, and the estimated parameters will not be unbiased estimates of 3o. In addition, there is no theoretical basis for the new predictors satisfying any statistical optimality criteria when K < r. Nevertheless, the approach has some appeal, primarily because the new predictor matrix will have orthogonal columns and the fit will be numerically more stable. In addition, vi has largest variance among the v¿, v2 the second largest variance, etc, so that the first few principal components may account for a substantial proportion of the variation in the original covariates. There are a variety of suggestions in the literature for choosing K [Jol86], including minimizing a cross-validated error sums of squares or choosing K so that

Unlike PCR, PLS uses both response and predictor values to construct transformations of the covariates to be used as new predictors. The method of PLS was first proposed by [Wol66, Wol76] for modeling information-scarce situations in social sciences. It has also been used in chemometrics, for instance, to predict a chemical composition from a near infrared reflectance spectrum [Gou96]. [WM03] provide a detailed comparison of the use of PCR and PLS in chemometrics. The original development of PLS was motivated by a heuristically appealing representation of both the vector of responses Y and the predictor matrix Z as linear combinations (with error) of a common set of latent variables, so that

The N x 1 vectors ti are the latent variables, the p x 1 vectors pi are called the loading vectors, and the scalars qi are called loading scores. Wold's original algorithm for computing the latent variables and their loadings has been discussed in [Hel88] and [SB90]. We have adopted Helland's notation here; interested readers should see that paper for a heuristic motivation of the algorithm.

[Wol84] gives the following algorithm for partial least squares on data {(Yi, Zi)} with a fixed number K << min(p, n) latent variables:

(1) weight vector Wk = g'k_iSk-i and latent variable tk = Qk-iWk;

(2) loading score qk = (t'ktk)-1t'kSk-i = (t'ktk)-1t'kY and loading vector Pk = (tk tk )-1t'k Qk-i = (tk tk)-1t'k Z;

(3) residuals £k = £k-i — qktk and Qk = Qk-i — tkp'k.

3. The predicted value of the response is Y = n-111'Y + K=1 qktk.

The small data set in ACTG 333 makes model checking difficult, so in the analysis presented here we use extensions of the methods presented above to semiparametric linear models for right censored data, called the accelerated failure time (AFT) model in the time to event literature. In the AFT model, no assumption is made about the form of the error distribution. As usual, right-censored data is denoted by {(TiACi, 5i, Zi), i = 1,..., n}, where Ti > 0 is the response variable, Ci > 0 is the censoring variable, Si = I{Ti<c'i}, Zi is a p x 1 covariate vector, A A B is the minimum of A and B. The indicator I{A} assumes value 1 if the A occurs and 0 otherwise. We take Ti and Ci to be conditionally independent given Zi. The p x 1 regression coefficient /3o in the AFT model satisfies g(Ti) = 3oZi + £i, where {ei} are independent, identically distributed with finite variance and an unspecified distribution function Fs. Since the intercept is not specified in the model, s may have non-zero mean. The known monotone transformation g(-)

is usually chosen to be the identity function or a logarithm transformation. Because of the presence of censoring, we do leave the response variable T, equivalently Y = g(T), in its original measurement scale.

To estimate the coefficients in the semiparametric AFT, Buckley and James, (1979) used the transformation s(') on the observed response Y°=g(Ti) A g(Ci), where sY) = SiYO + (1 — Si)E{Yi\Yi > Y°, Zi}. If s(') were known, E{s(Y°)\Zi} = E(e^ + S'oZi,, and ordinary least squares could be used with the transformed responses {s(Y°)}. The Buckley-James estimating algorithm simultaneously updates {S(Y°)} and /3 at each step and proceeds iteratively:

1. Select an initial estimator 3(0, and let Y = Z3(0.

2. Compute the residuals e = Yo — Y and the estimated transformation

Was this article helpful?

## Post a comment