In the study of chronic diseases such as HIV infection and cancer, the rapid improvement in the technology of measuring disease characteristics at the molecular or genetic level makes it possible to collect large amounts of data on potential predictors of outcome. In cancer, these data often include measurements of gene or protein expression in the tumor; in HIV, the data may include characterizations of genetic mutations in the HIV-1 virus that lead to amino acid substitutions in the protease gene at specified codons. In either situation, the existing knowledge of the biology of the disease may not be sufficient to guide a study team to a definitive (or at least plausible) set of predictor variables for outcome. When used judiciously, data-dependent methods for variable selection or dimension reduction can be a useful part of exploratory analyses.
This paper examines methods for finding low-dimensional predictors of outcome when the response variable is a potentially censored measurement of change in HIV viral load (HIV-1 RNA response) and the predictors include both traditional prognostic measures, such as the history of prior treatment, and patient-specific biomarkers of mutations at specified locations in the HIV-1 virus. More specifically, we use the data to cluster subjects into groups with predicted good or poor response. We use the data from the AIDS Clinical Trials Group (ACTG) randomized trial 333 [PG00]. The primary outcome for the trial was the change in HIV-1 RNA level (log10 copies/mL) measured at randomization (considered baseline) compared to times during the course of therapy (weeks 2, 4, 6, 8, 16 and 24). The assay used to quantify levels of HIV-1 RNA was unable to detect virus present in blood plasma at lower than 500 (2.70 log10) copies/mL. For patients whose HIV-1 RNA level could be measured at baseline (all patients in this analysis), the change between baseline and later time points was right-censored when the RNA level was below the limit of quantification. This paper uses a particular method of dimension reduction (partial least squares) for a detailed analysis of this data set. The operation of the method used here is examined in more detail in [HH05].
Methods for right-censored data can be used to estimate the association of potential prognostic variables or treatment with RNA levels. Because the censored data are incomplete observations on laboratory parameters and not event times, linear models for censored data, rather than the more common proportional hazards model, can sometimes be easier to interpret. In [PG00], parametric linear regression models with normally distributed errors are used to model the dependence between changes in RNA levels and treatment or other patient level characteristics. The justification for using these models in studies of HIV is discussed in [Mar99]. Clinical response was defined in the study as the change in viral RNA between randomization and week 8, so the study report emphasizes linear models for this change, although changes at weeks 4 and 6 are also analyzed. In this paper, we focus on methods for predicting the changes from baseline to week 8, using semiparametric models for censored data in linear models, so-called accelerated failure time model (AFT). We then use the predicted changes to construct prognostic subgroups.
Some authors [Hug99, JT00] have investigated the use of linear mixed models [LW82] for the longitudinal measurements of RNA levels. The time-dependent RNA levels are left-censored when they fall below the limit of quantification for the assay, so linear mixed models must be extended to allow for partially observed measurements. We do not use the longitudinal model approach here.
Polymerase chain reaction (PCR) was used in ACTG 333 to amplify genes in the HIV-1 RNA extracted from patient plasma at baseline. The HIV-1 protease gene was fully sequenced, enabling the detection of mutations to the wild-type of this gene and amino acid substitutions at 99 protease residues. [PG00] describes in detail the association of substitutions at 12 selected residues with the change in viral RNA between baseline and week 8. These substitutions had been implicated in previous literature with resistance of the virus to the treatments used in this trial. The data for the trial present an opportunity to explore the value of the additional mutation data, along with clinical measurements, in predicting week 8 viral response. The data set analyzed here contained mutation data on 25 residues, or codon positions, and 10 clinical variables for 60 patients (details in Section 4). The large number of covariates compared to the number of subjects emphasizes the need for dimension reduction in the covariate space. We examine the behavior of step-wise regression (Step-AFT) in this context as well as extensions of principal component regression (PCR) and partial least squares (PLS).
This paper gives a more extended treatment of partial least squares with censored data than can be found in the companion paper [HH04] published in Lifetime Data Analysis. In Section 2, we have added the use of the conditional median of the estimated error distribution to predict the response for a future subject with a given set of covariates. Extensive simulation studies show the small and moderate sample size properties of partial least squares. These simulation results are discussed in a new Section 3 and sections 3 and 4 from [HH04] have been moved to sections 4 and 5 accordingly. In Section 5 the data analysis for the HIV data set, we have added analysis showing the prediction of the response for a future subject and the use of resampling to examine the leave-two-out cross validation method.
Was this article helpful?