One of the purposes of a multivariate regression analysis is to determine co-variates of importance for the outcome of interest. When the estimated effect of a covariate is highly statistically significant it is easy to be lead to the conclusion that such a covariate has a substantial effect on the outcome under study. However, this might not necessarily be the case. Quantities assessing the extent to which the covariates actually determine the outcome are needed to avoid overinterpretation of the effect. Another purpose of a multivariate regression analysis is to enable prediction of the outcome of interest and in this case a quantity assessing the accuracy of the predictions based on the regression model is needed.

Measures of explained variation and predictive accuracy can be used to address these questions. We carefully distinguish between the two concepts. Korn and Simon [KS91] provide a general framework and their approach is adopted and elaborated here. To asses the importance of the covariates, the explained variation is defined on a population level. This quantity is also related to the chosen class of regression models and if the model is misspecified, it cannot necessarily be considered as a measure of the ability of the covariates to determine the outcome. To quantify the ability of the covariates and the regression model to determine or rather predict the outcome the predictive accuracy is defined on a population level. A high predictive accuracy requires a useful prediction rule as well as informative covariates. Whether the model is misspecified or not, the predictive accuracy is a meaningful quantity.

In the linear normal model, the estimator of the explained variation is asymptotically equal to the estimator of the predictive accuracy and is better known as the R2-statistic or the coefficient of determination. This statistic has become standard output from the statistical software packages. Outside of the linear model the estimators of the explained variation and the predictive accuracy usually differ and due to the possibility of the model being misspeci-

fied, the predictive accuracy has often been preferred instead of the explained variation. However, there seems to be some confusion in the literature on the distinction between the two measures and how they are actually affected by misspecification of the regression model. We here provide a more detailed discussion. In our exposition we put more weight on explicitly formulating the various underlying statistical models than in most of the literature in the area. We only consider parametric models.

Our interest is motivated by the use of explained variation and predictive accuracy in failure time models since here the estimation becomes complicated due to censoring of the outcome of interest. There is no unique generalization of the R2-statistic to survival data and several authors have proposed other measures and estimators in the simple failure time model, see e.g. Schemper and Stare [SS96], O'Quigley and Xu [QX01], Graf et al. [GSSS99] and Schemper and Henderson [SH00]. Some authors, e.g. Graf et al. and Schemper and Henderson, are inspired by the approach of Korn and Simon whereas others have different approaches, of which some are only defined in the Cox regression model. So far none of the measures have been widely accepted.

Section 2 contains an introduction to the approach of Korn and Simon including a detailed discussion of consistency of the estimators. In Section 3 the concept of model misspecification is introduced and the effect of mis-specification on the estimators is discussed. Furthermore, the simple failure time model is discussed shortly in Section 4, namely the estimation procedures proposed by Graf et al. and Schemper and Henderson. We do not give a detailed introduction to their work but only present their ideas. Finally we provide in Section 5 some concluding remarks.

Was this article helpful?

## Post a comment