## Misspecification and definition of the predictive accuracy

The model is said to be misspecified if the true distribution of the response Y conditional on the covariate vector Z does not belong to the proposed regression model indexed by 0 G O. That is, a true 00 G O does not exist. In the case of the model being misspecified it is not possible to consider the estimated explained variation and the explained residual variation as estimators of the explained variation, namely the degree to which the covariates determine the value of the variable of interest. It is however, by appropriate use of the theorem provided in the Appendix, possible to state which quantities the two estimators estimate consistently.

White [Whi89] proves that the maximum likelihood estimator in a misspecified model indexed by a finite dimensional parameter 0 G O under appropriate regularity conditions is a consistent estimator of the parameter 0* G O minimising the Kullback-Leibler divergence. For every 0 G O, the Kullback-Leibler divergence is a measure of the distance from the true unknown density to the density determined by the parameter 0. In this sense, the maximum likelihood estimator suggests the distribution among the proposed distributions that agrees best with the true distribution and the parameter 0* is therefore termed the least false or most fitting parameter.

By appropriate use of the theorem in the Appendix it then follows, when the prescribed conditions are fulfilled, that the estimated explained variation is a consistent estimator of Ve*. This quantity is a measure of the degree to which the actual covariates would determine the value of the variable of interest, if the distribution of this variable were described by the distribution determined by 0*.

The explained residual variation is similarly, under appropriate conditions, a consistent estimator of

where the mean in the numerator is with respect to the true unknown distribution of (V,Z) = (f (Y),Z) whereas the mean in the denominator is with respect to the true distribution of V. The numerator is the prediction error of the prediction rule z ^ Vg* (z) in the true distribution of (Z,Y) whereas the denominator is the marginal prediction error corresponding to the marginal prediction rule (z ^ V0*). Thus, Wg* is the predictive accuracy of the predictions based on the least false model and the covariates compared to the marginal predictions based on the least false model. Some authors define the above quantity as the explained variation. Since it is based on a non-optimal prediction rule we prefer to think of it as the predictive accuracy instead.

From the above it follows that both estimators of the explained variation might be biased estimators of the true explained variation in case of the model being misspecified. The explained variation of the least false model estimated by the estimated explained variation is a measure related to the chosen, misspecified model and the covariates whereas the predictive accuracy estimated by the explained residual variation is a measure of the ability of the model and the covariates to describe, namely predict, the values of the variable of interest. According to the interpretation of these two quantities, the explained residual variation appears to be the most rational estimator since it still has a relevant interpretation when the model is misspecified. This is probably the reason why most papers on explained variation for uncoarsened data do not even consider the estimated explained variation as an estimator of the explained variation, e.g. Mittlbock and Waldhor [MW00] and Mittlbock and Schemper [MS02]. Others argue that an estimator of the explained variation should compare the observed and the predicted values directly as is the case for the explained residual variation but not always for the estimated explained variation.

It is our experience however that there will only be small, if any, differences between the quantities estimated by the two estimators and that these quantities will be rather close to the true explained variation. Korn and Simon [KS91] claim the opposite, namely that there might be considerable differences between the population measures estimated by the two estimators if the model is 'grossly' misspecified. We found that this is usually not the case provided the proposed regression model is defined in a sensible way, namely that the parameter space O is not unnecessarily restricted. Korn and Simon [KS91] base their statement on an example of a misspecified regression model for which the parameter space O consists of one point 9, i.e. O = {0}. In this case the least false parameter 9* equals 9. Using this distribution, they determine the explained variation Vg* and the predictive accuracy Wg*. The difference between these two quantities turns out to be considerable as well as they both differ considerably from the true explained variation of the model considered. However, it must be obvious that it is not reasonable to pick an arbitrary distribution, determine the explained variation and the predictive accuracy in the true distribution of (Z, Y) of this distribution and then expect these two quantities to be equal as well as equal to the true explained variation. If instead the parameter space O is allowed to be large as possible, the explained variation Vg* and the predictive accuracy Wg* of the least false model are equal and close to the true explained variation. The example of Korn and Simon [KS91] is given below:

Consider the logistic regression model where the true distribution of the binary response Y conditional on the covariate Z is Bernoulli with parameter p(Z) where logitp(Z) = Z and Z is uniform on {-1.5, -1, -0.5, 0.5,1,1.5}. Using a quadratic loss function, the explained variation of this distribution, the true model, is 0.2256.

The model is misspecified by assuming Y conditional on Z to be Bernoulli with parameter p(Z) = 0.1I(Z < 0) + 0.9I(Z > 0). The distribution of the covariate Z remains unchanged. The proposed model is indexed by the single parameter 0 e O = {(0.1,0.9)} and has an explained variation of Vg = 0.64. On the other hand, the predictive accuracy of this model in the true distribution of (Z, Y) is Wg = 0.0758. On the basis of this example they conclude that there might be large differences between the quantities estimated by the estimated explained variation and the explained residual variation. However, if instead the parameter space is allowed to be as large as possible, i.e. O = (0,1)2, then 0* = (0.2763,0.7237) resulting in an explained variation of Vg* = 0.2002 and a predictive accuracy of Wg* = 0.2002, that is the two quantities are equal.

We have furthermore considered examples of misspecification of the linear predictor in the normal and the logistic regression model but did not succeed in finding examples for which the explained variation and the predictive accuracy differed appreciably. These examples were examined analytically and by simulation studies.