## Measures of explained variation

Suppose (Z, Y) is a random variable, Z being a ^-dimensional vector of covari-ates and Y being a p-dimensional response variable. A parametric regression model indexed by a finite dimensional parameter 0 e O C Rd is proposed for the conditional distribution of Y given Z. The distribution of the vector of covariates is assumed not to depend on the parameter 0 and left completely unspecified. Expectations with respect to the conditional distribution of Y given Z determined by 0, the marginal distribution of Z and the marginal distribution of Y determined by the parameter 0 will be denoted by Eg [-| Z], E [ • ] and Eg [ • ] = E [Eg [• Z] ], respectively. As a starting point it is assumed that the true distribution of Y conditional on Z belongs to the model, that is the existence of a true parameter 00 e O is required.

Let V denote the variable of interest. It is assumed that V is a one-dimensional transformation of the response variable Y, i.e. V = f (Y) for some function f : Rp ^ R. Unless coarsened data is considered (see Section 4 on survival analysis below), f is usually the identity function. Using the regression model, our two purposes are to determine how much of the variation in V is explained by the covariates and to make accurate predictions of V.

### 2.1 Definition of the explained variation

Following the loss function approach as proposed by Korn and Simon [KS91] a loss function L has to be defined. Then, L(v,V) denotes the loss incurred when making the prediction V of an observation v of the variable of interest V. The loss function L is assumed to be bounded below by 0 and to attain the value 0 when the correct value vV = v of v is predicted. Quadratic loss L(v,V) = (v — V)2, absolute loss L(v,V) = \v — V| and entropy loss L(v,V) = — (v log v + (1 — v) log(1 — v)) are the most commonly used loss functions (the latter only when predicting binary variables), see e.g. Korn and Simon [KS90] and Korn and Simon [KS91].

A prediction of the variable of interest V based on the vector of covariates Z can be defined by any function v : R9 ^ R (z ^ v(z)), since such a function determines a prediction rule. For every 9 G O, a measure of the ability of the covariates and the prediction rule v to predict the variable of interest V is the prediction error defined as the expected loss E[Eg [L(V,v(Z)) \ Z] ]. Since interest is in making accurate predictions, the focus will be on the prediction rules giving rise to the smallest possible prediction error: For every 9 G O the 9-optimal prediction rule is defined as the prediction rule vg minimising the prediction error, i.e.

E[Eg [L(V,vg(Z)) \ Z]] < E[Eg [L(V,v(Z)) \ Z]] for all v : R9 ^ R.

Note that the 9-optimal prediction rule indeed depends on the choice of loss function: Using quadratic, absolute and entropy loss the 9-optimal prediction rules are given by the means, the medians and the means, respectively, of the conditional distributions of V = f (Y) given Z = z G R9 determined by the parameter 9.

The prediction error corresponding to the 9-optimal prediction rule will be denoted ng in the following, i.e. ng = E[Eg [L(V,vg(Z)) \ Z]].

Since the prediction error is a positive number, it is difficult to determine whether it is small or large corresponding to whether the covariates and the prediction rule are good or bad in predicting the variable of interest. It may here be helpful to compare it to the prediction error based on a prediction rule not depending on the covariate values. Thus, consider a prediction rule of the form z ^ v0 for a fixed v0 G R. Such a prediction rule will be termed a marginal prediction rule. In this case the marginal prediction error is Eg[L(V, v0)]. The 9-optimal marginal prediction rule is similarly defined as the prediction rule (z ^ v°) minimising the marginal prediction error, i.e.

The prediction error corresponding to the 9-optimal marginal prediction rule

When considering the 9-optimal prediction rules the prediction error based on the covariates and the marginal prediction error might be compared by the explained variation

for every 9 G O. This quantity attains values between zero and one. Values close to zero correspond to the prediction errors being almost equal, i.e. that the covariates and the prediction rule do not determine the variable of interest particularly accurately since the marginal prediction rule is almost as accurate. Values close to one on the other hand correspond to the covariates and the prediction rule determining the variable of interest to a large extent. Since the explained variation compares the best possible rules of prediction it becomes a measure of the degree to which the covariates determine the variable of interest.

When squared error loss is considered, V- reduces to the variance of the conditional mean divided by the marginal variance of the variable of interest: Ve = VarEe(V|Z)/Vare(Y). In this case it thus measures the reduction in the variance of the variable of interest when the information on the covariates is included in the model.

In this context the explained variation Ve is the quantity of interest. However, another quantity measuring the accuracy of a non-optimal prediction rule based on the covariates turns out to be of interest too. We postpone the introduction of this quantity, the population concept of predictive accuracy, until we have discussed estimation of the explained variation and misspecification of the model.

### 2.2 Estimation of the explained variation

Suppose (Zi,Yi),..., (Zn,Yn) is a sample of independent random variables distributed as (Z, Y). Based on this sample, the distribution of Y conditional on Z is estimated by a parameter 9n whereas the marginal distribution of the vector of covariates Z is estimated by the empirical distribution of Zi,...,Zn.

Korn and Simon [KS91] suggest two estimators of the explained variation. Obviously, the explained variation of the estimated model might be used as an estimator, that is v = ! nHA (L(V,Vn (Z)) | Z = Zj) = , n8n

This estimator is termed the estimated explained variation. Note that the estimated explained variation indeed is based on the estimated model since it is a function of the expected losses in the distribution determined by 0n whereas it only depends on the values of the sample through the estimated parameter 0n and the covariate values Z1,...,Zn.

Korn and Simon [KS91] also consider the explained residual variation,

This estimator only depends on the model through the 0n-optimal prediction rules z ^ Vg (z) and z ^ V9 .In the numerator, the values of the variable of interest are compared to the predicted values based on the covariates by the loss function L. Similarly, the values of the variable of interest are compared to the marginal predicted value V9 in the denominator. The explained

6n residual variation is therefore, besides being a measure of explained variation, also a measure of how accurate the predictions based on the 0n-optimal prediction rule and the covariates actually are compared to the 0n-optimal marginal prediction rule.

Korn and Simon [KS91] do not formulate conditions under which the two estimators are to be considered as consistent estimators of the explained variation Ve0 of the true model. In the Appendix we provide a theorem stating sufficient conditions. This theorem ensures that it is possible to obtain consistent estimators by averaging terms which are dependent through their common dependence on the estimated parameter 0n as is the case for the numerators and the denominators of the above estimators. How the theorem is used to guarantee the consistency of the two estimators above is also demonstrated in the Appendix.

When considering quadratic loss in the normal linear regression model, the explained variation is equal to the squared multiple correlation coefficient. The two estimators of the explained variation, the estimated explained variation and the explained residual variation, are almost identical. Traditionally the explained residual variation is used as the estimator of the explained variation (for reasons to be described below in Section 3 on misspecification of the model) and is probably better known as the R2-statistic. However, it is well known that this estimator for small samples has a positive bias as an estimator of the explained variation and therefore the adjusted R2-statistic Radj is used instead (Helland [Hel87]):

In the normal linear model, the adjusted R2-statistic is exactly the estimated explained variation. Also in other regression models, the explained residual variation might be an inflated estimator of the explained variation when small samples are considered. Mittlbock and Waldhor [MW00] propose a similar adjustment of the explained residual variation for the Poisson regression model whereas Mittlbock and Schemper [MS02] propose similar and other adjustments for the logistic regression model.