## Selecting a semiparametric estimator by the expected loglikelihood

Benoit Liquet1 and Daniel Commenges2

1 Laboratoire de Statistique et Analyse des Données, BHSM, 1251 avenue centrale BP 47

38040 Grenoble Cedex 09, FRANCE [email protected]

2 INSERM E0338, Université Victor Segalen Bordeaux 2, 146 rue Léo Saignat, 33076 Bordeaux Cedex, FRANCE. [email protected]

A criterion for choosing an estimator in a family of semi-parametric estimators from incomplete data is proposed. This criterion is the expected observed log-likelihood (ELL). Adapted versions of this criterion in case of censored data and in presence of explanatory variables are exhibited. We show that likelihood cross-validation (LCV) is an estimator of ELL and we exhibit three bootstrap estimators. A simulation study considering both families of kernel and penalized likelihood estimators of the hazard function (indexed on a smoothing parameter) demonstrates good results of LCV and a bootstrap estimator called ELLbboot. When using penalized likelihood an approximated version of LCV also performs very well. The use of these estimators of ELL is exemplified on the more complex problem of choosing between stratified and unstratified proportional hazards models. An example is given for modeling the effect of sex and educational level on the risk of developing dementia.

Key words: bootstrap, cross-validation, Kullback-Leibler information, proportional hazard model, semi-parametric, smoothing.

### 1 Introduction

The problem of model choice is obviously one of the most important in statistics. Probably one of the first solution to a model choice problem was given by Mallows [Mal73] who proposed a criterion (Cp) for selecting explanatory variables in linear regression problems. This problem of selection of variables was studied by many authors in more general regression models ([Cop83]; [Mil02]). The celebrated Akaike criterion [Aka74] brought a solution to the problem of parametric model selection. This criterion called AIC (An Information Criterion or Akaike Information Criterion) was based on an approximation of the Kullback-Leibler distance [Aka73]. Criterions improving AIC for small samples have been proposed: AICc [HT89] and EIC which is a bootstrap estimation [ISK97]. Finally in the case of missing data, Cavanaugh and Shumway [CS98] proposed a variant of AIC. A closely related, but more difficult problem, is that of choice of a smoothing parameter in smoothed semi-(or non-) parametric estimation of functions. These functions may be density function [Sil86], effect functions of an explanatory variable [HT90] or hazard functions ([O'S88], [JCL98]). Smoothing methods are in particular kernel smoothing methods and penalized likelihood. In simple regression problems, versions of AIC and AICc are available [HST98] and simple versions of the cross-validation criterion have been proposed: CV, GCV [CW79]. However in general problems only the likelihood cross-validation criterion (LCV) [O'S88] and bootstrap techniques, in particular, extension of EIC [LSC03] are available. In some problems approximations of the mean integrated square error (MISE) are available ([RH83], [MP87], [Fer99]).

Liquet et al. [LSC03] have introduced a general point of view which is to choose an estimator among parametric or semi-parametric families of estimators according to a criterion which is an approximation of the Kullback-Leibler distance; they have shown on some simulation studies that the best criterions were EIC and LCV. They treated a general multivariate regression problem. The aim of this paper is to extend this point of view to the case where incomplete data are observed. The data may be incomplete because of right or interval-censoring for instance. This is not a trivial extension: indeed, it becomes clear that for using relatively simply the bootstrap approach, the theoretical criterion to be estimated must be changed. The proposed criterion is the expectation of the (observed) log-likelihood (ELL) rather than the Kullback-Leibler distance. This paper uses much of the material of Liquet and Commenges [LC04] and applies the approach to the choice between stratified and unstratified proportional hazards models.

We define the ELL criterion in section 2 and give useful versions of it for use with right-censored data and with explanatory variables (where partial and conditional likelihood respectively are used). In section 3 we exhibit three bootstrap estimators of ELL and show that LCV also estimates ELL. Section 4 presents simulation studies for comparing the four estimators together with the Ramlau-Hansen approach for hazard functions using kernel smoothing methods or penalized likelihood.

In section 5, we show an application of these criterions to a more complex problem, which is to compare stratified and unstratified proportional hazards models. Our particular application is modeling onset of dementia as a function of sex and education level (coded as a binary variable). We could consider a proportional hazard for both variables or stratified models on one variable, or making four strata. No method has been proposed to our knowledge to compare such different semi-parametric models. We propose to compare them using the ELL criterion, in practice using LCV or a bootstrap estimator, and apply these methods to the data of the PAQUID study, a large cohort study on dementia [LCDBG94].

2 The expected log-likelihood as theoretical criterion

### 2.1 Definitions and notations

Let T be the time of the events of interest. Let f and F be the density function and the cumulative distribution function of T. The hazard function is defined by H(t) = St) where S = 1 — F is the survival function of T. However, we do not observe the realizations of T but only a sample W = {W1,..., Wn} of independent and identically distributed (i.i.d.) variables which bring information on the variable T. For instance, in the case of right-censored observations, the Wj's are copies of the random variable W = (T, 5) where T = min(T, C) and 5 = where C is a censoring variable independent of T. In the sequel, we denote by fc the probability density functions and Sc the survival functions of C. Other cases of censoring are left and interval censoring. We denote by aW (■) a family of estimators of H(■), where h most often represents a smoothing parameter. To any particular estimator AW(■) corresponds an estimator f^(■) = (■ )exp(— f (u)du). Our aim is to propose an information criterion to choose the smoothing parameter for a family of estimators and also to choose between different families of estimators.

### 2.2 The expected log-likelihood

For uncensored data, the useful part of the Kullback-Leibler information criterion, measuring the distance between fh (■) and f, is the conditional expectation of the log-likelihood of a future observation T given T

tion F and being independent of the sample T. Based on KL, Akaike [Aka74] (see also DeLeeuw, [DeL92]), in a parametric framework for complete data, defined the popular criterion AIC (—2 log L + 2p; when L is the likelihood and p is the number of estimated parameter) as an estimator of the expectation of the Kullback-Leibler information EKL=E [KL(T)]. In presence of incomplete data, even EKL is difficult to estimate. In particular, because T is not observed, it is not possible to directly estimate the different expectations by bootstrap.

Instead, we define a criterion as the expectation of the observed log-likelihood of a new sample which is a copy of the original sample, given the original sample:

This criterion does not depend on W and judges a procedure of estimation Ah that can be applied to any W of same distribution. Thus the criterion that we propose is, in accordance with the pinciple of Akaike (see also DeLeeuw [DeL92]), the non-conditional expectation of the log-likelihood, ELL; ELL can be considered as equivalent to the expectation of the Kullback-Leibler information for observed data. Indeed it is relatively easy to show that for a parametric model, AIC defined as —2 log L + 2p (p being the number of parameters) is an estimator of ELL (more precisely of -2ELL) (see Cavanaugh and Shumway [CS98].

2.3 Case of right-censored data

In presence of right-censored data as defined in section 2.1, the likelihood

LPW (W') = n{fW(T)}< {SW(T)}1-< {fc(T )}1-': {Sc(T[)}< i=i where fW(•) and §W(•), the estimators of f and S are deduced from aW(•). The criterion defined in (2) can be decomposed in two parts:

where

and n

which is the partial likelihood (in the sense of Andersen et al. [ABGK93]). The second term in (3) does not depend on Ah; thus maximizing ELL is equivalent to maximize ELLP. Finally our criterion is:

that is the ELL criterion applied to the partial likelihood; this is very fortunate because this avoids estimating the distribution of the censoring variable. Note however that the ELL criterion cannot be applied to the Cox partial likelihood, at least not directly: we need a smooth estimate of the hazard function to apply our criterion. Any non-smooth estimate has a value —to and is rejected.

### 2.4 Case of explanatory variable

We consider the case of presence of explanatory variables. We note Wi = (Ti,Xi) with Ti the survival time and Xi a vector of covariates for the ith individual. It is assumed that Ti has conditional density function f ( • jxi) given Xi = xi. Our aim is to estimate A( • | • ) the corresponding conditional hazard function. We note this estimator AW( • | • ) and f^( • | • ) the corresponding density. The likelihood (W') is:

i=i where fX( • ) is the marginal density of X'. With the same reasoning as in 2.3, the criterion in (2) can be decomposed in two parts:

and n

i=i which is the conditionnal likelihood. The second term of (9) does not depend on Ah; thus maximizing ELL is equivalent to maximizing ELLc. Finally our criterion is:

that is the ELL criterion applied to the conditionnal likelihood; this is very fortunate because this avoids estimating the distribution of the explanatory variable. Both tricks can be applied when there are both explanatory variables and censoring.

3 Estimation of ELL

In order to obtain practical selection criteria, it is necessary to estimate ELL. Several estimators may be considered.

3.1 Likelihood cross-validation : LCV

Throughout this subsection, we index the sample W by its size n and thus use the notation Wn. We recall that the likelihood cross-validation is defined as:

where (Wi) is the likelihood contribution of Wi for the estimator defined on the sample W-i in which Wi is removed. The LCV choice for ^ is the estimator which maximizes LCV. An important property of LCV is that the expectation of LCV is approximatively equal to ELL and it is shown in Liquet and Commenges (2004) that when n —► to,

E [LCV(Wn)] : 1 ELL(Ah(n)) ^ , where Fh(n) is an estimator applied to a sample of size n. If n is large, the computation of LCV is intensive. An approximation based on a first-order f w-i f w expansion of log(Wi) around log(Wi) can be used. This leads to an expression of the form

LCVa(Wn) &W (Wi) - mdf, i=i where the term mdf can be interpreted as the model degrees of freedom, and this expression is analogous to an AIC criterion. For instance, in the spline approximation of the penalized likelihood, we have mdf = tiace([U -2hQ]-1H) where H is the converged Hessian matrix of the log-likelihood, and Q is the penalized part of the converged Hessian matrix, see Joly et al. [JCL98] for more details.

3.2 Direct bootstrap method for estimating ELL (ELLboot and

ELLiboot)

We can directly estimate by bootstrap the expectation of the log-likelihood (ELL). We define this bootstrap estimator as

where W * = (W*, ...,W*) , W* - Fw , W * = (W1*,..., Wn*) and Wj * -Fw, FW being the empirical distribution of Wi based on W. We use the notation E* to remind that the expectation is taken relatively to the estimated distribution Fw. In practice, the expectation is approximated by a mean of B repetitions of bootstrap samples (Wj = W j = W*)

To improve this criterion, we can iterate the bootstrap method [Hal92]. We define this new estimator as:

where W ** = (W***,..., W**) , W** - FW *, W ** = W**,..., W'r**) and Wj** - Fw *, FFw * being the empirical distribution of W* based on W*. E** is calculated with respect to the distribution fw*. The expectation is also approximated by a mean of B repetitions of bootstrap samples (Wj = W j =

More explicitly, for each j, we take a bootstrap sample from W* from W, then we take Wj a bootstrap sample from W*; we obtain W j by the same way; X^1 is the estimator of A for fixed h based on Wj.

### 3.3 Bias corrected bootstrap estimators

To construct this estimator, we first propose the log-likelihood as naive estimator of the ELL criterion and then correct it by estimating its bias by bootstrap [Hal92]. This approach is similar to that used for deriving the EIC [LSC03] criterion available for complete data. Our corrected estimator of ELL is:

where b(W) - — ^ \ log Lxh (Wj) - log Lxh (WU is the bootstrap esti-

mate of the bias (bias = E{log(W)}—ELL), B is the number of bootstrap sample Wj taken at random from the distribution of W*. For more details see Liquet and Commenges (2004).

Remark : for all the bootstrap methods when treating right-censored observations the bootstrap expectations have to be conditionned on having at least one uncensored observation because the estimator are not defined otherwise.

### 4 Simulation

We have compared ELLboot, ELLiboot, ELLbboot and LCV using both families of kernel and penalized likelihood estimators of hazard functions. We have included the Ramlau-Hansen method when using kernels, a popular method for estimating hazard functions [ABGK93]. We compare the criteria when using kernel smoothing in 4.1 and penalized likelihood in 4.2.

4.1 Kernel estimator

The smoothed Nelson-Aalen estimator is

where K(■) is a kernel function, A(■) is the Nelson-Aalen estimator of A(■), the cumulative hazard function, and h is the bandwith parameter. RamlauHansen [RH83] has proposed an estimator of the MISE (mean integrated square error) based on an approximated cross-validation method for estimating h; we call it the RH method. We apply gaussian kernels to allow the use of the different criteria. Indeed, if we used a kernel with compact support, we risk for small h to have LCV criteria equal to —x. For the criteria based on bootstrap, kernels with compact support are prohibited since the bootstrap expectations are theoretically equal to —x for bandwidth lower than the range of the observed event times. We consider problems where the density near zero is very low so there is no edge effect near zero. The data were generated from a mixture of gamma distributions. We generated random samples Ti,...,Tn of i.i.d. failure times and Ci,...,Cn of i.i.d. censoring times; the Ci were independent of the Ti. So the observed samples were (T1: ¿1),..., (Tn, Sn) where T1 = min(Ti; Ci) and 5i = I[Ti<Ci]. The density of T was a a mixture of Gamma {0.4r(t; 40,1)+0.6r(t; 880,1)}, with the probability density functions r(t; a,j) = aJt"'r^-. The probability density function of Ci was a simple Gamma: r(t; 90,1), r(t; 90,1.1) and r(t; 90,1.3) corresponding to a percentage of censoring around 15%, 25% and 50% respectively. Samples of sizes 30, 50 and 100 were generated. Figure 1 displays the smoothed Nelson-Aalen estimate chosen by ELLbboot and the true hazard function for one simulated example from a mixture of gamma.

Each bootstrap estimator was computed using B=400 samples. Each simulation involved 100 replications. For each replication we computed the useful part of the Kullback-Leibler information (KL) between the true density function f and the estimators chosen by each criterion

KL(f; fW) = j\og fW(t)f (t)dt where J =]0;Tmax]. We do not take Tmax equal to +x, because for large times t when there is censoring, we do not have enough information to determine fW (t). Tmax was chosen for each simulation such as

where nTmax represents the risk set at time Tmax. We computed, for each simulation presented, the average of KL and its standard error. Since KL generally takes negative values we give in tables 1, 2, 4 the values of -KL: low

Fig. 1. True hazard function (solid line), smoothed Nelson-Aalen estimator (dashed line) and penalized likelihood estimate (dotted line) chosen by ELLbboot for a simulated example. The sample size is 50, with 15% right-censored observations.

values then correspond to estimators close to the true distribution. First we present in table 1 the results of the simulation comparing the optimal criterion KL and the new criterion ELL. The two theoretical criteria give practically the same results. We note some differences only when there is little information (small sample size and high censoring level). The average of -KL obtained for the pratical criteria are given in table 2. These averages can be compared to an optimal value, the value of KL when estimators are chosen using the true ELL.

We may note that RH yielded in all cases much higher (worse) values of -KL than the other criteria. The ELLboot criterion, although better than RH, had in pratically all the cases higher values than the other criteria. The differences were very small between LCV, ELLiboot and ELLbboot although for high censoring level and small sample sizes, LCV tended to perform not as well as the bootstrap methods. For all the simulations, the three competitive criteria had values of KL quite close to the values given by ELL. Although the simulations were based on only 100 replications some differences were large comparatively to the standard errors. To make the comparisons more rigorous we performed paired t-tests for comparing the criteria in the case of 25% of censoring. All the tests (except one) of ELLboot and RH versus LCV, ELLiboot and ELLbboot were significant at the 0.001 level; the three tests comparing ELLboot with RH were also significant. This confirms that

Table 1. Average Kullback-Leibler information —KL (AW) for the kernel estimator for estimating the hazard function of the mixture of gamma (0.4.T(t,a,b) + 0.6r(t, c, d)) for bandwith chosen by ELL and KL, based on 100 replications. Standard errors are given in parentheses.

—KL(AW) for kernel estimators

KL ELL

—KL(AW) for kernel estimators

KL ELL

 15% censoring 30 3.96(0.005) 3.96(0.005) 50 3.98(0.003) 3.99(0.003) 100 4.01(0.002) 4.01(0.002) 25% censoring 30 3.89(0.004) 3.91(0.005) 50 3.93(0.004) 3.93(0.004) 100 3.95(0.002) 3.95(0.002) 50% censoring 30 3.81(0.02) 3.92(0.04) 50 3.80(0.009) 3.84(0.02) 100 3.80(0.005) 3.80(0.005)

Table 2. Average Kullback-Leibler information —KL(AW) for the kernel estimator for estimating the hazard function of the mixture of gamma {0.4r(t,a,b) + 0.6r(t,c,d)} for each criterion based on 100 replications. Standard errors are given in parentheses.

Table 2. Average Kullback-Leibler information —KL(AW) for the kernel estimator for estimating the hazard function of the mixture of gamma {0.4r(t,a,b) + 0.6r(t,c,d)} for each criterion based on 100 replications. Standard errors are given in parentheses.

—KL(AW) for kernel estimators n ELL ELLbboot LCV ELLgoot ELLboot RH 15% censoring

30 3.96(0.005) 4.00(0.009) 4.01(0.02) 3.98(0.005) 4.04(0.01) 4.19(0.06) 50 3.99(0.003) 4.00(0.006) 4.00(0.008) 4.00(0.005) 4.04(0.01) 4.22(0.06) 100 4.01(0.002) 4.02(0.002) 4.02(0.002) 4.02(0.002) 4.05(0.005) 4.12(0.02)

25% censoring

30 3.91(0.005) 3.94(0.009) 3.96(0.01) 3.92(0.006) 3.98(0.01) 4.26(0.08) 50 3.93(0.004) 3.95(0.007) 3.96(0.01) 3.94(0.006) 3.99(0.01) 4.2(0.06) 100 3.95(0.002) 3.96(0.002) 3.96(0.002) 3.96(0.002) 3.99(0.007) 4.10(0.03)

50% censoring

30 3.92(0.04) 3.99(0.07) 4.04(0.07) 4.01(0.07) 4.02(0.08) 4.36(0.1)

50 3.84(0.02) 3.85(0.02) 3.91(0.03) 3.85(0.02) 3.88(0.03) 4.18(0.09)

100 3.80(0.005) 3.81(0.005) 3.83(0.02) 3.80(0.005) 3.84(0.008) 3.95(0.03)

n the criteria can be classified in three groups ordered from best to worst : 1) LCV, ELLiboot and ELLbboot; 2) ELLboot; 3) RH.

We also compared the different criteria in term of MISE (mean integrated squared error). The result of this simulation are summarized in table 3. Although the RH criterion was based on minimizing the MISE, it gave the worst result. Since Marron and Padgett [MP87] proved an optimality property of cross-validation for bandwidth choice, this may be due to the approximation done for obtaining the RH criterion.

Table 3. Comparison of the criteria by the MISE distance for the kernel estimator for estimating the hazard function of the mixture of gamma {0.4r(t,a,b) + 0.6r(t,c,d)} based on 100 replications. Standard errors are given in parentheses.

MISE for kernel estimators n ELLbboot_LCV_ELLboot ELLboot_RH

25% censoring

30 0.039(0.005) 0.044(0.008) 0.034(0.004) 0.043(0.006) 0.095(0.02) 50 0.038(0.004) 0.039(0.005) 0.035(0.004) 0.04l(0.006) 0.110(0.02) 100 0.034(0.004) 0.033(0.004) 0.034(0.004) 0.046(0.004) 0.073(0.012)

4.2 Penalized likelihood estimator

Another approach to estimate the hazard function is to use penalized likelihood: