Statistical Inference for Two Sample and Regression Models with Heterogeneity Effect A Collected Sample Perspective

Hong-Dar Isaac Wu

School of Public Health, China Medical University, 91 Hsueh-Shih Rd., Taichung 404, TAIWAN. [email protected]

Summary. Heterogeneity effect is an important issue in the analysis of clinical trials, survival data, and epidemiological cohort studies. This article reviews the works of inference for heterogeneity effect from a series of works by Hsieh who used the empirical process approach, and relevant works by Bagdonavicius, Nikulin, and coworkers. This includes two-sample models and Cox-type relative risk regression models. Heterogeneity property over the covariate space as well as non-constancy property are discussed for several models. In survival analysis, the log-relative risk as a function of time and of the covariates are plotted to present the heterogeneity property of Hsieh's and Bagdonavicius and Nikulin's hazards regression models.

Key words: Heterogeneity, two-sample problem, location-scale model, transformation model, Cox model, Hsieh model, Bagdonavicius-Nikulin model

1 Introduction

This article reviews the works of inference for heterogeneity effect from a series of works, mainly by Hsieh, and relevant works by Bagdonavicius, Nikulin, and coworkers. Before starting our discussion, the meaning of 'heterogeneity' is briefly defined. Concerning a measure of 'effect' and a set of subpopulations indexed by a variable W, if the effect is fixed over W, we say there is a homogeneity effect; otherwise, there is heterogeneity. In some situations, heterogeneity can be dealt with by stratified analysis, or by random effect analysis. By this, however, it is often assumed that there is unobserved or unmeasured factors according to which the effect heterogeneity exists and will be averaged or eliminated. The work of Hsieh proposed another possibility that the heterogeneity is a result of observable variables, and the impact of these variable 'should be' (and 'can be') estimated. A simple example of location-scale model can be used for an illustrative purpose. Consider the case of logistic regression: Let X ~ FX(x)=Logistic(ai,&i) = i+_e<++hix and

Y ~ GY(x)=Logistic(a2, 52) = 1+e2a^+2i,2X , F(■) and G(-) are cdfs. For ordinary 2 x 2 table analysis, a possibly unknown cutoff value xo is assumed such that the odds of the first (F) and second (G) groups are odds_p(x0) = --( <—and odds^(x0) — ( < o)

respectively, which results in the odds ratio (OR) of G-group versus F-group:



The above two-sample problem can be simplified when the two distributions have an identical 'dispersion' (or 'scale'), that is bi = b2. In that case, OR=ea2-ai. The situation of 'identical dispersion' can be extended to the ordinary logistic regression if aj is suitably modeled by a set of covariates aj = /3q + Pizij + ••• + !3P

However, when the dispersions are not identical (referred to as a case of 'heterogeneity'), the odds ratio is (with two-sample setting):

Without loss of generality, we can set ai = 0,a2 = a, and bi = 1,b2 = b. Then OR=ea+Xo(b-1), which depends on the location difference a, as well as on the scale parameter b and the cutoff value xo. The phenomenon of heterogeneity becomes more apparent if the dispersion parameter b(> 0) is further expressed

as a regression setting eY z through the same set of covariates z.

There are, of course, other indices to be used as a measure of effect. If the variable indexes different locations (0 vs. a) will also index different scales (1 vs. b), the heterogeneity effect is said to be 'from the observable variable itself'. It is particularly important when the variable (say X) is continuous, and heterogeneity effect cannot be stratified out even by grouping the X-variable, because the 'effect' of X is to be estimated. This point will become more clear in a later context concerning a regression model with heterogeneity. To make reliable inference, the heterogeneity parameter needs to be estimated explicitly. This is very different from the other heterogeneity models in which the variable resulting in individual or cluster heterogeneity is not observed. So the heterogeneity discussed in this paper is not of the same type and not at the same level with, for example, the random effect models.

2 Two-Sample Models

Two-sample problem plays important role in the development of statistical inference. In clinical trials or epidemiological cohort studies, for example, data collected prospectively according to two treatments or retrospectively to diseased and healthy groups are analyzed to assess the effect of a treatment or the association between an exposure and the disease of concern. In what follows, we briefly refer to the measure of interest as 'treatment effect' or simply 'effect'. If the two-sample relation is described by a location-scale model and the goal is to make inference about the treatment effect, it is necessary to estimate with precision both of the location and scale parameters simultaneously. Ignoring the scale parameter (dispersion) leads to biased effect estimate. In this section, we introduce Hsieh's work on two-sample problems through the empirical process approach (EPA).

2.1 Two-sample location-scale model

The two-sample location-scale model studied in Hsieh [HSI95, HSI96a] assumes two distributions, say F(•) and G(-), satisfying

G(x) = F(X—J) or G-1(t)= j + aF-1(t), 0 <t < 1, a and two sets of samples X1,...,Xm ~ F,Y1,...,Yn ~ G. Let u = (ui,...,uj)T be a set of grid (or cutoff) points, 0 < U1 < ... < uj < 1, and J depends on n: J = J(n). The EPA of Hsieh builds up the following regression-type setting for the specified points U1, .. . ,uj:

where D =diag(..., 1/f (F-1(uj)),...), G—1() and F—1( ) are the empirical quantile processes (Csorgo [CS83]). The process Km n(u) is different for complete (no-censoring) and censored data problems. For complete two-sample data, Km,n(u) is a linear combination of two independent Brownian bridge process pertaining to the strong approximations of the two quantile processes respectively; for right censored data, Km,n(u) is a combination of two independent generalized Kiefer processes.

To estimate 0 = (j,a)T, equation (1) is treated (at u) as a regression setting and least squares method is used. However, the covariance matrix of DKm,n(u) may involve unknown j and a, a generalized least squares (GLS) estimate is then adopted. Let XJx2 = (1Jx1,Frn 1(u)), where 1Jx1 = (1,..., 1)T is a J x 1 column vector; further define Se = DE^P, where Ek is the covariance matrix of the K(-)-process. Then we have the GLS estimate of 0:

For which if a reweighted procedure is needed, Hsieh suggested a 'one-step' iteration only. Further, f (F-1(uk)) can be substituted by its kernel-smoothed estimate. The estimation has the same spirit of minimum chi-square method and, as a companion result, a testing statistic for overall model checking is rendered. In addition to the convenience of implementation, Hsieh's GLS estimate for the location-scale model has an important feature: It achieves the semiparametric Fisher information bound (Bickel et al. [BKRW93]) for large samples, and is thus asymptotically efficient.

2.2 Two-sample transformation model

Now suppose that the two populations have relationship FG-1 (u) = & (p + u))(0 < u < 1) for a specified transformation &. For complete data, Hsieh [HSI95] proposed an EPA estimation procedure based on a strong approximation of the empirical receiver's operating characteristic (ROC) curve Fm(G-1(u)) to the true curve F(G-1(u)):

Here Km,n(u) is a combination of two independent Brownian bridges. See also Hsieh [HSI96b] for the problem of ROC curve estimation. According to (2), for a set of points u = (u1,..., uj)T,

where Sk is the covariance matrix of y/nKm n(•). The following asymptotic distribution can be obtained by ¿-method and the derivative of a inverse function:

-1(FmG-1(u)) — (p + a&-1(u))} —d N(0,CSkC), (4)

where C =diag(..., +aty-1(uj)),...), ^(•) is the derivative of &(•). The previous formula implies

in which the covariance of e is a2 = (1/n)CSkC = Se. In view of this, a regression setting is built up. The case of & = the cumulative standard normal distribution, is studied in Hsieh [HSI96b]. For censored data, a similar setting was derived in Hsieh [HSI96c]:

&-1(Si,m(S-n (u)))= p + a&-1(u)+Km,n(u), (6)

where Sim and Soni are Kaplan-Meier survival estimators for the two true survivor functions, and Km<n(u) is again a combination of two independent generalized Kiefer processes. For unified exposition, we still denote the covariance of Km,n(•) in (6) as Se. Note that the regression settings of (5) and (6) lead to the following least squares type estimation: For complete data, let ROC(u) = FmG-1(u); for right censored data, ROC(u) = S1m(S-n(u)). Further define D(u) = &-1(ROC(u)) — (p + a&-1(u)). Then, because of the normality property of e and Km,n(-), the (log-) likelihood comprises the quadratic form {D(u)}T U-1 {D(u)} plus a remainder term. Also note that the information of 0 = (p,a)T contained in the remainder is asymptotically negligible compared to that contained in the quadratic term (Hsieh [HSI95, HSI96c]), taking derivatives of the quadratic term results in the estimating equation

This equation is convenient to use because, like the situation in linear regression with normal errors, a generalized least squares (GLS) estimate can be obtained by

0 J( dD(u) )T y-1( dD(u) ^-1( dD(u))T fi-1rm-1(R>nC(u))\ (8)

where Se is a consistent estimator of Ue.

The above estimation procedure has the following merits: it combines the estimation and hypothesis testing problems in a unified quadratic form, which is asymptotically chi-square distributed. This resembles the spirit of minimum chi-square inference. To elucidate, note that the quantity A = {DQ(u)}TS-1 {Dq(u)} ~ x22J. The quadratic term A can be decomposed as

A = D(u)}T S-1{D§(u)}+(0—0)T {(D^r 1 (fd^)}e(0-0)+oP(i), where Qg = {D^(u)}TS-1{D¿(u)} ~ x2j-2 is used as a statistic for testing the global model goodness-of-fit; and Qi = (0 — 0)T{(^dr)T£-1(D11 )}e(Q — 0) ~ x2 can be used to test for a local hypothesis such as H0 : 0 = 00 vs. Ha : 0 = 00 (for some specified 00) if under the validity of the global model. This issue will also be explored in the following discussion on hazards regression model.

Was this article helpful?

0 0

Post a comment