Veronique Sebille1 and Mounir Mesbah2

1 Laboratoire de Biostatistiques, Faculté de Pharmacie, Université de Nantes, 1 rue Gaston Veil, BP 53508, 44035 Nantes Cedex 1, France. [email protected]

2 Laboratoire de Statistique Théorique et Appliquée (LSTA), Université Pierre et Marie Curie - Paris VI, Boîte 158, - Bureau 8A25 - Plateau A. 175 rue du Chevaleret, 75013 Paris, France [email protected]

Summary. Early stopping of clinical trials either in case of beneficial or deleterious effect of treatment on quality of life (QoL) is an important issue. QoL is usually evaluated using self-assessment questionnaires and responses to the items are combined into scores assumed to be normally distributed (which is rarely the case). An alternative is to use item response theory (IRT) models such as the Rasch model for binary items which takes into account the categorical nature of the items.

Sequential analysis and mixed Rasch models (MRM) were combined in the context of phaseII non-comparative trials. The statistical properties of the Sequential Probability Ratio Test (SPRT) and of the Triangular Test (TT) were compared using MRM and traditional average scores methods (ASM) by means of simulations.

The type I error of the SPRT and TT was correctly maintained for both methods. While remaining a bit underpowered, MRM displayed higher power than the ASM for both sequential tests. Both methods allowed substantial reductions in average sample numbers as compared with fixed sample designs (about 60%).

The use of IRT models in sequential analysis of QoL endpoints is promising and should provide a more powerful method to detect therapeutic effects than the traditional ASM.

Key words: Quality of life; Item Response Theory; Rasch models; Sequential Probability Ratio Test; Triangular Test; Clinical Trials

Clinical trials usually focus on endpoints that traditionally are biomedical measures such as disease progression or survival for cancer trials, survival or hospitalization for heart failure trials. However, such endpoints do not reflect patient's perception of his or her well-being and satisfaction with therapy. Health-Related Quality of Life (QoL) which refers to "the extent to which one's usual or expected physical, emotional and social well-being are affected by a medical condition or its treatment" is an important health outcome (Cella and Bonomi, 1995; Fairclough, 2002).

Non-comparative phase II trials, which are commonly designed to evaluate therapeutic efficacy as well as further investigation of the side-effects and potential risks associated with therapy, often use QoL endpoints. Early stopping of such trials either in case of beneficial or deleterious effect of the treatment on QoL is an important matter (Cannistra, 2004). Ethical concerns and economic reasons for the use of early stopping rules include the fact that patients are recruited sequentially in a trial and that data from early recruited patients are available for analysis while later patients are still being included in the trial. Such a framework offers the possibility of using the emerging evidence to stop the study as soon as the treatment effect on QoL becomes clear. Early stopping of a trial can occur either for efficacy (when the trial seems to show clear treatment advantage), safety (when the trial seems to show clear treatment harm) or futility reasons (when the trial no longer has much chance of showing any treatment benefit). However, it is well-known that multiple looks at data result in inflation of the type I error a and in the risk of over-interpretation of interim results. Thus, specific early termination procedures have been developed to allow for repeated statistical analyses on accumulating data and for stopping a trial as soon as the information is sufficient to conclude. Among the sequential methods that have been developed over the last few decades (Pocock, 1977; O'Brien and Fleming, 1979; Lan and De Mets, 1983), the Sequential Probability Ratio Test (SPRT) and the Triangular Test (TT), which were initially developed by Wald (Wald, 1947) and Anderson (Anderson, 1960) and later extended by Whitehead to allow for sequential analyses on groups of patients (Whitehead and Jones, 1979; Whitehead and Stratton, 1983) have some of the interesting following features. They allow for: (i) early stopping under Ho or under Hi, (ii) the analysis of quantitative, qualitative or censored endpoints, (iii) type I and II errors to be correctly maintained at their desired planning phase values, (iv) substantial sample size reductions as compared with the single-stage design (of about 30% reductions can often be achieved).

Patient's QoL is usually evaluated using self-assessment questionnaires which consist of a set of questions often called items (which can be dichoto-mous or polytomous) which are frequently combined to give scores for scales or subscales. The common practice is to work on average scores which are generally assumed to be normally distributed. However, these average scores are rarely normally distributed and usually do not satisfy a number of basic measurement properties including sufficiency, unidimensionality, or reliability.

More important, these scores are often used, knowingly or not, as a reduction of a bigger amount of data (each score is a "sufficient statistic" for a given set of observed categorical items, and then is used as a surrogate for this set of items), without indroducing clearly the mechanism of such reduction in the likelihood.

In Educational Sciences framework, or more generally in psychometry or sociometry, models relating a set of observed items to a hidden latent concept are called measurement models. Otherwise, models relating concepts (directly observed or latent) are called analysis models. Item Response Theory (IRT), which was first mostly developed in educational testing, takes into account the multiplicity and categorical nature of the items by introducing an underlying response model (Fisher and Molenaar, 1995) relating those items to a latent parameter having the nice property to be interpreted as the true individual QoL. In this framework, the probability of response of a patient on an item depends upon two different parameters: the "ability level" of the person (which reflects his/her current QoL) and the "difficulty" of the item (which reflects somehow the capacity of that specific item in discriminating between good and bad QoL). IRT models are specific generalized linear models which were more developed from a "measurement" point of view than from an "analysis" one. However, an equivalent modeling framework could be repeated measures logistic regression since IRT modeling deals with repeated items aimed at measuring an unobserved latent trait. IRT modeling, as a tool for scientific measurement, is not quite well established in the clinical trial framework despite a number of advantages offered by IRT to analyze clinical trial data including: helpful solutions to missing data problems, the possibility to determine whether items are biased against certain subgroups, an appropriate tool for dealing with ceiling and floor effects (Holman et al., 2003a). Moreover, it has been suggested that IRT modeling offers a more accurate measurement of health status and thus should be more powerful to detect treatment effects (McHorney et al., 1997; Kosinski et al., 2003). Hence, IRT modeling could be an interesting alternative to traditional sequential analysis of QoL endpoints based only on average scores. Thus, we tried to evaluate the benefit of combining sequential analysis and IRT methodologies in the context of phase II non-comparative trials. We performed sequential analysis of QoL endpoints (obtained from the observed data) using IRT modeling and we compared the use of IRT modeling methods with the traditional use of average scores methods.

The basic assumption for IRT models is the unidimensionality property stating that all items of a questionnaire should measure the same underlying concept (e.g., QoL) often called latent trait and noted 9. Another important assumption of IRT models, which is closely related to the former, is the concept of local independence meaning that items should be conditionally independent given the latent trait 9. It can be expressed mathematically by writing the joint probability of a response pattern given the latent trait 9 as a product of marginal probabilities. Let Xij be the answer for subject i to item j and let Oi be the unobserved latent variable (also called the ability, in our context, we call it the QoL) for subject i (i = 1, ..., N; j = 1, ..., k).

P (Xn = xn,Xi2 = Xi2, ...,Xik = Xik/Oi) = JJ P (Xij = Xij/Oi)

j=i where (Xii, Xi2, ..., Xik) are a set of items (either dichotomous or polyto-mous). In other words, the person's ability or the person's QoL should be the only variable affecting individual item response. For any person i, or more accurately for any given Oi, the correponding response values Xijto the various items j (j=1 to k) are independent as they were choosen randomly.

For binary items, one of the most commonly used IRT model is the Rasch model, sometimes called the one parameter logistic model (Rasch, 1960). The Rasch model specifies the conditional probability of a patient's response xij given the latent variable Oi, nd the item parameters f3j:

where f3j is called the difficulty parameter for item j (j = 1, ..., k). Contrasting with other IRT models, in the Rasch model, a subject's total score, k

Si Xij is a sufficient statistic for a specific latent trait or ability Oi.

Thus, when the total score of a questionnaire with binary items is used as a measure of QoL, it is "knowingly or not" assumed that the Rasch model is the true underlying model.

Several methods are available for estimating the parameters (the Os and f3s) in the Rasch model (Hamon, and Mesbah, 2002) including: joint maximum likelihood (JML), conditional maximum likelihood (CML), and marginal maximum likelihood (MML). JML is used when person and item parameters are considered as unknown fixed parameters. However, this method gives asymptotically biased and inconsistent estimates (Haberman, 1977). The second method CML consists in maximizing the conditional likelihood given the total score in order to obtain the items parameters estimates. The person parameters are then estimated by maximizing the likelihood using the previous items parameters estimates. This method has been shown to give consistent and asymptotically normally distributed estimates of item parameters (Andersen, 1970). The last method MML is used when the Rasch model is interpreted as a mixed model with 0 as a random effect having distribution h(0, Z) with unknown parameters Z. The distribution h is often assumed to belong to some family distribution (often Gaussian) and its parameters are jointly estimated with the item parameters. As with the CML method, the MML estimators for the item parameters are asymptotically efficient (Thissen, 1982). Furthermore, since MML does not presume existence of a sufficient statistic (unlike CML), it is applicable to virtually any type of IRT model.

2.4 Sequential Analysis Traditional Sequential Analysis

In the traditional framework of sequential analysis (Wald, 1947; Whitehead, 1997; Jennison and Turnbull, 1999), 0i is assumed to be observed (not to be a latent value) and the observed score Si is used as a "surrogate" of the true latent trait 0i. In that setting, we generally assume that 01, 02, ..., 0n are N independent variables following distribution f(01), f(02), ..., f(0N) with unknown individual parameters yi and yi (i = 1, ..., N). We shall assume that those individual parameters are the same, i.e., that Vi (i = 1, ..., N), yi = y (parameter of interest) and yi = y (vector of nuisance parameters), and that the trial involves the comparison of the two following hypotheses: H0: y < 0 against H1: y > 0, In that classical setting, the decision is based on the likelihood of the data, i.e. on:

Values y0 < y>1 are chosen and the following continuation region is used for the sequential test for suitable values of Ba^ < 1 < Aa,p (Wald, 1947):

where y (y0) (y (y1)) denotes the maximum likelihood estimate of y for y =

If the terminal value of the likelihood ratio is below ^, then H0 is no rejected, if it is above Aa,p, then H0 is rejected. It is well-known that if y0 and y1 are assumed to be small (Whitehead, 1997), the log likelihood function l (y, y (y)) can be approximated using Taylor expansion up to quadratic terms in y for y = y0 or y = yi. Thus, the continuation region can be simplified in the following way:

L (0i,02,.., 0N/y, y) = fViV (0l) ■ fViV (62) ...fv,v (6n)

Was this article helpful?

## Post a comment