## L 6162 0nfyiyyi L 0102 0nyoyyo

where the Z(S) statistic is the efficient score for y depending on the observed scores S, and the V(S) statistic is Fisher's information for y.

More precisely: Z (S) = lv (0,<>(0)) and V (S) = —{ l(0,<p(0)) } 1 where:

- (0, < (0)) denotes the first partial derivative of l (<p, <) with respect to <, evaluated at (0,^(0)),

- the leading element of the inverse of the matrix of second derivatives is denoted by lvv (<, <), that is:

« } (<, <Pj where lvv (0, <£(0)) denotes the second partial derivative of l (<, <) with respect to <, evaluated at (0, < (0)), and lvv (0, < (0)) denotes the mixed derivative.

Asymptotic distributional results have shown that for large samples and small <, Z(S) follows a normal distribution: Z(S) - N( <V(S), V(S) ). More precisely, let a sequential study with up to K analyses produce the sequence of test statistics (z1(S), Z2(S), ..., ZK(S)). The sequence (Z1(S), Z2(S), ..., ZK(S)) is multivariate normal with: Zk(S) — N(<Vk(S), Vk(S)) and Cov(Zfc1(S), Zfc2(S)) = Vfc1(S) for k= 1, 2, ..., K and 1 < k1 < k2 < K (Whitehead, 1997; Jennison and Turnbull, 1999).

### Sequential Analysis based on Rasch measurements

We shall now be interested in the latent case, i.e., the case where Oi is unobserved. Thus, the likelihood will be different, because the likelihood is traditionally a function of the observations, not of the unobserved variables. The following steps will be used in order to obtain the likelihood that we need for sequential testing.

1. The Rasch model specifies the conditional distribution of item response given the latent variable 0i and item parameters f3j:

e{6i-Pj) Xij f (xij/Ou ¡3jj = 1 + e(ffi-fj) = fj (xij/Oij

2. We can then write:

ftp,tp (Oij and get the joint distribution of the observed xij and the latent variable

3. The likelihood is obtained after marginalizing over the unobserved latent variable Oi:

4. Local independence of items allows us to derive the likelihood of a subject:

5. Finally, independence of subjects allows us to obtain the likelihood:

Wfp1,P2,-,Pk,V,V (xil , Xi2 , ■■■,xik ) = J U,V MU fjj (xij/Oi ) dOi i i j

Using the notation n = (Pi, P2, ..., 3k, <£>), we can write:

L (61,62, ■■■, ON; (?)) = n / (Oi) II fj (xij/Oi) dOi ij or, more precisely:

L (Oi ,62,.., On ; <p, n (*>)) = nj II i + e(e_S) dOi (1)

### Estimation of parameters

We assumed that the latent trait O followed a normal distribution ~ N(p, a2) and that we are testing: Ho: p = po = 0 against Hi: p > 0. In this framework, the parameters of interest is p and the vector of nuisance parameters is n = (Pi, P2, ..., Pk, a). >From (1), the log likelihood is:

I(Oi,O2,-,ON;p,n(p)) = J2log I —= e-^ H (et_M d i I j

l(a, 6, ..., Cn; p, n(p)) = Elogi^f n i+^—j-S • g(ti) d&J

where g is the density of the standard normal distribution. Z and V statistics

The statistic Z which was previously defined and noted Z(S) will be depending this time on X, the responses to the items, which contain all the information on the items: Z(X) = l^(0,rj (0)) where n (0) is the MLE of n under H0 (p =

po = 0), and ri(0) = n*= (P*, P*2, ..., Pk, a*), with p* = pi (0), ..., a* = a (0). Then, we can write: Z(X) = lM(0, P0, P0, ..., P0, a0). We assumed that the [0, ¡3°, ..., 3k were known and we computed the MLE of a under the null hypothesis in order to further estimate the Z(X) and V(X) statistics. More details are given in Appendix 1. Estimation of the statistics Z(X) and V(X) was done by maximising the marginal likelihood, obtained from integrating out the random effects. Numerical integration methods had to be used because it is not possible to provide an analytical solution. We used the well-known adaptive Gauss-Hermite quadrature to obtain numerical approximations (Pinheiro and Bates, 1995).

2.5 The Sequential Probability Ratio Test and the Triangular Test

The statistics Z and V were noted Z(S) and V(S) in the case of traditional sequential analysis based on sufficient scores and Z(X) and V(X) in the case of a joint sequential and Rasch analysis based directly on observed items. However, for the ease of the general presentation of the tests we shall use the notations Z and V here. The SPRT and the TT tests use a sequential plan defined by two perpendicular axes, the horizontal axis corresponds to Fisher's information V, and the vertical axis corresponds to the efficient score Z which represents the benefit as compared with H0. The TT appears on figure 1.1. For a one-sided test, the boundaries of the test, delineate a continuation region (situated between these lines), from the regions of non rejection of H0 (situated beneath the bottom line) and of rejection of H0 (situated above the top line). The boundaries depend on the statistical hypotheses (values of the expected treatment benefit, a and 3 and on the number of subjects included between two analyses. They can be adapted at each analysis when this number varies from one analysis to the other, using the "Christmas tree" correction (Siegmund, 1979). The expressions of the boundaries for one-sided tests (Sebille and Bellissant, 2001) are given in Appendix 2. At each analysis, the values of the two statistics Z and V are computed and Z is plotted against V, thus forming a sample path as the trial goes on. The trial is continued as long as the sample path remains in the continuation region. A conclusion is reached as soon as the sample path crosses one of the boundaries of the test: non rejection of H0 if the sample path crosses the lower boundary, and rejection of H0 if it crosses the upper boundary.

### 2.6 Study framework

We simulated 1000 non-comparative clinical trials with patient's item responses generated according to a Rasch model. The latent trait 9i was assumed to follow a normal distribution with mean p and variance a2= 1 and the trial we considered involved the comparison of the two hypotheses: H0: p = p0 = 0 against Hi: p > 0. The minimum clinically relevant difference (a difference worth detecting) often computed as an effect size (ratio of the minimum clinically relevant difference to the standard deviation) is often measured as in clinical trial practice. Since p0 = 0 and the standard Fig. 1. Stopping boundaries based on the Triangular Test (TT) for a = ¡3 = 0.05 with an effect size (ES) of 0.5.

deviation of 9 is equal to one, the effect size will be equal to ^ in our case. In practice, effect sizes of interest seen in published research range from 0.2 (small) to 0.8 (large) but the magnitude of effect size primarily depends on the subject matter. Indeed, in medical research effect sizes of 0.5 or 0.6 may be considered as large effect sizes. To our knowledge, there are no well-known conventional values for the effect size that could be most appropriate for QoL endpoints since they closely depend on the medical context under consideration. However, it seems that effect sizes ranging from 0.4 to 0.6 could be of interest (Lacasse et al., 1996). We assumed that the items under consideration formed part of a calibrated item bank, meaning that items parameters were assumed to be known (Holman et al., 2003b). The items parameters were uniformly distributed in the interval (-2, 2) and ^ f3j = 0. The average score methods simply used the sum of item scores S for each patient, assuming a normal distribution and the Z(S) and V(S) statistics were computed within the well-known framework of normally distributed endpoints (Whitehead, 1997).

We compared in the context of sequential analysis of QoL endpoints the use of Rasch modelling methods with traditional average scores methods. The statistical properties of the SPRT and of the TT were studied in the setting of one-sided non-comparative trials. We studied the type I error (a), power (1-¡3), average sample number (ASN) and 90th percentile (P90) of the number of patients required to reach a conclusion using simulations. The sequential tests were compared with the traditional method using the SPRT or TT based on the averages of patient's scores. We investigated scales with 10 or 20 items, 3 = expected effect sizes (0.4, 0.5 and 0.6), and sequential analyses were performed every 20 included patients. The SAS PROC NLMIXED allowed Quasi-Newton procedures to maximise the likelihood and adaptive Gaussian quadrature was used to integrate out the random effects. The sequential tests were all programmed in C+—+ language.

### 3 Results

Table 1.1 shows the type I error for different values of the effect size, number of items and nominal power for the TT using either the average scores or the Rasch modelling method. The significance level was close to the target value of 0.05 for the average scores method but slightly increased when a 10 items scale was used as compared with a 20 items scale. The significance level was always lower than the target value of 0.05 for the Rasch modeling method for all effect sizes, number of items used, and nominal power values. Moreover, the significance level seemed to decrease as the effect size increased.

 Effect size Nb of Average scores Rasch model items Power 0.90 0.95 0.90 0.95 0.4 10 0.056 0.062 0.033 0.035 (0.007) (0.008) (0.006) (0.006) 0.4 20 0.049 0.049 0.036 0.034 (0.007) (0.007) (0.006) (0.006) 0.5 10 0.057 0.055 0.027 0.033 (0.007) (0.007) (0.005) (0.006) 0.5 20 0.050 0.044 0.020 0.033 (0.007) (0.006) (0.004) (0.006) 0.6 10 0.052 0.053 0.020 0.028 (0.007) (0.007) (0.004) (0.005) 0.6 20 0.049 0.046 0.014 0.018 (0.007) (0.007) (0.004) (0.004)

Table 1.2 shows the power for different values of the effect size, number of items and nominal power for the TT using either the average scores or the Rasch modelling method. The TT was underpowered especially when using the averages scores method as compared with the Rasch modelling method. For instance, as compared with the target power value of 0.95, there were decreases in power of approximately 12% and 7% with 10 and 20 items, respectively for the averages scores method. By contrast, the decrease in power was of about only 5% for the Rasch modelling method, whatever the number of items used. Moreover, the power seemed to decrease as the effect size increased.

Table 1.3 shows the ASN of the number of patients required to reach a conclusion under Ho and Hi for different values of the effect size, number of items and nominal power for the TT using either the average scores or

 Effect size Nb of Average scores Rasch model items Power 0.90 0.95 0.90 0.95 0.4 10 0.723 0.821 0.848 0.927 (0.014) (0.012) (0.011) (0.008) 0.4 20 0.812 0.869 0.852 0.919 (0.012) (0.011) (0.011) (0.009) 0.5 10 0.769 0.837 0.831 0.907 (0.013) (0.012) (0.012) (0.009) 0.5 20 0.825 0.902 0.812 0.910 (0.012) (0.009) (0.012) (0.009) 0.6 10 0.782 0.846 0.807 0.898 (0.013) (0.011) (0.012) (0.010) 0.6 20 0.834 0.891 0.772 0.874 (0.012) (0.010) (0.013) (0.010)

the Rasch modelling method. We also computed for comparison purposes the approximate number of patients required by a single-stage design (SSD) using IRT modelling from the results published in a recent paper (Holman et al., 2003a). As expected, the ASNs all decreased as the expected effect sizes increased whatever the method used. The ASNs under H0 and Hi were always much smaller for both methods based either on averages scores or Rasch modelling than for the SSD for whatever values of effect size, number of items or nominal power considered. The decreases in the ASNs were a bit larger for the averages scores method followed by the Rasch modelling method. For instance, under Ho (Hi) as compared with the SSD, there were decreases of approximately 70% (65%) and 60% (55%) in sample size for the averages scores and the Rasch modelling method, respectively.

Table 1.4 shows the P90 of the number of patients required to reach a conclusion under Ho and Hi for different values of the effect size, number of items and nominal power for the TT using either the average scores or the Rasch modelling method. In most cases, the P90 values of the sample size distribution under Ho and Hi were of the same order of magnitude for the average scores and the Rasch modelling method. Moreover, the P90 always remained lower for both methods based either on averages scores or Rasch modelling than for the SSD whatever values of effect size or number of items considered.

The operating characteristic (OC) function (figure 1.2), which is the probability of accepting Ho, was computed for the SPRT using either the average scores or the Rasch modelling method under Ho (where it should be equal

Table 3. ASN required to reach a conclusion under Ho for the Triangular Test using either the average scores method or the Rasch modelling method for different values of the effect size, number of items and power (nominal a = 0.05).(* When using IRT: approximate number of subjects required in a single-stage design (SSD)).

Table 3. ASN required to reach a conclusion under Ho for the Triangular Test using either the average scores method or the Rasch modelling method for different values of the effect size, number of items and power (nominal a = 0.05).(* When using IRT: approximate number of subjects required in a single-stage design (SSD)).

 Effect Nb of IRT * Average Rasch model size items. scores Power 0.90 0.95 0.90 0.95 0.90 0.95 Ho / Hi Ho / Hi Ho / Hi Ho / Hi 0.4 10 75 125 34.24 / 41.50 / 49.70 / 60.78 / 42.58 50.70 62.62 71.44 0.4 20 70 120 34.30 / 42.42 / 42.10 / 51.44 / 40.62 46.86 52.50 59.16 0.5 10 60 100 24.74 / 29.04 / 33.72 / 41.34 / 29.10 33.84 43.10 48.60 0.5 20 55 95 24.46 / 29.02 / 29.68 / 35.02 / 28.50 33.06 37.16 42.52 0.6 10 45 80 21.24 / 23.16 / 26.24 / 30.38 / 22.78 25.60 32.46 36.88 0.6 20 45 80 21.20 / 22.96 / 23.38 / 26.92 / 22.78 24.98 27.64 31.52

Table 4. P90 of the number of patients required to reach a conclusion under H0 for the Triangular Test using either the average scores method or the Rasch modelling method for different values of the effect size, number of items and power (nominal a = 0.05).(* When using IRT: approximate number of subjects required in a singlestage design (SSD)).

Table 4. P90 of the number of patients required to reach a conclusion under H0 for the Triangular Test using either the average scores method or the Rasch modelling method for different values of the effect size, number of items and power (nominal a = 0.05).(* When using IRT: approximate number of subjects required in a singlestage design (SSD)).

 Effect Nb of *IRT Average Rasch model size items scores Power 0.90 0.95 0.90 0.95 0.90 0.95 Ho / Hi Ho / Hi Ho / Hi Ho / Hi 0.4 10 ~ 75 ~ 125 60 / 60 60 / 80 80 / 100 100 / 100 0.4 20 ~ 70 ~ 120 60 / 60 60 / 80 60 / 80 80 / 100 0.5 10 ~ 60 ~ 100 40 / 40 40 / 60 60 / 60 60 / 80 0.5 20 ~ 55 ~ 95 40 / 40 40 / 60 40 / 60 60 / 60 0.6 10 ~ 45 ~ 80 20 / 40 40 / 40 40 / 40 40 / 60 0.6 20 ~ 45 ~ 80 20 / 40 40 / 40 40 / 40 40 / 40 Fig. 2. Operating characteristic (OC) function (probability of accepting Ho) computed for the Sequential Probability Ratio Test (SPRT) using either the average scores or the Rasch modelling method under H0 and under H1 with an effect size of 0.5 and a nominal power of 0.95. Fig. 3. Average Sample Number (ASN) under H0 and H1 (effect size of 0.5) for the Sequential Probability Ratio Test (SPRT) using the average scores or the Rasch modelling method and approximate sample size required by the SSD using IRT modelling (SSD_IRT).

to 0.95) and under Hi with an effect size of 0.5 and a nominal power of 0.95 (where it should be equal to 0.05). As observed with the TT, we can see that the OC functions of the SPRT are quite similar under H0 for both methods whereas under Hi, the Rasch modelling method seems more accurate, that is closer to the nominal value of 0.05 than the average scores method which is higher. Figure 1.3 shows the ASNs under H0 and H1 (effect size of 0.5) for the SPRT using the average scores or the Rasch modelling method as well as the approximate sample size required by the SSD using IRT modelling. As with the TT, the ASNs were always much lower using either the average scores or the Rasch modelling method as compared with the sample size required by the SSD. Moreover, we observed that the ASN of the SPRT was a bit higher using the Rasch model as compared with the average scores method.

Table 5. Distributions of the Z(S), V(S), Z(X), and V(X) statistics under Ho estimated with the average scores (A) or the Rasch modeling (R) method. *: Number of Patients is Cumulated number of included patients since the beginning of the trial. Data are: Z (S) , V (S) , Z (X), and V (X) : sample means; (Var): variance of Z (S) or Z (X) §: p (Kolmogorov-Smirnov)=0.005.

Table 5. Distributions of the Z(S), V(S), Z(X), and V(X) statistics under Ho estimated with the average scores (A) or the Rasch modeling (R) method. *: Number of Patients is Cumulated number of included patients since the beginning of the trial. Data are: Z (S) , V (S) , Z (X), and V (X) : sample means; (Var): variance of Z (S) or Z (X) §: p (Kolmogorov-Smirnov)=0.005.

Method A

Method R

Number of patients*

Numbei of items

Z (S) (Var)

(Var)

V (X)

40

10

0.087§ (38.897)

39.514

0.070 (23.158)

27.457

20

-0.007 (39.135)

39.511

0.015 (31.456)

33.362

60

10

0.076 (61.188)

59.491

-0.056 (35.628)

39.923

20

-0.060 (56.238)

59.532

0.102 (44.488)

48.598

100

10

-0.179 (104.283)

99.479

-0.262 (62.286)

65.186

20

-0.381 (96.579)

99.517

-0.247 (76.248)

We evaluated the benefit of combining sequential analysis and IRT methodologies in the context of phase II non-comparative clinical trials with QoL endpoints. We studied and compared the statistical properties of the SPRT and of the TT using either a Rasch modeling method or the traditional average scores method. Simulation studies showed that: (i) the type I error a was correctly maintained but seemed to be lower for the Rasch modeling method as compared with the average scores method, (ii) both methods seemed to be underpowered, especially the average scores method, the power being higher when using the Rasch modeling method, (iii) as expected using sequential analysis, both methods allowed substantial reductions in ASNs as compared with the SSD, the average scores method allowing smaller ASNs than the Rasch modeling method.

The fact that the Rasch modeling method seemed to be more conservative than the average scores method in terms of significance level might be partly explained by looking at the distributions of the Z(S), V(S), Z(X), and V(X) statistics under Ho (table 1.5) under different conditions. According to asymptotic distributional results, we might expect the sequences of test statistics (Zi(S), Z2(S), ..., Zk(S)) and (Zi(X), Z2(X), ..., Zk(X)) to be multivariate normal with: Zfc(S) - N(0, Vfc(S)) and Zfc(X) - N(0, Vfc(X)), respectively, under H0 for k = 1, 2, ..., K analyses (Whitehead, 1997; Jen-

nison and Turnbull, 1999). The normality assumption was not rejected using a Kolmogorov-Smirnov test, except for the average scores method with a 10-items scale when Z(S) was estimated on only 40 patients (corresponding to the second interim analysis). Moreover, the variance of Z (S) and of Z (X) were quite close to V (S) and V (X), respectively, in most cases. However, the variance of ZZ was always lower when the estimation was performed using the Rasch modeling method (Z (X)) as compared with the average scores method (Z (S), p < 0.001, for all cases), suggesting that the estimator of Z using Rasch modeling might be more efficient. The same feature was observed under Hi (data not shown) except for the normality assumption which did not hold when a 20-items scale was used for both methods. This might explain why the SPRT and TT were underpowered, especially when using the average scores method. However, a more thoughtful theoretical study of the distributions of the statistics Z(S), Z(X) and V(S), V(X) which were obtained using both methods would be worth investigating.

Several limitations to our study are worth being mentioned. Firstly, we assumed all items parameters to be known which is unrealistic (at least for most scales). An option could be to investigate 2-stage estimation (Andersen, 1977) using item parameters estimates as known constants. However, problems with small sample sizes might occur especially in the context of sequential analysis of clinical trials where interim analyses are often performed on less than 50 patients and further work is needed. Secondly, some further sensitivity analyses could be worthwhile such as investigating the effects on the results of: (i) changing the number of items (either <10 or >20), (ii) looking at smaller or larger effects sizes than the ones investigated, and (iii) evaluating the potential effects of changing the frequency of the sequential analyses (= 20 patients). Other types of investigations could also be interesting, such as: applying these combined methodologies to comparative clinical trials (phase III trials), evaluating the impact on the statistical properties of the sequential tests of the amount of missing data (often encountered in practice) and missing data mechanisms (missing completely at random, missing at random, non ignorable missing data). In addition, other group sequential methods could also be investigated such as spending functions (Lan and De Mets, 1983), and Bayesian sequential methods (Grossman et al., 1994) for instance. Finally, we only worked on binary items and polytomous items more frequently appear in health-related QoL scales used in clinical trial practice. Other IRT models such as the Partial Credit Model or the Rating Scale Model (Andrich, 1978; Masters, 1982) would certainly be more appropriate in this context and are currently being investigated (work in progress).

### 5 Conclusion

Item response theory usually provides more accurate assessment of health status as compared with summation methods (McHorney et al., 1997; Kosinski et al., 2003). The use of IRT methods in the context of sequential analysis of QoL endpoints seems to be promising and might provide a more powerful method to detect therapeutic effects than the traditional summation method. Even though the number of subjects required to reach a conclusion seemed to be a bit higher using IRT (one more sequential analysis was needed), the trade-off between small ASN versus a satisfying precision of the estimation of treatment effect is an open question.

Finally, there are a number of challenges for medical statisticians using IRT that may be worth to mention: IRT was originally developed in educational research using samples of thousands or even ten thousands. Such large sample sizes are very rarely (almost never) attained in medical research where medical interventions are often assessed using less than 200 patients. The problem is even more crucial in the sequential analysis framework where the first interim analysis is often performed on fewer patients. Moreover, IRT and associated estimation procedures are conceptually more difficult than the summation methods often used in medical research. Perhaps one of the biggest challenges for medical statisticians will be to explain these methods well enough so that clinical researchers will accept them and use them. As in all clinical research but maybe even more in this context, there is a real need for good communication and collaboration between clinicians and statisticians.

6 References

Andersen, E. B. (1970) Asymptotic properties of conditional maximum likelihood estimators. J. R. Statist. Soc. B, 32, 283-301.

Andersen, E. B. (1977) Estimating the parameters of the latent population distribution. Psychometrika, 42, 357-374.

Anderson, T. W. (1960) A modification of the sequential probability ratio test to reduce the sample size. Ann. Math. Stat., 31, 165-197.

Andrich, D. (1978) A rating formulation for ordered response categories. Psychometrika, 43, 561-573.

Cannistra, S. A. (2004) The ethics of early stopping rules: who is protecting whom? J. Clin. Oncol., 22, 1542-1545.

Cella, D. F. and Bonomi, A. E. (1995) Measuring quality of life: 1995 update. Oncology, 9, 47-60.

Fairclough, D. L. (2002) Design and analysis of quality of life studies in clinical trials. Boca Raton: Chapman & Hall/CRC.

Fisher, G.H. and Molenaar, I.W. (1995) Rasch Models, Foundations, Recent Developments, and Applications. New-York: Springer-Verlag.

Grossman, J., Parmar, M. K., Spiegelhalter, D. J., Freedman, L. S. (1994) A unified method for monitoring and analysing controlled trials. Statist. Med., 13, 1815-1826.

Haberman, S. J. (1977) Maximum likelihood estimates in exponential response models. Ann. Statist., 5, 815-841.

Hamon, A. and Mesbah, M. (2002) Questionnaire reliability under the Rasch model. In Mesbah, M., Cole, B. F., Lee, M. L. T. (eds.) Statistical Methods for Quality of Life Studies: Design, Measurements and Analysis. Amsterdam: Kluwer.

Holman, R., Glas, C. A., and de Haan, R. J. (2003a) Power analysis in randomized clinical trials based on item response theory. Control. Clin. Trials, 24, 390-410.

Holman, R., Lindeboom, R., Glas, C. A. W., Vermeulen M., and de Haan, R. J. (2003b) Constructing an item bank using item response theory: the AMC linear disability score project. Health. Serv. Out. Res. Meth., 4, 19-33.

Jennison, C. and Turnbull, B. W. (1999) Group Sequential Methods with Applications to Clinical Trials. Boca Raton: Chapman & Hall/CRC.

Kosinski, M., Bjorner, J. B., Ware, J. E. Jr, Batenhorst, A., and Cady R. K. (2003) The responsiveness of headache impact scales scored using 'classical' and 'modern' psychometric methods: a re-analysis of three clinical trials. Qual. Life. Res., 12, 903-912.

Lacasse, Y., Wong, E., Guyatt, G. H., King, D., Cook, D. J., and Goldstein R. S. (1996) Meta-analysis of respiratory rehabilitation in chronic obstructive pulmonary disease. Lancet, 348, 115-1119.

Lan, K. K. G. and De Mets, D. L. (1983) Discrete sequential boundaries for clinical trials. Biometrika, 70, 659-663.

Masters, G. N. (1982) A Rasch model for partial credit scoring. Psychome-trika, 47, 149-174.

McHorney, C. A., Haley, S. M., and Ware, J.E. Jr. (1997) Evaluation of the MOS SF-36 Physical Functioning Scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods. J. Clin. Epidemiol., 50, 451-461.

O'Brien, P. C. and Fleming, T. R. (1979) A multiple testing procedure for clinical trials. Biometrics, 35, 549-556.

Pinheiro, J. C. and Bates, D. M. (1995) Approximations to the Log-likelihood Function in the Nonlinear Mixed-effects Model. J. Comput. Graph. Statist, 4, 12-35.

Pocock, S. J. (1977) Group sequential methods in the design and analysis of clinical trials. Biometrika, 64, 191-199.

Rasch, G. (1960) Probabilistisc models for some intelligence and attainment tests. Copenhagen, D.K.:Nielsen & Lydiche. [Expanded edition, 1980, Chicago: The University of Chicago Press].

Sebille, V. and Bellissant, E. (2001) Comparison of the two-sided single triangular test to the double triangular test. Control. Clin. Trials, 22, 503514.

Siegmund, D. (1979) Corrected diffusion approximations in certain random walk problems Adv. Appl. Probab., 11, 701-719.

Thissen, D. (1982) Marginal maximum likelihood estimation for the one-parameter logistic model. Psychometrika, 47, 175-186.

Wald, A. (1947). Sequential Analysis. New York, U.S.A.: Wiley.

Whitehead, J. and Jones, D. R. (1979) The analysis of sequential clinical trials. Biometrika, 66, 443-452.

Whitehead, J. and Stratton, I. (1983) Group sequential clinical trials with triangular continuation regions. Biometrics, 39, 227-236.

Whitehead, J. (1997) The Design and Analysis of Sequential Clinical Trials, revised 2nd edition. Chichester, U.K.:Wiley.

7 Appendix 1

The first derivative of the log likelihood with respect to a is: