## Tests of Fit based on Products of Spacings

Paul Deheuvels1 and Gérard Derzko2

1 L.S.T.A., Université Paris VI, 7 avenue du Château, F 92340 Bourg-la-Reine, France [email protected]

2 Sanofi-Synthélabo Recherche, 371 rue du Professeur Joseph Blayac, 34184 Montpellier Cedex 04, France [email protected]

Summary. Let Z = Z1,...,Zn be an i.i.d. sample from the distribution F(z) = P(Z < z) and density f(z) = dF(z). Let Zi,n < ... < Zn^n be the order statistics generated by Z1,...,Zn. Let Z0>n = a = inf{z : F(z) > 0} and Z„+1:„ = b = sup{z : F(z) < 1} denote the end-points of the common distribution of these observations, and assume that f is continuous and positive on (a,b). We establish the asymptotic normality of the sum of logarithms of the spacings Zin — Zi-1n, for i = 1,...,n + 1, under minimal additional conditions on f. Our results largely extend previous results in the literature due to Blumenthal [Blu68] and other authors.

1 Introduction and Main Results.

### 1.1 Introduction.

Let Z = Zi, Z2,... be independent and identically distributed [i.i.d.] random variables with distribution function F(z) = P(Z < z) and density f (z), assumed throughout to be continuous and positive on (a, b) for —tt < a < b < <x>, and equal to 0 otherwise. Here, a = inf{z : F(z) > 0} and b = sup{z : F(z) < 1} denote the distribution end-points. For each n > 1, denote by a < Zi,n < ... < Zn,n < b the order statistics of Zi ,...,Zn, and set, for convenience, Zo,n = a and Zn+i,n = b, with F(Zo,n) = F(a) = 0 and F(Zn+i,n) = F(b) = 1, for n > 0. Denote by

the spacings of order n > 1 based upon {Zinn-i : 1 < i < n}. Darling [Dar53] introduced the class of statistics n-q+i n-q+i

Tn = Tn(p,q)= E {— log(nDin)}=log (nn-p-q+2 Din), (2) i=p i=p to test the null hypothesis (when —tt < a < b < tt)

120 Paul Deheuvels and Gérard Derzko (H.0) f (z) = (b — a)-1H(a,b)(z) for a < z < b, against the alternative (H.1) that f is arbitrary on (a,b). In (2), p and q are fixed integers such that 1 < p < n — q +1 < n. When the distribution endpoints a and b are finite and known, a standard choice for p and q is given by p = q = 1. On the other hand, when a (resp. b) is unknown (or possibly infinite), Din (resp. Dn,n) is unknown (or possibly infinite), so it is more appropriate to choose p > 2 (resp. q > 2), otherwise Tn(p,q) becomes meaningless. The aim of the present paper is to investigate the limiting behavior of Tn = Tn(p, q) as n -œ. It will become obvious later on that the results we shall obtain are essentially independent of the choices of p, q, subject to the restrictions that

{1 when a > —m, I 1 when b < <x>, u and q > qo = \ o u i. (3)

Because of this, we will use throughout the notation Tn = Tn(p, q), and specify the values of p, q only in case of need.

Under rather strenuous regularity assumptions on f (assuming, in particular that f is twice differentiable on (a,b), see, e.g., (2.3a) in [Blu68]), implying finiteness of Var(log f (Z)), Blumenthal [Blu68] (see also Cressie [Cre76]) showed that, as n ^œ, n-1/2{ Tn — nY — nE(log.

where "-4" denotes weak convergence. In (4), Z(■) and y denote, respectively, the Riemann zeta function and Euler's constant, conveniently defined by

Z(2) = —, Y = j (— log t)e-tdt = lrimi |Z(r) — — }

= limiY1 — log n) = 0.577215 ... , n—^oo L ' ^ j J

j=iJ

(see, e.g., Spanier and Oldham [SO87]). Here, r(■) stands for Euler's Gamma function, namely o r(r) = tr-1e-t for r > 0. (6)

One of the purposes on the present paper is to give simple conditions implying the validity of (4) . Our main result concerning this problem is stated in the following theorem.

Theorem 1.1 Assume that

and either

(i) f is continuous and bounded away from 0 on [a, b]; or

(ii) f is monotone in a right neighborhood of a, and monotone in a left, neighborhood of b.

Then, for each p > po and q > po, we have n-i/2{Tn(p,q) - nY - nE(log f (Z))}

1.2 Some Relations with the Kullback-Leibler Information .

The limiting result in (8) is related to the Kullback-Leibler information in the following way. In general, for any two random variables Yo and Yi with densities go and gi on R, with respect to the Lebesgue measure, the Kullback-Leibler information K(gi,go) of gi with respect to go is defined by (with the convention 0/0 = 1)

when gi(y)dy ^ go(y)dy (which we denote by gi ^ go), and

The well-known property that

with equality if and only if gi = go a.e., follows from the fact that the function

& for x < 0, fulfills h(x) > 0 with equality if and only if x = 1. This, in turn, implies that

Jr lgo(y)J

with equality if and only if gi = go a.e. (with gi ^ go). The inequality (13) also holds when gi ^ go, since then, by definition, K(gi,go) = &. By applying (13) to gi = f and go = (b — a)-1 II (a , b) when —to < a <b < to, we see that

Kf, (b — a)-1%,b)) = f (z)log f (z)dz + log(b — a) (14)

with equality if and only (H.0) holds, namely, when f (t) = (b — a)-1 a.e. on (a,b). When the constants a and b are unknown (but finite), we may estimate these quantities by Zin and Zn n, respectively. Under (H.0), it is straightforward that, as n ^ to,

Z\,nn = a + OP(1/n) >a and Zn,n = b + OP(1/n) <b. (16)

By (16), the test rejecting (H.0) when either (for a and b specified)

Tn > c*n,a := n7 — nlog(b — a) + n1/2va{ y — ^ , (17)

or (for a and b unspecified)

Tn > c*n*a := nY — n log(Zn,n — Zhn)+n1/2va{ — — 1j , (18)

where va denotes the upper quantile of order a G (0,1) of the normal N(0,1) law, is asymptotically consistent, with size tending to a as n ^ to, against all alternatives for which (4) is satisfied. Moreover, the obvious inequality Znnn — Zn, n < b — a implies that c^*a > cn a, so that we have always

The exact critical value cnaa = cn a + o(n1/2) = c*n a + o(n1/2) defined by

can be computed, making use of the methods of Deheuvels and Derzko [DD03], who described several methods to evaluate numerically the distribution of Tn under (H.0). In particular, they gave a simple proof of the fact that, under (H.0), with a = 0, b = 1 and p = q = 1,

E( exp(sTn)) = r (1 — s)n{ n^—))} f0r S< 1. (21)

We note that a version of (21) was obtained originally by Darling [Dar53] by different methods.

Unfortunately, the consistency of tests of the form (17)-(18), rejecting (H.0) for values of Tn exceeding cn a or c*na, is known to hold only for the rather narrow alternative class of density functions f (■) described in [Blu68] as sufficient to imply (4). One of the purposes of the present paper is to overcome this drawback by extending the validity of (4) to a more wider class of distributions. The just-given Theorem 1.1 provides this result by givien a new proof of (4), under much weaker conditions that that imposed by Blumenthal [Blu68], and Cressie [Cre76]. In the sequel, we will limit ourselves, unless otherwise specified, to gives details of the proof in the case where —to < a < b < to, and we will then set a = 0 and b = 1 without loss of generality. The following proposition, which will turn out to be an easy consequence of Theorem 1.1, gives an example of how these results apply in the present framework.

Proposition 1.1 Let f be continuous and positive on (a,b), and either:

(ii) monotone in a right neighborhood of a, monotone in a left neighborhood of b, and such that, for some e > 0,

Proof. We observe that the conditions (23), (22) and (24) readily imply that E((log f (Z))2) < to. Therefore, the proposition is a direct consequence of Theorem 1.1.□

Example 1.1 Let F(x) = 1/(log(e/x))r for 0 < x < 1 and r > 0. Obviously, f (x) = r/(x(log(e/x))r+1) and log f (x) = (1 + o(1)) log(e/x) as x I 0. Thus,

This show the sharpness of the conditions in Proposition 1.1, since the finite-ness of E((log f (X))2) < to is a minimal requirement for (24) to hold. The arguments used in our proofs, given in the next section, mix the methods of Deheuvels and Derzko [DD03], with classical empirical process arguments.

sn n

2 Proofs.

2.1 A useful Theorem.

We start by proving the following useful theorem, of independent interest.

Theorem 2.1 Assume that E((log f (Z))2) < to. Then, for each p > p0 and q > Po, we have, as n ^ to, n-q+i n-"'2

£ [_ log{ "<F Z']-F <Z-n »} - 7 - E( log f (Z))

Remark 2.1 It will become obvious from the arguments given later on that the conclusion (25) of Theorem 1 remains valid when we replace formally in (25), F(Zi,n) — F(Zi-in) by F(Zi+i,n) - F(Zt,n).

The remainder of the present sub-section is devoted to proving Theorem 1 in the case p = 2 and q = l. The proof for arbitrary p > po and q > qo is very similar, and left to the reader. We will show later on how Theorem 1 may be applied to prove Theorem 1.1. Below, the following notation will be in force. We will set Uo,n = F(Zo,n) = 0 and Un+i,n = F(Zn+i,n) = l, for each n > 0, and let 0 < Uin = F(Zin) < ... < Un,n = F(Zn,n) < l denote the order statistics of the first n > l observations from the i.i.d. sequence Ui = F(Zi), U2 = F(Z2),..., of uniform (0, l) random variables, defined on the probability space (Q, A, P) on which sit Zi,Z2,..., as given in §1.

We set Yn = — log(l — Un) = — log(l — F(Zn)) for n = l, 2,..., and observe that these random variables form an i.i.d. sequence of exponentially distributed random variables. Moreover, setting Yo,n = — log(l — F(Zo,n)) = 0 for n > 0, the order statistics

Yo,n =0 < Yi n = — log(l — F(Zi,n)) < . . . < Yn,n = — log(l — F(Zn,n)),

Yi,n = — log(l — Uin) = — log(l — F(Z^)) for 0 < i < n. (27) Set now = (n — i + l)(Y,n — Yi-i<n) for l < i < n, so that