## Efficient influence functions

In this section we calculate the efficient influence function for estimating the expected value E[h(X, Y)] with observations (X, ZY, Z) as described in the Introduction. The joint distribution P(dx, dy, dz) of the observations depends on the marginal distribution G(dx) of X, the conditional probability n(x) of Z =1 given X = x, and the conditional distribution Q(x,dy) of Y given X = x. More precisely, we have

P(dx, dy, dz) = G(dx)Bn(x)(dz)(zQ(x, dy) + (1 — z)So(dy)), where Bp = pSi + (1 — p)So denotes the Bernoulli distribution with parameter p and St the Dirac measure at t. Consider perturbations Gnu, Qnv and nnw of G, Q and n that are Hellinger differentiable in the following sense:

Jj (n1'2(dQ1n'v2(x, ■} - dQ1/2(x, ■}) - 2v(x, ■ }dQ1/2(x, ■}) 2 G(dx) ^ 0,

(nl/2(dBZ(x) - dBl/l)) - ^ ■ - n(x}}w(x}dBl{2x))2 G(dx} ^ This requires that u belongs to

that v belongs to

with M(dx,dy} = Q(x,dy}G(dx}; and that w belongs to L2(Gn}, where Gn(dx} = n(x}(1 - n(x}} G(dx}.

We have local asymptotic normality: With Pnuvw denoting the joint distribution of the observations (X, ZY, Z} under the perturbed parameters Gnu, Qnv and nw n dP n

'-ppP(Xi, ZiYi, Zi} = n-1/2Y,tuvw(Xi, ZiYi, Z{} i=1 i=1

- 2E[t2uvw(X,ZY,Z}] + op(1}, where tuvw(X, ZY, Z} = u(X} + Zv(X, Y} + (Z - n(X}}w(X} and E[ttw(X, ZY, Z}] = E[u2(X}] + E[Zv2(X, Y}] + E[(Z - n(X}}2w2(X}]

= j u2 dG + JJ n(x}v2(x,y} Q(x, dy}G(dx} + J w2 dGn.

If we have models for the parameters G, Q and n, then, in order for the perturbations Gnu, Qnv and nnw to be within these models, the functions u, v and w must be restricted to subsets U of L2,0(G}, V of V0, and W of L2(Gn}. The choices U = L2to(G} and V = Vo correspond to fully nonparametric models for G and Q. Parametric models for G and Q result in finite-dimensional U and V. In what follows the spaces U, V and W will be assumed to be closed and linear.

Let now k be a functional of G, Q and n. The functional is differentiable with gradient g e L2(P} if, for all u e U, v e V and w e W, n1/2 (K(Gnu, Qnv ,*nw } - k(G, Q, n}) ^ E [g(X, ZY, Z}tuvw (X, ZY, Z}].

The gradient g is not unique. The canonical gradient is g*, where g* (X, ZY, Z} is the projection of g(X, ZY, Z} onto the tangent space

Since T is a sum of orthogonal spaces

T3 = {(Z — n(X))w(X) : w G W}, the random variable g*(X, ZY, Z) is the sum g*(X, ZY, Z) = u*(X) + Zv*(X, Y) + (Z — n(X))w*(X), where u*(X), Zv%(X,Y) and (Z — n(X))w*(X) are the projections of the random variable g(X, ZY, Z) onto T\, T2 and T3, respectively. We assume that E[g2(X, ZY, Z)] is positive.

An estimator k for k is regular with limit L if L is a random variable such that, for all u G U, v G V and w G W, n1/2[ k — k(G nu, Qnv, nnw )) ^ L under Pnuvw.

The Hâjek-Le Cam convolution theorem says that L is distributed as the sum of a normal random variable with mean zero and variance E[g2 (X, ZY, Z)] and some independent random variable. This justifies calling an estimator k efficient if it is regular with limit such a normal random variable.

An estimator k for k is asymptotically linear with influence function ^ G

L2,0(P) if n n1/2(k — k(G, Q, n)) = n-1/2J2 ^Xi, ZiY, Zi) + op(l).

As a consequence of the convolution theorem, a regular estimator is efficient if and only if it is asymptotically linear with influence function g*. A reference for the convolution theorem and the characterization is Bickel, Klaassen, Ritov and Wellner [BKRW98].

We are interested in estimating k(G, Q, n) = E[h(X, Y)]= JJ h(x,v) Q(x, dy)G(dx) = J hdM.

Let Mnuv(dx,dy) = Qnv(x,dy)Gnu(dx). Then Mnuv is Hellinger differentiate in the following sense:

with t(x, y) = u(x)+v(x, y). If Mnuv satisfies limsupn f h? dMnuv < to, then nl/2( i hdMnuv — f hdM) ^ E[h(X, Y)(u(X) + v(X,Y))];

see e.g. Ibragimov and Has'minski"i [IH81], p. 67, Lemma 7.2. Thus the canonical gradient of E[h(X, Y)] is determined by

E[u*(X)u(X)] + E[Zv*(X, Y)v(X, Y)] + E[(Z - n(X))2w*(X)w(X)] = E [h(X,Y) (u(X) + v(X,Y))]

for all u G U, v G V and w G W. Setting first u = 0 and v = 0, we see that w* = 0. Setting v = 0, we see that u*(X) is the projection of h(X,Y) onto T\. Taking u = 0, we see that the projection of Zv*(X,Y) onto V = {v(X,Y) : v G V} must equal the projection of h(X,Y) onto V.

We are mainly interested in a fully nonparametric model for G, for which U = L20(G). Then u* (X) = x(X) - E[x(X)]. We now give explicit formulas for v*, and hence for the canonical gradient of E[h(X,Y)], in four cases: fully nonparametric conditional distribution, with V = Vo; parametric conditional distribution, with V finite-dimensional; and two semiparametric models, namely linear regression with and without independence of covariate and error.

1. Nonparametric conditional distribution. If V = V0, then the projections of h(X, Y) and Zv* (X, Y) onto V are h(X, Y)-x(X) and n(X)v* (X, Y). Thus

Hence, if U = L2lo(G), the canonical gradient of E[h(X,Y)] is

4nP(X, ZY, Z) = x(X) - E[x(X)] + -— (h(X, Y) - x(X))•

For the important special case h(X, Y) = Y we obtain

4np(X, ZY, Z) = E(Y I X) - E[Y] +--(Y - E(Y | X))•

2. Parametric conditional distribution. Let Q(x,dy) = q#(x,y) dy, where \$ is an m-dimensional parameter. In this case, V will be the span of the components of the score function the Hellinger derivative of the parametric model q\$ at \$:

JJ (ql+t(x,y) - q1J2(x,y) - 2tTi^(x,y)ql/2(x,y)) dyG(dx) = °(t2).

We also assume that E[Z£^(X,Y)£#(X,Y)T] is positive definite. If is differentiate in \$, then = fq#, where is the derivative of q\$ with respect to \$. If we set L = i#(X, Y), then V = {cTL : c G Rm}. Thus v* is of the form cTL. Since the projections of h(X,Y) and Zv*(X,Y) onto V are aTL and bTL with a = (E[LLT])-1E[Lh(X,Y)] and b = (E[LLT])-1E[ZLLT] c*, we obtain c* = (E[ZLLt])-1E[Lh(X,Y)]. Thus, if U = L20(G), the canonical gradient of E[h(X,Y)] is

3. Linear regression with independence. We consider the linear regression model Y = êX + e with e and X independent. We assume that e has an unknown density f with finite Fisher information J for location and X has finite and positive variance. We do not assume that e has mean zero. In this model, Q(x, dy) = f (y — êx) dy. Write F for the distribution function of f. As shown in Bickel [Bic82],

Here I denotes the score function £(y) = — f '(y)/f (y) for location. The space V can be written as the orthogonal sum of the spaces Vi = {a£ : a G R} with e =(X — E [X Me), and V = {ß(e) : ß G L20(F)}. The projection of h(X, Y) onto V is ch£/E[£2] with ch = E[h(X, Y)e], and the projection of h(X, Y) onto V2 is h(e) — E[h(e)] with h(e) = E(h(X, Y) | e). For b G L2(F), the projection of Zb(e) onto Vi is c£/E[£2] with c = E[Zb(e)e] = E [Z](E(X IZ =1) — E [X ])E [b(e)£(e)], and the projection of Zb(e) onto V2 is E[Z](b(e) — E[b(e)]). Let e* = (X — E(X I Z =1))£(e).

Then Ze* is orthogonal to V2, and its projection onto Vi is a*e/E[e2] with a* = E[Ze*e] = E[Ze2]. Since Ch = E[h(X, Y)e] = E[h(X, Y)e*] + (E(XIZ =1) — E[X])E[h(X, Y)i(e)], it follows that

Thus, if U = L2,0(G), the canonical gradient of E[h(X,Y)] is

^(X, ZY, Z) = x(X)—E[x(X)]+z(EEei^ e* + EZ (h(e)—E\$(e)])).

For h(X, Y) = Y we can use the identity E[e£(e)] = 1 to simplify the canonical gradient to

VY E[X])+ Z(E[X] — E(XIZ = 1)) + Z(e — E[e]) ß(X — E[X] +-EZ2]-^ + E[Z] •

4. Linear regression without independence. Now we consider the linear regression model Y = ßX+e with E(e | X) = 0. We write a2(X) = E(e2 | X) and ph(X) = E(h(X,Y)e | X). In this model, we have only the constraint f yQ(x, dy) = ßx on the transition distribution Q. In this case, the space V is the sum of the two orthogonal spaces

For details see Müller, Schick and Wefelmeyer [MSW04]. The projection of h(X,Y) onto V1 is aha-2 (X)Xe with ah = E[h(X, Y)a-2(X)Xe]/E[a-2(X)X2], while the projection onto V2 is h2 = h(X,Y) — x(X) — E[ph(X)]a-2(X)e. It is now easy to check that v*(X,Y) = a*a-2(X)Xe + h2/n(X). Thus, if

U = L2,0(G), the canonical gradient of E[h(X,Y)] is

^u(X, ZY, Z) = x(X) — E[x(X)] + -— (h(X, Y) — x(X))