Imputing responses that are not missing

Ursula U. Müller1, Anton Schick2, and Wolfgang Wefelmeyer3

1 Fachbereich 3, Universität Bremen, Postfach 330 440, 28334 Bremen, Germany [email protected]

2 Department of Mathematical Sciences, Binghamton University, Binghamton, NY 13902-6000, USA [email protected]

3 Mathematisches Institut, Universität zu Köln, Weyertal 86-90, 50931 Köln, Germany [email protected]

We consider estimation of linear functionals of the joint law of regression models in which responses are missing at random. The usual approach is to work with the fully observed data, and to replace unobserved quantities by estimators of appropriate conditional expectations. Another approach is to replace all quantities by such estimators. We show that the second method is usually better than the first.

1 Introduction

Let (X,Y) be a random vector. We want to estimate E[h(X,Y)], the expectation of some known square-integrable function h. If we are able to sample from (X,Y), we can use the empirical estimator ¿E1=1h(Xj,Yj). If nothing is known about the distribution of (X,Y), this estimator is efficient. We are interested in the situation where we always observe X, but Y only if some indicator Z equals one. We assume that Z and Y are conditionally independent given X. Then one says that Y is missing at random. In this case the empirical estimator is not available unless all Zj are one. Let n(X) = E(Z | X) = P(Z = 1 | X). If n is known and positive, we could use the estimator 1=1 Zih(Xi, Yi)/n(Xj). If n is unknown, one could replace n by an estimator n, resulting in

Surprisingly, even if n is known, replacing n by an estimator can decrease the asymptotic variance. Such an improvement is given by Schisterman and Rotnitzky [SR01]. A similar result, on average treatment effects, is in Hirano, Imbens and Ridder [HIR03]. Another estimator for E[h(X, Y)] is the partially imputed estimator