Advanced Econometrics Takeshi Amemiya
Independent and Identically Distributed Case
Let. . • , yrbe independent observations from a symmetric distribu
tion function F[(y — fi)la such that F(0) = Thus ц is both the population mean and the median. Here a represents a scale parameter that may not necessarily be the standard deviation. Let the order statistics be y(1) ё Уз, S • • • S ym. We define the sample median fi to be y(( т+т) if Tis odd and any arbitrarily determined point between y(T/2) and У«г/2)+і) if Tis even. It has long been known that fi would be a better estimator of (i than the sample mean /2 = Г-1 2£.i y, if F has heavier tails than the normal distribution. Intuitively speaking, this is because fi is much less sensitive to the effect of a few wild observations than fi is.
It can be shown that fi is asymptotically normally distributed with mean ц and variance [47/(0)2]-1, where /is the density of F. i0 Using this, we can compare the asymptotic variance of fi with that of fi for various choices of F. Consider the three densities
Normal -p= е~хг'г
Ш
Laplace ^ e~lxl
Cauchy - * .
n (1 +x2)
Table 2.3 shows the asymptotic variances of fi and fi under these three distributions. Clearly, the mean is better than the median under normality but the
Table 2.3 Asymptotic variances (times sample size) of the sample mean and the sample median under selected distributions
TV(/2) T V(fi)
|
median outperforms the mean in the case of the other two long-tailed distributions. (Note that the mean is the maximum likelihood estimator under normality and that the median, because it minimizes 2|y, — b, is the maximum likelihood estimator under the Laplace distribution.) In general, we call an estimator, such as the median, that performs relatively well under distributions heavier-tailed than normal a “robust” estimator. To be precise, therefore, the word robustness should be used in reference to the particular class of possible distributions imagined. A comparison of the variances of the mean and the median when the underlying class of distributions is a mixture of two normal distributions can be found in Bickel and Doksum (1977, p. 371), which has an excellent elementary discussion of robust estimation.
Another robust estimator of location that has long been in use is called the OL-trimmed mean, which is simply the mean of the sample after the proportion a of largest and smallest observations have been removed. These and other similar robust estimators were often used by statisticians in the nineteenth century (see Huber, 1972; Stigler, 1973). However, the popularity of these robust estimators declined at the turn of the century, and in the first half of the present century the sample mean or the least squares estimator became the dominant estimator. This change occurred probably because many sophisticated testing procedures have been developed under the normality assumption (mathematical convenience) and because statisticians have put an undue confidence in the central limit theorem (rationalization). In the last twenty years we have witnessed a resurgence of interest in robust estimation among statisticians who have recognized that the distributions of real data are often significantly different from normal and have heavier tails than the normal in most cases. Tukey and his associates in Princeton have been the leading proponents of this movement. We should also mention Mandelbrot (1963), who has gone so far as to maintain that many economic variables have infinite variance. However, it should be noted that the usefulness of robust estimation is by no means dependent on the unboundedness of the variance; the occur-
rence of heavier tails than the normal is sufficient to ensure its efficacy. For a survey of recent developments, the reader is referred to Andrews et al. (1972), who reported on Monte Carlo studies of 68 robust estimators, and to Huber (1972, 1977, 1981), Hogg (1974), and Koenker (1981a).
Robust estimators of location can be classified into four groups: M, Lp, L, and R estimators. M, L, and R estimators are the terms used by Huber (1972). Lp estimators constitute a subset of the class M, but we have singled them out because of their particular importance. We shall briefly explain these classes of estimators and then generalize them to the regression case.
M Estimator The M estimator (stands for “maximum-likelihood-type” estimator) is defined as the value ofb that minimizes 2xp[{y, — b)/s] wheres is an estimate of the scale parameter о and p is a chosen function. If p is twice differentiable and its second derivative is piecewise continuous with Ep'[{y, — p)/s0] = 0 where s0 is the probability limit of s,11 we can use the results of Chapter 4 to show that the M estimator is asymptotically normal with mean p and variance
E{p>[s-Qy-p)Y)
Note that when p{X) = X2, this formula reduces to the familiar formula for variance of the sample mean.
Consider an M estimator proposed by Huber (1964). It is defined by
p(z) = iz2 if |z| < c
— c|z| — -£c2 if |z|Sc,
where z = (y — p)/s and c is to be chosen by the researcher. (The Monte Carlo studies of Andrews et al. (1972) considered several values of c between 0.7and 2.) Huber (1964) arrived at the p function in (2.3.2) as the minimax choice (doing the best against the least favorable distribution) when F(z) = (1 — е)Ф(г) - I - etf(z), where z = (y — p)/o, Я varies among all the symmetric distributions, Ф is the standard normal distribution function, and є is a given constant between 0 and 1. The value of c depends on € in a certain way. As for s, one may choose any robust estimate of the scale parameter. Huber (1964) proposed the simultaneous solution of
(2.3.4)
in terms of b and s. Huber’s estimate of/x converges to the sample mean or the sample median as c tends to °° 0r 0, respectively.
Another M estimator shown to be robust in the study of Andrews et al. (1972) is the following one proposed by Andrews (1974). Its p function is defined by
—p(z) = 1 + cos z if z S я (2.3.5)
if z > n,
where z — {y— b)/s as before. Andrews’ choice of s is (2.1) Median {|y, — p).
Lp Estimator. The Lp estimator is defined as the value of b that minimizes 2£.j|yt — bp. Values of p between 1 and 2 are the ones usually considered. Clearly, p = 2 yields the sample mean and p = 1 the sample median. For any p Ф 1, the asymptotic variance of the estimator is given by (2.3.1). The approximate variance for the case p = 1 (the median) was given earlier. Note that an estimate of a scale parameter need not be used in defining the Lp estimator.
L Estimator. The L estimator is defined as a linear combination of order statistics У(і) § y(2) = ’ ' ' — У(т)- The sample median and the a-trimmed mean discussed at the beginning of this subsection are members of this class. (As we have seen, the median is a member of Lp and hence of M as well.)
Another member of this class is the Winsorized mean, which is similar to a trimmed mean. Whereas a trimmed mean discards largest and smallest observations, a Winsorized mean “accumulates” them at each truncation point. More precisely, it is defined as
T *[(£+ 1)Уь+» + y<g+2)+ • ' - +JW-n + (£+ l)y(r-,)] (2.3.6) for some integer g.
A generalization of the median is the 0 th sample quantile, 0 < 0 < 1; this is defined as у(Л), where к is the smallest integer satisfying к > Тв if T6 is not an integer, and an arbitrarily determined point between y(T9) and y(n+ 0 if Тв is an integer. Thus в = і corresponds to the median. It can be shown that the 0th sample quantile, denoted by рів), minimizes
Previously we have written Д(і) simply as Д. Gastwirth (1966) proposed a linear combination of quantiles О. ЗДШ + 0.4Д($) + 0.3ДШ as a robust estimator of location. Its asymptotic distribution can be obtained from the following general result attributed to Mosteller (1946): Suppose 0 <в{< 02< • • • <0„< 1. Then Д(0,),Д(02), . . . ,Д(0в) are asymptotically jointly normal with the means equal to their respective population quantiles, /t(0j), M02)> • • • . М(0Л) (that is, в, = F[ju(0,)]), and variances and covariances given by
Л Estimator. The rank is a mapping from n real numbers to the integers 1 through n in such a way that the smallest number is given rank 1 and the next smallest rank 2 and so on. The rank estimator of /і, denoted by ц *, is defined as follows: Construct a sequence of n = 2T observations x2, . . . , x„ by defining*, = yi-b, i= 1, 2, . . . , T, andxr+, = b - yt, і = 1, 2, . . . , T, and let their ranks be RUR2, . . . ,Rn. Then /x * is the value of b that satisfies
where / is a function with the property foJ(A) dk = 0. Hodges and Lehmann (1963) proposed setting /(A) = A — For this choice of /(А), ц * can be shown to be equal to Median ((yt + yj)/2), 1 g і ё j s T. It is asymptotically normal with mean ц and variance
Remarks
We have covered most of the major robust estimators of location that have been proposed. Of course, we can make numerous variations on these estimators. Note that in some of the estimation methods discussed earlier there are parameters that are left to the discretion of the researcher to determine. One systematic way to determine them is the adaptive procedure, in which the values of these parameters are determined on the basis of the information contained in the sample. Hogg (1974) surveyed many such procedures. For example, the a of the a-trimmed mean may be chosen so as to minimize an.
estimate of the variance of the estimator. Similarly, the weights to be used in the linear combination of order statistics may be determined using the asymptotic variances and covariances of order statistics given in (2.3.8).
Most of the estimators discussed in this subsection were included among the nearly seventy robust estimators considered in the Monte Carlo studies of Andrews et al. (1972). Their studies showed that the performance of the sample mean is clearly inferior. This finding is, however, contested by Stigler (1977), who used real data (eighteenth and nineteenth century observations on physical constants such as the speed of light and the mean density of the earth for which we now know the true values fairly accurately). He found that with his data a slightly trimmed mean did best and the more “drastic” robust estimators did poorly. He believes that the conclusions of Andrews et al. are biased in favor of the drastic robust estimators because they used distributions with significantly heavy tails as the underlying distributions. Andrews et al. did not offer definite advice regarding which robust estimator should be used. This is inevitable because the performance of an estimator depends on the assumed distributions. These observations indicate that it is advisable to perform a preliminary study to narrow the range of distributions that given data are supposed to follow and decide on which robust estimator to use, if any. Adaptive procedures mentioned earlier will give the researcher an added flexibility.
The exact distributions of these robust estimators are generally hard to obtain. However, in many situations they may be well approximated by methods such as the jackknife and the bootstrap (see Section 4.3.4).