Advanced Econometrics Takeshi Amemiya
Maximum Likelihood Estimator
4.1.2 Definition
Let LT(6) = L(y, в) be the joint density of a Г-vector of random variables У — (У і. Уі. • • • . Утї characterized by a Я-vector of parameters в. When we regard it as a function of в, we call it the likelihood function. The term maximum likelihood estimator (MLE) is often used to mean two different concepts: (1) the value of в that globally maximizes the likelihood function L( у, в) over the parameter space 0; or (2) any root of the likelihood equation
дЬт(в) _
дв
that corresponds to a local maximum. We use it only in the second sense and use the term global maximum likelihood estimator to refer to the first concept. We sometimes use the term local maximum likelihood estimator to refer to the second concept.
4.1.3 Consistency
The conditions for the consistency of the global MLE or the local MLE can be immediately obtained from Theorem 4.1.1 or Theorem 4.1.2 by putting QT(6) = log LT(6). We consider the logarithm of the likelihood function because T~l log LT(6) usually converges to a finite constant. Clearly, taking the logarithm does not change the location of either global or local maxima.
So far we have not made any assumption about the distribution of y. If we assume that {y,} are i. i.d. with common density function/(•, в), we can write
log L(y, в) ='£ logf(y„ в). (4.2.2)
r-1
In this case we can replace assumption C of either Theorem 4.1.1 or Theorem
4.1.2 by the following two assumptions:
E sup |log/(y,, 0)| < M, for some positive constant M, (4.2.3)
«ее
and
log/(y„ 0) is a continuous function of в for each yt. (4.2.4) In Theorem 4.2.1 we shall show that assumptions (4.2.3) and (4.2.4) imply
(4.2.5)
Furthermore, we have by Jensen’s inequality (see Rao, 1973, p. 58, for the proof)
(4.2.6)
where the expectation is taken using the true value 80, and, therefore
E log/(y„ в)<Е log f(y„ Є0) for 8Фв0.
As in (4.2.7), we have T~lE log LT{8) < T~lElog Ет(в0)іотв Ф в0 and for all T. However, when we take the limit of both sides of the inequality (4.2.7) as T goes to infinity, we have
lim T~lE log LT{8) ^ lim T~lE log Ьт(в0).
T—fcee 7*—*oo
Hence, one generally needs to assume that 1ітг_и T~lE log LT(8) is uniquely maximized at 8 = 60.
That (4.2.3) and (4.2.4) imply (4.2.5) follows from the following theorem when we put gffi) = log f(y„ в)-Е log f(y„ 8).
Theorem 4.2.1. Let g(у, 8) be a measurable function of у in Euclidean space for each в Є 0, a compact subset of RK (Euclidean ЛГ-space), and a continuous function of в Є 0 for each y. Assume E g( y, 8) = 0. Let {y,} be a sequence of i. i.d. random vectors such that E supeee|g(y,, 0) < °°. Then Т-'ЪТ-Жу,, в) converges to 0 in probability uniformly in 8 є 0.
Proof} Partition 9 into n nonoverlapping regions 6f, 0J,. . . , 9J in such a way that the distance between any two points within each 9? goes to 0 as n goes to *>. Let 8x,82,. . . ,8„ be an arbitrary sequence of ЛГ-vectors such that0/G9?, i= 1, 2,. . . , «.Then writing g$) for g(y„ 8), we have for any € > 0
(4.2.8)
where the first inequality follows from the fact that if A implies В then P(А) ё P( B) and the last inequality follows from the triangle inequality. Because gjfi) is uniformly continuous in в є 9, we have for every і
lim sup l&(0) ~ g№= 0-
И—»00 АРА?
But, because
sup gt(0) ~ gtfi)I ^ 2 sup gt(6)
Oeej вєе
and the right-hand side of the inequality (4.2.10) is integrable by our assumptions, (4.2.9) implies by the Lebesgue convergence theorem (Royden, 1968,
p. 88)
lim E sup gi(6) ~ g,(0i) = 0
«—►oo 4^0?
uniformly for i. Take n so large that the expected value in (4.2.11) is smaller than e/2. Finally, the conclusion of the theorem follows from Theorem 3.3.2 (Kolmogorov LLN 2) by taking T to infinity.
This theorem can be generalized to the extent that 7'_12£.1я(0і) and, supeee;. I gJiO) — g£6i)can be subjected to a law of large numbers. The following theorem, which is a special case of a theorem attributed to Hoadley (1971) (see also White, 1980b), can be proved by making a slight modification of the proof of Theorem 4.2.1 and using Markov’s law of large numbers (Chapter 3, note 10).
Theorem 4.2.2. Let gj(y, в) be a measurable function of у in Euclidean space for each t and for each в Є 0, a compact subset of RK (Euclidean К-space), and a continuous function of в for each у uniformly in t. Assume E g/y, в) = 0. Let {y,} be a sequence of independent and not necessarily identically distributed random vectors such that E supeee |g/y(, 0)|1+<5 = Af < oo for some 8 > 0. Then T^S^g^y,, в) converges to 0 in probability uniformly in в Є 9.
We will need a similar theorem (Jennrich, 1969, p. 636) for the case where y, is a vector of constants rather than of random variables.
Theorem 4.2.3. Letyb y2,. . . , yTbe vectors of constants. We define the empirical distribution function of (y,, y2, • • • , Уг) by FT(a)=
X(y, < a), where x takes the value 1 or 0 depending on whether the event in its argument occurs or not. Note that y, < a means every element of the vector y, is smaller than the corresponding element of a. Assume that g(y, в) is a bounded and continuous function of у in Euclidean space and 0ina compact set 0. Also assume that FT converges to a distribution function F. Then 1ітг_«, r-‘2£.ig(y„ в) = fg(y, 0) dF(y) uniformly in в.
There are many results in the literature concerning the consistency of the maximum likelihood estimator in the i. i.d. case. Rao (1973, p. 364) has presented the consistency of the local MLE, which was originally proved by Cramer (1946). Wald (1949) proved the consistency of the global MLE without assuming compactness of the parameter space, but his conditions are difficult to verify in practice. Many other references concerning the asymptotic properties of the MLE can be found in survey articles by Norden (1972, 1973).
As an example of the application of Theorem 4.1.1, we shall prove the consistency of the maximum likelihood estimators of P and o1 in Model 1. Because in this case the maximum likelihood estimators can be written as explicit functions of the sample, it is easier to prove consistency by a direct method, as we have already done in Section 3.5. We are considering this more complicated proof strictly for the purpose of illustration.
Example 4.2.1. Prove the consistency of the maximum likelihood estimators of the parameters of Model 1 with normality using Theorem 4.1.1, assuming that lim T~lX'X is a finite positive definite matrix.
In Section 1.1 we used the symbols p and or2 to denote the true values because we did not need to distinguish them from the domain of the likelihood function. But now we shall put the subscript 0 to denote the true value; therefore we can write Eq. (1.1.4) as
y = Xj80 + u, (4.2.12)
where Vu, — <t§. From (1.3.1) we have
log Lr = -| log 2% - ~ log ff2 - 2^2 (У “ W(y - Wb (4.2.13) = -■j log 2л —~ log o2
- 1Ш ~P)+тжРо -P)+«0,
where the second equality is obtained by using (4.2.12). Therefore
plim log LT = ~j log In - ^ log cr2
Define a compact parameter space 0 by
c, s;<r2Sc2, 0'0£c3, (4.2.15)
where c, is a small positive constant and c2 and c3 are large positive constants, and assume that (0’o, al) is an interior point of 0. Then, clearly, the convergence in (4.2.14) is uniform in 0 and the right-hand side of (4.2.14) is uniquely maximized at (0’o, <r§). Put в = (0', a1)' and define 0rby
log LT(0T) = max log LT(jS). (4.2.16)
eee
Then QT is clearly consistent by Theorem 4.1.1. Now define 6T by
log LT(6T) = max log LT{Q), (4.2.17)
wherethemaximizationin(4.2.17)isoverthewhoIe Euclidean (K + l)-space.
Then the consistency of QT, which we set out to prove, would follow from
lim Р(вт=вт)= 1. (4.2.18)
7’—* ос
The proof of (4.2.18) would be simple if we used our knowledge of the explicit formulae for 6T in this example. But that would be cheating. The proof of (4.2.18) using condition D given after the proof of Theorem 4.1.1 is left as an exercise.
There are cases where the global maximum likelihood estimator is inconsistent, whereas a root of the likelihood equation (4.2.1) can be consistent, as in the following example.
Example 4.2.2. Lety,,* = 1,2,. . . , T, be independent with the common distribution defined by
ДУі) ~ (T?) with probability к (4.2.19)
= N(ju2, al) with probability 1 — k.
This distribution is called a mixture of normal distributions. The likelihood function is given by
(4.2.20)
If we put цх = yx and let cr, approach 0, the term of the product that corresponds to t = 1 goes to infinity, and, consequently, L goes to infinity. Hence, the global MLE canot be consistent. Note that this example violates assumption C of Theorem 4.1.1 because Q(ff) does not attain a global maximum at 0O. However, the conditions of Theorem 4.1.2 are generally satisfied by this model. An extension of this model to the regression case is called the switching regression model (see Quandt and Ramsey, 1978).
It is hard to construct examples in which the maximum likelihood estimator (assuming the likelihood function is correctly specified) is not consistent and another estimator is. Neyman and Scott (1948) have presented an interesting example of this type. In their example MLE is not consistent because the number of incidental (or nuisance) parameters goes to infinity as the sample size goes to infinity.
4.1.4 Asymptotic Normality
The asymptotic normality of the maximum likelihood estimator or, more precisely, a consistent root of the likelihood equation (4.2.1), can be analyzed by putting QT — log LT in Theorem 4.1.3. If {yt) are independent, we can write
where ft is the marginal density of yt. Thus, under general conditions on/,, we can apply a law of large numbers to d2 log L-г/двдв' and a central limit theorem to d log LT/dd. Even if (y,) are not independent, a law of large numbers and a central limit theorem may still be applicable as long as the degree of dependence is limited in a certain manner, as we shall show in later chapters. Thus we see that assumptions В and C of Theorem 4.1.3 are expected to hold generally in the case of the maximum likelihood estimator.
Moreover, when we use the characteristics otLTas a joint density function, we can get more specific results than Theorem 4.1.3, namely as we have shown.
in Section 1.3.2, the regularity conditions on the likelihood function given in assumptions A' and B' of Section 1.3.2 imply
A(0o) = - B(0o). (4.2.22)
Therefore, we shall make (4.2.22) an additional assumption and state it formally as a theorem.
Theorem 4.2.4. Under the assumptions of Theorem 4.1.3 and assumption (4.2.22), the maximum likelihood estimator 0T satisfies
m - 4) - * (o. -[«■ в і • <«•*»
If {y,} are i. i.d. with the common density function /(•, ff), we can replace assumptions В and C of Theorem 4.1.3 as well as the additional assumption (4.2.22) with the following conditions on/( •, 6) itself:
/1*"0'
uniformly in в in an open neighborhood of 0O. (4.2.26)
A sufficient set of conditions for (4.2.26) can be found by putting g^ff) = a2 log fJdQjdOj in Theorem 4.2.1. Because log LT= log/(y,, в) in this case, (4.2.26) implies assumption В of Theorem 4.1.3 because of Theorem 4.1.5. Assumption C of Theorem 4.1.3 follows from (4.2.24) and (4.2.26) on account of Theorem 3.3.4 (Lindeberg-Levy CLT) since (4.2.24) implies E(d logf/дв)^ = 0. Finally, it is easy to show that assumptions (4.2.24)- (4.2.26) imply (4.2.22).
We shall use the same model as that used in Example 4.2.1 and shall illustrate how the assumptions of Theorem 4.1.3 and the additional assumption (4.2.22) are satisfied. As for Example 4.2.1, the sole purpose of Example
4.2.3 is as an illustration, as the same results have already been obtained by a direct method in Chapter 1.
Example 4.2.3. Under the same assumptions made in Example 4.2.1, prove^the asymptotic normality of the maximum likelihood estimator 6 = (Р',Э>у.
We first obtain the first and second derivatives of log L:
(4.2.27) |
|
вд,(у m |
(4.2.28) |
d2 log L _ 1 дрр rr2 ’ |
(4.2.29) |
«Зр 2a* > W ** |
(4.2.30) |
X'>* |
(4.2.31) |
From (4.2.29), (4.2.30), and (4.2.31) we can clearly see that assumptions A and В of Theorem 4.1.3 are satisfied. Also from these equations we can evaluate the elements of A(0O):
From (4.2.27) and (4.2.28) we obtain
Thus, by applying either the Lindeberg-Feller or Liapounov CLT to a sequence of an arbitrary linear combination of a (K + l)-vector {xltu„ x2tu„ . . . , xau„ u2 - al), we can show
Figure 4.1 The log likelihood function in a nonregular case with zero asymptotic covariance between (4.2.35) and (4.2.36). Thus assumption C of Theorem 4.1.3 has been shown to hold. Finally, results (4.2.32) through (4.2.36) show that assumption (4.2.22) is satisfied. We write the conclusion (4.2.23) specifically for the present example as
(4.2.37)
There are cases where the global maximum likelihood estimator exists but does not satisfy the likelihood equation (4.2.1). Then Theorem 4.1.3 cannot be used to prove the asymptotic normality of MLE. The model of Aigner, Amemiya, and Poirier (1976) is such an example. In their model, plim T~l log L T exists and is maximized at the true parameter value 60 so that MLE is consistent. However, problems arise because plim T~1 log LT is not smooth at 0O; it looks like Figure 4.1. In such a case, it is generally difficult to prove asymptotic normality.
4.2.4 Asymptotic Efficiency
The asymptotic normality (4.2.23) means that if Tis laijge the variance-covariance matrix of a maximum likelihood estimator may be approximated by
(4.2.38)
But (4.2.38) is precisely the Cramer-Rao lower bound of an unbiased estimator derived in Section 1.3.2. At one time statisticians believed that a consistent and asymptotically normal estimator with the asymptotic covariance matrix
(4.2.38) was asymptotically minimum variance among all consistent and asymptotically normal estimators. But this was proved wrong by the following counterexample, attributed to Hodges and reported in LeCam (1953).
Example 4.2.4. Let 6T be an estimator of a scalar parameter such that plim 6T — 0 and 4Т(дт — 0)—* АГ[0, i>(0)]. Define the estimator 6* = Wr§T, where
wT—0 if |0r|< T~l/A = 1 if |0r| gr1/4.
It can be shown (the proof is left as an exercise) that 4Т(в%— в) —► N[0, v*(0)], where v*(0) = 0 and if (в) = v{d) іївФО.
The estimator 6* of Example 4.2.4 is said to be superefficient. Despite the existence of superefficient estimators, we can still say something good about an estimator with the asymptotic variance-covariance matrix (4.2.38). We shall state two such results without proof. One is the result of LeCam (1953), which states that the set of 0 points on which a superefficient estimator has an asymptotic variance smaller than the Cramer-Rao lower bound is of Lebesgue measure zero. The other is the result of Rao (1973, p. 350) that the matrix
(4.2.38) is the lower bound for the asymptotic variance-covariance matrix of all the consistent and asymptotically normal estimators for which the convergence to a normal distribution is uniform over compact intervals of 0. These results seem to justify our use of the term asymptotically efficient in the following sense:
Definition 4.2.1. A consistent estimator is said to be asymptotically efficient if it satisfies statement (4.2.23).
Thus the maximum likelihood estimator under the appropriate assumptions is asymptotically efficient by definition. An asymptotically efficient estimator is also referred to as best asymptotically normal (BAN for short). There are BAN estimators other than MLE. Many examples of these will be discussed in subsequent chapters, for example, the weighted least squares estimator will be discussed Section 6.5.3, the two-stage and three-stage least squares estimators in Sections 7.3.3 and 7.4, and the minimum chi-square estimators in Section 9.2.5. Barankin and Gurland (1951) have presented a general method of generating BAN estimators. Because their results are mathematically too abstract to present here, we shall state only a simple corollary of their results: Let (y() be an i. i.d. sequence of random vectors with E y, = p{0), E(y, — p)(y, — p)' = 2(0), and with exponential family density
/(У. 0) = exp [0*0(0) + fi0(y) + с*<(0)А(У)]
and define zr= r_1X, Ljy,. Then the minimization of zT—ц(в) 2(0)-J[zr — ^(0)] yields a BAN estimator of 0 (see also Taylor, 1953; Ferguson, 1958).4
Different BAN estimators may have different finite sample distributions. Recently many interesting articles in the statistical literature have compared the approximate distribution (with a higher degree of accuracy than the asymptotic distribution) of MLE with those of other BAN estimators. For example, Ghosh and Subramanyam (1974) have shown that in the exponential family the mean squared error up to 0(T~2) of MLE after correcting its bias up to 0(T~l) is smaller than that of any other BAN estimator with similarly corrected bias. This result is referred to as the second-order efficiency of MLE,5 and examples of it will be given in Sections 7.3.5 and 9.2.6.