A COMPANION TO Theoretical Econometrics
Asymptotic Properties
At the beginning, we presented an intuitive justification for the GMM estimation framework. In this section, we provide a more rigorous argument by establishing the consistency and asymptotic normality of the estimator. The latter facilitates the construction of large sample confidence intervals for 00. These intervals depend on the long-run variance of the sample moment, and so we briefly discuss how this variance can be consistently estimated. We also derive the asymptotic distribution of the estimated sample moment. The latter analysis provides further evidence of the connection between gT(0T) and the overidentifying restrictions, and plays an important role in the model specification test discussed in Section 6.
The analysis rests on applications of the laws of large numbers (LLN) and the central limit theorem (CLT) to functions of vt. So far, we have only restricted vt to be stationary, and this is insufficient by itself to guarantee these limit theorems. Accordingly, we impose the following condition.
Assumption 7. Ergodicity. The random process {vt; -«> < t < ^} is ergodic.
A formal definition of ergodicity involves rather sophisticated mathematical ideas and is beyond the scope of this chapter. Instead we refer the interested reader to Davidson (1994, pp. 199-203). It is sufficient for ergodicity that the dependence between vt and vt-m decreases at a certain rate to zero as m ^ ^. If vt satisfies this type of restriction then it is called a mixing process. Certain other regularity conditions must also be imposed but due to space constraints we shall not explore them here. Instead we refer the interested reader to Hansen's (1982) original article or the surveys by Newey and McFadden (1994) and Wooldridge (1994).
Recall that 0T is consistent for 90 if 0T — 90. If there is a closed form solution for 0T then it is relatively straightforward to establish whether or not the estimator is consistent by examining the limiting behavior of appropriate functions of the data. Unfortunately, as remarked above, we do not have this luxury in most nonlinear models. However, although there is no closed form, 0T is clearly defined by (11.8). The key to a proof of consistency is the consideration of what happens if we perform a similar minimization on the population analog to QT(9),
Qo(9) = {E[f(vt, 9)]}W{E[f (vt, 9)]} (11.15)
The answer follows directly from our earlier assumptions. The population moment condition implies Q0(90) = 0. The global identification condition and the positive definiteness of W, imply Q0(9) > 0 for all 9 Ф 90. Taken together these two properties imply Q0(9) has a unique minimum at 9 = 90. Intuition suggests that if 0T minimizes QT(9) and QT (9) converges in probability to a function, Q0(9), with a unique minimum at 90, then 0T converges in probability to 90. In essence this intuition is correct but there is one mathematical detail which needs to be taken into account. It is not necessarily the case that the minimum of a sequence of functions converges to the minimum of the limit of the sequence of functions. For this to be the case, it is sufficient that QT(9) converges uniformly to Q0(9).8
Assumption 8. Uniform Convergence in Probability of Qt(9).
sup9E0 | Qt(9) - Q0(9)| — 0.
Once uniform convergence is imposed, then consistency can be established along the lines described above; e. g. see Hansen (1982) or Wooldridge (1994).
Theorem 1. Consistency of the Parameter Estimator. If Assumptions 1-4, 6-8 and certain other regularity conditions hold then 0T — 90.
To develop the asymptotic distribution of the estimator, we require an asymptotically valid closed form representation for T 1/2(0T - 90). This representation comes from an application of the Mean Value Theorem9 which relates f (■) to its first derivatives df(vt, 9)/Э9'. So, to pursue this approach, it is necessary to impose Assumption 5. The Mean Value Theorem implies that
gT(0T) — gr(Qo) + GT(dT, 0O, Xт)фт — 9o), (11.16)
where GT(0T, 0O, X) is the (q x p) matrix whose ith row is the corresponding row of GT(0f) where GT(0) — T-1 ’ZTt—1 df(vt, 0)/Э0', 0T — XiT0O + (1 - X;T)0T for some 0 < Xi/T < 1, and X T is the (q x 1) vector with ith element Xi T. Premultiplication of
(11.16) by GT(0T)'WT yields
GT(0T)'WTgT(0T) — GT(0T)'WTgT(0O) + GT(0T)'WTGT (0T, 0O, X T)(0T — 0O).
(11.17)
Now the first order conditions in (11.9) imply the left-hand side of (11.17) is zero and so with some rearrangement it follows from (11.17) that
T 1/2(0T — 0O) — — [GT (0T)'WTGT(0T, 0O, XT)]—1GT(0T)'WTT1/2gT(0O)
— MTT1/2gT(0O) say. (11.18)
Equation (11.18) implies that T 1/2(0T — 0O) behaves like the product of a random matrix, MT, and a random vector, T 1/2gT(0O). Therefore, we can derive the asymptotic distribution of the estimator from the limiting behavior of these two components. The asymptotic behavior of T1/2gT(0 O) is given by a version of the CLT.
Assumption 9. Central Limit Theorem for T1/2gT(0O). T1/2gT(0O) —
N(O, S) where S is a positive definite matrix of constants.
Now consider MT. Since 0T — 0O and 0T lies on the line segment between 0T and 0O, then it follows that 0T — 0O for i — 1, 2...p. Intuition suggests that this should imply both GT(0T) and GT(0T, 0O, XT) converge in probability to GO — E[df(vt, 0O)/30']. In essence this is correct, but the argument can only be formally justified if GT(0) converges uniformly and certain other regularity conditions apply. For brevity, we adopt the high level assumption that the desired behavior occurs, and refer the interested reader to Newey and McFadden (1994) for the necessary underlying regularity conditions.
Assumption 10. Convergence of GT(0T) and GT(0T, 0O, XT). GT(0T) — GO and GT(0T, 0O, X T) — GO.
Assumptions 6 and 1O can be combined with Slutsky's Theorem to deduce that MT — (GO'WGO)—1GOW. Therefore, T1/2(0T — 0O) is the product of a random matrix which converges in probability to a constant, and a random vector which converges to a normal distribution. This structure implies:1()
Theorem 2. Asymptotic distribution of the estimator. If Assumptions 1-1O and certain other regularity conditions hold then: T1/2(0T — 0O) 2— N(O, MSM') where M — (GOWGO)—1GOW.
Theorem 2 implies that an approximate 100(1 - a)% confidence interval for 00i in large samples is given by
Pt, i ± (11.19)
where VTii is the i - ith element of a consistent estimator of MSM. In practice, the asymptotic variance can be consistently estimated by VT = MTSTMT where MT = [GT(0T)'WTGT(0T)]-1GT(0T)'WT, and ST is a consistent estimator of S. The construction of ST depends on the time series properties of f(vt, 0O). With certain relatively mild additional conditions, it can be shown that11
S = Го + X (Г - + Г-), (11.20)
І =1
where Г;- = E[( f - E[ft])(f— - E[ ft_j])'] is known as the jth autocovariance matrix12 of f = f(vt, 0о). For brevity, we distinguish only two cases of interest. First, if f is a martingale difference (MD) sequence and hence serially uncorrelated (that is, Г = 0, ІФ 0) then S can be consistently estimated by13
Smd = T fj!, (11.21)
t=1
where f = f(vt, 0T). It can be shown that SMD S if the martingale difference
assumption is valid; for example see White (1994, Theorem 8.27, p. 193). Second, and more generally, S can be estimated by a member of the class of heteroskedasticity autocorrelation consistent covariance (HACC) estimators,
b(T)
Shacc = Г 0 + X (Г i + Г!), (11.22)
i=1
where Г = T-1 X T= i+1 ftfUi, {®iT} are known as weights and b(T) is the bandwidth. The weights and bandwidth must satisfy certain conditions if SHACC is to be both positive semi-definite14 and consistent. Various combinations have been proposed in the literature.15 One example is the "Bartlett" kernel proposed in this context by Newey and West (1987) for which ю iT = 1 - i/[b(T) + 1]. Andrews (1991) shows that this choice yields a consistent estimator if b(T) ^
In practice, the researcher must choose both the bandwidth and the weights. While this choice can be guided by asymptotic theory, there is no consensus to date upon what choice is best in the sample sizes encountered in economics and finance.16 It should be noted that the consistency of both SMD and SHACC is predicated on E[ f(vt, 00)] = 0. If the model is misspecified, and hence Assumption 3 is violated, then neither estimator is consistent. This inconsistency can have important consequences for the model specification test described in Section 6, and we return to this issue there.
Finally, we consider the asymptotic distribution of the estimated sample moment. It is most convenient to work with the transformed moment, WT/2T1/2gT(0T). Equation (11.16) implies
WT/2T1/2gT(0T) = WT/2T1/2gT(00) + WT/2Gt(0t, 00, XT)T 1/2(0t - 0„). (11.23) If we substitute for T 1/2(0T - 00) from (11.18) then (11.23) can be written as
W1/2T 1/2gT(0T) = NT(0T)WT/2T1/2gT(0o), (11.24)
where
NT(0T) = Iq - WT/2Gt(0t, 00, Xt)[Gt(0t)'WtGt(0t, 00, XT)] 1Gt(0t)'WT/2.
Equation (11.24) implies W]/2T1/2gT(0T) has the same generic structure as the expression for T1/2(0T - 00) in (11.18) namely: a random matrix, which converges to a matrix of constants, times a random vector which converges to a normal distribution. Therefore, we can use the same logic as before to deduce the following result; see Hansen (1982).
Theorem 3. Asymptotic distribution of the estimated sample moment. If Assumptions 1-10 and certain other regularity conditions hold then: W1T/2T1/2gT(0T) - N(0, NSN') where N = [Iq - P(0„)]W1/2' and P(0O) =
F(0o)[F (0o)'F(0o)]-1F(0o)'.
The connection between the estimated sample moment and the overidentifying restrictions manifests itself in the asymptotic distribution. Equation (11.24) implies that
WT/2T1/2gT(0T) = [Iq - P(0o)] W 1/2T1/2gT(0o) + 0P(1). (11.25)
Inspection of (11.25) reveals that the asymptotic behavior of the estimated sample moment is governed by the function of the data which appears in the overidentifying restrictions. Therefore, the mean of the asymptotic distribution in Theorem 3 is zero because the overidentifying restrictions are satisfied at 0o. This relationship also has an impact on the properties of the variance of the limiting distribution. Since W1/2 and S are nonsingular, it follows that17 rankjNSN'} = rank{Iq - P(0o)} = q - p, and so the covariance matrix is singular.18 This rank is easily recognized to be the number of overidentifying restrictions.