Mostly Harmless Econometrics: An Empiricist’s Companion
Asymptotic OLS Inference
In practice, we don’t usually know what the CEF or the population regression vector is. We therefore draw statistical inferences about these quantities using samples. Statistical inference is what much of traditional econometrics is about. Although this material is covered in any Econometrics text, we don’t want to skip the inference step completely. A review of basic asymptotic theory allows us to highlight the important fact that the process of statistical inference is entirely distinct from the question of how a particular set of regression
estimates should be interpreted. Whatever a regression coefficient may mean, it has a sampling distribution that is easy to describe and use for statistical inference.[9]
We are interested in the distribution of the sample analog of
P = E [XiX'j-1 E [X. y.]
a sample of size N. A natural estimator of the first population moment, E[W.], is the sum, — ^i=l w.. By the law of large numbers, this sample moment gets arbitrarily close to the corresponding population moment as the sample size grows. We might similarly consider higher-order moments of the elements of W., e. g., the matrix of second moments, E[W. W/], with sample analog -^^2N=1 W. W/. Following this principle, the method of moments estimator of ft replaces each expectation by a sum. This logic leads to the Ordinary Least Squares (OLS) estimator
Although we derived ft as a method of moments estimator, it is called the OLS estimator of ft because it solves the sample analog of the least-squares problem described at the beginning of Section З.1.2.[10]
A - Individual-level data
. regress earnings school, robust
Source | SS df MS
------ 1-------------------------
Model | 22631.4793 1 22631.4793
Residual | 188648.31 409433 .460755019
--------- 1-----------------------------
Total | 211279.789 409434 .51602893
■+
B - Means by years of schooling
. regress average_earnings school [aweight=count], robust
(sum of wgt is 4.0944e+05)
Source
Model
Residual
Total
The asymptotic sampling distribution of /3 depends solely on the definition of the estimand (i. e., the nature of the thing we’re trying to estimate, /) and the assumption that the data constitute a random sample. Before deriving this distribution, it helps to record the general asymptotic distribution theory that covers our needs. This basic theory can be stated mostly in words. For the purposes of these statements, we assume the reader is familiar with the core terms and concepts of statistical theory (e. g., moments, mathematical expectation, probability limits, and asymptotic distributions). For definitions of these terms and a formal mathematical statement of the theoretical propositions given below, see, e. g., Knight (2000).
THE LAW OF LARGE NUMBERS Sample moments converge in probability to the corresponding population moments. In other words, the probability that the sample mean is close to the population mean can be made as high as you like by taking a large enough sample.
THE CENTRAL LIMIT THEOREM Sample moments are asymptotically Normally distributed (after subtracting the corresponding population moment and multiplying by the square root of the sample size). The covariance matrix is given by the variance of the underlying random variable. In other words, in large enough samples, appropriately normalized sample moments are approximately Normally distributed.
SLUTSKY’S THEOREM
(a) Consider the sum of two random variables, one of which converges in distribution and the other converges
in probability to a constant: the asymptotic distribution of this sum is unaffected by replacing the one that converges to a constant by this constant. Formally, let be a statistic with a limiting distribution and let bn be a statistic with probability limit b. Then a^ + bn and a^ + b have the same limiting distribution.
(b) Consider the product of two random variables, one of which converges in distribution and the other
converges in probability to a constant: the asymptotic distribution of this product is unaffected by replacing the one that converges to a constant by this constant. This allows us to replaces some sample moments by population moments (i. e., by their probability limits) when deriving distributions. Formally, let a^ be a statistic with a limiting distribution and let bn be a statistic with probability limit b. Then aNbn and a^b have the same asymptotic distribution.
THE CONTINUOUS MAPPING THEOREM Probability limits pass through continuous functions. For example, the probability limit of any continuous function of a sample moment is the function evaluated at the corresponding population moment. Formally, the probability limit of h(pN) is h(b) where plim bN = b and h(-) is continuous at b. X
THE DELTA METHOD Consider a vector-valued random variable that is asymptotically Normally distributed. Most scalar functions of this random variable are also asymptotically Normally distributed, with covariance matrix given by a quadratic form with the covariance matrix of the random variable on the inside and the gradient of the function evaluated at the probability limit of the random variable on the outside. Formally, the asymptotic distribution of h(pN) is Normal with covariance matrix Vh(b)'QVh(b) where plim bn = b, h(-) is continuously differentiable at b with gradient Vh(b), and bn has asymptotic covariance matrix fi.[11]
We can use these results to derive the asymptotic distribution of /3 in two ways. A conceptually straightforward but somewhat inelegant approach is to use the delta method: 3 is a function of sample moments, and is therefore asymptotically Normally distributed. It remains only to find the covariance matrix of the asymptotic distribution from the gradient of this function. (Note that consistency of 3 comes immediately from the continuous mapping theorem). An easier and more instructive derivation uses the Slutsky and central limit theorems. Note first that we can write
where the residual e' is defined as the difference between the dependent variable and the population regression function, as before. This is as good a place as any to point out that these residuals are uncorrelated with the regressors by definition of 3. In other words, E[X'e'] =0 is a consequence of 3 = E[XiXi]~1E[XjYj] and ei = Yj—X'3, and not an assumption about an underlying economic relation. We return to this important point in the discussion of causal regression models in Section 3.2.[12]
Substituting the identity 3.1.6 for Yj in the formula for 3, we have
'У ' [13]iei |
3 = 3 + [^ XjXj
The asymptotic distribution of 3 is the asymptotic distribution of /N(3—3) = N XiXi] 1 Xiei.
By the Slutsky theorem, this has the same asymptotic distribution as E[XiXi]_1 Xiei. Since E[Xiei] =
0, "У2Xiei is a root-N-normalized and centered sample moment. By the central limit theorem, this is
asymptotically Normally distributed with mean zero and covariance matrix E[X'Xie2], since this fourth moment is the covariance matrix of Xiei. Therefore, 3 has an asymptotic Normal distribution, with probability limit 3, and covariance matrix
E [XiXi ]-1E [XiXie2]E [XiXi]-1. (3.1.7)
The standard errors used to construct t-statistics are the square roots of the diagonal elements of this
matrix. In practice these standard errors are estimated by substituting sums for expectations, and using the estimated residuals, є; =Yj — Xjf to form the empirical fourth moment, ^[XjXje2]/N.
Asymptotic standard errors computed in this way are known as heteroskedasticity-consistent standard errors, White (1980a) standard errors, or Eicker-White standard errors in recognition of Eicker’s (1967) derivation. They are also known as “robust” standard errors (e. g., in Stata). These standard errors are said to be robust because, in large enough samples, they provide accurate hypothesis tests and confidence intervals given minimal assumptions about the data and model. In particular, our derivation of the limiting distribution makes no assumptions other than those needed to ensure that basic statistical results like the central limit theorem go through. These are not, however, the standard errors that you get by default from packaged software. Default standard errors are derived under a homoskedasticity assumption, specifically, that E[e2|Xj] = a2, a constant. Given this assumption, we have
E [XjX'ef ] = E(XjXjE [e2|Xj]) = a2E[XjXj],
E[XjX']-1E[XjX'e2]E [XjX']- |
by iterating expectations. The asymptotic covariance matrix of i then simplifies to
The diagonal elements of (3.1.8) are what SAS or Stata report unless you request otherwise.
Our view of regression as an approximation to the CEF makes heteroskedasticity seem natural. If the CEF is nonlinear and you use a linear model to approximate it, then the quality of fit between the regression line and the CEF will vary with Xj. Hence, the residuals will be larger, on average, at values of Xj where the fit is poorer. Even if you are prepared to assumed that the conditional variance of Yj given Xj is constant, the fact that the CEF is nonlinear means that E[(y;—Xjf)2|Xj] will vary with Xj. To see this, note that, as a rule,
E[(Yj - Xjf)2|Xj] = (3.1.9)
E{[(Yj - E^X]) + (E[Yj|Xj] - Xjf)]2|Xj}
= V [Y;|X;] + (E[Y;|X;] - Xjf )2.
Therefore, even if V[y^X;] is constant, the residual variance increases with the square of the gap between the regression line and the CEF, a fact noted in White (1980b).[14]
In the same spirit, it’s also worth noting that while a linear CEF makes homoskedasticity possible, this is
not a sufficient condition for homoskedasticity. Our favorite example in this context is the linear probability model (LPM). A linear probability model is any regression where the dependent variable is zero-one, i. e., a dummy variable such as an indicator for labor force participation. Suppose the regression model is saturated, so the CEF is linear. Because the CEF is linear, the residual variance is also the conditional variance, V[Yj|Xj]. But the dependent variable is a Bernoulli trial and the variance of a Bernoulli trial is P[yі|Xi](1 — P[yі|Xi]). We conclude that LPM residuals are necessarily heteroskedastic unless the only regressor is a constant.
These points of principle notwithstanding, as an empirical matter, heteroskedasticity may matter little. In the micro-data schooling regression depicted in Figure 3.1.3, the robust standard error is.0003447, while the old-fashioned standard error is.0003043, only slightly smaller. The standard errors from the grouped - data regression, which are necessarily heteroskedastic if group sizes differ, change somewhat more; compare the.004 robust standard to the.0029 conventional standard error. Based on our experience, these differences are typical. If heteroskedasticity matters too much, say, more than a 30% increase or any marked decrease in standard errors, you should worry about possible programming errors or other problems (for example, robust standard errors below conventional may be a sign of finite-sample bias in the robust calculation; see Chapter 8, below.)