Mostly Harmless Econometrics: An Empiricist’s Companion
Economic Relationships and the Conditional Expectation Function
Empirical economic research in our field of Labor Economics is typically concerned with the statistical analysis of individual economic circumstances, and especially differences between people that might account for differences in their economic fortunes. Such differences in economic fortune are notoriously hard to explain; they are, in a word, random. As applied econometricians, however, we believe we can summarize and interpret randomness in a useful way. An example of “systematic randomness” mentioned in the introduction is the connection between education and earnings. On average, people with more schooling earn more than people with less schooling. The connection between schooling and average earnings has considerable predictive power, in spite of the enormous variation in individual circumstances that sometimes clouds this fact. Of course, the fact that more educated people earn more than less educated people does not mean that schooling causes earnings to increase. The question of whether the earnings-schooling relationship is causal is of enormous importance, and we will come back to it many times. Even without resolving the difficult question of causality, however, it’s clear that education predicts earnings in a narrow statistical sense. This predictive power is compellingly summarized by the conditional expectation function (CEF).
The CEF for a dependent variable, Yj given a Kx1 vector of covariates, Xj (with elements Xki) is the expectation, or population average of Yj with Xj held fixed. The population average can be thought of as the mean in an infinitely large sample, or the average in a completely enumerated finite population. The CEF is written E [Yj IXj] and is a function of Xj. Because Xj is random, the CEF is random, though sometimes we work with a particular value of the CEF, say E[Yj|Xj=42], assuming 42 is a possible value for Xj. In Chapter 2, we briefly considered the CEF E[Yj|Dj], where Dj is a zero-one variable. This CEF takes on two values, E[Yj|Dj = 1] and E[Yj|Dj = 0]. Although this special case is important, we are most often interested in CEFs that are functions of many variables, conveniently subsumed in the vector, Xj. For a specific value of Xj, say Xj = x, we write E [Yj|Xj = x]. For continuous Yj with conditional density fy (-|Xj = x), the CEF is
E [yj |Xj = x]= tfy (t|Xj = x) dt.
If Yj is discrete, E [yj|Xj = x] equals the sum ^t tfy (t|Xj = x).
Expectation is a population concept. In practice, data usually come in the form of samples and rarely consist of an entire population. We therefore use samples to make inferences about the population. For example, the sample CEF is used to learn about the population CEF. This is always necessary but we postpone a discussion of the formal inference step taking us from sample to population until Section 3.1.3. Our “population first” approach to econometrics is motivated by the fact that we must define the objects of
interest before we can use data to study them.1
Figure 3.1.1 plots the CEF of log weekly wages given schooling for a sample of middle-aged white men from the 1980 Census. The distribution of earnings is also plotted for a few key values: 4, 8, 12, and 16 years of schooling. The CEF in the figure captures the fact that—the enormous variation individual circumstances notwithstanding—people with more schooling generally earn more, on average. The average earnings gain associated with a year of schooling is typically about 10 percent.
Figure 3.1.1: Raw data and the CEF of average log weekly wages given schooling. The sample includes |
white men aged 40-49 in the 1980 IPUMS 5 percent file.
An important complement to the CEF is the law of iterated expectations. This law says that an unconditional expectation can be written as the population average of the CEF. In other words
E [yі] = E{E [yі|Хг]}, (3.1.1)
where the outer expectation uses the distribution of Xi. Here is proof of the law of iterated expectations for continuously distributed (Xi, Yi) with joint density fxy (u, t), where fy (t|Xi = x) is the conditional [8]
tgy (t)dt-
The integrals in this derivation run over the possible values of Xj and Yj (indexed by u and t). We’ve laid out these steps because the CEF and its properties are central to the rest of this chapter.
Theorem 3.1.1 The CEF-Decomposition Property
Yj — E [Yi|Xj] + "i,
where (i) £j is mean-independent of Xj, i. e., E[ej|Xj] — 0,and, therefore, (ii) £j is uncorrelated with any function of Xj.
Proof. (i) E[ej|Xj] — E[Yj - E [yj|Xj] | Xj] — E [yj|Xj] - E [yj|Xj] — 0;(ii) This follows from (i): Let h(Xj) be any function of Xj. By the law of iterated expectations, E[h(Xj)ej] — E{h(Xj)E[ej|Xj]} and by mean-independence, E[ej|Xj] — 0. ■
This theorem says that any random variable, yj, can be decomposed into a piece that’s “explained by Xj”, i. e., the CEF, and a piece left over which is orthogonal to (i. e., uncorrelated with) any function of Xj.
Theorem 3.1.2 The CEF-Prediction Property.
(yj - m (Xj))2 |
Let m (Xj) be any function of Xj. The CEF solves
so it is the MMSE predictor of yj given Xj.
Proof. Write
(y, - m (X,))2 = ((y, - E [y,|X,]) + (E [y,|X,] - m (X,)))2
= (y, - E [y, |X,])2 + 2 (E [Yi|Xi] - m (X,)) (y, - E [y,|X,])
+ (E [Yi|Xi] - m (Xi))2
The first term doesn’t matter because it doesn’t involve m (X,). The second term can be written h(X,)",, where h(Xi) = 2 (E [y,|X,] — m (X,)), and therefore has expectation zero by the CEF-decomposition property. The last term is minimized at zero when m (X,) is the CEF. ■
A final property of the CEF, closely related to both the CEF decomposition and prediction properties, is the Analysis-of-Variance (ANOVA) Theorem:
Theorem 3.1.3 The ANOVA Theorem
V (y,) = V (E [y,|X,]) + E [V (y,|X,)]
where V(•) denotes variance and V (y,|X,) is the conditional variance of Y, given X,.
Proof. The CEF-decomposition property implies the variance of y, is the variance of the CEF plus the variance of the residual, є, = y, — E [y,|X,] since є, and E [y,|X,] are uncorrelated. The variance of є, is
E ["2] = E [E ["2|X,]] = E [V [y,|X,]]
where E |"2|Xi] = V [y,|X,] because є, = y, — E [y,|X,]. ■
The two CEF properties and the ANOVA theorem may have a familiar ring. You might be used to seeing an ANOVA table in your regression output, for example. ANOVA is also important in research on inequality where labor economists decompose changes in the income distribution into parts that can be accounted for by changes in worker characteristics and changes in what’s left over after accounting for these factors (See, e. g., Autor, Katz, and Kearney, 2005). What may be unfamiliar is the fact that the CEF properties and ANOVA variance decomposition work in the population as well as in samples, and do not turn on the assumption of a linear CEF. In fact, the validity of linear regression as an empirical tool does not turn on linearity either.