Mostly Harmless Econometrics: An Empiricist’s Companion
Two-Stage Least Squares
The reduced-form equation, (4.1.4b), can be derived by substituting the first stage equation, (4.1.4a), into the causal relation of interest, (4.1.6), which is also called a “structural equation” in simultaneous equations language. We then have:
Yj _ a'Xj + p[Xj^10 + ^11zj + C1j] + Pj
_ Xj[a + p^10] + p^11zj + [p?1j + pj]
_ X W20 + ^21zj + ^2i,
where ^20 = a + p'Kio, ^21 = її, and £2i = pCii + Vi in equation (4.1.4b). Equation (4.1.7) again shows why p = 021. Note also that a slight re-arrangement of (4.1.7) gives
Y i = a’Ni + p[Xi^io + ^11 Zi] + C2i; (4.1.8)
where [Xi^1o + ^11Zi] is the population fitted value from the first-stage regression of Si on Xi and Zi. Because Zi and Xi are uncorrelated with the reduced-form error, ^2i, the coefficient on [Xi^o + ^11Zi] in the population regression of Yi on Xi and [Xi'Кю + ^11Zi] equals p.
In practice, of course, we almost always work with data from samples. Given a random sample, the first-stage fitted values in the population are consistently estimated by
Si = Xi7T1o + ТГ 11zi,
where 7T 1o and 7r11 are OLS estimates from equation (4.1.4a). The coefficient on Si in the regression of Yi on Xi and Si is called the Two-Stage Least Squares (2SLS) estimator of p. In other words, 2SLS estimates can be constructed by OLS estimation of the “second-stage equation,”
Y i = Oi’Xi + pSi + [Vi + p(si — Si)], (4.1.9)
This is called 2SLS because it can be done in two steps, the first estimating Si using equation (4.1.4a), and the second estimating equation (4.1.9). The resulting estimator is consistent for p because (a) first-stage estimates are consistent; and, (b) the covariates, Xi, and instruments, Zi, are uncorrelated with both Vi and
(si Si).
The 2SLS name notwithstanding, we don’t usually construct 2SLS estimates in two-steps. For one thing, the resulting standard errors are wrong, as we discuss later. Typically, we let specialized software routines (such as are available in SAS or Stata) do the calculation for us. This gets the standard errors right and helps to avoid other mistakes (see Section 4.6.1, below). Still, the fact that the 2SLS estimator can be computed by a sequence of OLS regressions is one way to remember why it works. Intuitively, conditional on covariates, 2SLS retains only the variation in Si that is generated by quasi-experimental variation, i. e., generated by the instrument, Zi.
2SLS is a many-splendored thing. For one, it is an instrumental variables estimator: the 2SLS estimate of p in (4.1.9) is the sample analog of Cov(YSi’si), where S* is the residual from a regression of Si on Xi. This follows from the multivariate regression anatomy formula and the fact that Cov(Si, S*) = V(S*). It is also easy to show that, in a model with a single endogenous variable and a single instrument, the 2SLS estimator is the same as the corresponding ILS estimator.[41]
The link between 2SLS and IV warrants a bit more elaboration in the multi-instrument case. Assuming each instrument captures the same causal effect (a strong assumption that is relaxed below), we might want to combine these alternative IV estimates into a single more precise estimate. In models with multiple instruments, 2SLS provides just such a linear combination by combining multiple instruments into a single instrument. Suppose, for example, we have three instrumental variables, Zii, Z2i, and Z3j. In the Angrist and Krueger (1991) application, these are dummies for first, second, and third-quarter births. The first-stage equation then becomes
si = Xj^io + ^iizii + ^I2z2i + ^13z3i + Cli; (4.1.10a)
while the 2SLS second stage is the same as (4.1.9), except that the fitted values are from (4.1.10a) instead of (4.1.4a). The IV interpretation of this 2SLS estimator is the same as before: the instrument is the residual from a regression of first-stage fitted values on covariates. The exclusion restriction in this case is the claim that all of the quarter of birth dummies in (4.1.10a) are uncorrelated with ^ in equation equation (4.1.6).
The results of 2SLS estimation of a schooling equation using three quarter-of-birth dummies, as well as other interactions, are shown in Table 4.1.1, which reports OLS and 2SLS estimates of models similar to those estimated by Angrist and Krueger (1991). Each column in the table contains OLS and 2SLS estimates of p from an equation like (4.1.6), estimated with different combinations of instruments and control variables. The OLS estimate in column 1 is from a regression of log wages with no control variables, while the OLS estimates in column 2 are from a model adding dummies for year of birth and state of birth as control variables. In both cases, the estimated return to schooling is around.075.
Г Cov(Yi j _
V,(Zz^ . But the sample analog of the numerator, Cov(Yp), is the OLS estimate of кол in the reduced
7Г11 ^ & ’ V (~i) 21
form, (4.1.4b), while Кц is the OLS estimate of the first-stage effect, кц, in (4.1.4a). Hence, 2SLS with a single instrument is ILS, i. e., the ratio of the reduced form-effect of the instrument to the corresponding first-stage effect where both the first-stage and red uced-form in clude covariates.
Table 4.1.1: 2SLS estimates of the economic returns to schooling
OLS |
2SLS |
||||
(1) (2) |
(3) |
(4) |
(5) (6) |
(7) |
(8) |
Years of education |
0.075 |
0.072 |
0.103 |
0.112 |
0.106 |
0.108 |
0.089 |
0.061 |
(0.0004) |
(0.0004) |
(0.024) |
(0.021) |
(0.026) |
(0.019) |
(0.016) |
(0.031) |
|
Covariates: |
||||||||
Age (in quarters) |
/ |
|||||||
Age (in quarters) squared |
/ |
|||||||
9 year of birth dummies |
/ |
/ |
/ |
/ |
/ |
|||
50 state of birth dummies |
/ |
/ |
/ |
/ |
/ |
|||
Instruments: |
dummy |
dummy |
dummy |
full set |
full set |
full set |
||
for |
for |
for |
of QOB |
of QOB |
of QOB |
|||
QOB=l |
QOB=l |
QOB=l |
dummies |
dummies |
dummies |
|||
or |
int. with |
int. with |
||||||
QOB=2 |
year of |
year of |
||||||
birth |
birth |
|||||||
dummies |
dummies |
Notes: The table reports OLS and 2SLS estimates of the returns to schooling using the the Angrist and Krueger (1991) 1980 Census sample. This sample includes native-born men, born 1930-1939, with positive earnings and non-allocated values for key variables. The sample size is 329,509. Robust standard errors are reported in parentheses.
The first pair of IV estimates, reported in columns 3 and 4, are from models without controls. The instrument used to construct the estimates in column 1 is a single dummy for first quarter births, while the instruments used to construct the estimates in column 2 are a pair of dummies indicating first and second quarter births. The standard error estimates range from.10 - .11. The results from models including year of birth and state of birth dummies as control variables are similar, not surprisingly, since quarter of birth is not closely related to either of these controls. Overall, the 2SLS estimates are mostly a bit larger than the corresponding OLS estimates. This suggests that the observed associated between schooling and earnings is not driven by omitted variables like ability and family background.
Column 7 in Table 4.1.1 shows the results of adding interaction terms to the instrument list. In particular, each specification adds interaction with 9 dummies for year of birth (the sample includes cohorts born 193039), for a total of 30 excluded instruments. The first stage equation becomes
Si = X'^10 + niiZii + 'K2'i2i + ^13 z3i (4.1.10b)
+ z1ibj + X^(Bij Z2i)K2j + X^(Bij z3i)^3j + £li
j j j
where Bij is a dummy equal to one if individual i was born in year j for j equal to 1931 - 39. The coefficients K1j, K2j, K3j are the corresponding year-of-birth interactions. These interaction terms capture differences in the relation between quarter-of-birth and schooling across cohorts. The rationale for adding these interaction terms is an increase in precision that comes from increasing the first-stage R2, which goes up because the quarter of birth pattern in schooling differs across cohorts. In this example, the addition of interaction terms to the instrument list leads to a modest gain in precision; the standard error declines from.0194 to.0161.[42]
The last 2SLS model reported in Table 4.1.1 includes controls for linear and quadratic terms in age-inquarters in the list of covariates, Xi. In other words, someone who was born in the first quarter of 1930 is recorded as being 50 years old on census day (April 1), 1980, while someone born in the fourth quarter is recorded as being 49.25 years old. This finely coded age variable, entered into the model with a linear and quadratic term, provides a partial control for the fact that small differences age may be an omitted variable that confounds the quarter-of-birth identification strategy. As long as the effects of age are similarly smooth, the quadratic age-in-quarters model will pick them up.
This variation in the 2SLS set-up illustrates the inter-play between identification and estimation. For the 2SLS procedure to work, there must be some variation in the first-stage fitted values conditional on whatever control variables (covariates) are included in the model. If the first-stage fitted values are a linear combination of the included covariates, then the 2SLS estimate simply does not exist. In equation (4.1.9) this
is manifest by perfect multicollinearity. 2SLS estimates with quadratic age exist. But the variability “left over” in the first-stage fitted values is reduced when the covariates include variables like age in quarters, that are closely related to the instruments (quarter of birth dummies). Because this variability is the primary determinant of 2SLS standard errors, the estimate in column 8 is markedly less precise than that in column 7, though it is still close to the corresponding OLS estimate.
Recap of IV and 2SLS Lingo
As we’ve seen, the endogenous variables are the dependent variable and the independent variable(s) to be instrumented; in a simultaneous equations model, endogenous variables are determined by solving a system of stochastic linear equations. To treat an independent variable as endogenous is to instrument it, i. e., to replace it with fitted values in the second stage of a 2SLS procedure. The independent endogenous variable in the Angrist and Krueger (1991) study is schooling. The exogenous variables include the exogenous covariates that are not instrumented and the instruments themselves. In a simultaneous equations model, exogenous variables are determined outside the system. The exogenous covariates in the Angrist and Krueger (1991) study are dummies for year of birth and state of birth. We think of exogenous covariates as controls. 2SLS aficionados live in a world of mutually exclusive labels: in any empirical study involving instrumental variables, the random variables to be studied are either dependent variables, independent endogenous variables, instrumental variables, or exogenous covariates. Sometimes we shorten this to: dependent and endogenous variables, instruments and covariates (fudging the fact that the dependent variable is also endogenous in a traditional SEM).