Mostly Harmless Econometrics: An Empiricist’s Companion
Saturated Models, Main Effects, and Other Regression Talk
We often discuss regression models using terms like saturated and main effects. These terms originate in an experimentalist tradition that uses regression to model discrete treatment-type variables. This language is now used more widely in many fields, however, including applied econometrics. For readers unfamiliar with these terms, this section provides a brief review.
Saturated regression models are regression models with discrete explanatory variables, where the model includes a separate parameter for all possible values taken on by the explanatory variables. For example, when working with a single explanatory variable indicating whether a worker is a college graduate, the model is saturated by including a single dummy for college graduates and a constant. We can also saturate when the regressor takes on many values. Suppose, for example, that Si = 0, 1, 2, ...,r. A saturated regression model for Si is
Yi = fi0 + fild1i + fi2d2i + ... + fi T dT> + "i;
where dji = 1[Si = j] is a dummy variable indicating schooling level-j, and ftj is said to be the jth-level schooling effect. Note that
ftj = E[Yi|Si = j] - E[Yi|Si =0],
while fi0 = E[Yi|Si =0]. In practice, you can pick any value of Si for the reference group; a regression model is saturated as long as it has one parameter for every possible j in Eyі|Si = j]. Saturated models fit the
CEF perfectly because the CEF is linear in the dummy regressors used to saturate. This is an important special case of the regression-CEF theorem.
If there are two explanatory variables, say one dummy indicating college graduates and one dummy indicating sex, the model is saturated by including these two dummies, their product, and a constant. The coefficients on the dummies are known as main effects, while the product is called an interaction term. This is not the only saturated parameterization; any set of indicators (dummies) that can be used to identify each value taken on by the covariates produces a saturated model. For example, an alternative saturated model includes dummies for male college graduates, male dropouts, female college graduates, and female dropouts, but no intercept.
Here’s some notation to make this more concrete. Let хц indicate college graduates and X2i indicate women. The CEF given xii and X2i takes on four values:
E [Yi|xii =0, X2i = 0], E [Yi|xii = 1, X2i = 0], E [Yi|xii =0, X2i = 1],
E [Yi|xii = 1, X2i = 1].
We can label these using the following scheme:
E [Yi|xii = 0,X2i = 0] E [Yi|xii = 1,X2i = 0] E [Yi|xii = 0,X2i = 1] E [yі |xii = 1,X2i = 1]
Since there are four Greek letters and the CEF takes on four values, this parameterization does not restrict the CEF. It can be written in terms of Greek letters as
E[Yi|xii, X2i] = a + ftxii + 7X2i + J(xiiX2i),
a parameterization with two main effects and one interaction term.[15] The saturated regression equation becomes
Yi = a + ftxii + 7X2i + J(xiiX2i) + £i.
Finally, we can combine the multi-valued schooling variable with sex to produce a saturated model that
has т main effects for schooling, one main effect for sex, and т sex-schooling interactions:
T T
yi = po + 53 Pj dji + Tx2i + 53 (djix2i) + "i - (3.1.10)
j=1 j = 1
The interaction terms, 5j, tell us how each of the schooling effects differ by sex. The CEF in this case takes on 2(t + 1) values while the regression has this many parameters.
Note that there is a natural hierarchy of modeling strategies with saturated models at the top. It’s natural to start with a saturated model because this fits the CEF. On the other hand, saturated models generate a lot of interaction terms, many of which may be uninteresting or imprecise. You might therefore sensibly choose to omit some or all of these. Equation (3.1.10) without interaction terms approximates the CEF with a purely additive model for schooling and sex. This is a good approximation if the returns to college are similar for men and women. And, in any case, schooling coefficients in the additive specification give a (weighted) average return across both sexes, as discussed in Section 3.3.1, below. On the other hand, it would be strange to estimate a model which included interaction terms but omitted the corresponding main effects. In the case of schooling, this would be something like
T
yi — po + Tx2i +53 ^j(djix2i) + "i - (3.1.11)
j=1
This model allows schooling to shift wages only for women, something very far from the truth. Consequently, the results of estimating (3.1.11) are likely to be hard to interpret.