Mostly Harmless Econometrics: An Empiricist’s Companion
Limited Dependent Variables and Marginal Effects
Many empirical studies involve variables that take on only a limited number of values. An example is the Angrist and Evans (1998) investigation of the effect of childbearing on female labor supply, discussed in
Section 3.4.2 in this chapter and in the chapter on instrumental variables, below. This study is concerned with the causal effects of childbearing on parents’ work and earnings. Because childbearing is likely to be correlated with potential earnings, the study reports instrumental variables estimates based on sibling - sex composition and multiple births, as well as OLS estimates. Almost every outcome in this study is either binary (like employment status) or non-negative (like hours worked, weeks worked, and earnings). Should the fact that a dependent variable is limited affect empirical practice? Many econometrics textbooks argue that, while OLS is fine for continuous dependent variables, when the outcome of interest is a limited dependent variable (LDV), linear regression models are inappropriate and nonlinear models such as Probit and Tobit are preferred. In contrast, our view of regression as inheriting its legitimacy from the CEF makes LDVness seem less central.
As always, a useful benchmark is a randomized experiment, where regression is simply a treatment-control difference. Consider regressions of various outcome variables on a randomly assigned regressor that indicates one of the treatment groups in the Rand Health Insurance Experiment (HIE; Manning, et al, 1987). In this ambitious experiment, probably the most expensive in American social science, the Rand Corporation set up a small health insurance company that charged no premium. Nearly 6,000 participants in the study were randomly assigned to health insurance plans with different features.
One of the most important features of any insurance plan is the portion of health care costs the insured individual is expected to pay. The HIE randomly assigned individuals to many different plans. One plan provided entirely free care, while the others included various combinations of co-payments, expenditure caps, and deductibles so that patients covered some of their health care costs out-of-pocket. The main purpose of the experiment was to learn whether the use of medical care is sensitive to cost and, if so, whether this affects health. The HIE results showed that those offered free or low-cost medical care used more of it, but they were not, for the most part, any healthier as a result. These findings helped pave the way for cost-sensitive health insurance plans and managed care.
Most of the outcomes in the HIE are LDVs. These include dummies indicating whether an experimental subject incurred any medical expenditures or was hospitalized in a given year and non-negative outcomes such as the number of face-to-face doctor visits and gross annual medical expenses (whether paid by patient or insurer). The expenditure variable is zero for about 20 percent of the sample. Results for two of the HIE treatment groups are reproduced in Table 3.4.1, derived from the estimates reported in Table 2 of Manning, et al. (1987). Table 3.4.1 shows average outcomes in the free care and individual deductible groups. The latter group faced a deductible of $150 per person or $450 per family per year for outpatient care, after which all costs were covered (There was no charge for inpatient care). The overall sample size in these two groups was a little over 3,000.
To simplify the LDV discussion, suppose that the comparison between free care and deductible plans is
Table 3.4.1: Average outcomes in two of the HIE treatment groups
|
Notes: Adapted from Manning (1987), Table 2. All standard errors (shown in parentheses) are corrected for intertemporal and intrafamily correlations. Amounts are in June 1984 dollars. Visits are face-to-face contacts with MD, DO, or other health providers; excludes visits only for radiology, anesthesiology or pathology services. Visits and expenses exclude dental care and outpatient psychotherapy.
the only comparison of interest and that treatment was determined by simple random assignment.[33] Let Dj = 1 denote assignment to the deductible group. By virtue of random assignment, the difference in means between those with Dj = 1 and Dj =0 identifies the effect of treatment on the treated. As in our earlier discussion of experiments (Chapter 2):
E [Yj|Dj = 1] - E [Yj|Dj =0] (3.4.1)
= E [Yij|Dj = 1] - E [y 0j | D j = 1]
= E [yij - Y0j]
because Dj is independent of potential outcomes. Also, as before, E [Yj|Dj = 1] — E [Yj|Dj = 0] is the slope coefficient in a regression of Yj on Dj.
Equation (3.4.1) suggests that the estimation of causal effects in experiments presents no special challenges whether Yj is binary, non-negative, or continuously distributed. The interpretation of the right-hand side changes for different sorts of dependent variables, but you do not need to do anything special to get the average causal effect. For example, one of the HIE outcomes is a dummy denoting any medical expenditure.
Since the outcome here is a Bernoulli trial, we have
E[Yii - Yoi] = E[yii] - E[Y0i] = P[Yii = 1] - P[yoi = 1]. (3.4.2)
This relation might affect the language we use to describe the results but not the underlying calculation. In the HIE, for example, comparisons across experimental groups, as on the left hand side of (3.4.1), show that 87 percent of those assigned to the free-care group used at least some care in a given year, while only 72 percent of those assigned to the deductible plan used care. The relatively modest $150 deductible therefore had a marked effect on use of care. The difference between these two rates, —.15(s. e. = .017) is an estimate of E[y 1i —Yoi], where Yi is a dummy indicating any medical expenditure. Because the outcome here is a dummy variable, the average causal effect is also a causal effect on usage rates or probabilities.
Recognizing that the outcome variable here is a probability, suppose instead that you use Probit to fit the CEF in this case. No harm in trying! The Probit model is usually motivated by the assumption that participation is determined by a latent variable, Y*, that satisfies
where vi is distributed N(0,a2). Note that this variable cannot be actual medical expenditure since
expenditure is non-negative and therefore non-Normal, while Normally distributed variables are continuously distributed on the Real line and can therefore be negative. Given the latent index model,
Yi = 1[y* > 0],
the CEF can be written
г, ! r£o + £i*Dil
E[Yi|Di] =Ф[£0 £1 1 ],
where Ф[.] is the Normal CDF. Therefore
E[Yi|Di] = Ф[^] + {Ф[] - Ф[^]}D a a a
This is a linear function of the regressor, Di, so the slope coefficient in the regression of Yi on Di is exactly the difference in Probit fitted values, Ф[] — Ф[^°]. Note, however, that the Probit Coefficients, ^° and do not give us the size of effect of Di on participation until we feed them back into the Normal CDF (though they do have the right sign).
One of the most important outcomes in the HIE is gross medical expenditure, in other words, health care costs. Did subjects who faced a deductible use less care, as measured by the cost? In the HIE, the average difference in expenditures between the deductible and free-care groups was —141 dollars (s. e. = 60), about 19% of the expenditure level in the free-care group. This calculation suggests that making patients pay a portion of costs reduces expenditures quite a bit, though the estimate is not very precise.
Because expenditure outcomes are non-negative random variables, and sometimes equal to zero, their expectation can be written
E[Yi|Di] = E[Yi|Yi > 0, Di]P[yi > 0|Di],
The difference in expenditure outcomes across treatment groups is
E [yi|Di = 1] - E [Yi|Di =0] (3.4.4)
= E [Yi|Yi > 0, Di = 1] P [Yi > 0|Di = 1] - E [Yi|Yi > 0, Di = 0] P [Yi > 0|Di = 0]
= {P [Yi > 0|di = 1] - P [Yi > 01Di = 0]}E [Yi |yi > 0, Di = 1]
participation effect
+ {E [Yi|Yi > 0, Di = 1] - E [Yi|Yi > 0, Di = 0]}P [Yi > 0|Di = 0] .
1 {Z }
COP effect
So the overall difference in average expenditure can be broken up into two parts: the difference in the probability that expenditures are positive (often called a participation effect), and the difference in means conditional on participation, a conditional-on-positive (COP) effect. Again, however, this has no special implications for the estimation of causal effects; equation (3.4.1) remains true: the regression of Yi on Di gives the population average treatment effect for expenditures.
Good COP, Bad COP: Conditional-on-positive effects
Because the effect on a non-negative random variable like expenditure has two parts, some applied researchers feel they should look at these parts separately. In fact, many use a "two-part model," where the first part is an evaluation of effect on participation and the second part looks at the COP effects (see, e. g., Duan, et al., 1983 and 1984 for such models applied to the HIE). The first part of (3.4.4) raises no special issues, because, as noted above, the fact that Yi is a dummy means only that average treatment effects are also differences in probabilities. The problem with the two-part model is that the COP effects do not have a causal interpretation, even in a randomized trial. This is exactly the same selection problem raised in Section 3.2.3, on bad control.
To analyze the COP effect further, write
E [Yi |Yi > 0, Di = 1] - E [Yi |Yi > 0, Di =0] (3.4.5)
= E [Yii|Yii > 0] - E [y0i |Y0i > 0]
= E [y li - Y 0i |Y li > 0] + {E [y 0i |Y li > 0] - E [Y0i|Y0i > 0]}.
4--------------------------- v--------------------------- } 4------------------------------------------------- v------------------------------------------------ }
causal effect selection bias
This decomposition shows that the COP effect is composed of two terms: a causal effect for the subpopulation
that uses medical care when it is free and the difference in Y oi between those who use medical care when it is free and those who use medical care when they have to pay something. This second term is a form of selection bias, though it is more subtle than the selection bias in Chapter 2.
Here selection bias arises because the experiment changes the composition of the group with positive expenditures. The Yoi > 0 population probably includes some low-cost users who would opt out of care if they had to pay a deductible. In other words, it is larger and probably has lower costs on average than the Yn > 0 group. The selection bias term is therefore positive, with the result that COP effects are closer to zero than the negative causal effect, E[y^ —Yoi|Yii > 0]. This is a version of the bad control problem from Section 3.2.3: in a causal-effects setting, Yi > 0 is an outcome variable and therefore unkosher for conditioning unless the treatment has no effect on the likelihood that Yi is positive.
One resolution of the non-causality of COP effects relies on censored regression models like Tobit. These models postulate a latent expenditure outcome for nonparticipants (e. g., Hay and Olsen, 1984). A traditional Tobit formulation for the expenditure problem stipulates that the observed Yi is generated by
Yi = 1[y* > 0]y*
where Y* is a Normally distributed latent expenditure variable that can take on negative values. Because Y* is not an LDV, Tobit proponents feel comfortable linking this to Di with a traditional linear model, say, equation (3.4.3). In this case, /3* is the causal effect of Di on latent expenditure, Y*. This equation is defined for everyone, whether Yi is positive or not. There is no COP-style selection problem if we are happy to study effects on Y*.
But we are not happy with effects on Y*. The first problem is that "latent health care expenditure" is a puzzling construct.[34] Health care expenditure really is zero for some people; this is not a statistical artifact or due to some kind of censoring. So the notion of latent and potentially negative Yi* is hard to grasp. There is no data on Y* and there never will be. A second problem is that the link between the parameter 3* in the latent model and causal effects on the observed outcome, Yi, turns on distributional assumptions about the latent variable. To establish this link we evaluate the expectation of Yi given Di to find
where a is the standard deviation of v (see, e. g. McDonald and Moffitt, 1980). This expression involves the assumed Normality and homoskedasticity of vi and the assumption that Yi can be represented as 1[y* > 0]y*, as well as the latent coeff cients.
The Tobit CEF provides us with an expression for a treatment effect on observed expenditure. Specifically,
E [Yi|Di = 1] - E [yi |Di = 0]
a rather daunting expression. But since the only conditioning variable is a dummy variable, Di, none of this is necessary for the estimation of E[Yi|Di = 1] — E[Yi|Di = 0]. The slope coefficient from an OLS regression of Yi on Di recovers the CEF difference on the left hand side of (3.4.7) whether or not you adopt a Tobit model to explain the underlying structure.
COP effects are sometimes motivated by a researcher’s sense that when the outcome distribution has a mass point - that is, it piles up on particular values like zero - or a heavily skewed distribution, or both, then an analysis of effects on averages misses something. Analyses of effects on averages indeed miss some things, like changes in the probability of specific values, or a shift in quantiles away from the median. But why not look at these distribution effects directly? A sensible alternative to COP effects looks directly at effects on distributions or quantiles. Distribution outcomes include the likelihood that annual medical expenditures exceed zero, 100 dollars, 200 dollars, and so on. This puts 1[Yi > c] for different choices of c on the left-hand side of the regression of interest. Econometrically, these outcomes are all in the category of equation (3.4.2). The idea of looking directly at distribution effects with linear probability models is illustrated by Angrist (2001), in an analysis of the effects of childbearing on hours worked. Alternately, if quantiles provide a focal point, we can use quantile regression to model them. Chapter 7 discusses this idea in detail.
Do Tobit-type latent-variable models ever make sense? Yes, if the data you are working with are truly censored. True censoring means the latent variable has an empirical counterpart that is the outcome of primary interest. A leading example from labor economics is CPS earnings data, which topcodes (censors) very high values of earnings to protect respondent confidentiality. Typically, we’re interested in the causal effect of schooling on earnings as it appears on respondents’ tax returns, not their CPS-topcoded earnings. Chamberlain (1994) shows that in some years, CPS topcoding reduces the measured returns to schooling considerably, and proposes an adjustment for censoring based on a Tobit-style adaptation of quantile regression. The use of quantile regression to model censored data is also discussed in Chapter 7.[35]
Covariates lead to nonlinearity
True censoring as with the CPS topcode is rare, a fact that leaves limited scope for constructive applications of Tobit-type models in applied work. At this point, however, we have to hedge a bit. Part of the neatness in the discussion of experiments comes from the fact that E[Yi|Di] is necessarily a linear function of D; so that regression and the CEF are one and the same. In fact, this CEF is linear for any function of Y;, including the distribution indicators, 1[Y; > c]. In practice, of course, the explanatory variable of interest isn’t always a dummy, and there are usually additional covariates in the CEF, in which case, E[Yi|Xi, Di] is almost certainly nonlinear for LDVs. Intuitively, as predicted means get close to the dependent variable boundaries, say because some covariate cells are close to the boundaries, the derivatives of the CEF for LDVs get smaller (think, for example, of the how the Normal CDF flattens at extreme values).
The upshot is that in LDV models with covariates, regression need not fit the CEF perfectly. It remains true, however, that the underlying CEF has a causal interpretation if the CIA holds. And if the CEF has a causal interpretation, it seems fair to say that regression has a causal interpretation as well, because it still provides the MMSE approximation to the CEF. Moreover, if the model for covariates is saturated, then regression also estimates a weighted average treatment effect similar to (3.3.1) and (3.3.3). Likewise, if the regressor of interest is multi-valued or continuous, we get a weighted average derivative, as described by the formulas in subsection 3.3.1.
And yet, we don’t often have enough data for the saturated-covariate regression specification to be very attractive. Regression will therefore miss some features of the CEF. For one thing, it may generate fitted values outside the LDV boundaries. This fact bothers some researchers and has certainly generated a lot of bad press for the linear probability model. One attractive feature of nonlinear models like Probit and Tobit is that they produce CEFs that respect LDV boundaries. In particular, Probit fitted values are always between zero and one, while Tobit fitted values are positive (this is not obvious from equation 3.4.6). We might therefore prefer nonlinear models on simple curve-fitting grounds.
Point conceded. It’s important to emphasize, however, that the output from nonlinear models must be converted into marginal effects to be useful. Marginal effects are the (average) changes in CEF implied by a nonlinear model. Without marginal effects, it’s hard to talk about the impact on observed dependent variables. Continuing to assume the regressor of interest is D;, population average marginal effects can be constructed either by differencing
or by differentiation: E j 8E'ei@N*’D*] j valued regressors as well.
How close do OLS regression estimates come to the marginal effects induced by a nonlinear model like
Probit or Tobit? We first derive the marginal effects, and then show an empirical example. The Probit CEF for a model with covariates is
XiP0 + P {Pi ■
a
The average finite difference is therefore
In practice, this can also be approximated by the average derivative,
(Stata computes marginal effects both ways but defaults to (3.4.8) for dummy regressors). Similarly, generalizing equation (3.4.6) to a model with covariates, we have
for a non-negative LDV. Tobit marginal effects are almost always cast in terms of the average derivative, which can be shown to be the surprisingly simple expression
See, e. g., Wooldridge (2006). One immediate implication of (3.4.9) is that the Tobit coefficient, P* is always too big relative to the effect of Di on Yi. Intuitively, this is because - given the linear model for latent Y* - the latent outcome always changes when Di switches on or off. But real Yi need not change: for many people, it’s zero either way.
Table 3.4.2 compares regression and nonlinear marginal effects for a regression of female employment and hours of work, both LDVs, on measures of fertility. The estimates were constructed using one of the 1980 Census samples used by Angrist and Evans (1998) This sample includes married women aged 21-35 with at least two children. The childbearing variables consist of either a dummy indicating additional childbearing beyond two, or the total number of births. The covariates include linear terms in mothers’ age, age at first birth, race dummies (black and Hispanic), and mother’s education (dummies for high school graduates, some college, and college graduates). The covariate model is not saturated, rather there are linear effects and no interactions, so the underlying CEF in this example is surely nonlinear.
Probit marginal effects for the effect of a dummy variable indicating more than two children are indistinguishable from OLS estimates of the same relation. This can be seen in columns 2, 3, and 4 of Table 3.4.2,
the first row of which compares the estimates from different methods for the full 1980 sample. The OLS estimate of the effect of a third child is -.162, while the corresponding Probit marginal effects are -.163 and -.162. These were estimated using (3.4.8) in the first case and
in the second (hence, a marginal effect on the treated).
Tobit marginal effects for the relation between fertility and hours worked are also quite close to the corresponding OLS estimates, though not indistinguishable. This can be seen in columns 5 and 6. Compare, for example, the Tobit estimates of -6.56 and -5.87 with the OLS estimate of -5.92 in column 2. Although one Tobit estimate is 10 percent larger in absolute value, this seems unlikely to be of substantive importance. The remaining columns of the table compare OLS to marginal effects for an ordinal childbearing variable instead of a dummy. These calculations all use derivatives to compute marginal effects (labeled MFX). Here too, the OLS and nonlinear marginal effects estimates are similar for both Probit and Tobit.
It is sometimes said that Probit models can be expected to generated marginal effects close to OLS when the fitted values are close to.5 because the nonlinear CEF is roughly linear in the middle. We therefore replicated the comparison of OLS and marginal effects in a subsample with relatively high average employment rates, non-white women over 30 who attended college and whose first birth was before age 20. Although the average employment rate is 83 percent in this group, the OLS estimates and marginal effects are again similar.
Table 3.4.2: Comparison of alternative estimates of the effect of childbearing on LDVs
Right-hand side variable |
||||||
More than two children |
Number of children |
|||||
Mean OLS Probit |
Tobit |
OLS |
Probit MFX |
Tobit MFX |
||
Avg |
Avg ef- |
Avg Avg ef- |
Avg effect, |
Avg ef- |
Avg |
|
effect, |
feet on |
effect, feet on |
full sample |
feet, full |
effect on |
|
full |
treated |
full treated |
sample |
treated |
||
sample |
sample |
Dependent variable |
(1) |
(2) |
(3) |
(4) (5) |
(6) |
(7) |
(8) |
(9) |
(10) |
Panel A: Full Sample |
|||||||||
Employment |
.528 |
-.162 |
-.163 |
-.162 |
_ |
-.113 |
-.114 |
_ |
_ |
(.499) |
(.002) |
(.002) |
(.002) |
(.001) |
(.001) |
||||
Flours worked |
16.7 |
-5.92 |
- |
-6.56 |
-5.87 |
-4.07 |
- |
-4.66 |
-4.23 |
(18.3) |
(.074) |
(.081) |
(.073) |
(.047) |
(.054) |
(.049) |
|||
Panel B: Non-white College Attendees over |
30, first |
birth before |
age 20 |
||||||
Employment |
.832 |
-.061 |
-.064 |
-.070 |
_ |
-.054 |
-.048 |
_ |
_ |
(.374) |
(.028) |
(.028) |
(.031) |
(.016) |
(.013) |
||||
Flours worked |
30.8 |
-4.69 |
- |
-4.97 |
-4.90 |
-2.83 |
- |
-3.20 |
-3.15 |
(16.0) |
(1.18) |
(1.33) |
(1.31) |
(.645) |
(.670) |
(.659) |
Notes: The table reports OLS estimates, average treatment effects, and marginal effects (MFX) for the effect of childbearing on mothers’ labor supply. The sample in Panel A includes 254,654 observations and is the same as the married-women-1980- Census sample used by Angrist and Evans (1998). Covariates include age, age at first birth, and dummies for boys at first and second birth. The sample in Panel В includes 746 nonwhites with at least some college aged over 30 whose first birth was before age 20. Standard deviations are reported in parentheses in column 1. Standard errors are shown in parentheses in other columns. The sample used to estimate average effects on the treated includes women with more than two children.
The upshot of this discussion is that while a nonlinear model may fit the CEF for LDVs more closely than a linear model, when it comes to marginal effects this probably matters little. This optimistic conclusion is not a theorem, but as in the empirical example here, it seems to be fairly robustly true.
Why then, should we bother with nonlinear models and marginal effects? One answer is that the marginal effects are easy enough to compute now that they are automated in packages like Stata. But there are a number of decisions to make along the way (e. g., the weighting scheme, derivatives versus finite differences) while OLS is standardized. Nonlinear life also promises to get considerably more complicated when we start to think about IV and panel data. Finally, extra complexity comes into the inference step as well, since we need standard errors for marginal effects. The principle of Occam’s razor advises, "Entities should not be multiplied unnecessarily." In this spirit, we quote our former teacher, Angus Deaton (1997), pondering the nonlinear regression function generated by Tobit-type models:
Absent knowledge of F [the distribution of the errors], this regression function does not even identify the P’s [Tobit coefficients] - see Powell (1989) - but more fundamentally, we should ask how it has come about that we have to deal with such an awkward, difficult, and non-robust object.