Mostly Harmless Econometrics: An Empiricist’s Companion
Sharp RD
1 if Xj > xo
. (6.1.
0 if Xj < xo
where xo is a known threshold or cutoff. This assignment mechanism is a deterministic function of Xj because once we know xj we know Dj. It’s a discontinuous function because no matter how close xj gets to xo, treatment is unchanged until xj = xo.
This may seem a little abstract, so here is an example. American high school students are awarded National Merit Scholarship Awards on the basis of PSAT scores, a test taken by most college-bound high school juniors, especially those who will later take the SAT. The question that motivated the first discussions of RD is whether students who win these awards are more likely to finish college (Thistlewaithe and Campbell, 1960; Campbell, 1969). Sharp RD compares the college completion rates of students with PSAT scores just above and just below the National Merit Award thresholds. In general, we might expect students with higher PSAT scores to be more likely to finish college, but this effect can be controlled by fitting a regression to the relationship between college completion and PSAT scores, at least in the neighborhood of the award cutoff. In this example, jumps in the relationship between PSAT scores and college attendance in the neighborhood of the award threshold are taken as evidence of a treatment effect. It is this jump in regression lines that gives RD its name.[93]
An interesting and important feature of RD, highlighted in a recent survey of RD by Imbens and Lemieux (2008), is that there is no value of xi at which we get to observe both treatment and control observations. Unlike full-covariate matching strategies, which are based on treatment-control comparisons conditional on covariate values where there is some overlap, the validity of RD turns on our willingness to extrapolate across covariate values, at least in a neighborhood of the discontinuity. This is one reason why sharp RD is usually seen as distinct from other control strategies. For this same reason, we typically cannot afford to be as agnostic about regression functional form in the RD world as in the world of Chapter 3.
Figure 6.1.1 illustrates a hypothetical RD scenario where those with xi > 0.5 are treated. In Panel A, the trend relationship between Yi and xi is linear, while in Panel B, it’s nonlinear. In both cases, there is a discontinuity in the relation between E[Yoi|xi] and xi around the point xo.
A simple model formalizes the RD idea. Suppose that in addition to the assignment mechanism, (6.1.1), potential outcomes can be described by a linear, constant-effects model
E[Yoi|xi] = a + Pxi
Y1i = Y0i + P
This leads to the regression,
Y i = a + flxi + pDi + (6.1.2)
where p is the causal effect of interest. The key difference between this regression and others we’ve used to estimate treatment effects (e. g., in Chapter 3) is that Di, the regressor of interest, is not only correlated with xi, it is a deterministic function of xi. RD captures causal effects by distinguishing the nonlinear and discontinuous function, 1(xi > xo), from the smooth and (in this case) linear function, xi.
|
|
|
|
|
|
|
But what if the trend relation, E[Yoj|xj], is nonlinear? To be precise, suppose that E[Yoi|xi] = f (xi) for some reasonably smooth function, f (xi). Panel B in Figure 6.1.1 suggests there is still hope even in this more general case. Now we can construct RD estimates by fitting
Yi = f (xi) + pDi + Ці, (6.1.3)
where again, Di = 1(xi > xo) is discontinuous in xi at xo. As long as f (xi) is continuous in a neighborhood of xo, it should be possible to estimate a model like (6.1.3), even with a flexible functional form for f (xi). For example, modeling f (xi) with a pth-order polynomial, RD estimates can be constructed from the regression
Y i = a + f 1xi + f2x2 + ... + fpxp + pDi + ці. (6.1.4)
A generalization of RD based on (6.1.4) allows different trend functions for E[Yoi|xi] and E[Y1i|xi]. Modeling both of these CEFs with pth-order polynomials, we have
E [Yoi|xi] = fo(xi) = a + foixi + fo2x2 + ... + fopxp E [Yii|xi] = fi(xi) = a + p + fnxi + f i2^~i2 + ... + fipxip,
where xi = xi — xo. Centering xi at xo is just a normalization; it ensures that the treatment effect at xi = xo is still the coefficient on Di in the regression model with interactions.
To derive a regression model that can be used to estimate the effects interest in this case, we use the fact that Di is a deterministic function of xi to write
E[Yi|xi] — E [y oi | xi] + E[Yii Y oi | xi] D i.
Substituting polynomials for conditional expectations, we then have
yi = a + foixi + fo2x2 + ... + f opxP (6.1.6)
+pDi + f iDixi + f 2Di xi2 + ... + fpDixiP + Ці;
where f 2 = f ii — foi, f2 = fi2 _ fo2, and f2 = fip — fop and the error term, ц, is the CEF residual.
Equation (6.1.4) is a special case of (6.1.6) where f2 = f2 = f( = 0. In the more general model, the treatment effect at xi — xo = c > 0 is p + f(A + f(jc2 + ... + fpcp, while the treatment effect at xo is p. The model with interactions has the attraction that it imposes no restrictions on the underlying conditional mean functions But in our experience, RD estimates of p based on the simpler model, (6.1.4), usually turn out to be similar to those based on (6.1.6).
The validity of RD estimates based on (6.1.4) or (6.1.6) turns on whether polynomial models provide an adequate description of E[Yoi|Xi]. If not, then what looks like a jump due to treatment might simply be an unaccounted-for nonlinearity in the counterfactual conditional mean function. This possibility is illustrated in Panel C of Figure 6.1.1, which shows how a sharp turn in E[Yoi|xi] might be mistaken for a jump from one regression line to another. To reduce the likelihood of such mistakes, we can look only at data in a neighborhood around the discontinuity, say the interval [xo — S, xo + S] for some small number S. Then we have
E [yі |xo - S < xi < xo] ' E[Yoi |xi = xo]
E [yі|xo <Xi < xo + S] ' E[Yii|xi = xo],
so that
lim E [yі|xo < xi < xo + S] - E [Yi|xo - S < xi < xo] = E[Yii - Yoi|xi = xo]. (6.1.7)
$ ! o
In other words, comparisons of average outcomes in a small enough neighborhood to the left and right of xo should provide an estimate of the treatment effect that does not depend on the correct specification of a model for E[yoi |xi]. Moreover, the validity of this nonparametric estimation strategy does not turn on the constant effects assumption, Y1i —Yoi = P! the estimand in (6.1.7) is the average causal effect, E[y^ —Yoi|xi = xo].
The nonparametric approach to RD requires good estimates of the mean of Yi in small neighborhoods to the right and left of xo. Obtaining such estimates is tricky. The first problem is that working in a small neighborhood of the cutoff means that you don’t have much data. Also, the sample average is biased for the population average in the neighborhood of a boundary (in this case, xo). Solutions to these problems include the use of a non-parametric version of regression called local linear regression (Hahn, Todd, and van der Klaauw, 2001) and the partial-linear and local-polynomial estimators developed by Porter (2003). Local linear regression amounts to weighted least squares estimation of an equation like (6.1.6), with linear terms only and more weight given to points close to the cutoff.
Sophisticated nonparametric RD methods have not yet found wide application in empirical practice; most applied RD work is still parametric. But the idea of focusing on observations near the cutoff value - what Angrist and Lavy (1999) call a "discontinuity sample" - suggests a valuable robustness check: Although RD estimates get less precise as the window used to select a discontinuity sample gets smaller, the number of polynomial terms needed to model f (xi) should go down. Hopefully, as you zero in on xo with fewer and fewer controls, the estimated effect of Di remains stable.[94] A second important check looks at the behavior of pre-treatment variables near the discontinuity. Since pre-treatment variables are unaffected by treatment, there should be no jump in the CEF of these variables at xo.
Lee’s (2008) study of the effect of party incumbency on re-election probabilities illustrates the sharp RD design. Lee is interested in whether the Democratic candidate for a seat in the U. S. House of Representatives has an advantage if his party won the seat last time. The widely-noted success of House incumbents raises the question of whether representatives use the privileges and resources of their office to gain advantage for themselves or their parties. This conjecture sounds plausible, but the success of incumbents need not reflect a real electoral advantage. Incumbents - by definition, candidates and parties who have shown they can win - may simply be better at satisfying voters or getting the vote out.
To capture the causal effect of incumbency, Lee looks at the likelihood a Democratic candidate wins as a function of relative vote shares in the previous election. Specifically, he exploits the fact that an election winner is determined by Dj = 1 (xj > .0), where xj is the vote share margin of victory (e. g., the difference between the Democratic and Republican vote shares when these are the two largest parties). Note that, because Dj is a deterministic function of xj, there are no confounding variables other than xj. This is a signal feature of the RD setup.
Figure 6.1.2a, from Lee (2008), shows the sharp RD design in action. This figure plots the probability a Democrat wins against the difference between Democratic and Republican vote shares in the previous election. The dots in the figure are local averages (the average win rate in non-overlapping windows of share margins that are.005 wide); the lines in the figure are fitted values from a parametric model with a discontinuity at zero.[95] [96] The probability of a democratic win is an increasing function of past vote share. The most important feature of the plot is the dramatic jump in win rates at the 0 percent mark, the point where a Democratic candidate gets more votes. Based on the size of the jump, incumbency appears to raise party re-election probabilities by about 40 percentage points.
Figure 6.1.2b checks the sharp RD identification assumptions by looking at Democratic victories before the last election. Democratic win rates in older elections should be unrelated to the cutoff in the last election, a specification check that works out well and increases our confidence in the RD design in this case. Lee’s investigation of pre-treatment victories is a version of the idea that covariates should be balanced by treatment status in a (quasi-) randomized trial. A related check examines the density of xj around the discontinuity, looking for bunching in the distribution of xj near xo. The concern here is that individuals with a stake in Dj might try to manipulate xj near the cutoff, in which case observations on either side may not be comparable (McCrary 2008 proposes a formal test for this). Until recently, we would have said this is unlikely in election studies like Lee’s. But the recount in Florida after the 2000 presidential election suggests we probably should worry about manipulable vote shares when U. S. elections are close.
Democratic Vote Share Margin of Victory, Election t
b
-0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25
Democratic Vote Share Margin of Victory, Election t
Figure 6.1.2: Probability of winning an election by past and future vote share (from Lee, 2008). (a) Candidate’s probability of winning election t + 1, by margin of victory in election t: local averages and parametric fit. (b) Candidate’s accumulated number of past election victories, by margin of victory in election t: local averages and parametric fit.