Mostly Harmless Econometrics: An Empiricist’s Companion
Fewer than 42 clusters
Bias from few clusters is a risk in both the Moulton and the serial correlation contexts because in both cases inference is cluster-based. With few clusters, we tend to underestimate either the serial correlation in a random shock like vst or the intra-class correlation, p, in the Moulton problem. The relevant dimension for counting clusters in the Moulton problem is the number of groups, G. In a differences-in-differences scenario where you’d like to cluster on state (or some other cross-sectional dimension), the relevant dimension for counting clusters is the number of states or cross-sectional groups. Therefore, following Douglas Adam’s dictum that the ultimate answer to life, the universe, and everything is 42, we believe the question is: How many clusters are enough for reliable inference using a standard cluster adjustment derived from (8.2.6)?
1. Bias correction of clustered standard errors. Clustered standard errors are biased in small samples because E (egє)) = E (egє)) = Фэ just as in Section 8.1. Usually, E (egє)) is too small. One solution is to inflate residuals in the hopes of reducing bias. Bell and McCaffrey (2002) suggest a procedure (called bias-reduced linearization or BRL) that adjusts residuals by
Ф g |
= aegeg |
eg |
= Aeg |
where A solves
A'g Ag = (I - Hg )
and
Hg = Xg (X 'X )-1X'g.
This is a version of HC2 for the clustered case. BRL works for the straight-up Moulton problem with few clusters but for technical reasons cannot be used for the typical differences-in-differences serial
correlation problem.[127]
2. Recognizing that the fundamental unit of observation is a cluster and not an individual unit within clusters, Bell and McCaffrey (2002) and Donald and Lang (2007) suggest that inference be based on a t-distribution with G—K degrees of freedom rather than on the standard Normal distribution. For small G, this makes a big difference - confidence intervals will be much wider, thereby avoiding some mistakes. Cameron, Gelbach, and Miller (2008) report Monte Carlo examples where the combination of a BRL adjustment and use of t-tables works well.
3. Donald and Lang (2007) argue that estimation using group means works well with small G in the Moulton problem, and even better when inference is based on a t-distribution with G—K degrees of freedom. But, as we discussed in the previous section, the regressor must be fixed within groups. The level of aggregation is the level at which you’d like to cluster, e. g., schools in Angrist and Lavy (2007). For serial correlation, this is the state, but state averages cannot be used to estimate a model with a full set of state effects. Also, since treatment status varies within states, averaging up to the state level averages the regressor of interest as well, changing the rules of the game in a way we may not like (the estimator becomes instrumental variables using group dummies as instruments). The group means approach is therefore out of bounds for the serial correlation problem.[128] Note also that if the grouped residuals are heteroskedastic, and you therefore use robust standard errors, you must worry about bias of the form discussed in Section 8.1. If both the random effect and the underlying micro residual are homoskedastic, you can fix heteroskedasticity in the group means by weighting by the group size. But weighting changes the estimand when the CEF is nonlinear - so this is not open-and-shut (Angrist and Lavy, 1999 chose not to weight school-level averages because the variation in their study comes mostly from small schools). Weighted or not, the safest course when working with group-level averages is to use of our rule of thumb from Section 8.1: take the maximum of robust and conventional standard errors as your best measure of precision.
4. Cameron, Gelbach, and Miller (2008) report that some forms of a block bootstrap work well with small numbers of groups, and that the block bootstrap typically outperforms Stata-clustered standard errors without the bias correction. This appears to be true both for the Moulton and serial correlation problems. But Cameron, Gelbach, and Miller (2008) focus on rejection rates using (pivotal) test statistics, while we like to see standard errors.
5. Parametric corrections: For the Moulton problem, this amounts to use of the Moulton factor. With serial correlation, this means correcting your standard errors for first-order serial correlation at the group level. Based on our sampling experiments with the Moulton problem and a reading of the literature, parametric approaches may work well, and better than the nonparametric estimator (8.2.6), especially if the parametric model is not too far off (see, e. g., Hansen, 2007a, which also proposes a bias correction for estimates of serial correlation parameters). Unfortunately, however, beyond the greenhouse world of controlled Monte Carlo studies, we’re unlikely to know whether parametric assumptions are a good fit.
Alas, the bottom line here is not entirely clear, as is the more basic question of when few clusters are fatal for inference. The severity of the resulting bias seems to depend on the nature of your problem, in particular whether you confront straight-up Moulton or serial correlation issues. Aggregation to the group level as in Donald and Lang (2007) seems to work well in the Moulton case as long as the regressor of interest is fixed within groups and there is not too much underlying heteroskedasticity. At a minimum, you’d like to show that your conclusions are consistent with the inferences that arise from an analysis of group averages since this is a conservative and transparent approach. Angrist and Lavy (2007) go with BRL standard errors to adjust for clustering at the school level but validate these by showing that key results come out the same using covariate-adjusted group averages.
As far as serial correlation goes, most of the evidence suggests that when you are lucky enough to do research on US states, giving 51 clusters, you are on reasonably safe ground with a naive application of Stata’s cluster command at the state level. But you might have to study Canada, which offers only 10 clusters in the form of provinces, well below 42. Hansen (2007b) finds that Liang and Zeger (1986) [Stata-clustered] standard errors are reasonably good at correcting for serial correlation in panels, even in the Canadian scenario. Hansen also recommends use of a t-distribution with G — k degrees of freedom for critical values.
Clustering problems have forced applied microeconometricians to eat a little humble pie. Proud of working with large micro data sets, we like to sneer at macroeconomists toying with small time series samples. But he who laughs last laughs best: if the regressor of interest varies only at a coarse group level - such as over time or across states or countries - then it’s the macroeconomists who have had the most realistic mode of inference all along.