Mostly Harmless Econometrics: An Empiricist’s Companion
Clustering and Serial Correlation in Panels
8.2.1 Clustering and the Moulton Factor
Bias problems aside, heteroskedasticity rarely leads to dramatic changes in inference. In large samples where bias is not likely to be a problem, we might see standard errors increase by about 25 percent when moving from the conventional to the HC1 estimator. In contrast, clustering can make all the difference.
The clustering problem can be illustrated using a simple bivariate regression estimated in data with a group structure. Suppose we’re interested in the bivariate regression,
Yig — P 0 + P 1xg + eig; (8.2.1)
where Yig is the dependent variable for individual i in cluster or group g, with G groups. Importantly, the regressor of interest, xg, varies only at the group level. For example, data from the STAR experiment analyzed by Krueger (1999) come in the form of Y ig, the test score of student i in class g, and class size,
xg.
Although students were randomly assigned to classes in the STAR experiment, the data are unlikely to be independent across observations. The test scores of students in the same class tend to be correlated because students in the same class share background characteristics and are exposed to the same teacher and classroom environment. It’s therefore prudent to assume that, for students i and j in the same class, g,
E[eigejg] — pa2e > 0, (8.2.2)
where p is the intra-class correlation coefficient and a2e is the residual variance.[121]
Correlation within groups is often modeled using an additive random effects model. Specifically, we assume that the residual, eig, has a group structure:
where vg is a random component specific to class g and Vig is a mean-zero student-level component that’s left over. We focus here on the correlation problem, so both of these error components are assumed to be homoskedastic.
When the regressor of interest varies only at the group level, an error structure like (8.2.3) can increase standard errors sharply. This unfortunate fact is not news - Kloek (1981) and Moulton (1986) both made the point - but it seems fair to say that clustering didn’t really become part of the applied econometrics
Zeitgeist until about 15 years ago.
Given the error structure, (8.2.3), the intra-class correlation coefficient becomes
2
where af is the variance of vg and a^ is the variance of цig. A word on terminology: p is called the intra-class correlation coefficient even when the groups of interest are not classrooms.
Let Vc(bi) be the conventional OLS variance formula for the regression slope (generated using Qc in the previous section), while V(b1) denotes the correct sampling variance given the error structure, (8.2.3). With regressors fixed at the group level and groups of equal size, n, we have
a formula derived in the appendix to this chapter. We call the square root of this ratio the Moulton factor, after Moulton’s (1986) influential study. Equation (8.2.4) tells us how much we over-estimate precision by ignoring intra-class correlation. Conventional standard errors become increasingly misleading as n and p increase. Suppose, for example, that p =1. In this case, all the errors within a group are the same, so the Уig’s are the same as well. Making a data set larger by copying a smaller one n times generates no new information. The variance Vc(/31) should therefore be scaled up by a factor of n. The Moulton factor increases with group size because with a fixed overall sample size, larger groups means fewer clusters, in which case there is less independent information in the sample (because the data are independent across clusters but not within).[122]
Even small intra-class correlation coefficients can generate a big Moulton factor. In Angrist and Lavy (2007), for example, 4000 students are grouped in 40 schools, so the average n is 100. The regressor of interest is school-level treatment status - all students in treated schools were eligible to receive cash rewards for passing their matriculation exams. The intra-class correlation in this study fluctuates around.1. Applying formula (8.2.4), the Moulton factor is over 3: the standard errors reported by default are only one-third of what they should be.
Equation (8.2.4) covers an important special case where the regressors are fixed within groups and group size is constant. The general formula allows the regressor, Xig, to vary at the individual level and for different group sizes, ng. In this case, the Moulton factor is the square root of
where n is the average group size, and px is the intra-class correlation of Xig:
Eg Ei=fc (xig - x) (xfcg - x)
V(xig)T, g ng (ng ~ 0
Note that px does not impose a variance-components structure like (8.2.3) - here, px is a generic measure of the correlation of regressors within groups. The general Moulton formula tells us that clustering has a bigger impact on standard errors with variable group sizes and when px is large. The impact vanishes when px = 0. In other words, if the xig’s are uncorrelated within groups, the grouped error structure does not matter for the estimation of standard errors. That’s why we worry most about clustering when the regressor of interest is fixed within groups.
We illustrate formula (8.2.1) using the Tennessee STAR example. A regression of Kindergartners’ percentile score on class size yields an estimate of -0.62 with a robust (HC) standard error of 0.09. In this case, px = 1 because class size is fixed within classes while V (ng) is positive because classes vary in size (in this case, V (ng) = 17.1). The intra-class correlation coefficient for residuals is.31 and the average class
standard errors should be multiplied by a factor of 2.65 = V^. The corrected standard error is therefore about 0.24.
The Moulton factor works similarly with 2SLS except that px should be computed for the instrumental variable and not the regressor. In particular, use (8.2.5) replacing px with pz, where pz is the intra-class correlation coefficient of the instrumental variable (Shore-Sheppard, 1996) and p is the intra-class correlation of the second-stage residuals. To understand why this works, recall that conventional standard errors for 2SLS are derived from the residual variance of the second-stage equation divided by the variance of the first-stage fitted values. This is the same asymptotic variance formula as for OLS, with first-stage fitted values playing the role of regressor.[123]
Here are some solutions to the Moulton problem:
1. Parametric: Fix conventional standard errors using (8.2.5). The intra-class correlations p and px are easy to compute and supplied as descriptive statistics in some software packages.[124]
2. Cluster standard errors: Liang and Zeger (1986) generalize the White (1980a) robust covariance matrix
to allow for clustering as well as heteroskedasticity:
(X'ХГ1 (EXg^9X^ (X'X)-1, where
b(ng -1)9 bng 9
Here, X9 is the matrix of regressors for group g and a is a degrees of freedom adjustment factor similar to that which appears in HC1. The clustered variance estimator Vc(b) is consistent as the number of groups gets large under any within-group correlation structure and not just the parametric model in
(8.2.3) . Vc(b) is not consistent with a fixed number of groups, however, even when the group size tends to infinity. To see why, note that the sums in Vc(b) are over g and not i. Consistency is determined by the law of large numbers, which says that we can rely on sample moments to converge to population moments (Section 3.1.3). But here the sums are at the group level and not over individuals. Clustered standard errors are therefore unlikely to be reliable with few clusters, a point we return to below.
3. Use group averages instead of micro data: let Y9 be the mean of Yi9 in group g. Estimate
Y9 = bo + b1x9 + e9
by weighted least squares using the group size as weights. This is equivalent to OLS using micro data but the standard errors are asymptotically correct given the group structure, (8.2.3). Again, the asymptotics here are based on the number of groups and not the group size. Importantly, however, because the group means are close to Normally distributed with modest group sizes, we can expect the good finite-sample properties of regression with Normal errors to kick in. The standard errors that come out of grouped estimation are therefore likely to be more reliable than clustered standard errors in samples with few clusters.
Grouped-data estimation can be generalized to models with micro covariates using a two-step procedure. Suppose the equation of interest is
YІ9 = b0 + b1 x9 + WІ9 ^ + ei9 ; (8.2.7)
where wi9 is a vector of covariates that varies within groups. In step 1, construct the covariate-adjusted group effects, ^9, by estimating
Yi9 = T9 + w'i9S + Vi9 ■
The mg, called group effects, are coefficients on a full set of group dummies. The estimated fig are group means adjusted for the effect of the individual level variables w'ig. Note that by virtue of
(8.2.7) and (8.2.3), mg = 3o + 31xg + vg. In step 2, therefore, we regress the estimated group effects on group-level variables:
bg = 30 + 3lxg + {vg + (Ag _ Mg) }. (8.2.8)
The efficient GLS estimator for (8.2.8) is weighted least squares, using the reciprocal of the estimated variance of the group-level residual, {^g + (Mg — Mg)}, as weights. This can be a problem since the variance of vg is not estimated very well with few groups. We might therefore weight by the reciprocal of the variance of the estimated group effects, the group size, or use no weights at all.[125] In an effort to better approximate the relevant finite-sample distribution, Donald and Lang (2007) suggest that inferences in grouped procedures be based on a t-distribution with G—К degrees of freedom.
Note that the grouping approach does not work when Xig varies within groups. Averaging Xig to Xg is a version of IV, as we saw in Section 4. So with micro-variation in the regressor of interest, grouping estimates parameters that differ from the target parameters in a model like (8.2.7).
4. Block bootstrap: In general, bootstrap inference uses the empirical distribution of the data by resampling. But simple random resampling won’t do in this case. The trick with clustered data is to preserve the dependence structure in the target population. We do this by block bootstrapping - that is, drawing blocks of data defined by the groups g. In the Tennessee STAR data, for example, we’d block bootstrap by re-sampling entire classes instead of individual students.
5. Estimate a parametric GLS or maximum likelihood model based on a version of (8.2.1). This fixes the clustering problem but also changes the estimand unless the CEF is linear, as detailed in section 3.4.1. We therefore prefer other approaches.
Table 8.2.1 compare standard-error fix-ups in the STAR example. The table reports six estimates of the standard errors: conventional robust standard errors (using HC1); two versions of parametrically corrected standard errors using the Moulton formula (8.2.5), the first using the formula for the intra-class correlation given by Moulton and the second using Stata’s estimator from the loneway command; clustered standard errors; block-bootstrapped standard errors; and standard errors from weighted estimation at the group level. The coefficient estimate is -0.62. In this case, all adjustments deliver similar results, a standard error of about.23. This happy outcome is due in large part to the fact that with 318 classrooms, we have enough clusters for group-level asymptotics to work well. With few clusters, however, things are much dicier, a point we return to at the end of the chapter.