Springer Texts in Business and Economics
Regression Diagnostics and Specification Tests
Sources of influential observations include: (i) improperly recorded data, (ii) observational errors in the data, (iii) misspecification and (iv) outlying data points that are legitimate and contain valuable information which improve the efficiency of the estimation. It is constructive to isolate extreme points and to determine the extent to which the parameter estimates depend upon these desirable data.
One should always run descriptive statistics on the data, see Chapter 2. This will often reveal outliers, skewness or multimodal distributions. Scatter diagrams should also be examined, but these diagnostics are only the first line of attack and are inadequate in detecting multivariate discrepant observations or the way each observation affects the estimated regression model.
In regression analysis, we emphasize the importance of plotting the residuals against the explanatory variables or the predicted values у to identify patterns in these residuals that may indicate nonlinearity, heteroskedasticity, serial correlation, etc, see Chapter 3. In this section, we learn how to identify significantly large residuals and compute regression diagnostics that may identify influential observations. We study the extent to which the deletion of any observation affects the estimated coefficients, the standard errors, predicted values, residuals and test statistics. These represent the core of diagnostic tools in regression analysis.
Accordingly, Belsley, Kuh and Welsch (1980, p. 11) define an influential observation as “..one which, either individually or together with several other observations, has demonstrably larger impact on the calculated values of various estimates (coefficients, standard errors, f-values, etc.) than is the case for most of the other observations.”
First, what is a significantly large residual? We have seen that the least squares residuals of y on X are given by e = (In — PX)u, see equation (7.7). y is n x 1 and X is n x k. If u ~ IID(0, a2In), then e has zero mean and variance a2(In — PX). Therefore, the OLS residuals are correlated and heteroskedastic with var(e^) = a2 (1 — ho) where hii is the i-th diagonal element of the hat matrix H = PX, since у = Hy.
The diagonal elements Нц have the following properties:
En=i hu = tr(Px) = k and Нц = £j=i h2 > h2 > 0.
The last property follows from the fact that PX is symmetric and idempotent. Therefore, h2 — hu < 0 or hu(hu — 1) < 0. Hence, 0 < hu < 1, (see problem 1). hu is called the leverage of the i-th observation. For a simple regression with a constant,
hii = (1/n) + (x2/E n=ix2)
where xi = Xi — X; hii can be interpreted as a measure of the distance between X values of the i-th observation and their mean over all n observations. A large hii indicates that the i-th observation is distant from the center of the observations. This means that the i-th observation with large hii (a function only of Xi values) exercises substantial leverage in determining the fitted value yi. Also, the larger hii, the smaller the variance of the residual ei. Since observations
B. H. Baltagi, Econometrics, Springer Texts in Business and Economics, DOI 10.1007/978-3-642-20059-5_8, 179
© Springer-Verlag Berlin Heidelberg 2011
with high leverage tend to have smaller residuals, it may not be possible to detect them by an examination of the residuals alone. But, what is a large leverage? НІІ is large if it is more than twice the mean leverage value 2Н = 2k/n. Hence, Нц > 2k/n are considered outlying observations with regards to X values.
An alternative representation of Нц is simply Нц = diPXdi = Pxdi\2 = x'i(X'X)-1xi where di denotes the i-th observation’s dummy variable, i. e., a vector of dimension n with 1 in the i-th position and 0 elsewhere. Xi is the i-th row of X and \.\ denotes the Euclidian length. Note that diX = xi.
Let us standardize the i-th OLS residual by dividing it by an estimate of its variance. A standardized residual would then be:
ei = ei/вл/1 - Нц (8.1)
where a2 is estimated by s2, the MSE of the regression. This is an internal studentization of the residuals, see Cook and Weisberg (1982). Alternatively, one could use an estimate of a2 that is independent of ei. Defining s2^ as the MSE from the regression computed without the i-th observation, it can be shown, see equation (8.18) below, that
s2 = (n - k)s2 - e2/(1 - Ніі)= s2 (П - k - Єг (82)
S(i = (n - k -1) =S n - k - 1) (8 )
Under normality, s'^i) and ei are independent and the externally studentized residuals are defined by
ei ei/s(i)V1 НІІ ~ tn-k-1 (8.3)
Thus, if the normality assumption holds, we can readily assess the significance of any single studentized residual. Of course, the e* will not be independent. Since this is a t-statistic, it is natural to think of eii as large if its value exceeds 2 in absolute value.
Substituting (8.2) into (8.3) and comparing the result with (8.1), it is easy to show that e* is a monotonic transformation of ei
n - k - 1 N 2
ei Єї 2
n — k — e2
Cook and Wiesberg (1982) show that e* can be obtained as a t-statistic from the following augmented regression:
y = X/3* + dip + u
where di is the dummy variable for the i-th observation. In fact, tp = ei/(1 - НІІ) and e* is the t-statistic for testing that p = 0. (see problem 4 and the proof given below). Hence, whether the i-th residual is large can be simply determined by the regression (8.5). A dummy variable for the i-th observation is included in the original regression and the t-statistic on this dummy tests whether this i-th residual is large. This is repeated for all observations i = 1,...,n.
This can be generalized easily to testing for a group of significantly large residuals:
where Dp is an n x p matrix of dummy variables for the p-suspected observations. One can test p* = 0 using the Chow test described in (4.17) as follows:
where the i-th observation is moved to the bottom of the data, without loss of generality. The
last observation has no effect on the least squares estimate of в* since both the dependent and
*
independent variables are zero. This regression will yield в = в (і), and the i-th observation’s residual is clearly zero. By the Frisch-Waugh-Lovell Theorem given in section 7.3, the least squares estimates and the residuals from (8.14) are numerically identical to those from (8.5).
*
Therefore, в = в (і) in (8.5) and the г-th observation residual from (8.5) must be zero. This implies that у = уі—х'ів(і), and the fitted values from this regression are given by у = Xe^+ddp whereas those from the original regression (7.1) are given by X/. The difference in residuals is therefore
e — Є(і) = X/(і) + dy — X/ (8.15)
premultiplying (8.15) by Px and using the fact that PXX = 0, one gets PX(e — e(і)) = Pxdy. But, Pxe = e and PxЄ(і) = Є(і), hence Pxdу = e—e(і). Premultiplying both sides by d! i one gets dPxdi, у = єі since the г-th residual of e(і) from (8.5) is zero. By definition, dPxdi, = 1 — h^, therefore
у = Єі/(1 — hu) (8.16)
premultiplying (8.15) by (X'X)-lX' one gets 0 = /3(і) — /3 + (X'X)-1X, diу. This uses the fact that both residuals are orthogonal to X. Rearranging terms and substituting у from (8.16), one
gets
3 — у(і) = (X 'X) 1Хіу = (X 'X) 1ХіЄі/(1 — hu) as given in (8.13).
Note that s2^) given in (8.2) can now be written in terms of в(і):
4) = E і=і(Уі — Х'Ф(і))2/(п — k — 1) upon substituting (8.13) in (8.17) we get
which is (8.2). This uses the fact that He = 0 and H2 = H. Hence, ehu = 0 and Vn h2 = h-
t=i h%t — hn - ^ ^
To assess whether the change in (the j-th component of в) that results from the deletion of the г-th observation, is large or small, we scale by the variance of /3j, a2(X'X)-1. This is denoted by
DFBETAS j = (4 — %))А(і)^ (X'X j1 (8.19)
Note that S(^) is used in order to make the denominator stochastically independent of the numerator in the Gaussian case. Absolute values of DFBETAS larger than 2 are considered influential. However, Belsley, Kuh, and Welsch (1980) suggest 2Д/п as a size-adjusted cutoff. In fact, it would be most unusual for the removal of a single observation from a sample of 100 or more to result in a change in any estimate by two or more standard errors. The size-adjusted
cutoff tend to expose approximately the same proportion of potentially influential observations, regardless of sample size. The size-adjusted cutoff is particularly important for large data sets.
In case of Normality, it can also be useful to look at the change in the f-statistics, as a means of assessing the sensitivity of the regression output to the deletion of the г-th observation:
(8.20)
Another way to summarize coefficient changes and gain insight into forecasting effects when the г-th observation is deleted is to look at the change in fit, defined as
DFFITi = yi - щ = x'i[e - в(i)] = hiiei/(1 - ho)
where the last equality is obtained from (8.13).
We scale this measure by the variance of yp), i. e., ал/hu, giving
where a has been estimated by s(i) and e* denotes the externally studentized residual given in (8.3). Values of DFFITS larger than 2 in absolute value are considered influential. A size - adjusted cutoff for DFFITS suggested by Belsley, Kuh and Welsch (1980) is 2д/k/n.
In (8.3), the studentized residual e* was interpreted as a f-statistic that tests for the significance of the coefficient p of di, the dummy variable which takes the value 1 for the г-th observation and 0 otherwise, in the regression of y on X and di,. This can now be easily proved as follows:
Consider the Chow test for the significance of p. The RRSS = (n — k)s2, the URSS = (n — k - 1)s2i) and the Chow F-test described in (4.17) becomes
The square root of (8.23) is e* ~ fn-k-1. These studentized residuals provide a better way to examine the information in the residuals, but they do not tell the whole story, since some of the most influential data points can have small e* (and very small ei).
One overall measure of the impact of the г-th observation on the estimated regression coefficients is Cook’s (1977) distance measure D2. Recall, that the confidence region for all k regression coefficients is (в - в)'X'X(в - в)/ks2 ~ F(k, n - k). Cook’s (1977) distance measure D2 uses the same structure for measuring the combined impact of the differences in the estimated regression coefficients when the г-th observation is deleted:
D2(s) = (3 - %))'X'X(3 - %)/ks2 (8.24)
Even though D2(s) does not follow the above F-distribution, Cook suggests computing the percentile value from this F-distribution and declaring an influential observation if this percentile value > 50%. In this case, the distance between в and в(%) will be large, implying that the г-th
observation has a substantial influence on the fit of the regression. Cook’s distance measure can be equivalently computed as:
D2(s) depends on ei and hu; the larger ei or hu the larger is D2(s). Note the relationship between Cook’s D2(s) and Belsley, Kuh, and Welsch (1980) DFFITSi(o) in (8.22), i. e.,
DFFITSi(a) = XkDi(a) = (& - xid(i})/(aVhii)
Belsley, Kuh, and Welsch (1980) suggest nominating DFFITS based on s(i) exceeding 2д/k/n for special attention. Cook’s 50 percentile recommendation is equivalent to DFFITS > Vk, which is more conservative, see Velleman and Welsch (1981).
Next, we study the influence of the i-th observation deletion on the covariance matrix of the regression coefficients. One can compare the two covariance matrices using the ratio of their determinants:
Using the fact that
det[X'i)X(i)] = (1 - hii)det[X'X] see problem 8, one obtains
where the last equality follows from (8.18) and the definition of e* in (8.3). Values of COVRA - TIO not near unity identify possible influential observations and warrant further investigation. Belsley, Kuh and Welsch (1980) suggest investigating points with COVRATIO —1| near to or larger than 3k/n. The COVRATIO depends upon both hii and e*2. In fact, from (8.28), COVRATIO is large when hii is large and small when e* is large. The two factors can offset each other, that is why it is important to look at hii and e* separately as well as in combination as in COVRATIO.
Finally, one can look at how the variance of X changes when an observation is deleted.
var(yi) = s2hu and var(X(i)) = var(xi^(i)) = s2i)(hu/(1 - hu))
and the ratio is
FVARATIOi = s2i)/s2(1 - hu) (8.29)
This expression is similar to COVRATIO except that [s2i)/s2] is not raised to the k-th power. As a diagnostic measure it will exhibit the same patterns of behavior with respect to different configurations of hii and the studentized residual as described for COVRATIO.
Dependent Variable: LNC Analysis of Variance
Sum of |
Mean |
||||
Source |
DF |
Squares |
Square |
F Value |
Prob>F |
Model |
2 |
0.50098 |
0.25049 |
9.378 |
0.0004 |
Error |
43 |
1.14854 |
0.02671 |
||
C Total |
45 |
1.64953 |
|||
Root MSE |
0.16343 |
R-square |
0.3037 |
||
Dep Mean |
4.84784 |
Adj R-sq |
0.2713 |
||
C. V. |
3.37125 |
||||
Parameter |
Estimates |
||||
Parameter |
Standard |
T for H0: |
|||
Variable |
DF |
Estimate |
Error |
Parameter=0 |
Prob > |T| |
INTERCEP |
1 |
4.299662 |
0.90892571 |
4.730 |
0.0001 |
LNP |
1 |
-1.338335 |
0.32460147 |
-4.123 |
0.0002 |
LNY |
1 |
0.172386 |
0.19675440 |
0.876 |
0.3858 |
Table 8.1 Cigarette Regression |
Example 1: For the cigarette data given in Table 3.2, Table 8.1 gives the SAS least squares regression for logC on logP and logY.
logC = 4.30 — 1.34 logP + 0.172 logY + residuals (0.909) (0.325) (0.197)
The standard error of the regression is s = 0.16343 and R2 = 0.271. Table 8.2 gives the data along with the predicted values of logC, the least squares residuals e, the internal studentized residuals e given in (8.1), the externally studentized residuals e* given in (8.3), the Cook statistic given in (8.25), the leverage of each observation h, the DFFITS given in (8.22) and the COVRATIO given in (8.28).
Using the leverage column, one can identify four potential observations with high leverage, i. e., greater than 2h = 2k/n = 6/46 = 0.13043. These are the observations belonging to the following states: Connecticut (CT), Kentucky (KY), New Hampshire (NH) and New Jersey (NJ) with leverage 0.13535,0.19775,0.13081 and 0.13945, respectively. Note that the corresponding OLS residuals are —0.078,0.234,0.160 and —0.059, which are not necessarily large. The internally studentized residuals are computed using equation (8.1). For KY this gives
From Table 8.2, two observations with a high internally studentized residuals are those belonging to Arkansas (AR) and Utah (UT) with values of 2.102 and —2.679 respectively, both larger than 2 in absolute value.
The externally studentized residuals are computed from (8.3). For KY, we first compute s2KY), the MSE from the regression computed without the KY observation. From (8.2), this is
(n - k)s2 - e2KY/(1 - hKY)
(n - k - 1)
(46 - 3)(0.16343)2 - (0.23428)2/(1 - 0.19775)
(46 - 3 - 1)
From (8.3) we get
This externally studentized residual is distributed as a f-statistic with 42 degrees of freedom. However, e*KY does not exceed 2 in absolute value. Again, e*AR and eJT are 2.193 and -2.901 both larger than 2 in absolute value. From (8.13), the change in the regression coefficients due to the omission of the KY observation is given by
в - fi(KY) = (X'X)-1XKYeKY/(1 - hKY)
Using the fact that
and x'KY = (1, -0.03260, 4.64937) with eKY = 0.23428 and hKY = 0.19775 one gets
(/3 - l3(KY))' = (-0.082249, -0.230954, 0.028492)
In order to assess whether this change is large or small, we compute DFBETAS given in (8.19). For the KY observation, these are given by
Similarly, DFBETASky,2 = -0.7251 and DFBETASky,3 = 0.14758. These are not larger than 2 in absolute value. However, DFBETASKY,2 is larger than 2Д/п = 2Д/46 = 0.2949 in absolute value. This is the size-adjusted cutoff recommended by Belsley, Kuh and Welsch (1980) for large n.
The change in the fit due to the omission of the KY observation is given by (8.21). In fact,
HKY - U(KY) = XKY [p - P(KY)]
or simply
Scaling it by the variance of 3(ky) we get from (8.22)
This is not larger than 2 in absolute value, but it is larger than the size-adjusted cutoff of 2д/k/n = 2д/3/46 = 0.511. Note also that both DFFITSar = 0.667 and DFFITSut = -0.888 are larger than 0.511 in absolute value.
Cook’s distance measure is given in (8.25) and for KY can be computed as
The other two large Cook’s distance measures are DRAr(s) = 0.13623 and DUT(s) = 0.22399, respectively. COVRATIO omitting the KY observation can be computed from (8.28) as
which means that COVRATIOKY - 1/ = 0.1125 is less than 3k/n = 9/46 = 0.1956. Finally, FVARATIO omitting the KY observation can be computed from (8.29) as
By several diagnostic measures, AR, KY and UT are influential observations that deserve special attention. The first two states are characterized with large sales of cigarettes. KY is a producer state with a very low price on cigarettes, while UT is a low consumption state due to its high percentage of Mormon population (a religion that forbids smoking). Table 8.3 gives the predicted consumption along with the 95% confidence band, the OLS residuals, and the internalized student residuals, Cook’s D-statistic and a plot of these residuals. This last plot highlights the fact that AR, UT and KY have large studentized residuals.