Mostly Harmless Econometrics: An Empiricist’s Companion
Why is Regression Called Regression and What Does Regression-to-the — mean Mean?
The term regression originates with Francis Galton’s (1886) study of height. Galton, who worked with samples of roughly-normally-distributed data on parents and children, noted that the CEF of a child’s height given his parents’ height is linear, with parameters given by the bivariate regression slope and intercept. Since height is stationary (its distribution is not changing [much] over time), the bivariate regression slope is also the correlation coefficient, i. e., between zero and one.
The single regressor in Galton’s set-up, Xj, is average parent height and the dependent variable, Yj, is the height the of adult children. The regression slope coefficient, as always, is Pi = ^^(У*.)^ , and the intercept is a = E [Yj] — P1E [Xj]. But because height is not changing across generations, the mean and variance of Yj and Xj are the same. Therefore,
Cov (Yj, Xj) = Cov (yj, Xj) =
У (Xj) = у/УЩ у/УІУІ) = Pxy
E [Yj] - PiE [Xj] = p(1 - Pi) = p(1 - pxy) where pxy is the intergenerational correlation coefficient in height and p = E [Yj] = E [Xj] is population average height. From this we get the linear CEF
E [yj lXj] = p(1 - Pxy) + PxyXj;
so the height of a child given his parents’ height is therefore a weighted average of his parents’ height and the population average height. The child of tall parents will therefore not be as tall as they are, on average. Likewise, for the short. To be specific, Pischke, who is 6’ 3", can expect his children to be tall, though not as tall as he is. Thankfully, however, Angrist, who is 5’6", can expect his children to be taller than he is. Galton called this property, "regression toward mediocrity in hereditary stature." Today, we call this "regression to the mean."
Galton, who was Charles Darwin’s cousin, is also remembered for having founded the Eugenics Society, dedicated to breeding better people. Indeed, his interest in regression came largely from this quest. We conclude from this that the value of scientific ideas should not be judged by their author’s politics.
Galton does not seem to have shown much interest in multiple regression, our chief concern in this chapter. Indeed, the regressions in Galton’s work are mechanical properties of distributions of stationary random variables, almost identities, and certainly not causal. Galton, would have said so himself because he objected to the Lamarckian idea (later promoted in Stalin’s Russia) that acquired traits could be inherited.
The idea that regression can be used for statistical control satisfyingly originates in an inquiry into the determinants of poverty rates by George Udny Yule (1899). Yule, a statistician and student of Karl Pearson’s (Pearson was Galton’s protege) realized that Galton’s regression coefficient could be extended to multiple variables by solving the least squares normal equations that had been derived long before by Legendre and Gauss. Yule’s (1899) paper appears to be the first publication containing multivariate regression estimates. His model links changes in poverty rates in an area to changes in the administration of the English Poor Laws, while controlling for population growth and the age distribution in the area. He was particularly interested in whether out-relief, the practice of providing income support for poor people without requiring them to move to the poorhouse, did not itself contribute to higher poverty rates. This is a well-defined causal question of a sort that still occupies us today.[36]
Finally, we note that the history of regression is beautifully detailed in the book by Steven Stigler (1986). Stigler is a famous statistician at the University of Chicago, but not quite as famous as his father, the economist and Nobel laureate, George Stigler.