Springer Texts in Business and Economics
Simple Linear Regression
3.1 Introduction
In this chapter, we study extensively the estimation of a linear relationship between two variables, Yi and Xi, of the form:
Yi = a + вХі + Ui i = 1, 2,...,n (3.1)
where Yi denotes the i-th observation on the dependent variable Y which could be consumption, investment or output, and Xi denotes the i-th observation on the independent variable X which could be disposable income, the interest rate or an input. These observations could be collected on firms or households at a given point in time, in which case we call the data a cross-section. Alternatively, these observations may be collected over time for a specific industry or country in which case we call the data a time-series. n is the number of observations, which could be the number of firms or households in a cross-section, or the number of years if the observations are collected annually. a and в are the intercept and slope of this simple linear relationship between Y and X. They are assumed to be unknown parameters to be estimated from the data. A plot of the data, i. e., Y versus X would be very illustrative showing what type of relationship exists empirically between these two variables. For example, if Y is consumption and X is disposable income then we would expect a positive relationship between these variables and the data may look like Figure 3.1 when plotted for a random sample of households. If a and в were known, one could draw the straight line (a + вХ) as shown in Figure 3.1. It is clear that not all the observations (Xi, Yi) lie on the straight line (a + вХ). In fact, equation (3.1) states that the difference between each Yi and the corresponding (a + вXi) is due to a random error ui. This error may be due to (i) the omission of relevant factors that could influence consumption, other than disposable income, like real wealth or varying tastes, or unforseen events that induce households to consume more or less, (ii) measurement error, which could be the result of households not reporting their consumption or income accurately, or (iii) wrong choice of a linear relationship between consumption and income, when the true relationship may be nonlinear. These different causes of the error term will have different effects on the distribution of this error. In what follows, we consider only disturbances that satisfy some restrictive assumptions. In later chapters we relax these assumptions to account for more general kinds of error terms.
In real life, a and в are not known, and have to be estimated from the observed data {(Xi, Y) for i = 1,2,..., n}. This also means that the true line (a + вX) as well as the true disturbances (the ui’s) are unobservable. In this case, a and в could be estimated by the best fitting line through the data. Different researchers may draw different lines through the same data. What makes one line better than another? One measure of misfit is the amount of error from the observed Yi to the guessed line, let us call the latter Yi = Y + вXi, where the hat (~) denotes a guess on the appropriate parameter or variable. Each observation (Xi, Yi) will have a corresponding observable error attached to it, which we will call ei = Yi — Yi, see Figure 3.2. In other words, we obtain the guessed Yi, (YYi) corresponding to each Xi from the guessed line,
B. H. Baltagi, Econometrics, Springer Texts in Business and Economics, DOI 10.1007/978-3-642-20059-5_3, 49
© Springer-Verlag Berlin Heidelberg 2011
2 + /3Xi. Next, we find our error in guessing that Yi, by subtracting the actual Yi from the guessed Yri. The only difference between Figure 3.1 and Figure 3.2 is the fact that Figure 3.1 draws the true consumption line which is unknown to the researcher, whereas Figure 3.2 is a guessed consumption line drawn through the data. Therefore, while the u^s are unobservable, the ei’s are observable. Note that there will be n errors for each line, one error corresponding to every observation.
Similarly, there will be another set of n errors for another guessed line drawn through the data. For each guessed line, we can summarize its corresponding errors by one number, the sum of squares of these errors, which seems to be a natural criterion for penalizing a wrong guess. Note that a simple sum of these errors is not a good choice for a measure of misfit since positive errors end up canceling negative errors when both should be counted in our measure. However, this does not mean that the sum of squared error is the only single measure of misfit. Other measures include the sum of absolute errors, but this latter measure is mathematically more difficult to handle. Once the measure of misfit is chosen, a and в could then be estimated by minimizing this measure. In fact, this is the idea behind least squares estimation.