Introduction to the Mathematical and Statistical Foundations of Econometrics
Conditional Expectations as the Best Forecast Schemes
I will now show that the conditional expectation of a random variable Y given a random variable or vectorXis the best forecasting scheme for Yin the sense that the mean-square forecast error is minimal. Let f (X) be a forecast of Y, where f is a Borel-measurable function. The mean-square forecast error (MSFE) is defined by MSFE = E[(Y — f (X))2]. The question is, For which function f is the MSFE minimal? The answer is
Theorem 3.13: If E [Y2] < to, then E [(Y — f (X))2] is minimal for f (X) = E [Y | X] ■
Proof: According to Theorem 3.10 there exists a Borel-measurable function g such that E[Y|X] = g(X) with probability 1. Let U = Y — E[Y|X] = Y — g(X)■ It follows from Theorems 3.3, 3.4, and 3.9 that
E[(7 - f (X))2|X] = E[(U + g(X) - f (X))2|X]
= E[U2|X] + 2E[(g(X) - f (X))U|X]
+ E [(g(X) - f (X))2|X]
= E[U2|X] + 2(g(X) - f (X))E[U|X]
+ (g(X) - f (X))2, (3.24)
where the last equality follows from Theorems 3.4and 3.9. Because, by Theorem
3.6, E(U|X) = 0 with probability 1, equation (3.24) becomes
E[(7 - f (X))2|X] = E[U2|X] + (g(X) - f (X))2. (3.25)
Applying Theorem 3.1 to (3.25), it follows now that
E[(7 - f (X))2] = E[U2] + E[(g(X) - f (X))2],
which is minimal if E[(g(X) - f (X))2] = 0. According to Lemma 3.1 this condition is equivalent to the condition that P [g(X) = f (X)] = 1. Q. E.D.
Theorem 3.13 is the basis for regression analysis. In parametric regression analysis, a dependent variable 7 is “explained” by a vector X of explanatory (also called “independent”) variables according to a regression model of the type 7 = g(X, 00) + U, where g(x, 0) is a known function of x and a vector 0 of parameters, and Uis the error term assumed tosatisfy the condition E [U |X] = 0 (with probability 1). The problem is then to estimate the unknown parameter vector 00. For example, a Mincer-type wage equation explains the log of the wage, 7, of a worker from the years of education, X1, and the years of experience onthejob, X2,byaregressionmodelofthetype 7 = a + вX1 + уX2 - 8X + U, and thus in this case 0 = (a, в, y, 8)T, X = (X1, X2)T, and g(X, 0) = a + вX1 + уX2 - 8X. The condition that E[U|X] = 0 with probability 1 now implies that E[71X] = g(X, 00) with probability 1 for some parameter vector 00. It follows therefore from Theorem 3.12 that 00 minimizes the mean-square error function E[(7 - g(X, 0))2]:
00 = argmin0E[(7 - g(X, 0))2], (3.26)
where “argmin” stands for the argument for which the function involved is minimal.
Next, consider a strictly stationary time series process 7t.
Definition 3.4: A time series process 7t is said to be strictly stationary if for arbitrary integers m1 < m2 < ••• < mt, the joint distribution of 7t-m1,7t-mt does not depend on the time index t.
Consider the problem of forecasting 7t of the basis of the past 7t-j, j > 1, of 7t. Usually we do not observe the whole past of 7t but only 7t _j for j = 1t - 1, for instance. It follows from Theorem 3.13 that the optimal MSFE
forecast of Yt given the information on Yt_j for j = is the conditional
expectation of Yt given Yt _ j for j = 1,...,m. Thus, if E [Yt2] < to, then
E[Yt |Yt_b Yt_m] = argminfE[(Yt _ f (Yt_b Y_m))2].
Similarly, as before, the minimum is taken over all Borel-measurable functions f on Km. Moreover, because of the strict stationarity assumption, there exists a Borel-measurable function gm on Km that does not depend on the time index t such that with probability 1,
E [Yt | Yt_1, Yt _m ] = gm (Yt _1, Yt_m )
for all t. Theorem 3.12 now tells us that
lim E [Yt | Yt _1,...,Yt _m ] = lim gm (Yt _1,...,Yt _m)
= E[Yt|Yt_1, Yt_2, Yt_3,...], (3.27)
where the latter is the conditional expectation of Yt given its whole past Yt_j, j > 1. More formally, let _m = a (Yt _1,...,Yt _m) and = V»=1 ^. Then (3.27) reads
lim E[Yt|^_m] = E[Yt 1^].
The latter conditional expectation is also denoted by Et_1 [Yt]:
Et_1[Yt ] = E [Yt |Yt_1, Yt _2, Yt_3,...]d==' E [Yt |^__)]. (3.28)
In practice we do not observe the whole past of time series processes. However, it follows from Theorem 3.12 that if t is large, then approximately, E[Yt| Yt Y1] « Et_1[Yt].
In time series econometrics the focus is often on modeling (3.28) as a function of past values of Yt and an unknown parameter vector в, for instance. For example, an autoregressive model of order 1, denoted by AR(1), takes the form Et_1[Yt] = a + вYt_ь в = (а, в)t, where в | < 1. Then Yt = a + в Yt_1 + Ut, where Ut is called the error term. If this model is true, then Ut = Yt _ Et_1[Yt], which by Theorem 3.6 satisfies P(Et_1[Ut] = 0) = 1.
The condition | в | < 1 is one of the two necessary conditions for strict stationarity of Yt, the other one being that Ut be strictly stationary. To see this, observe that by backwards substitution we can write Yt = a/(1 _ в) + вj Ut_j,
provided that |в | < 1. The strict stationarity of Yt follows now from the strict stationarity of Ut.
1. Why is property (3.6) equivalent to (3.8)?
2. Why is the set A defined by (3.11) contained in.^x?
3. Why does (3.12) hold?
4. Prove Theorem 3.5.
5. Prove Theorem 3.6.
6. Verify (3.20). Why does Theorem 3.8 follow from (3.20)?
7. Why does (3.21) imply that Theorem 3.9 holds?
8. Complete the proof of Theorem 3.9 for the general case by writing, for instance,
X = max(0, X) — max(0, — X) = X1 — X2, and Y = max(0, Y) — max(0, — Y) = Y1 — Y2 and applying the result of part (b) of the proof to each pair Xi, Yj.
9. Prove (3.22).
10. Let Y andXbe random variables with E [|Y |] < ж and Ф beaBorel-measurable one-to-one mapping from К into R. Prove that E[Y|X] = E[Y|Ф(X)] with probability 1.
11. Let YandXbe random variables with E[Y2] < ж, P(X = 1) = P(X = 0) = 0.5, E[Y] = 0, and E[X • Y] = 1. Derive E[Y|X]. Hint: Use Theorems 3.10 and 3.13.
APPENDIX
Let Zn = E[Y|&n] and Z = E[Y|&ж], and let A є u£=i &n be arbitrary. Note that the latter implies A є &ж. Because of the monotonicity of {&n} there exists an index kA (depending on A) such that for all n > kA,
j Zn(rn)dP(rn) = j Y(m)dP(rn). (3.29)
AA
If Yis bounded: P [|Y| < M] = 1 for some positive real number M, then Zn is uniformly bounded: |Zn| = |E[Y|&n]| < E[|Y||&n] < M; hence, it follows from (3.29), the dominated convergence theorem, and the definition of Z that
I lim Zn(rn)dP(rn) = f Z(rn)dP(rn) (3.30)
J n J
AA
for all sets A є иЖ=1 &n ■ Although u^=1 &n is not necessarily a a-algebra, it is easy to verify from the monotonicity of {&n} that u^=1 &n is an algebra. Now let & be the collection of all subsets of &ж satisfying the following two conditions:
(a) For each set B є & equality (3.30) holds with A = B.
(b) For each pair of sets B1 є & and B2 є &*, equality (3.30) holds with
A = B1 U B2.
Given that (3.30) holds for A = ^ because ^ є u£Lj &n, it is trivial that (3.30) also holds for the complement A of A:
hence, if B є &*, then В є &*. Thus, &* is an algebra. Note that this algebra exists because U=j &n is an algebra satisfying the conditions (a) and (b). Thus,
u„O0=1 & c c
I will show now that & is a a-algebra, and thus that &” = & because the former is the smallest a-algebra containing u^=1 &n. For any sequence of disjoint sets Aj є &*, it follows from (3.30) that
hence, Uj= Aj є &*. This implies that & is a a-algebra containing U”=1 &n because we have seen in Chapter 1 that an algebra closed under countable unions of disjoint sets is a a-algebra. Hence, &«> = &*, and consequently (3.30), hold for all sets A є &”. This implies that P [Z = limn—” Zn ] = 1 if Yis bounded.
Next, let Y be nonnegative: P[|Y > 0] = 1 and denote for natural numbers m > 1, Bm = {ш є ^ : m — 1 < Y(ш) < m}, Ym = Y ■ I(m — 1 < Y < m), Znm) = E [Ym |&n ] and Z(m) = E [Ym I&” ]. I have just shown that for fixed m > 1 and arbitrary A є &«,,
j lim Z('m)(rn)dP{rn) = j Z(m)(«)dP(«) = j Ym (ш)сР(ш)
A A A
= j Y(rn)dP(rn), (3.31)
AnBm
where the last two equalities follow from the definitions of Z(m) and Zm. Because Ym (ш)I(ш є Bm) = 0, it follows that Z^^rn)I(ш є Bm) = 0; hence,
f lim Zm(rn)dPrn = f lim Z(nm)(ш)dPш
n— n—
A A u Bm
+ f lim Z(im)(m)dP(m)
n—
^nJB m
= f lim Z(nm)(ш)dPш, n— n
AnBm
and thus by (3.31),
f lim Z(m)(M)dP(m) = f Y(rn)dP(m).
J n^TO J
ACBm ACBm
Moreover, it follows from the definition of conditional expectations and Theorem 3.7 that
Zm = E[Y ■ I(m - 1 < Y < m),Wn] = E[Y|Bm C ]
= E[E(YFn)Bm C &n] = E[ZnBm C &n];
hence, for every set A є Ції Fn,
= j Y(rn)dP(rn), (3.32)
ACBm
which by the same argument as in the bounded case carries over to the sets A є FTO. It follows now from (3.31) and (3.32) that
/ lim Zn (rn)dP(rn) =
J n^TO J
A C Bm A C Bm
for all sets A є FTO. Consequently,
for all sets A є FTO. This proves the theorem for the case P[Y > 0] = 1. The general case is now easy using the decomposition Y = max(0, Y) - max(0, - Y).