INTRODUCTION TO STATISTICS AND ECONOMETRICS
MAXIMUM LIKELIHOOD ESTIMATOR: DEFINITION AND COMPUTATION
Suppose we want to estimate the probability (p) that a head will appear for a particular coin; we toss it ten times and a head appears nine times. Call this event A. Then we suspect that the coin is loaded in favor of heads: in other words, we conclude that p = У2 is not likely. If p were У2, event A would be expected to occur only once in a hundred times, since we have P(A I p = У2) = CI°(V,)10 = 0.01. In the same situation p = % is more likely, because P(A p = 3/4) = С9°(3/4)9(У4) s= 0.19, and p = 9/10 is even more likely, because P(A p = 9/ю) = С9°(9/ю)9(Уіо) — 0.39. Thus it makes sense to call P(A p) = СІ°р9(1 — p) the likelihood function of p given event A. Note that it is the probability of event A given p, but we give it a different name when we regard it as a function of p. The maximum likelihood estimator of p is the value of p that maximizes P(A p), which in our example is equal to 9/i0. More generally, we state
DEFINITION 7.3.1 Let (X1; X2,. . . , X„) be a random sample on a discrete population characterized by a vector of parameters 0 = (0b 02, ,
0X) and let xt be the observed value of Xr Then we call
П
L = П P(Xi = Xi | 0)
i= 1
the likelihood function of 0 given (jq, x2, . . . , xn), and we call the value of 0 that maximizes L the maximum likelihood estimator.
Recall that the purpose of estimation is to pick a probability distribution among many (usually infinite) probability distributions that could have generated given observations. Maximum likelihood estimation means choosing that probability distribution under which the observed values could have occurred with the highest probability. It therefore makes good intuitive sense. In addition, we shall show in Section 7.4 that the maximum likelihood estimator has good asymptotic properties. The following two examples show how to derive the maximum likelihood estimator in the case of a discrete sample.
EXAMPLE 7.3.1 Suppose X ~ B(n, p) and the observed value of X is k. The likelihood function of p is given by
(7.3.1) L = Clp 1 - p)n~h.
We shall maximize log L rather than L because it is simpler (“log” refers to natural logarithm throughout this book). Since log is a monotonically increasing function, the value of the maximum likelihood estimator is unchanged by this transformation. We have
(7.3.2) log L = log Ck+k log p + (n — k) log(l — p).
Setting the derivative with respect to p equal to 0 yields
d log L _ k n — k _ ^ dp p 1 — p
(7.3.3) and denoting the maximum likelihood estimator by p, we „ k
p = ~.
n
To be complete, we should check to see that (7.3.4) gives a maximum rather than any other stationary point by showing that d2log L/dp evaluated at p = k/n is negative.
This example arises if we want to estimate the probability of heads on the basis of the information that heads came up k times in n tosses. Suppose that we are given more complete information: whether each toss has resulted in a head or a tail. Define Xj = 1 if the ith toss shows a head and = 0 if it is a tail. Let be the observed value of Xb which is, of course, also 1 or 0. The likelihood function is given by
П
(7.3.5) L = Y[P\-Pr
І = 1
f n ■) |
Ґ |
||||
(7.3.6) |
log L = |
lo gp + |
n |
- X xi |
|
кг=1 ) |
i=1 ) |
Taking the logarithm, we have |
log(l - p). |
But, since k = £”=іхг, (7.3.6) is the same as (7.3.2) aside from a constant term, which does not matter in the maximization. Therefore the maxi- |
mum likelihood estimator is the same as before, meaning that the extra information is irrelevant in this case. In other words, as far as the estimation of p is concerned, what matters is the total number of heads and not the particular order in which heads and tails appear. A function of a sample, such as Хгге=1хг in the present case, that contains all the necessary information about a parameter is called a sufficient statistic.
example 7.3.2 This is a generalization of Example 7.3.1. Let X,-, і = 1, 2, . . . , n, be a discrete random variable which takes К integer values 1, 2К with probabilities pi, pi, ... , pK. This is called the multinomial distribution. (The subsequent argument is valid if X; takes a finite number of distinct values, not necessarily integers.) Let npj = 1, 2, ... , K, be the number of times we observe X = j. (Thus X;=1n; = n.) The likelihood function is given by к
(7.3.7) L = с П P]’>
>=i
where c = n/(nf. nf. • • • nK). The log likelihood function is given by
к
(7.3.8) log L = log с + X nj l°g Pj-
7=1
Differentiate (7.3.8) with respect to p, pi, ... , pK~j, noting that pK = — pi — pi — ... — pK-i, and set the derivatives equal to zero:
(7.3.9) dl^L = = °> 7 = 1, 2, . . . , K-l.
dpj Pj pK
Adding the identity nK/pK = nK/pK to the above, we can write the К equations as
(7.3.10) Pj = anp j = 1, 2, . . . , K,
where a is a constant which does not depend on j. Summing both sides of (7.3.10) with respect to j and noting that LjL ipj = 1 and T*Li n} = n yields
Therefore, from (7.3.10) and (7.3.11) we obtain the maximum likelihood estimator
71 ■
(7.3.12) pj = ^, j=l,2,...,K.
The die example of Section 7.1.2 is a special case of this example.