Advanced Econometrics Takeshi Amemiya
Multinomial Models
9.3.1 Statistical Inference
In this section we shall define a general multinomial QR model and shall discuss maximum likelihood and minimum chi-square estimation of this model, along with the associated test statistics. In the subsequent sections we shall discuss various types of the multinomial QR model and the specific problems that arise with them.
Assuming that the dependent variable y, takes m, + 1 values 0, 1,2,. . . , mh we define a general multinomial QR model as
Р(Уі=Л = Р0(х*,в), i'=l,2,. . . ,n and (9.3.1)
7 = 1,2,.. . ,m„
where x* and в are vectors of independent variables and parameters, respectively. (Strictly speaking, we should write j as jt, but we shall suppress the subscript /'.) We sometimes write (9.3.1) simply as Pi} = Ftj. We shall allow the possibility that not all the independent variables and parameters are included in the argument of every Ftj. Note that P(yt = 0)(=F(0) need not be specified because it must be equal to one minus the sum of the /я, probabilities defined in (9.3.1).
It is important to let w, depend on / because in many applications individuals face different choice sets. For example, in transport modal choice analysis, traveling by train is not included in the choice set of those who live outside of its service area.
To define the maximum likelihood estimator of в in the model (9.3.1) it is useful to define 2?_,(»i< + 1) binary variables
Уц= 1 if y<=j (9.3.2)
= 0 if yt*j, /'=1,2,. . . ,n and
7 = 0, 1,. . . , mf.
Then we can write trfb log likelihood function as
п УПі
log L = 2 2 У‘Jlo® Fa, (9.3.3)
i-l j-0
which is a natural generalization of (9.2.7). The MLE 0 of в is defined as a solution of the normal equation d log L/дб = 0.
Many of the results about the MLE in the binary case hold for the model
(9.3.1) as well. A reasonable set of sufficient conditions for its consistency and asymptotic normality can be found using the relevant theorems of Chapter 4, as we have done for the binary case. The equivalence of the method of scoring and the NLWLS (NLGLS to be exact) iteration can also be shown. However, we shall demonstrate these things under a slightly less general model than
(9.3.1) . We assume6
Р(Уі = j) = Fj(x'aft, ,. . . , x'M, (9.3.4)
/ = 1, 2,. . . , n and j=l,2,...,m,
where Я is a fixed integer. In most specific examples of the multinomial QR model we have H=m, but this need not be so for this analysis. Note that we have now assumed that m does not depend on і to simplify the analysis.
We restate Assumptions 9.2.1,9.2.2, and 9.2.3 with obvious modifications as follows.
Assumption 9.3.1. Fv has partial derivatives/^ = dFtjld{x'ikfik) and partial second-order derivatives = д/*/<?(Х(Д) for every i, j, k, /, and 0 < Fu < 1 and fij> 0 for every j, and k.
Assumption 9.3.2. The parameter space В is an open bounded subset of a Euclidean space.
Assumption 9.3.3. {xlA} are uniformly bounded in і for every h and lim„_„ n~ ,x, Ax-Aisa finite nonsingular matrix for every h. Furthermore,
the empirical distribution function of {xlA} converges to a distribution function.
Under these assumptions the MLE fi of fl = (fl[, fl'2,. . . ,ftHY can be shown to be consistent and asymptotically normal. We shall derive its asymptotic covariance matrix.
Differentiating (9.3.3) with respect go ft, we obtain
d log L _ Ai F~lfkx dfik
Differentiating (9.3.5) with respect to ft yields
dftdft fa Viit iJ J ijJ iJ ik il
+ 2 E Уцрй1/цхікха-
i-l j-0
Taking the expectation and noting 2jl0/y = 0> we obtain
Define A = {A#}. Then we can show under Assumptions 9.3.1, 9.3.2, and
9.3.3
Jn(P — /?)—»■N(0, lim /1А-1).
Л-*оо
To show the equivalence between the method-of-scoring and the NLWLS iteration (see Amemiya, 1976b), it is convenient to rewrite (9.3.5) and (9.3.7) using the following vector notation:
Уі = (Уп>Уй, • • • ,УітУ (9-3.9)
F,-= (F(1, Fa,. . . , Fim)'
f? = (/&,Д,. • • ,/Ь. У
Л( = E(y, - F,)(y, - F, y = D(F,-) - F, F;,
where D(F;) is the diagonal matrix the yth diagonal element of which is FtJ. Then, using the identities = 1 and 2/і0Д = 0 and noting
Л^1 = D'^Fj) + Fa'll', (9.3.10)
where 1 is an w-vector of ones, we obtain from (9.3.5) and (9.3.7)
and
Suppose P is the initial estimate of /?, let F,, f f, and Л, be, respectively, F,-, ff, and Л, evaluated at P and define
and
<* = X Хі^ГЛГ1 X + X М^АГЧУі“ F.)- (9.3.14)
1-І A—1 i-l
Then the second-round estimator p obtained by the method-of-scoring iteration started from P is given by
P = A~lc, (9.3.15)
where A = {Aw} and c = (cj, c2,. . .
We can interpret (9.3.15) as an NLGLS iteration. From (9.3.4) we obtain
Уі = F, + u„ /'=1,2,. . . , n, (9.3.16)
where £u, = 0, £u, u' = A„ and Еи$ = 0 for / Ф j. Expanding F, in Taylor series around x'ihfih, h = 1, 2,. . . , H, we obtain
Уі - F, + J) ffxJA = J; ffxJA + u„ (9.3.17)
/= 1,2,. . . , ft.
The approximation (9.3.17) can be regarded as a linear regression model of mn observations with a nonscalar covariance matrix. If we estimate A, by A, and apply FGLS to (9.3.17), the resulting estimator is precisely (9.3.15).
Next we shall derive the MIN x2 estimator of fi in model (9.3.4), following Amemiya (1976b). As in Section 9.2.5, we assume that there are many observations with the same value of independent variables. Assuming that the vector of independent variables takes T distinct values, we can write (9.3.4) as
Рц = Ffatifiit хаРг, • • • > хшРн), (9.3.18)
/=1,2,. . . , T and j = 1, 2,. . . ,m.
Tо define the MIN x2 estimator of/?, we must be able to invert m equations in (9.3.18) for every / and solve them for H variables x'afii, x'afi2, • • • , x'mPh• Thus for the moment, assume H=m. Assuming, furthermore, that the Jacobian does not vanish, we obtain
xtkPk ~ Gk(.Ptl > Fat • • ■ » Ptm)’ (9.3.19)
/=1,2,. . . , T and k = 1,2,. . . ,m.
As in Section 9.2.5 we define rtj = where I, is the set of і for which
ih = xlH for all h, and Ptj = rtjln„ where n, is the number of integers in I,. Expanding Gk(Pn, Pa, - • • ,P, m) in Taylor series around {Pn, Pa, ■ • • , Ptm) and using (9.3.19), we obtain
Gk(Pn, Pa, - ■ • , Pm) = x’tkfik+ f gJ, k(P, j ~ F, j), (9-3.20)
j-i
/=1,2,. . . , T and fc = 1,2,. . . ,m,
where g{k — dGk/dPtj. Equation (9.3.20) is a generalization of (9.2.30), but here we write it as an approximate equation, ignoring an error term that corresponds to w, in (9.2.30). Equation (9.3.20) is an approximate linear regression equation with a nonscalar covariance matrix that depends on A, = E($t — Fr)(P, — F,)' = w^*[D(Ft) — FtF,'], where P, = (Pn, Pt2, . . . , PtmY and F, = (Fn, Fa,. . . , Ftm)'. The MINx2 estimator fi offi is defined as FGLS applied to (9.3.20), using A, = n^1[D(P/) — Р/Р,'] as an estimator of A,. An example of the MIN x2 estimator will be given in Example
9.3.1 in Section 9.3.2.
The consistency and the asymptotic normality of fi can be proved by using a method similar to the one employed in Section 9.2.5. Let fl-1 be the asymptotic covariance matrix offi, that is, Vw (fi — fi) —<► N(0, lim,,..,» rcQ~1). Then it can be deduced from (9.3.20) that the k, lth subblock of Q is given by
«« = S MG^aVx;,, (9.3.21)
t
where G,' is an m X w matrix the fcth row of which is equal to (g#, g%,. . . , g%) and { )й' denotes the k, lth element of the inverse of the matrix inside { }. Now, we obtain from (9.3.12)
A*, = 2 x^FJAr'F,}*/*;,, (9.3.22)
where FJ is an m X m matrix the kth row of which is equal to (/]*. ,f,. . . , f%). Thus the asymptotic equivalence of MIN x2 and MLE follows from the identity Gr1 = Fr'.
In the preceding discussion we assumed H=m. If H < m, we can still invert (9.3.18) and obtain
*tkfik = Gk(Pti, Pa,. . . ,Pm), (9.3.23)
t = 1,2,. . . , T and к = 1, 2,. . . , H,
but the choice of function Gk is not unique. Amemiya (1976b) has shown that the MIN x2 estimator is asymptotically efficient only when we choose the correct function Gk from the many possible ones. This fact diminishes the usefulness of the method. For this case Amemiya (1977c) proposed the following method, which always leads to an asymptotically efficient estimator.
Step 1. Use some Gk in (9.3.23) and define
V-tk = Gfc(P, i, P(i, . . . , Pm).
Step 2. In (9.3.17) replace F, and f} by F, and f f evaluated atp, k and replace x'ihfih by Мл-
Step 3. Apply FGLS on (9.3.17) using A, evaluated at jutk.
We shall conclude this subsection by generalizing WSSR defined by (9.2.42) and (9.2.45) to the multinomial case.
We shall not write the analog of (9.2.42) explicitly because it would require cumbersome notation, although the idea issimple. Instead, we shall merely point out that it is of the form (у — ХД0)'2_1(У — X&j) obtained from the regression equation (9.3.20). It is asymptotically distributed as chi-square with degrees of freedom equal to mT minus the number of regression parameters.
Next, consider the generalization of (9.2.45). Define the vector F, =
Then, the analog of (9.2.45) is
г-і j-o 1 tj
It is asymptotically equivalent to the analog of (9.2.42).
9.3.2 Ordered Models
Multinomial QR models can be classified into ordered and unordered models. In this subsection we shall discuss ordered models, and in the remainder of Section 9.3 we shall discuss various types of unordered models.
A general definition of the ordered model is
Definition 9.3.1. The ordered model is defined by
for some probability measure p depending on x and в and a finite sequence of successive intervals (5)} depending on x and dsuch that Ц5, = R, the real line.
A model is unordered if it is not ordered. In other words, in the ordered model the values that у takes correspond to a partition of the real line, whereas in the unordened model they correspond either to a nonsuccessive partition of the real line or to a partition of a higher-dimensional Euclidean space.
In most applications the ordered model takes the simpler form
Р(У=;|x, a, 0) = F{a]+, - x'fi) - F{aj - x'ft), (9.3.25)
j = 0, , m, Qk) = -°°, uj^otj+l, am+i = °°,
for some distribution function F. l(F= Ф, (9.3.25) defines the ordered probit
model; and if F = A, it defines the ordered logit model. Pratt (1981) showed that the log likelihood function of the model (9.3.25) based on observations (yh x,), і = 1, 2,. . . , n, on {y, x) is globally concave iff derivative of F, is positive and log/is concave.
The model (9.3.25) is motivated by considering an unobserved continuous random variable y* that determines the outcome of у by the rule
y=j ifandonlyif otj < y* < aj+l, (9.3.26)
7 = 0, 1,. . . , m.
If the distribution function of y* — x'ft is F, (9.3.26) implies (9.3.25).
In empirical applications of the ordered model, y* corresponds to a certain interpretative concept. For example, in the study of the effect of an insecticide by Gurland, Lee, and Dahm (1960), y* signifies the tolerance of an insect against the insecticide. Depending on the value of у*, у takes three discrete values corresponding to the three states of an insect—dead, moribund, and alive. In the study by David and Legg (1975), y* is the unobserved price of a house, and the observed values of у correspond to various ranges of the price of a house. In the study by Silberman and Talley (1974), y* signifies the excess demand for banking in a particular location and у the number of chartered bank offices in the location. See also Example 9.4.1 in Section 9.4.1.
The use of the ordered model is less common in econometric applications than in biometric applications. This must be due to the fact that economic phenomena are complex and difficult to explain in terms of only a single unobserved index variable. We should be cautious in using an ordered model because if the true model is unordered, an ordered model can lead to serious biases in the estimation of the probabilities. On the other hand, the cost of using an unordered model when the true model is ordered is a loss of efficiency rather than consistency.
We shall conclude this subsection by giving an econometric example of an ordered model, which is also an interesting application of the MIN x2 method discussed in Section 9.3.1.
Example 9.3.1 (Deacon and Shapiro, 1975). In this article Deacon and Shapiro analyzed the voting behavior of Californians in two recent referenda: Rapid Transit Initiative (November 1970) and Coastal Zone Conservation Act (November 1972). We shall take up only the former. Let AU, be the difference between the utilities resulting from rapid transit and no rapid transit for the ith individual. Deacon and Shapiro assumed that AU, is distributed logistically with mean fit—that is, P(AU, < x) = A(x — pt)—and that
Deacon and Shapiro assumed = x'fix and 6, = x$2, where x,- is a vector of independent variables and some elements of 0, and 02 are a priori specified to be zeros. (Note that if St = 0, the model becomes a univariate binary logit model.)
The model (9.3.27) could be estimated by MLE if the individual votes were recorded and x, were observable. But, obviously, they are not: We only know the proportion of yes votes and no votes in districts and observed average values of x, or their proxies in each district. Thus we are forced to use a method suitable for the case of many observations per cell.7 Deacon and Shapiro used data on 334 California cities. For this analysis it is necessary to invoke the assumption that x, = x, for all і E /,, where I, is the set of individuals living in the tth city. Then we obtain from (9.3.28) and (9.3.29)
P(Y
log ~Рг) (9.3.30)
and
P(N і
l0g і - PIN) = _X'(A + Г (9-3.31)
Let P,(Y) and P,(N) be the proportion of yes and no votes in the rth city. Then, expanding the left-hand side of (9.3.30) and (9.3.31) by Taylor series around Pt( Y) and P,(N), respectively, we obtain the approximate regression equations
i°g - s x;(j?1 “ h) (9-3-32)
1
P/LYKl-P/LY)]
and
log, (9-133)
1 - P,(N)
+m»hMlPiN)-m)]-
Note that (9.3.32) and (9.3.33) constitute a special case of (9.3.20). The error terms of these two equations are heteroscedastic and, moreover, correlated with each other. The covariance between the error terms can be obtained from the result Cov [PAY), P,(N)] = - nTlP,(Y)PAN). The MIN x2 estimates of (A — fii)and — (Pi + A)are obtained by applying generalized least squares to (9.3.32) and (9.3.33), taking into account both heteroscedasticity and the correlation.®