Advanced Econometrics Takeshi Amemiya
Discriminant Analysis
The purpose of discriminant analysis is to measure the characteristics of an individual or an object and, on the basis of the measurements, to classify the individual or the object into one of two possible groups. For example, accept or reject a college applicant on the basis of examination scores, or determine whether a particular skull belongs to a man or an anthropoid on the basis of its' measurements.
We can state the problem statistically as follows: Supposing that the vector of random variables x* is generated according to either a density g, or g0, we are to classify a given observation on x*, denoted xf, into the group characterized by either g, or g0. It is useful to define yt = 1 if xf is generated by g, and у і = 0 if it is generated by g0. We are to predict yt on the basis of xf. The essential information needed for the prediction is the conditional probability P(y, = 1 |xf). We shall ignore the problem of prediction given the conditional probability and address the question of how to specify and estimate the conditional probability.4
By Bayes’s rule we have
|
|
where Qi and q0 denote the marginal probabilities P{y( = 1) and P(yt = 0), respectively. We shall evaluate (9.2.46), assuming that g, and g0 are the densities ofЩцг, and N(fi0, X0), respectively. We state this assumption formally as
xf|(y,= 1)~ A(/ti,2i) xfl(y, = 0) ~ N{(io, 20).
This is the most commonly used form of discriminant analysis, sometimes referred to as normal discriminant analysis.
Under (9.2.47), (9.2.46) reduces to the following quadratic logit model:
P{yt = 1 |x?) = Mfim + + xf'Axf), (9.2.48)
where
= ~№0% - Ц[*7% + log Qi - log q0 (9.2.49)
-|log|2il + ^log|20|,
0(2) = Xi lfl ~ 2o 1Цо> (9.2.50)
and
A = - j(2o‘-Xr1). (9-2.51)
In the special case 2, = 20, which is often assumed in econometric applications, we haveA = 0; therefore (9.2.48) further reduces toalinearlogit model:
Р(У,= l|x?) = A(x$), (9.2.52)
where we have written /?(1) + /f('2)xf = x</? to conform with the notation of Section 9.2.1.
Let us consider the ML estimation of the parameters //,, fig, 2,,2 0,qx, and q0 based on observations (y, , xf), /=1,2,. . . , n. The determination of qx and q0 varies with authors. We shall adopt the approach of Warner (1963) and treat qx and q0 as unknown parameters to estimate. The likelihood function can be written as
L = П [£i(x?)<7,№(х?)<7о]‘->". (9-2.53)
/-1
Equating the derivatives of log L to 0 yields the following ML estimators:
йх^—, (9-2.54)
П
where и, = 2 1-іУі,
where «о = и —
(9.2.59)
If Xt — X0(=2) as is often assumed, (9.2.58) and (9.2.59) should be replaced by
(9.2.60)
The ML estimators of/?(1), /?(2), and A are obtained by inserting these estimates into the right-hand side of (9.2.49), (9.2.50), and (9.2.51).
Discriminant analysis is frequently used in transport modal choice analysis. See, for example, articles by Warner (1962) and McGillivray (1972).
We call the model defined by (9.2.47) with Xi = 20 and by (9.2.52) the discriminant analysis (DA) model and call the estimator of (/?(1), fi'(2)Y obtained by inserting (9.2.56), (9.2.57), and (9.2.60) into (9.2.49) and (9.2.50) with X, = X0 the DA estimator, denoted ДэА. In contrast, if we assume only
(9.2.52) and not (9.2.47), we have a logit model. We denote the logit MLE offi by. In the remainder of this section we shall compare these two estimators.
The relative performance of the two estimators will critically depend on the assumed true distribution for xf. If (9.2.47) with X! = X0 is assumed in addition to (9.2.52), the DA estimator is the genuine MLE and therefore should be asymptotically more efficient than the logit MLE. However, if (9.2.47) is not assumed, the DA estimator loses its consistency in general, whereas the logit MLE retains its consistency. Thus we would expect the logit MLE to be more robust.
Efron (1975) assumed the DA model to be the correct model and studied the
loss of efficiency that results if fi is estimated by the logit MLE. He used the asymptotic mean of the error rate as a measure of the^inefficiency of an estimator. Conditional on a given estimator (be it Д>А or ), the error rate is defined by
Error Rate = P[x'fi = 0|x ~ N(fiQ, 2)]^0 (9.2.61)
+ і>[х'Д<О|х~ЛГ0и1,Х)к1 =<70Ф[(
A A
Efron derived the asymptotic mean of (9.2.61) for each of the cases fi = Д>А and j} = fiA, using the asymptotic distributions of the two estimators. Defining the relative efficiency of the logit ML estimator as the ratio of the asymptotic mean of the error rate of the DA estimator to that of the logit ML estimator, Efron found that the efficiency ranges between 40 and 90% for the various experimental parameter values he chose.
Press and Wilson (1978) compared the classification derived from the two estimators in two real data examples in which many of the independent variables are binary and therefore clearly violate the DA assumption (9.2.47). Their results indicated a surprisingly good performance by DA (only slightly worse than the logit MLE) in terms of the percentage of correct classification both for the sample observations and for the validation set.
Amemiya and Powell (1983), motivated by the studies of Efron, Press, and Wilson, considered a simple model with characteristics similar to the two examples of Press and Wilson and analyzed it using the asymptotic techniques analogous to those of Efron. They compared the two estimators in a logit model with two binary independent variables. The criteria they used were the asymptotic mean of the probability ofcorrect classification (PCC) (that is, one minus the error rate) and the asymptotic mean squared error. They found that in terms of the PCC criterion, the DA estimator does very well—only slightly worse than the logit MLE, thus confirming the results of Press and Wilson. For all the experimental parameter values they considered, the lowest efficiency of the DA estimator in terms of the PCC criterion was 97%. The DA estimator performed quite well in terms of the mean squared error criterion as well, although it did not do as well as it did in terms of the PCC criterion and it did poorly for some parameter values. Although the DA estimator is inconsistent in the model they considered, the degree of inconsistency (the difference between the probability limit and the true value) was surprisingly small in a majority of the cases. Thus normal discriminant analysis seems more robust against nonnormality than we would intuitively expect.
We should point out, however, that their study was confined to the case of binary independent variables; the DA estimator may not be robust against a different type of nonnormality. McFadden (1976a) illustrated a rather significant asymptotic bias of a DA estimator in a model in which the marginal distribution of the independent variable is normal. [Note that when we spoke of normality we referred to each of the two conditional distributions given in (9.2.47). The marginal distribution of x* is not normal in the DA model but, rather, is a mixture of normals.] Lachenbruch, Sneeringer, and Revo (1973) also reported a poor performance of the DA estimator in certain nonnormal models.