A COMPANION TO Theoretical Econometrics

# Binary and Multinomial Response Models

In this section we present the basic analysis of models with a single explanatory variable that is observed as a dichotomous (binary or binomial) or polychotomous (multinomial) variable.

Both binary and multinomial response models are models in which the dependent variable assumes discrete values. The simplest of these models is that in which the dependent variable y is binary; for instance, y can be defined as 1 if the individual is in the labor force, 0 otherwise.

When a dependent variable y can assume more than two values it can be classified as (i) categorical variable and (ii) count (noncategorical) variable. For instance, a categorical variable y may be defined as: y = 1 if the individual earns less than $10,000; y = 2 if the individual earns between $10,000 and $30,000; and y = 3 if the individual earns more than $30,000. Note that, as the name indicates, the variable categorizes individuals into different categories. A count variable is discrete but it does not categorize, like the number of strikes on a country in a given year. The methods of analysis are different for models with categorical and count variables (see Cameron and Trivedi, Chapter 15 in this companion).

Categorical variables can be further classified as (i) unordered, (ii) ordered, and (iii) sequential. Unordered categorical variables can be defined in any order desired, for instance: y = 1 if occupation is lawyer; y = 2 if occupation is teacher; and y = 3 if occupation is doctor. An example of an ordered categorical variable is the one above concerning the level of earnings. Finally, a sequential categorical variable can be illustrated as: y = 1 if the individual has not completed high school; y = 2 if the individual has completed high school but not college; y = 3 if the individual has completed college but not a higher degree; and y = 4 if the individual has completed a professional degree.

Let us now turn our attention to binary response models. We motivate these models by introducing the linear probability model. This is a regression model in which the dependent variable y is a binary variable. The model can be written as:

yi = + Ui, (17.1)

with E(u) = 0. The conditional expectation E(yi/xi) is equal to P'x;, which is interpreted as the probability that the event will occur given the xi. The fitted value, D = S'xi, will give the estimated probability that the event will occur given the particular value of x.

Since the model is heteroskedastic (the reader can easily check it), it has to be estimated by weighted least squares. However, the linear probability model is seldom used because the least squares method is not fully efficient due to the

nonnormality of the residuals щ, and more importantly, because in many cases E(y/x), interpreted as a probability, can lie outside the limits (0, 1).

Two alternative models that avoid the previous two problems are widely used for the estimation of binary response models: the probit and the logit. Both models assume that there is an underlying response variable y* defined by the regression relationship y* = P'x; + u. In practice, y* is unobservable but we observe a dummy variable defined by:

yі = 1 if y* > 0; yi = 0 otherwise. (17.2)

Note that in this formulation, P'x; is not E(y;/x;) as in the linear probability model; it is E(y*/x). From the regression relationship for y* and (17.2) we get:

P(yі = 1) = P(U > - P'xi) = 1 - F(-P'xi), (17.3)

where F is the cumulative distribution function for u.

In this case the observed values of y are just realizations of a binomial process with probabilities given by (17.3) and varying from trial to trial (depending on x;). Hence, the loglikelihood function is:

log L = X y i log F(e%) + X (1 - yi)log[1 - F(P' Xi)].

i=1

The functional form for F depends on the assumptions made about u. If the cumulative distribution assumed is the logistic we have the logit model, if it is assumed to be the normal distribution then we have the probit model. The logistic and the normal distributions are very close to each other, except at the tails; but even though the parameter estimates are usually close, they are not directly comparable because the logistic distribution has variance n2/3 rather than normalized to 1 as for the normal. Amemiya (1981) suggests that the logit estimates be multiplied by 0.625, instead of the exact 3/, arguing that it produces a closer approximation.

For the purpose of predicting effects of changes in one of the independent variables on the probability of belonging to a group, the derivative of the probability with respect to the particular independent variable needs to be computed. Letting xik and pk be the kth elements of the vector of explanatory variables and parameters, respectively, the derivatives for the logit and probit models are given by:

—— Ф(х-Р) = ф(х-P)P k for the probit, (17.6)

dxik

where L, Ф, and ф are the logistic cdf, and the standard normal cdf and pdf, respectively. In both models, we need to calculate the derivatives at different levels of the explanatory variables to get an idea of the range of variation of the resulting changes in probabilities. A common practice in empirical work is to evaluate them at the mean of the vector of independent variables.

The estimation of logit and probit models is done by maximization of the log - likelihood function (17.4) after substituting either the logistic or normal distribution for the functional form F. Since the derivatives of the loglikelihood function are nonlinear in p, we have to use iterative methods like the Newton-Raphson or the scoring methods. The asymptotic covariance matrix, which can be used for hypothesis testing, is obtained by inverting the corresponding information matrix. In practice, the logit and probit models are readily available in statistical software. When dealing with multiple observations (as with grouped data), a general method based on weighted least squares, known as the minimum chi- square method, can be used (see Maddala, 1983, section 2.8).

Regarding multinomial response models, we will only cover the case of unordered categorical variables. The estimation of models with ordered and sequential categorical variables follow the same rationale. The reader is referred to sections 2.13 and 2.14 of Maddala (1983) for the particulars involved. Models with count data are often estimated using Poisson regression, which is covered in section 2.15 of the same monograph and in Cameron and Trivedi's chapter in this companion.

The multinomial logit (MNL) and probit (MNP) models are often used to estimate models with unordered categorical variables. The MNP model, however, involves the computation of multidimensional integrals which for models with variables taking more than 3 or 4 values are infeasible to compute by direct means. Nonetheless, we can use simulation methods to evaluate such integrals. These methods are reviewed in Section 6 below and in Chapter 22 by Geweke, Houser, and Keane in this companion.

Both MNL and MNP models can be motivated by a random utility formulation. Let y*(i = 1,..., n, j = 1,... m) be the stochastic utility associated with the jth alternative for individual i, with

y* = P - Xi + Uij,

where xi are explanatory variables, P; are unknown parameters and щ is an unobservable random variable. We assume that the individual chooses the alternative for which the associated utility is highest. Define a set of dummy variables yij = 1 if the ith individual chooses the jth alternative, yj = 0 otherwise. Then, for example, the probability that alternative 1 is chosen is given by:

where nkj = uik - uij. Considering the observations as arising from a multinomial distribution with probabilities given by Pij, the loglikelihood function for either the MNL or MNP models can be written as:

n m

log L = XX Vij log Pj. (17.9)

i=1 j=i

For the MNL model, we assume that the ujs follow independent extreme-value distributions. McFadden (1974) showed that the probabilities in (17.9) for the MNL model are given by:

This model is computationally convenient since it avoids the problem of evaluating multidimensional integrals as opposed to the MNP model (see below). The estimation of the MNL model is through maximum likelihood (ML) using an iterative method (since again the derivatives are nonlinear in P). The asymptotic covariance matrix is also obtained by inverting the corresponding information matrix.

McFadden (1974) also suggested the conditional logit model. The main difference between this model and the MNL considered in (17.10) is that the former considers the effect of choice characteristics on the determinants of choice probabilities as well, whereas the MNL model makes the choice probabilities dependent on individual characteristics only. To illustrate this, let xij denote the vector of the values of the characteristics of choice j as perceived by individual i. Then, the probability that individual i chooses alternative j is

Note that, as opposed to (17.10), Pij does not have different coefficient vectors P;-. In (17.11) the vector p gives the vector of implicit prices for the characteristics.1 The conditional logit model is similarly estimated by ML.

Both MNL and conditional logit models have the property referred to as "independence of irrelevant alternatives" (IIA). This is because the odds ratio for any two choices i and j is exp(P'x;)/exp(P'x;-), which is the same irrespective of the total number m of choices considered. If the individual is offered an expanded choice set, that does not change its odds ratio. This property is in fact a drawback in many applications. Debreu (1960) pointed out that these models predict too high a joint probability of selection for two alternatives that are in fact perceived as similar rather than independent by the individual. To see this, consider the following choices: (i) red bus, (ii) blue bus, and (iii) auto. Suppose that consumers treat the two buses as equivalent and are indifferent between auto and bus. Then, the relative odds of alternatives (i) and (iii) depend on the presence of alternative (ii). They are 1:1 if choice (ii) is not present. They are 1:2 if choice (ii) is present. However, this is inconsistent with the IIA property.

The MNP model does not have the IIA property and thus would be preferred to the MNL model when such property is inappropriate. Nonetheless, recall the

MNP model computational problems mentioned above. To illustrate this, consider only three alternatives in the formulation (17.7) (i. e. j = 3), and assume that the residuals have a trivariate normal distribution with mean vector zero and some covariance matrix X. Using the same definitions as above, to compute the corresponding probabilities as in (17.8), the n/s will have a bivariate normal distribution with covariance matrix Q1 (that can be derived from X by standard formulae for normal densities), requiring thus the computation of bivariate integrals. It is easy to see that we must deal with trivariate integrals when considering four alternatives, and so on.

There are other models for the analysis of polychotomous variables. The elimination-by-aspects (EBA) model assumes that each alternative is described by a set of aspects (characteristics) and that at each stage of the process an aspect is selected. The selection of the aspect eliminates alternatives that do not contain the selected aspect, and the selection continues until a single alternative remains. Aspects common to all alternatives do not affect the choice probabilities avoiding thus the IIA property. Another model is the hierarchical elimination-by-aspects (HEBA) model. When aspects have a tree structure, the EBA model reduces to the HEBA model. McFadden (1981) introduced the generalized extreme value (GEV) and the nested multinomial logit (NMNL) models, which are models based on a random utility formulation where the error distribution is a multivariate generalization of the extreme-value distribution, and are formulated in such a way that the multivariate integrals analogous to (17.8) are analytically tractable. McFadden (1984) considers the GEV model as an elimination model that can be expressed in latent variable form and the NMNL model as a hierarchical elimination model based on the GEV structure. None of these models have the IIA property. References for the EBA, HEBA, GEV, and NMNL models are in Maddala (1983, ch. 3) and McFadden (1984).

So far, we have only discussed models with univariate qualitative variables. There are also models with multivariate qualitative variables in the literature. Among such models are, for instance, simultaneous equation models with qualitative variables, simultaneous equation models with qualitative variables and structural shift, and others. Multivariate qualitative response models are reviewed in Maddala (1983, ch. 5).