A COMPANION TO Theoretical Econometrics
Essentials of Count. Data Regression
A. Colin Cameron and Pravin K. Trivedi
In many economic contexts the dependent or response variable of interest (y) is a nonnegative integer or count which we wish to explain or analyze in terms of a set of covariates (x). Unlike the classical regression model, the response variable is discrete with a distribution that places probability mass at nonnegative integer values only. Regression models for counts, like other limited or discrete dependent variable models such as the logit and probit, are nonlinear with many properties and special features intimately connected to discreteness and nonlinearity.
Let us consider some examples from microeconometrics, beginning with samples of independent cross section observations. Fertility studies often model the number of live births over a specified age interval of the mother, with interest in analyzing its variation in terms of, say, mother's schooling, age, and household income (Winkelmann, 1995). Accident analysis studies model airline safety, for example, as measured by the number of accidents experienced by an airline over some period, and seek to determine its relationship to airline profitability and other measures of the financial health of the airline (Rose, 1990). Recreational demand studies seek to place a value on natural resources such as national forests by modeling the number of trips to a recreational site (Gurmu and Trivedi, 1996). Health demand studies model data on the number of times that individuals consume a health service, such as visits to a doctor or days in hospital in the past year (Cameron, Trivedi, Milne and Piggott, 1988), and estimate the impact of health status and health insurance.
Examples of count data regression based on time series and panel data are also available. A time series example is the annual number of bank failures over some period, which may be analyzed using explanatory variables such as bank
profitability, corporate profitability, and bank borrowings from the Federal Reserve Bank (Davutyan, 1989). A panel data example that has attracted much attention in the industrial organization literature on the benefits of research and development expenditures is the number of patents received annually by firms (Hausman, Hall, and Griliches, 1984).
In some cases, such as number of births, the count is the variable of ultimate interest. In other cases, such as medical demand and results of research and development expenditure, the variable of ultimate interest is continuous, often expenditures or receipts measured in dollars, but the best data available are, instead, a count.
In all cases the data are concentrated on a few small discrete values, say 0, 1, and 2; skewed to the left; and intrinsically heteroskedastic with variance increasing with the mean. In many examples, such as number of births, virtually all the data are restricted to single digits, and the mean number of events is quite low. But in other cases, such as number of patents, the tail can be very long with, say, one-quarter of the sample being awarded no patents while one firm is awarded 400 patents.
These features motivate the application of special methods and models for count regression. There are two ways to proceed. The first approach is a fully parametric one that completely specifies the distribution of the data, fully respecting the restriction of y to nonnegative integer values. The second approach is a mean-variance approach, which specifies the conditional mean to be nonnegative, and specifies the conditional variance to be a function of the conditional mean.
These approaches are presented for cross section data in Sections 2 to 4. Section 2 details the Poisson regression model. This model is often too restrictive and other, more commonly-used, fully parametric count models are presented in Section 3. Less-used alternative parametric approaches for counts, such as discrete choice models and duration models, are also presented in this section. The partially parametric approach of modeling the conditional mean and conditional variance is detailed in Section 4. Extensions to other types of data, notably time series, multivariate and panel data, are given in Section 5. In Section 6 practical recommendations are provided. For pedagogical reasons the Poisson regression model for cross section data is presented in some detail. The other models, many superior to Poisson, are presented in less detail for space reasons. For more complete treatment see Cameron and Trivedi (1998) and the guide to further reading in Section 7.