Mostly Harmless Econometrics: An Empiricist’s Companion
Censored Quantile Regression
Quantile regression allows us to look at features of the conditional distribution of Yi when part of the distribution is hidden. Suppose you have have data of the form
Yi;obs — Yi * l[Yi < c]; (7T.5)
where Yi;0bs is what you get to see and Yі is the variable you would like to see. The variable Yi;0bs is censored - information about Yi in Yi;0bs is limited for confidentiality reasons or because it was too difficult or time-consuming to collect more information. In the CPS, for example, high earnings are topcoded to protect respondent confidentiality. This means data above the topcode are recoded to have the topcode value. Duration data may also be censored: in a study of the effects of unemployment insurance on the duration of employment, we might follow new UI claimants for up to 40 weeks. Anyone out of work for longer has an unemployment spell length that is censored at 40. Note that limited dependent variables like hours worked or medical expenditure, discussed in Section 3.4.2, are not censored; they commonly take on the value zero by their nature, just as dummy variables like employment status do.
When dealing with censored dependent variables, quantile regression can be used to estimate the effect of covariates on conditional quantiles that are below the censoring point (assuming censoring is from above). This reflects the fact that recoding earnings above the upper decile to be equal to the upper decile has no effect on the median. So if CPS topcoding affects relatively few people (as is often true), censoring has no effect on estimates of the conditional median or even 3T for т — .75. Likewise, if less than 10 percent of the sample is censored conditional on all values of Xi, then when estimating 3T for т up to.9 you can simply ignore it. Alternately, you can limit the sample to values of Xi where QT(yі |Xi) is below c (or above, if censoring is from the bottom with Yi ;0bS УІ * l[Yi ^ C]).
Powell (1986) formalizes this idea with the censored quantile regression estimator. Because we may not know which conditional quantiles are below the censoring point (continuing to think of top codes), Powell proposes we work with
Qt (y і |Xi) — min(c, Xi3 T).
The parameter vector 3T solves
[3cT = arg mm Ef1 [Xi^T < c] * Pt(yi - Xib)]}. (7.L6)
b2 Rd
In other words, we solve the quantile regression minimization problem for values of Xi such that Xi3T < c. That is, we minimize the sample analog of (7.1.6). As long is there is enough uncensored data, the resulting estimates give us the quantile regression function we would have gotten had the data not been censored (assuming the conditional quantile function is, in fact, linear). And if it turns out that the conditional quantiles you are estimating are below the censoring point, then you are back to regular quantile regression.
The sample analog of (7.1.6) is no longer a linear programming problem but Buchinsky (1994) proposes a simple iterated linear programming algorithm that appears to work well. The iterations go like this: First you estimate 3T ignoring the censoring. Then find the cells with Xi3T < c. Then estimate the quantile regression again using these cells only, and so on. This algorithm is not guaranteed to converge but it appears to do so in practice. Standard errors can be bootstrapped. Buchinsky (1994) and Chamberlain
(1994) use this approach to estimate the returns to schooling for highly-experienced workers that may have earnings above the CPS topcode. The censoring adjustment tends to increase the returns to schooling for this group.