Last modified: 18 May 2015

#### Abstract

A challenge for many discrete choice modeling applications is related to the level of detail that can be used when defining the competing options in the choice set. One frequent problem has been the tradeoff between level of detail and the resulting size of the choice set: adding more detail can quickly lead to very large choice sets that exceed the practical capabilities for model estimation. In the area of vehicle choice, researchers have resorted to modeling the choice of “vehicle class,” where vehicle class is defined to represent an aggregation of a much larger set of vehicles (frequently by averaging of attributes). (An alternative approach is to randomly draw a sample of more highly detailed alternatives to represent the choice set. This approach is outside the scope this study.) A related issue is the level of detail available from, e.g., household survey data on their choices. For example, most surveys (including the National Household Travel Survey, NHTS) only collect information about household vehicle model year, make, and model (e.g. 2008 Honda Civic), but this is not enough information to uniquely identify the exact vehicle chosen from a set of at least 6 distinct 2008 Honda Civic varieties. In this case, the recorded choice is only “partially observed” relative to the level of detail that could otherwise be possible. This is potentially critical, because important vehicle attributes (performance, fuel operating cost, and price) can vary substantially across these varieties. The effect of level detail and/or aggregation on the properties of estimators is an area that has been largely unexplored.

McFadden (1978) provides an early exploration of the “large choice set” issue, considering the case of household residential location choice where only the neighborhood (not the exact house) is observed. He showed that if the distribution of relevant housing characteristics within a neighborhood is approximated by a multivariate normal distribution, then the higher moments of this distribution could be used to adjust the representative utility specifications in the nested logit structure and yield consistent parameter estimates and inferences. McFadden proves the approach is valid for conditional logit or aggregation at the bottom level of a nested logit structure, and many researchers have used it (even when models are not nested logit.)

One approach to the problem of partially observed data would be to aggregate alternatives up to the level where the choice variable is observed. For example, in the 2008 NHTS data we only observe whether a 2008 model year Honda Civic is a hybrid or not. The usual approach would be to average over attributes of the underlying non-hybrid models (e.g. DX, LX, or EX plus possibly different accessory packages) to produce attributes for the aggregate alternative “2008 non-hybrid Honda Civic” and then fit a discrete choice model using such aggregated alternatives. Even then, the choice set size is so large that many researchers would find the estimation to be computationally challenging, and would further aggregate the vehicles into e.g., 30 classes (domestic sub-compact, imported small SUV, etc.) since otherwise the choice models become large and computationally challenging.

However, this widely used approach leads to inconsistent parameter estimates, and this paper investigates the nature of this inconsistency using both a Monte Carlo study and a model of new vehicle purchase behavior using 2008 NHTS data. We explore the properties of McFadden’s approach, but also we also the performance of an efficient “broad choice” maximum likelihood estimator (see Brownstone and Li ,and Lloro and Brownstone). We find that the aggregation bias in the parameter estimates can be large, but the bias in the estimated confidence intervals is much larger (sometimes by a factor of more than 10).

Like many household surveys, the NHTS does not yield a representative sample of the US residential population. Even when the NHTS sampling weights are used, missing data in key variables (vehicle model year, type, and date of purchase) lead to a biased sample. Since we have market share data available, we use this information to improve estimation using both the Weighted Exogenous Sample Maximum Likelihood Estimator (WESMLE) and the Berry, Levinson, and Pakes (BLP) estimator extended to the broad choice situation. The broad choice BLP estimator is computationally demanding, but is more efficient than the WESMLE. Our results reflect the inefficiency of the WESMLE, but all consistent estimators yield very wide confidence bands when applied to the new vehicle choice model with NHTS data.

A long-standing problem with vehicle choice models is that at least some important vehicle attributes are frequently not observable (or, they are only observed with substantial error). This missing/noisy information exacerbates estimation issues because vehicle price is endogenous, and is the motivation for the two-step BLP approach that uses instruments to model vehicle price. We expect that our broad choice approach somewhat mitigates this problem since our attribute data is not contaminated by aggregating over many different vehicles. We check this by comparing our estimates for both the aggregate and broad choice BLP estimates, using instruments suggested by Train and Winston (2007). We find that using these instruments does not improve our results, and that the bias in the standard errors from using the two-stage BLP approach is very large without some type of correction.

We conclude that better data and better models are needed before we can make accurate quantitative predictions of the impact of policies designed to mitigate problems associated with household automobile use. The new broad choice BLP estimators used in this work are theoretically superior but computationally demanding. We speculate that data on multiple markets with different prices are needed to get more accurate estimates using these methods. Finally our results show that failing to account for aggregation across alternatives in discrete choice models can lead to very substantial biases in both parameter estimates and confidence intervals.** **