International Choice Modelling Conference, International Choice Modelling Conference 2015

Font Size: 
A New Generalized Heterogeneous Data Model (GHDM) to Jointly Model Mixed Types of Dependent Variables
Chandra R Bhat

Last modified: 18 May 2015


The joint modeling of data with mixed types of dependent variables (including ordered-response or ordinal variables, unordered-response or nominal variables, count variables, and continuous variables) is of interest in several fields, including biology, developmental toxicology, finance, economics, epidemiology, social science, and transportation. The interest in mixed model systems has been spurred particularly by the recent availability of high-dimensional heterogeneous data with complex dependence structures, thanks to technology that allows the collection and archival of voluminous amounts of data ("big data").

This paper proposes a new model formulation, the generalized heterogeneous data model (GHDM), to jointly model data containing mixed types of dependent variables, including multiple continuous variables, multiple ordinal variables, multiple count variables, and multiple nominal variables. Within this integrated model system, the covariance relationships among high-dimensional heterogeneous outcomes are explained by a much smaller number of latent continuous factors. Sufficiency conditions for identification of the GHDM parameters are presented. The paper proposes and develops a comprehensive blueprint for estimating the GHDM model using Bhat's maximum approximate composite marginal likelihood (MACML) approach. With this approach, the dimensionality of integration in the function that needs to be maximized to obtain a consistent estimator (under standard regularity conditions) is independent of the number of latent factors and easily accommodates general covariance structures for the structural equation and for the utilities of the discrete alternatives for each nominal outcome.

A simulation experiment within the virtual context of the integrated modeling of residential location choice and travel behavior is undertaken to evaluate the ability of the MACML approach to recover parameters in the GHDM from finite samples. The simulation results show that the MACML estimation approach does reasonably well in recovering the parameters, regardless of the sample size (N=1000, 2000, and 3000) used in estimation. The MACML estimator exhibits good empirical efficiency since the asymptotic standard errors (ASEs) (and the finite sample standard errors, or FSSEs) are only a small proportion of the true values, and the ASEs (derived based on the inverse of the Godambe information matrix) perform well in estimating the FSSEs. Further, it is remarkable that the approximation error due to the use of only a single permutation for approximating the MVNCD function is extremely small. However, the results also indicate that it is relatively more difficult to both accurately and precisely recover the effects of exogenous variables on the latent variables (in the structural equation system) as well as the effects of the latent variables on the outcomes (in the measurement equation system), relative to effects of exogenous variables on the outcomes in the measurement equation system and the inter-relationships between the endogenous variables. The suggestion is the exercise of caution when GHDM models with latent variables are being estimated with few observations. Our results suggest that there may be a need for 3000 observations or so for good accuracy and precision in the estimated coefficients when there are more than 2-3 psychological constructs used.

The simulation experiment also examines the implications of ignoring the presence of latent variables, so that the unobserved covariances among the multidimensional outcomes are not considered. In the virtual integrated land use-transportation modeling context used in the simulation, this is equivalent to ignoring all potential self-selection effects, which then should corrupt the endogenous variable effects, and lead to inaccurate and inefficient estimation of other parameters as well. The results indeed reveal a substantial degradation of parameter recovery across the board if the latent constructs are ignored away, and especially those associated with the endogenous variable effects. In addition, land use effects (residential built environment in the current paper) on travel choices can be substantially biased if the multi-dimensional bundled nature of residential and travel-related choices is not considered, which can lead to potentially inappropriate policy decisions regarding infrastructure investment. Overall, the simulation design and results do emphasize the fact that integrated land use-transportation (LU-T) modeling is not simply of academic interest, but can have substantial real implications for variable effects and subsequent policy analysis. The GHDM model proposed and used in the current paper can serve as a valuable tool for such integrated LU-T modeling efforts. More generally, the GHDM model should be widely applicable in numerous empirical contexts due to its ability to accommodate data with mixed types of dependent variables, including multiple ordinal variables, multiple continuous variables, multiple count variables, and multiple nominal variables.

Conference registration is required in order to view papers.