International Choice Modelling Conference, International Choice Modelling Conference 2015

Font Size: 
Investigating Bagging Predictors Method in Multinomial Logit Model
Milad Ghasri

Last modified: 18 May 2015

Abstract


Introduction

The multinomial Logit (MNL) model is most commonly used discrete choice model in diverse fields including transport engineering. It can be considered as the foundation of some advanced models such as random parameter models. Despite its rich literature and ample application in various fields, still there is room for improving the accuracy of these models. For instance, the process of selecting independent variables to be incorporated in the model is still known as a challenging issue (Prinzie and Van den Poel, 2008). This study aims to employ bagging predictors method (Breiman, 1996), which has been explored in data mining paradigm, in the discrete choice modelling context and examine its merits and demerits. The bagging predictors method refers to developing a combination of models (alternatively referred to as predictors) which have similar structures but different coefficients. The coefficients in each version of model are estimated based on fitting the model to a random subsample of the main database. This technique was initially used to structure a random forest (RF) model, where each individual model is a decision tree. Bagging predictors not only has been shown to increase the accuracy of the model, but it also enables modellers to select independent variables more efficiently.

Methodology

In the first step, this study uses the same terminology of RF and structures a combination of MNL in opposed to a single MNL model. Then it examines the distribution of estimated values of coefficients for each independent variable. Besides, it investigates the statistical relationship between the achieved distribution and the true value of it if a single model was developed for the whole population. It should be emphasized that, similar to RF, for developing individual models, only a random subset of independent variables are considered. In other words, not all the independent variables are present in every individual model.

The bagging predictors method facilitate the variable selection process in a systematic way which results in a more flexible structure for examining policy sensitive variables. There several reasons that policy sensitive variables might not be end up being included in an MNL model. Not being statistically significant when estimated with the full sample, and having correlation with other independent variables which have stronger relationship with the dependent variable can result in exclusion of some explanatory variables. Obviously, if a variable is not included, the model is not capable of capturing the elasticity of the dependent variable due to changes in the independent variable. Hence, the model is not capable of estimating the sensitivity of the results due to policies affecting the excluded variable. On the contrary, in the proposed method, every independent variable would be incorporated in a number of individual models therefore it has a contribution to the final result of the model ensemble. However, calculating the elasticity in this case is not as straightforward as for the MNL, and numerical procedures would be required.

In terms of data, the online available data on travel mode choice for intercity travel in Australia is chosen for model estimation[1]. This dataset is consisted of 840 observations for 210 individuals who choose a travel mode for traveling from Sydney to Melbourne. There are two individual specific attributes and four alternative specific variables in the dataset. Income and party size are the two individual specific attributes, and the alternative specific variables comprise waiting time, vehicle cost, generalized cost and travel time. The competing modes for each individual are car, air, train and bus.

Lastly as a reference point, the ensemble model is compared against a random parameter logit model (McFadden and Train, 2000). It is assumed that developing separated models for subsamples of database can be considered as an alternative way to capture taste variation in the database. Therefore, the model ensemble is compared against a random parameter logit model to illustrate its merits and demerits. The comparison covers the distribution of coefficients and the overall accuracy of models in predicting observed data. When it comes to the simulation process, final result of the model ensemble can be derived either by averaging out the result of individual models, when the dependent variable is continuous, or by plurality vote, when predicting a discrete variable is aimed. Regularly modellers divide the dataset into training data and test data in order to achieve an unbiased estimation of model’s accuracy. Whereas, regarding the concept of out of bag (OOB) data, the bagging predictors technique inherently provides an unbiased estimation of accuracy. Technically, the set of all observations which are not present in the developing process of each individual model is called OOB for that model. Considering the results of those models where a specific observation is in their OOB set gives an unbiased estimation of the model ensemble accuracy for that observation, and aggregating single accuracy on all observations results in an overall unbiased estimation of accuracy. In this fashion, model developing process benefits from using all of observations and test and train sectioning becomes redundant. 

 

Reference

Breiman, L. (1996) Bagging predictors. Machine learning 24, 123-140.

McFadden, D., Train, K. (2000) Mixed MNL models for discrete response. Journal of applied Econometrics 15, 447-470.

Prinzie, A., Van den Poel, D. (2008) Random Forests for multiclass classification: Random MultiNomial Logit. Expert Systems with Applications 34, 1721-1732.

 


[1] http://people.stern.nyu.edu/wgreene/Text/tables/TableF21-2.txt


Conference registration is required in order to view papers.