4. Data analysis results
4.1. Logistic regression overall models for all chosen stations
The correlation between distance and travel time, travel mode and time, purpose and InboundOut and Waiting time and InboundOut are 0.53, _0.45, _0.44 and _0.36 respectively (see Table 3). Therefore, variables, travel time and InboundOut, were removed before model selection. In addition, travel purpose was identified to have 95% confidence interval (0, inf). Therefore, it was not considered in the modelling process. Table 4 presents the best-fitting logistic regression model for predicting the nearest station choice for all seven stations. There are 833 records for all the stations (Table 2), but the sample size for this regression model is 732 with 101 missing records being removed for the purpose of the analysis. Three significant variables in the model were found to be statistically significant. The line-haul cost was the cost from chosen station to the train station, which is the nearest to the destination. The less the line-haul cost, the more likely a chosen station would be a non-nearest station. For example, a commuter could choose a transit station along the way towards their destination instead of using the nearest station in order to save ticket fares on trains. This suggests the effect of a big fare price jump between zones (Jansson and Angell, 2012). The shortest network distance from an origin to a station was also found to have an important influence on the nearest station choice. The shorter the distance from origin to station, the more likely a chosen station is the nearest station. In addition, as revealed by the model, the shorter the waiting time at a chosen station, the more likely that station is a non-nearest station.