In this paper, we proposed a detailed methodology for
finding risk factors from the imbalanced RFID airport baggage
tracking data. We presented the pre-processing steps for
preparing the raw RFID tracking data into FlightLeg records.
We estimated the risk score of a bag being mishandled. In
order to compute the risk scores, we learned classifiers that
assigned scores and then evaluated the quality of the scores
with the AUC measure. We dealt with the imbalance problem,
applied different data mining techniques, and based on AUCs
and Precision-Recall curves we found that the decision tree is
the best classifier for our data set. We fragmented the data set
into transit, non-transit, shorter and longer transit and obtained
the appropriate models for the different fragments. We also
found that re-balancing the data set by under-sampling helps
to achieve a better predictive model for the longer transit bags.
We conducted comprehensive experiments with real baggage
tracking data, and it showed that fragmenting and mining
each of the fragments separately was a right choice. The
extracted patterns show that overall available handling time
for a bag is a critical factor and; more specifically, a bag is
considered to be a high risk if it has less than 54 minutes
in the transit airport. For non-transit bags, the factors depend
on the departure airport. It was also found that a longer stay
between baggage handling locations and the total number of
bags during the flight hour are important factors to predict
mishandling as well. The proposed methodology can help
the aviation industry with examining baggage management
problems for further improvement in the system.
Several directions for future work exist. First, a more
thorough study of the root causes for mishandling, which is
non-trivial, given the low probability of Mishandled events.
Second, analyzing baggage handling sequences for finding
problems in the system. Third, finding spatio-temporal outliers
from the RFID baggage tracking data. Fourth, developing
native support from the data mining tools like automatic
methods for finding the most appropriate models.