Abstract
Recent advances in computing technology in terms of speed, cost, as well as access to tremendous amounts of
computing power and the ability to process huge amounts of data in reasonable time has spurred increased interest in
data mining applications to extract useful knowledge from data. Machine learning has been one of the methods used in
most of these data mining applications. It is widely acknowledged that about 80% of the resources in a majority of data
mining applications are spent on cleaning and preprocessing the data. However, there have been relatively few studies
on preprocessing data used as input in these data mining systems. In this study, we evaluate several inter-class as well as
probabilistic distance-based feature selection methods as to their effectiveness in preprocessing input data for inducing
decision trees. We use real-world data to evaluate these feature selection methods. Results from this study show that
inter-class distance measures result in better performance compared to probabilistic measures, in general.
2003 Elsevier B.V. All rights reserved.
Keywords: Artificial intelligence; Feature selection; Decision trees; Credit risk analysis