There may be two kinds of imbalances in a data set. One is between-class imbalance,
in which case some classes have much more examples than others [1]. The other is
within-class imbalance, in which case some subsets of one class have much fewer
examples than other subsets of the same class [2]. By convention, in imbalanced data
sets, we call the classes having more examples the majority classes and the ones having
fewer examples the minority classes.
The problem of imbalance has got more and more emphasis in recent years. Imbalanced
data sets exists in many real-world domains, such as spotting unreliable telecommunication
customers [3], detection of oil spills in satellite radar images [4],
learning word pronunciations [5], text classification [6], detection of fraudulent telephone
calls [7], information retrieval and filtering tasks [8], and so on. In these domains,
what we are really interested in is the minority class other than the majority
class. Thus, we need a fairly high prediction for the minority class. However, the
traditional data mining algorithms behaves undesirable in the instance of imbalanced
data sets, as the distribution of the data sets is not taken into consideration when these
algorithms are designed.
The structure of this paper is organized as follows. Section 2 gives a brief introduction
to the recent developments in the domains of imbalanced data sets. Section 3