The key idea of ADASYN algorithm is to use a density
distribution ˆ ri as a criterion to automatically decide the
number of synthetic samples that need to be generated for
each minority data example. Physically, ˆ ri is a measurement
of the distribution of weights for different minority class
examples according to their level of difficulty in learning.
The resulting dataset post ADASYN will not only provide a
balanced representation of the data distribution (according to
the desired balance level defined by the β coefficient), but it
will also force the learning algorithm to focus on those difficult
to learn examples. This is a major difference compared to the
SMOTE [15] algorithm, in which equal numbers of synthetic
samples are generated for each minority data example. Our
objective here is similar to those in SMOTEBoost [16] and
DataBoost-IM [17] algorithms: providing different weights for
different minority examples to compensate for the skewed
distributions. However, the approach used in ADASYN is
more efficient since both SMOTEBoost and DataBoost-IM
rely on the evaluation of hypothesis performance to update
the distribution function, whereas our algorithm adaptively
updates the distribution based on the data distribution characteristics.
Hence, there is no hypothesis evaluation required
for generating synthetic data samples in our algorithm.
Fig. 1 shows the classification error performance for different
β coefficients for an artificial two-class imbalanced data
set. The training data set includes 50 minority class examples
and 200 majority class examples, and the testing data set
includes 200 examples. All data examples are generated by
multidimensional Gaussian distributions with different mean
and covariance matrix parameters. These results are based
on the average of 100 runs with a decision tree as the base
classifier. In Fig. 1, β = 0 corresponds to the classification
error based on the original imbalanced data set, while β = 1
represents a fully balanced data set generated by the ADASYN