Classification of Imbalanced Data by Using the SMOTE Algorithm and
Locally Linear Embedding
Juanjuan Wang1, Mantao Xu2, Hui Wang2, Jiwu Zhang2
(1Department of Biomedical Engineering, Shanghai Jiaotong University, Shanghai, 200030)
(2Kodak Health Group Global R&D Center, Shanghai, 201206)
E-mail: wangjuanjuan@sjtu.edu.cn
Abstract
The classification of imbalanced data is a common
practice in the context of medical imaging intelligence.
The synthetic minority oversampling technique
(SMOTE) is a powerful approach to tackling the
operational problem. This paper presents a novel
approach to improving the conventional SMOTE
algorithm by incorporating the locally linear
embedding algorithm (LLE). The LLE algorithm is
first applied to map the high-dimensional data into a
low-dimensional space, where the input data is more
separable, and thus can be oversampled by SMOTE.
Then the synthetic data points generated by SMOTE
are mapped back to the original input space as well
through the LLE. Experimental results demonstrate
that the underlying approach attains a performance
superior to that of the traditional SMOTE.
1. Introduction
Imbalanced data classification often arises in many
practical applications in the context of medical pattern
recognition and data mining. Most of the existing
state-of-the-art classification approaches are well
developed by assuming the underlying training set is
evenly distributed. However, they are faced with a
severe bias problem when the training set is a highly
imbalanced distribution (i.e., the data comprises two
classes, the minority class C+ and the majority class
C