Abstract
This study investigates the effect of class imbalance in training data when developing neural network classifiers for computer-aided medical
diagnosis. The investigation is performed in the presence of other characteristics that are typical among medical data, namely small training sample
size, large number of features, and correlations between features. Two methods of neural network training are explored: classical backpropagation
(BP) and particle swarm optimization (PSO) with clinically relevant training criteria. An experimental study is performed using simulated data
and the conclusions are further validated on real clinical data for breast cancer diagnosis. The results show that classifier performance deteriorates
with even modest class imbalance in the training data. Further, it is shown that BP is generally preferable over PSO for imbalanced training data
especially with small data sample and large number of features. Finally, it is shown that there is no clear preference between oversampling and no
compensation approach and some guidance is provided regarding a proper selection.
c 2007 Elsevier Ltd. All rights reserved.