Conclusions
The use of different evaluation parameters prevents the software engineering
community from easily comparing research results with previous works. In this
study, we investigated 85 fault prediction papers based on their performance
evaluation metrics and categorized these metrics into two main groups. The first
group of metrics are used for prediction systems that classify modules into a faulty
or non-faulty module and the second group of metrics are applied to systems that
predict the number of faults in each module of the next release of a system. This
study showed that researchers have used numerous evaluation parameters for
software fault prediction up to now, and the selection of common evaluation
parameters is still a critical issue in the context of software engineering
experiments. From the first group, the most common metric for software fault
prediction research is the area under ROC curve (AUC). The AUC value is only
one metric and it is not a part of the metric set. Therefore, it is easy to compare
several machine learning algorithms by using this parameter. In addition to AUC,
PD, and PF, balance metrics are also widely used. In this study, we suggest using
the AUC value to evaluate the performance of fault prediction models. From the
second group of metrics, R2 and AAE / ARE can be used to ensure the
performance of the system that predicts the number for faults. We suggest the
following changes in software fault prediction research:
• Conduct more studies on performance evaluation metrics for software fault
prediction. Researchers are still working on finding a new performance
evaluation metric for fault prediction [19], but we need more research in this
area because this software engineering problem is inherently different than
the other imbalanced dataset problems. For example, it is not easy to
determine the misclassification cost ratio (Jiang et al., 2008) and therefore,
using cost curves for evaluation is still not an easy task.
• Apply a widely used performance evaluation metric. Researchers would like
to be able to easily compare their current results with previous works. If the
performance metric of previous studies is totally different than the widely
used metrics, that makes the comparison difficult.