As we mentioned in Section 1, the main contributions of this paper are: (a) the evaluation of the performance of seven BN classifiers in order to determine their effectiveness for accurately diagnosing breast cancer using two real-world breast cancer datasets, and (b) the empirical proof that the interobserver variability problem is implicitly contained in these data.
For the first contribution, the results in Experiments 1–4 show that only in the dataset collected by a single observer (Experiment 1), it is possible to consistently and accurately carry out the cytodiagnosis of breast cancer. For the remaining experiments, the results show that the subjective observation of the samples leads, in general, to a poor performance of the BN classifiers. Even the Naïve Bayesian classifier, which is very robust when noise is present, reflects such an anomaly. Thus, these results strongly suggest that the effectiveness of this kind of classifiers is significantly reduced when data from different observers is to be taken into account. In other words, it is not possible to generalize which features are the most relevant ones for determining the presence/absence of breast cancer. Regarding the second contribution, the results in Experiments 2–4 show that interobserver variability is implicitly present in the data: globally, the observers see the same; locally, they see different things. This implies that they are taking into account more information than that portrayed in the data; i.e., they are using more knowledge to make a decision. It is important to mention that the process or processes that cytopathologists follow to make their final diagnoses have not been yet fully understood and can only be partially explained in terms of pattern recognition with occasional use of heuristic logic [8]. Our results support this finding. Furthermore, as Cross et al. point out, all the features coded in the breast cancer datasets used in the present study were made by the expert cytopathologists, who carried out most of the processing that is probably required to solve the diagnostic problem. If this is true, then there is little work left to any classifier that uses these datasets. Hence, the information provided there is subjective rather than objective. To ameliorate this problem, alternative data collection methods such as image analysis techniques could be used so that objective measures from sample raw digitalized images can be extracted. It would then be interesting, in a future work, to investigate the possibility of building a pre-processing vision module capable of extracting “objective” features from raw images as well as integrating more information in the data (such as clinical details for instance) and, again, check the performance of the BN classifiers using this extended dataset. An important problem, worth mentioning, regarding the nature of the databases is that of the loss of richness in the representation of the cytopathological features: cytopathologists are forced to represent those naturally continuous features in a dichotomized way. A relevant exploration would be that of the possibility of codifying such characteristics with more power and richness than that of a binary coding. To do so, it would be necessary to allow pathologists to codify the features using a bigger range of possible values for each variable. It would also be very interesting to explore the incorporation of prior knowledge in the different procedures presented here to check if their performance can be significantly improved.