network models classify bad loans more accurately than either linear discriminant analysis or
logistic regression [13]. Piramuthu [16] compares a multilayer perceptron neural network and
a neural-fuzzy model in three credit scoring applications: Quinlan's credit card data [19], a small
loan default data set [20], and the Texas bank failure data [7]. The neural network model achieved
an overall accuracy of 83.56% on Quinlan's credit card data based on 10 repetitions and a single
data partition. The neural fuzzy model was almost 6% less accurate than the multilayer perceptron
across all three data sets [16]. The use of decision trees and a multilayer perceptron neural network
for credit card application scoring are studied by Davis et al. [9]. Their results are based on a single
data partition and a single neural network trial. The authors conclude that the multilayer
perceptron neural network and the decision tree model both have a comparable level of decision
accuracy [9]. Jensen [14] develops a multilayer perceptron neural network for credit scoring with
three outcomes: obligation charged o! (11.2%), obligation delinquent (9.6%), and obligation
paid-o!. Jensen reports a correct classi"cation result of 76}80% with a false positive rate (bad
credit risk classi"ed as good credit) of 16% and a false negative rate (good credit risk classi"ed as
bad credit) of 4%. Jensen concludes that the neural network has potential for credit scoring
applications based on results from a single data partition tested on only 50 examples [14].
The research available on predicting "nancial distress, whether conducted at the "rm or
individual level suggests that neural network models show potential yet lack an overwhelming
advantage over classical statistical techniques. In the quest for small fractional improvement in
predictive accuracy, it is necessary to investigate several neural network architectures and to use
a rigorous experimental methodology to establish performance di!erences between models. Many
of the previous studies are exploratory in nature, using only a single data partition to establish
training and test samples. This may lead to bias in the estimation of classi"cation accuracy on
holdout samples. In many cases, only a single trial of the neural network model is employed. The
stochastic nature of the neural network training process requires a number of repetitions to
estimate an expected accuracy level. The most common test used to establish statistically signi"-
cant di!erences between credit scoring models is a di!erence of two proportions or a paired
di!erence t test. Dietterich has shown these tests should never be used for comparing supervised
learning models because of the high probability of Type 1 error associated with these tests [21].
This research investigates the potential for small improvements in credit scoring accuracy by
exploring "ve neural network architectures, with two real world data sets partitioned into training
and test sets with 10-fold cross validation. Ten repetitions of each neural network trial are used and
then the models are tested for signi"cant di!erences with McNemar's Chi Square test which has
been shown to be the most powerful test of model di!erences for supervised learning algorithms
[21].