2.3.8 Evaluating Classifiers
The most commonly accepted evaluation measure for RS is the Mean Average Error
or Root Mean Squared Error of the predicted interest (or rating) and the measured
one. These measures compute accuracy without any assumption on the purpose of
the RS. However, as McNee et al. point out [51], there is much more than accuracy
to deciding whether an item should be recommended. Herlocker et al. [42] provide
a comprehensive review of algorithmic evaluation approaches to RS. They suggest
that some measures could potentially be more appropriate for some tasks. However,
they are not able to validate the measures when evaluating the different approaches
empirically on a class of recommendation algorithms and a single set of data.
A step forward is to consider that the purpose of a “real” RS is to produce a top-N
list of recommendations and evaluate RS depending on how well they can classify
items as being recommendable. If we look at our recommendation as a classifica-
tion problem, we can make use of well-known measures for classifier evaluation
such as precision and recall. In the following paragraphs, we will review some of
these measures and their application to RS evaluation. Note however that learn-
ing algorithms and classifiers can be evaluated by multiple criteria. This includes
how accurately they perform the classification, their computational complexity dur-
ing training , complexity during classification, their sensitivity to noisy data, their
scalability, and so on. But in this section we will focus only on classification perfor-
mance.