The principle of this technique is rather simple. Let
us suppose that similarities among messages are
measurable using a measure of distance among the
characteristic vectors. To decide whether a message is
legitimate or not, we look at the class of the messages
that are closest to it. Generally, this technique does not
use a separate phase of training and the comparison
between the vectors is a real time process. This has a
time complexity of O(nl) where n is the size of the
characteristic vector and l the sample size. This can be
circumvented by using a traditional indexing methods
[13][19][35]. To adjust the risk of false classification, t/k
rule is introduced. What can be read:
If at least t messages in k neighbors of the
message m are unsolicited, then m is
unsolicited email, otherwise, it is legitimate.
We should note that the use of an indexing method in
order to reduce the time of comparisons induces an
update of the sample with a complexity O(m), where m is
the sample size. An alternative of this technique is
known as memorybased approach [2][46].
TiMBL [11] is a software package developed by ILK
Research Group that implements a collection of machine
learning algorithms. Results of the implementation of
this technique in spam filtering reported in [2] seems to
be comparable to those of bayesian classifier