Actually, Bag of words method overcomes the other
methods for object detection. It represents an image as
an orderless collection of local features [7] (i.e. in face
representations local features can be an eye, ear, mouth, etc).
However, in face detection, object images belong to the
same category (face images), histograms of orderless local
features from the whole face do not have large enough
between class variations [9].
In Bag of Words [7], orderless local features are extracted
from images of different categories (face or non-face) as
candidates for basic elements, i.e., “words”. Feature descriptors are represented like numerical vectors. By clustering
methods, they convert numerical vectors to “codewords”
(cluster center) to produce a “codebook”. The number of
total clusters is the codebook size. So each feature in an
image is mapped to a codeword through the clustering
process and they are used to represent the histogram (see
Figure 4).