two dierent ways. Firstly, we intuitively decide the importance
of visual words by analysing the number of images
they appear in. Suppose, a visual word Vi is indexed to ni
images. We set a upper threshold H and a lower threshold
L. If L >= ni or, if ni >= H, then we remove this visual
word, arguing that it is less likely to be discriminating.
We analyse the results of this approach on the standard Oxford
Buildings dataset. It is observed that the mean Average
Precision(mAP) reduces with the size of the vocabulary.
However, the Precision-at-5 and Precision-at-10 remains una
ected(See Figure 3).
In another approach, we follow a supervised pruning technique.
We use the ground truth images to identify those
visual words that result in wrong retrievals. We start with
a training set of labeled images. Initially, each visual word
Vi is given zero score. We perform retrieval for each image
in the training set. Let us consider the retrieval process for
an image Ii. A visual word Vj occurring in the image gives
TF-IDF scores to other database images, say Jk, in which it
occurs. Now, suppose gi : ground truth set for image Ii;
if Jk 2 gi, then Vj 's score is incremented by the TF-IDF
value, else its score is decremented.
Hence, after iterating through each Ii, every visual word
Vi gets a nal score Si. We observed that, out of a total