Instead of using 128-dimensional SIFT descriptors, we
simply use the quantized visual words corresponding to the
keypoint locations to check whether the retrieved image is
spatially consistent. We compare our results of visual words
matching Vs SIFT matching (using the full-length descriptors)
on the standard Oxford Dataset. While using the fulllength
descriptors gives a mean Average Precision(mAP) of
60.73%, using only the visual words, the mAP reduces to
57.55%. However, our application is concerned with only the
top image, which acts as the annotation source. Hence, we
re-rank only the top 5 images. So, we compute the precision
at rank 5 using visual words matching for spatial re-ranking.
It comes to 90%, which remains same even when the SIFT
descriptors are used in matching.
With this, our storage and RAM requirements are lowered.
We store the keypoints and their corresponding visual
words that take up 36MB, preferably on the SD-card of the
mobile phone. During spatial verication, the corresponding
le for the retrieved image, of around 8 KB is read from
the SD-card and copied into the RAM.
During spatial verication, a keypoint Ki with visual word
as Vi in the query frame, matches with a keypoint Kj with
Vj in the retrieved image if, both are represented by the
same visual word i.e. if Vi = Vj . Therefore, instead of
computing L2-distance and comparing for each pair of the
128-dimensional descriptors, we compare only two integers.
This speeds up our spatial verication step.