4.2 Determination of document relevance in VSM
Once the documents are indexed, a search system can rank and order the documents according to the calculated similarity to a query. The query is represented in the same fashion as the documents – by term vector with ratings for each stored term – except that the normalization of the vector is not essential.
The similarity between a single document and the query is calculated as a cosine similarity between two vectors. If the two vectors are displayed in the N dimensional Cartesian coordinate system (where N is the total number of terms in both vector, and each axis is representing the value of one term) then the cosinesimilarity would be equal to the cosine of the angle between the two vectors.
To calculate the cosine similarity, the weight of each term from one of the vectors is multiplied with the weight of the same term from other vector (zero weight is assumed if term does not exists), and then all values have to be summarised. Finally that value should be divided by the length of the first vector and by the length of the second vector.
As term vector for documents is normalized during the indexing, its length can be omitted as it is equal to 1 for all documents. The same applies to the query term vector – it can be normalized once.
The figure 1.3 shows an example of two normalized vectors and the cosine similarity between vectors V1 and V2 is calculated below.