4.3 Example of VSM application
The example will consider four documents, and one query.
During the indexing, all the terms were extracted from the documents to create representation of each of the document. In the process an Inverse Document Frequency vector has to be generated. To generate IDF vector the indexer first has to create Document Frequency vector (DF) that for every term counts the number of documents that contains the term. Subsequently, the total number of documents is divided by the number of document that contains a specific term, and the logarithm of that value is stored in the Inverse Document Frequency vector (IDF) for that term.
All generated vectors should be normalized to eliminate the advantage given to the longer documents, as even if a term is repeated multiple times in longer documents, it should not be considered relevant to that document if it is flooded by other terms. The normalization of a vector is simply a process of dividing weights of each term stored in that vector by the length of that vector.
After the indexing process is completed the system is ready to generate responses to queries. In order to retrieve the search results for a specific query, a similarity between the user query and each of the documents has to be calculated.
4.3 Example of VSM applicationThe example will consider four documents, and one query.During the indexing, all the terms were extracted from the documents to create representation of each of the document. In the process an Inverse Document Frequency vector has to be generated. To generate IDF vector the indexer first has to create Document Frequency vector (DF) that for every term counts the number of documents that contains the term. Subsequently, the total number of documents is divided by the number of document that contains a specific term, and the logarithm of that value is stored in the Inverse Document Frequency vector (IDF) for that term.All generated vectors should be normalized to eliminate the advantage given to the longer documents, as even if a term is repeated multiple times in longer documents, it should not be considered relevant to that document if it is flooded by other terms. The normalization of a vector is simply a process of dividing weights of each term stored in that vector by the length of that vector.After the indexing process is completed the system is ready to generate responses to queries. In order to retrieve the search results for a specific query, a similarity between the user query and each of the documents has to be calculated.
การแปล กรุณารอสักครู่..
