There are two scenarios for near-duplicate detection.
One is the search scenario, where the goal is to find near-duplicates of a given document D.
This, like all search problems, conceptually involves the comparison of the query document to all other documents.
For a collection containingN documents, the number of comparisons required will be O(N).
The other scenario, discovery, involves finding all pairs of near-duplicate documents in the collection.
This process requires O(N2) comparisons.
Although information retrieval techniques that measure similarity using word-based representations of documents have been shown to be effective for identifying near-duplicates in the search scenario,
the computational requirements of the discovery scenario have meant that new techniques have been developed for deriving compact representations of documents.
These compact representations are known as fingerprints.