Near-duplicate documents are found by comparing the fingerprints that represent them.
Near-duplicate pairs are defined by the number of shared fingerprints or the ratio of shared fingerprints to the total number of fingerprints used to represent the pair of documents.
Fingerprints do not capture all of the information in the document,
however,
and consequently this leads to errors in the detection of near-duplicates.
Appropriate selection techniques can reduce these errors,
but not eliminate them.
As we mentioned,
evaluations have shown that comparing word-based representations using a similarity measure such as the cosine correlation (see section 7.1.2) is generally significantly more effective than fingerprinting methods for finding near-duplicates.
The problem with these methods is their efficiency.