The detection of near-duplicate documents is more difficult.
Even defining a near-duplicate is challenging.
Web pages,
for example,
could have the same text content but differ in the advertisements,
dates,
or formatting.
Other pages could have small differences in their content from revisions or updates