Documents with very similar content generally provide little or no new information to the user, but consume significant resources during crawling, indexing, and search.
In response to this problem, algorithms for detecting duplicate documents have been developed so that they can be removed or treated as a group during indexing and ranking.