Duplicate and near-duplicate documents occur in many situations.
Making copies and creating new versions of documents is a constant activity in offices,
and keeping track of these is an important part of information management.
On the Web, however, the situation is more extreme.
In addition to the normal sources of duplication, plagiarism and spam are common,
and the use of multiple URLs to point to the same web page and mirror sites can cause a crawler to generate large numbers of duplicate pages.
Studies have shown that about 30% of the web pages in a large crawl are exact or near duplicates of pages in the other 70% (e.g., Fetterly et al., 2003).