In general,
a near-duplicate is defined using a threshold value for some similarity measure between pairs of documents.
For example,
a document D1 could be defined as a near-duplicate of documentD2 if more than 90% of the words in the documents were the same.