There are a number of fingerprinting algorithms that use this general approach,
and they differ mainly in how subsets of the n-grams are selected.
Selecting a fixed number of n-grams at random does not lead to good performance in terms of finding near-duplicates.
Consider two near-identical documents,D1 andD2.
The fingerprints generated from n-grams selected randomly from document D1 are unlikely to have a high overlap with the fingerprints generated from a different set of n-grams selected randomly from D2