There are a number of fingerprinting algorithms that use this general approach,
and they differ mainly in how subsets of the n-grams are selected.
Selecting a fixed number of n-grams at random does not lead to good performance in terms of finding near-duplicates.
Consider two near-identical documents,D1 andD2.
The fingerprints generated from n-grams selected randomly from document D1 are unlikely to have a high overlap with the fingerprints generated from a different set of n-grams selected randomly from D2.
A more effective technique uses prespecified combinations of characters, and selects n-grams that begin with those characters.
Another popular technique, called 0 mod p, is to select all n-grams whose hash value modulo p is zero, where p is a parameter.