Figure 3.15 shows an example of this process for an 8-bit fingerprint.
Note that common words (stopwords) are removed as part of the text processing.
In practice, much larger values of b are used.
Henzinger (2006) describes a large-scale Web-based evaluation where the fingerprints had 384 bits.
A web page is defined as a near-duplicate of another page if the simhash fingerprints agree on more than 372 bits.
This study showed significant effectiveness advantages for the simhash approach compared to fingerprints based on n-grams