For collections of handwritten manuscripts written by a
single author (or a few authors) – for example the George
Washington collection used in this paper – the images of
multiple instances of the same word are likely to look similar.
For such collections, theWord spotting idea [5] provides
an alternative approach to index generation: first, each page
in the document collection is segmented into words, and the
different instances of a word are clustered together using
image matching. Then, a human can tag the m most interesting
clusters for indexing with the appropriate ASCIIequivalent,
which could be used to build a partial index for
the analyzed collection. Historical handwritten documents
are often of poor quality and unlike printed documents,
there is variation in the way the words are written. Thus,
both segmentation of a page into words and the matching of
word images are challenging problems for such documents.