There are many historical manuscripts written in a single hand which it would be useful to index.
Examples include the early Presidential papers at the Library of Congress and the collected
works of W. B. DuBois at the library of the University of Massachusetts.
The standard technique for indexing documents is to scan them in, convert them to machine readable form (ASCII) using Optical Character Recognition (OCR) and then index them using a text retrieval engine.
However, OCR does not work well on handwriting.
Here, an alternative scheme is proposed for indexing such texts. Each page of the document is segmented into words. The images of the words are then matched against each other to create equivalence classes (each equivalence classes contains multiple instances of the same word). The user then provides ASCII equivalents for say the top 2000 equivalence classes.