Parsing a document
what format is it in
what language
what character set is in use
Each of these is a classification problem
which we will study later in the course
But these tasks are often done heuristically
Complications
Format/language
documents being indexed can include docs from many different languages
A single index may contain terms from many languages.