3. Text Filtering: In a corpus of several thousands of documents, you will likely have many terms that are
irrelevant to either differentiating documents from each other or to summarizing the documents. You
will have to manually browse through the terms to eliminate irrelevant terms. This is often one of the
most time-consuming and subjective tasks in all of the text mining steps. It requires a fair amount of
subject matter knowledge (or domain expertise). In addition to term filtering, documents irrelevant to
the analysis are searched using keywords. Documents are filtered if they do not contain some of the
terms or filtered based on one of the other document variables such as date, category, etc. Term
filtering or document filtering alters the term-by-document matrix. As shown in Table 1.1, the term-
by-document matrix contains the frequency of the occurrence of the term in the document as the value
of each cell. Instead, you could have a log of the frequency or just a 1 or 0 value indicating the presence
of the term in a document as the value for each cell. From this frequency matrix, a weighted term-by-
document matrix is generated using various term-weighting techniques.