Tokenization is the process of breaking a stream of text up
into phrases, words, symbols, or other meaningful elements
called tokens. The goal of the tokenization is the exploration
of the words in a sentence. Textual data is only a textual
interpretation or block of characters at the beginning. In
information retrieval require the words of the data set. So we
require a parser which processes the tokenization of the
documents. This may be trivial as the text is already stored in
machine-readable formats. But Still there are some problems
that has been left, for e.g., the removal of punctuation marks
as well as other characters like brackets, hyphens, etc. The
main use of tokenization is identification of meaningful
keywords. Another problem are abbreviations and acronyms
which need to be transformed into a standard form.