In matching the words to the new master dictionary, simple statistics, regarding the
frequency of each word and the distribution of word length in the sample, could be
tabulated. Table 4.6 presents the distribution of the length of words, with the smallest
words being made up of two characters, as any one-character words were removed
when reading in the text. The longest word is “overcollateralization”, with 21
characters, which occurs 3 times. Two-character words are the most frequent, with
1407778 occurrences, after which the frequency tends to decrease as the length
increases. This is illustrated in Figure 5 and unsurprising given the word frequency
reported in Table 4.7, demonstrating that ten of the most frequent words all have a
length of four characters or less; six of which have only two characters.