To produce the simple text statistics for the total sample, all letters were read into perl
and using regular expression, punctuation was removed and tokens, i.e. character
combinations, of two or more alphabetic characters were retained. One character
tokens were removed to reduce noise from stray characters without meaning that may
have resulted from the OCR conversions. The words from Loughran and McDonald’s
(2011) master dictionary were used to match the tokens to words.