In this work we investigated the task of fully unsupervised POS induction in five different languages.
We identified and proposed solutions for three major problems of the simple hidden Markov model
that has been used extensively for this task: i) treating words atomically, ignoring orthographic
and morphological information – which we addressed by replacing multinomial word distributions
by small maximum-entropy models; ii) an excessive number of parameters that allows models to
fit irrelevant correlations – which we adressed by discarding parameters with small support in the
corpus; iii) a training regime (maximum likelihood) that allows very high word ambiguity – which
we addressed by training using the PR framework with a word ambiguity penalty. We show that all
these solutions improve the model performance and that the improvements are additive. Comparing
against the regular HMM we achieve an impressive improvement of 10.4% on average.
We also compared our system against the main competing systems and show that our approach
performs better in every language except English. Moreover, our approach performs well across
languages and learning conditions, even when hyperparameters are not tuned to the conditions.
When the induced clusters are used as features in a semi-supervised POS tagger trained with a small
amount of supervised data, we show significant improvements. Moreover, the clusters induced by
our system always perform as well as or better than the clusters produced by other systems