Abstract
the rapid growth of information in the digital world especially on the web,calls for automated
methods of organizing the digital information for convenient access and efficient
information retrieval. Topic modeling is a branch of machine learning and probabilistic
graphical modeling that helps in arranging the web pages according to their topical
structure.
The topic distribution over a set of documents (web pages) and the affinity of
a document toward a specific topic can be revealed using topic modeling. Topic modeling
algorithms are typically computationally expensive due to their iterative nature.
Recent
research efforts have attempted to parallelize specific topic models and are successful in
their attempts.
These parallel algorithms however have tightly-coupled parallel processes
which require frequent synchronization and are also tightly coupled with the underlying
topic model which is used for inferring the topic hierarchy.
In this paper, we propose a parallel
algorithm to infer topic hierarchies from a large scale document corpus. A key feature
of the proposed algorithm is that it exploits coarse grained parallelism and the components
running in parallel need not synchronize after every iteration, thus the algorithm lends
itself to be implemented on a geographically dispersed set of processing elements interconnected
through a network. The parallel algorithm realizes a speed up of 53.5 on a
32-node cluster of dual-core workstations and at the same time achieving approximately
the same likelihood or predictive accuracy as that of the sequential algorithm, with respect
to the performance of Information Retrieval tasks
dual-core