The rapid and extensive pervasion of information through the web has enhanced the
diffusion of a huge amount of unstructured natural language textual resources. A great
interest has arisen in the last decade for discovering, accessing and sharing such a vast
source of knowledge. For this reason, processing very large data volumes in a reasonable
time frame is becoming a major challenge and a crucial requirement for many commercial
and research fields. Distributed systems, computer clusters and parallel computing
paradigms have been increasingly applied in the recent years, since they introduced
significant improvements for computing performance in data-intensive contexts, such as
Big Data mining and analysis. Natural Language Processing, and particularly the tasks of
text annotation and key feature extraction, is an application area with high computational
requirements; therefore, these tasks can significantly benefit of parallel architectures. This
paper presents a distributed framework for crawling web documents and running Natural
Language Processing tasks in a parallel fashion. The system is based on the Apache
Hadoop ecosystem and its parallel programming paradigm, called MapReduce. In the
specific, we implemented a MapReduce adaptation of a GATE application and framework
(a widely used open source tool for text engineering and NLP). A validation is also offered
in using the solution for extracting keywords and keyphrase from web documents in a
multi-node Hadoop cluster. Evaluation of performance scalability has been conducted
against a real corpus of web pages and documents.