A web crawler (also known as a robot or a spider) is a system for the
bulk downloading of web pages. Web crawlers are used for a variety of
purposes. Most prominently, they are one of the main components of
web search engines, systems that assemble a corpus of web pages, index
them, and allow users to issue queries against the index and find the web
pages that match the queries. A related use is web archiving (a service
provided by e.g., the Internet archive [77]), where large sets of web pages
are periodically collected and archived for posterity. A third use is web
data mining, where web pages are analyzed for statistical properties,
or where data analytics is performed on them (an example would be
Attributor [7], a company that monitors the web for copyright and
trademark infringements). Finally, web monitoring services allow their
clients to submit standing queries, or triggers, and they continuously
crawl the web and notify clients of pages that match those queries (an
example would be GigaAlert [64]).
The raison d’ˆetre for web crawlers lies in the fact that the web is
not a centrally managed repository of information, but rather consists