In many applications, the crawler component has the primary responsibility for
identifying and acquiring documents for the search engine. There are a number of
different types of crawlers, but the most common is the general web crawler. A web
crawler is designed to follow the links on web pages to discover and download new
pages. Although this sounds deceptively simple, there are significant challenges in
designing a web crawler that can efficiently handle the huge volume of new pages
on the Web, while at the same time ensuring that pages that may have changed
since the last time a crawler visited a site are kept “fresh” for the search engine. A
web crawler can be restricted to a single site, such as a university, as the basis for
site search. Focused, or topical, web crawlers use classification techniques to restrict
the pages that are visited to those that are likely to be about a specific topic. This
type of crawler may be used by a vertical or topical search application, such as a
search engine that provides access to medical information on web pages.