WebDB. WebDB is a persistent custom database that tracks every known page and relevant link. It maintains a small set of facts about each, such as the last-crawled date. WebDB is meant to exist for a long time, across many months of operation.
Since WebDB knows when each link was last fetched, it can easily generate a set of fetchlists. These lists contain every URL we’re interested in downloading. WebDB splits the overall workload into several lists, one for each fetcher process. URLs are distributed almost randomly; all the links for a single domain are fetched by the same process, so it can obey politeness constraints.
The fetchers consume the fetchlists and start downloading from the Internet. The fetchers are “polite,” meaning they don’t overload a single site with requests, and they observe the Robots Exclusion Protocol. (This allows Web-site owners to mark parts of the site as off-limits to automated clients such as our fetcher.) Otherwise, the fetcher blindly marches down the fetchlist, writing down the resulting downloaded text.
Fetchers output WebDB updates and Web content. The updates tell WebDB about pages that have appeared or disappeared since the last fetch attempt. The Web content is used to generate the searchable index that users will actually query.
Note that the WebDB-fetch cycle is designed to repeat forever, maintaining an up-to-date image of the Web graph.
Indexing and Querying. Once we have the Web content, Nutch can get ready to process queries. The indexer uses the content to generate an inverted index of all terms and all pages. We divide the document set into a set of index segments, each of which is fed to a single searcher process.
We can thus distribute the current set of index segments over an arbitrary number of searcher processes, allowing us to scale easily with the query load. Further, we can copy an index segment to multiple machines and run a searcher over each one; that allows more good scaling behavior and reliability in case one or more of the searcher machines fail.
Each searcher also draws upon the Web content from earlier, so it can provide a cached copy of any Web page.
Finally, a pool of Web servers handle interactions with users and contact the searchers for results. Each Web server interacts with many different searchers to learn about the entire document set. In this way, the Web server is simultaneously acting as an HTTP server and a Nutch-search client.
Web servers contain very little state and can be easily reproduced to handle increased load. They need to be told only about the existing pool of searcher machines. The only state they do maintain is a list of which searcher processes are available at any time; if a given segment’s searcher fails, the Web server will query a different one instead.
Quality. Generating high-quality results, of course, is the most important barrier for Nutch to overcome. If it cannot find relevant pages as well as commercial engines do, Nutch isn’t much use. But how can it ever compete with large, paid engineering staffs?