Architecture
Nutch divides naturally into two pieces: the crawler and the
searcher. The crawler fetches pages and turns them into an inverted
index, which the searcher uses to answer users' search queries. The
interface between the two pieces is the index, so apart from an
agreement about the fields in the index, the two are highly
decoupled. (Actually, it is a little more complicated than this,
since the page content is not stored in the index, so the searcher
needs access to the segments described below in order to produce
page summaries and to provide access to cached pages.)
The main practical spin-off from this design is that the crawler
and searcher systems can be scaled independently on separate
hardware platforms. For instance, a highly trafficked search page
that provides searching for a relatively modest set of sites may only
need a correspondingly modest investment in the crawler
infrastructure, while requiring more substantial resources for
supporting the searcher.
We will look at the Nutch crawler here, and leave discussion of
the searcher to part two.
Architecture
Nutch divides naturally into two pieces: the crawler and the
searcher. The crawler fetches pages and turns them into an inverted
index, which the searcher uses to answer users' search queries. The
interface between the two pieces is the index, so apart from an
agreement about the fields in the index, the two are highly
decoupled. (Actually, it is a little more complicated than this,
since the page content is not stored in the index, so the searcher
needs access to the segments described below in order to produce
page summaries and to provide access to cached pages.)
The main practical spin-off from this design is that the crawler
and searcher systems can be scaled independently on separate
hardware platforms. For instance, a highly trafficked search page
that provides searching for a relatively modest set of sites may only
need a correspondingly modest investment in the crawler
infrastructure, while requiring more substantial resources for
supporting the searcher.
We will look at the Nutch crawler here, and leave discussion of
the searcher to part two.
การแปล กรุณารอสักครู่..
data:image/s3,"s3://crabby-images/98aba/98abadb1435b0cfbe63f2dabdddc22693678da81" alt=""