One of the reasons why GoogleTM is such an effective search engine is the
PageRankTM algorithm developed by Google’s founders, Larry Page and Sergey
Brin, when they were graduate students at Stanford University. PageRank is determined
entirely by the link structure of the World Wide Web. It is recomputed
about once a month and does not involve the actual content of any Web pages or
individual queries. Then, for any particular query, Google finds the pages on the
Web that match that query and lists those pages in the order of their PageRank.
Imagine surfing the Web, going from page to page by randomly choosing an
outgoing link from one page to get to the next. This can lead to dead ends at
pages with no outgoing links, or cycles around cliques of interconnected pages.
So, a certain fraction of the time, simply choose a random page from the Web.
This theoretical random walk is known as a Markov chain or Markov process. The
limiting probability that an infinitely dedicated random surfer visits any particular
page is its PageRank. A page has high rank if other pages with high rank link to
it.
Let W be the set of Web pages that can be reached by following a chain of
hyperlinks starting at some root page, and let n be the number of pages in W. For
Google, the set W actually varies with time, but by June 2004, n was over 4 billion.
Let G be the n-by-n connectivity matrix of a portion of the Web, that is, gij = 1
if there is a hyperlink to page i from page j and gij = 0 otherwise. The matrix G
can be huge, but it is very sparse. Its jth column shows the links on the jth page.
The number of nonzeros in G is the total number of hyperlinks in W.