Paper Review Series : Google

This paper proposes Google, a large scale hypertextual web search engine. It describes in detail efficient crawling and indexing the Web and mechanisms for much higher precision search result. Google used a combination of data structures and algorithms for crawling and indexing performance in terms of time and space.It introduces “PageRank”, a metric computed based on backlinks to prioritise search results. The paper shows its performance result and discusses futures works such as high quality search and scalability. Strengths

  1. The paper identifies the quality issue of search research and addresses it with “PageRank” ,which is a metric “PageRank” defined on webpages to order webpages in their relevance/importance , and anchor propagation.
  2. Optimised data structures like BigFile make crawling, indexing, and searching with little cost.
  3. Distributed system architecture like crawling servers is conducive for future scale-up. Weaknesses (i.e. Limitations or Possible Improvements)
  4. Common techniques like querying cache, sub-indexing is not mentioned.
  5. The paper does not mention that parsing, indexing, and searching is or could be partitioned and distributed, although Google at the time is prototypical.
  6. Manipulation of the system for higher search result ranking seems to be important issue and not sufficiently elaborated.

While previous search engine can achieve fast and complete indexing, it’s realised that people are only able/interested in viewing the most relevant ones. The quality or precision then should be where newer effort on search engine be directed. This paper proposed a system “Google” that addresses the issue. It introduced a new metric “PageRank”, based on which, search results of web pages are prioritized. “PageRank” of a web page is iteratively computed by the weighted “PageRank” of backlinks with a damping factor. The system also applied anchor propagation technique used in World Wide Web Worm. Anchor propagation means that the text of a link is not only associated with the page that the link is on, but also with the page the link points to.It has two benefits. First, the text of links often provide more accurate description of web pages than the pages themselves. Second, web pages that are not hypertextual thus not crawlable such as videos, images, databases and so on, can be returned as result.

Search engine working on a large domain of web pages could require impractically large storage and be slow consequently if not dealt with proper data structure. Google is designed to avoid disk I/O as much as possible. Although not detailed in the paper, the BigFile, the hand optimized hitlist, the lexicons of several forms supposedly play important roles in reducing space consumption yet keeping fetching records efficient.

Google anticipates the explosion of web content and adopted distributed architecture for crawling servers, which is a major part of the search engine. It attains impressive performance and can scale up easily with more hardware. It in a sense pioneered and directed the advancement of the internet technology. However, other processes like indexing can also be made distributed and the paper does not mention that. Further, a distributed system could be partitioned for many benefits, which is also not brought up in the paper.

Common techniques like querying cache and sub-indexing/secondary indexing are also not touched upon. The reasons could be well legitimate, for instance, that web pages change constantly so that caching make little sense, or that sub-indexing is not cost effective in terms of time cost and search speed, but a brief discussion could always be better.

Last rather import issue, how the system can be immune from manipulation for higher ranking is not clear. Based on the information given in the paper, it seems feasible to reverse engineer web pages so that they have higher ranking in potential search result. The paper mentions that the damping factor in computing “PageRank” enables personalisation which can prevent such scenario, however, no mechanism or idea is discussed.

Written on September 7, 2018