UbiCrawler: a scalable fully distributed web crawler
Paolo Boldi, Bruno Codenotti, Massimo Santini and Sebastiano Vigna 27th January 2003
Abstract We report our experience in implementing UbiCrawler, a scalable distributed web crawler, using the Java programming language. The main features of Ubi- Crawler are platform independence, fault tolerance, a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitation of the Java APIs, which prompted the authors to partially reimplement them.
1 Introduction
In this paper we present the design and implementation of UbiCrawler, a scalable, fault- tolerant and fully distributed web crawler, and we evaluate its performanceboth a priori and a posteriori. The overall structure of the UbiCrawler design was preliminarily described in [2]1, [5] and [4]. Our interest in distributed web crawlers lies in the possibility of gathering large data set to study the structure of the web. This goes from statistical analysis of specific web domains [3] to estimates of the distribution of classical parameters, such as page rank [19]. Moreover, we have provided the main tools for the redesign of the largest italian search engine, Arianna. Since the first stages of the project, we realized that centralized crawlers are not any longer sufficient to crawl meaningful portions of the web. Indeed, it has been recog- nized that as the size of the web grows, it becomes imperative to parallelize the crawling process, in order to finish downloading pages in a reasonable amount of time [9, 1]. Many commercial and research institution run their web crawlers to gather data about the web. Even if no code is available, in several cases the basic design has been made public: this is the case, for instance, of Mercator [17] (the Altavista crawler),
- f the original Google crawler [6], and of some research crawlers developed by the
academic community [22, 23, 21]. Nonetheless, little published work actually investigates the fundamental issues un- derlying the parallelization of the different tasks involved with the crawling process. In
1At the time, the name of the crawler was Trovatore, later changed into UbiCrawler when the authors
learned about the existence of an Italian search engine named Trovatore.