[PPT] - Web Crawling Najork and Heydon, High-Performance Web Crawling , PowerPoint Presentation

SLIDE 1

Web Crawling

Najork and Heydon, High-Performance Web Crawling, Compaq SRC Research

Report 173, 2001. Also in Handbook of Massive Data Sets, Kluwer, 2001.

Heydon and Najork, Mercator: A scalable, extensible Web crawler. World

Wide Web , 4, 1999.

Najork and Wiener, Breadth-first search crawling yields high-quality pages.
Proc. 10th Int. WWW Conf., 2001.
Arasu et al: Searching the Web. ACM Trans. Internet Technology, 1, 2001.

1

SLIDE 2

Web Crawling

Web Crawling = Graph Traversal S = {startpage} repeat remove an element s from S foreach (s, v) if v not crawled before insert v in S

2

SLIDE 3

Issues

Theoretical:

Startset S
Choice of s (crawl strategy)
Refreshing of changing pages.

Practical:

Load balancing (own resources and resources of crawled

sites)

Size of data (compact representations)
Performance (I/Os).

3

SLIDE 4

Crawl Strategy

Breath First Search
Depth First Search
Random
Priority Search

Possible priorities:

Often changing pages (how to estimate change rate?).
Using global ranking scheme for queries (e.g. PageRank).
Using query dependent ranking scheme for queries

(“focused crawling”, “collection building”).

4

SLIDE 5

BFS is Good

5 10 15 20 25 30 35 40 45 50 55

Day of crawl

2 4 6 8

Average PageRank

Figure 1: Average PageRank score by day of crawl

1 10 100 1000 10000 100000 1e+06 1e+07 1e+08

top N

5 10 15 20 25

Average day top N pages were crawled

Figure 2: Average day on which the top N pages were crawled

[From: Najork and Wiener, 2001]

Statistics for crawl of 328 million pages.

5

SLIDE 6

PageRank Priority is Even Better

(but computationally expensive to use. . . )

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% PageRank backlink breadth random Ordering metric: Pages crawled Hot pages crawled

Figure 2: The performance of various ordering metrics for IB(P); G = 100

[From: Arasu et al., 2001]

Statistics for crawl of 225.000 pages at Stanford.

6

SLIDE 7

Load Balancing

Own resources:

Bandwidth (control global rate of requests)
Storage (compact representations, compression)
Industrial-strength crawlers must be distributed (e.g.

partition the url-space)

7

SLIDE 8

Load Balancing

Own resources:

Bandwidth (control global rate of requests)
Storage (compact representations, compression)
Industrial-strength crawlers must be distributed (e.g.

partition the url-space) Resources of others:

BANDWIDTH. Control local rate of requests (e.g. 30 sec.

between request to same site).

Identify yourself in request. Give contact info.
Monitor the crawl.
Obey the Robots Exclusion Protocol (www.robotstxt.org).

[Also read the other material there.]

7

SLIDE 9

Efficiency

RAM: never enough for serious crawls. Efficient use of disk

based storage important. I/O when accessing data structures is often a bottleneck.

CPU cycles: not a problem (Java and scripting languages

are fine).

DNS lookup can be a bottleneck (as normally

synchronized). Asynchronous DNS: check GNU adns library. Rates reported for serious crawlers: 200-400 pages/sec.

8

SLIDE 10

Example: Mercator

Protocol Modules Processing Modules HTTP FTP Gopher Link Extractor GIF Stats Tag Counter Content Seen? DNS Resolver RIS URL Filter DUE URL Frontier I N T E R N E T

Mercator

Queue Files URL Set Log Log Doc FPs 1 2 3 4 5 6 7 8

Figure 1: Mercator’s main components.

[From: Najork and Heydon, 2001]

9

SLIDE 11

Mercator

Further ideas:

Fingerprinting ((sparse) hashfunction on strings).
Continuous crawling—crawled pages put back in queue

(prioritized using update history).

Checkpointing (crash recovery).
Very modular structure.

10

SLIDE 12

Details: Politeness

1 2 k 3 1 C C C 2 X X X 3 A A A n F F F Prioritizer Random queue chooser with bias to high−priority queues Back−end queue router Back−end queue selector A 3 C 1 F n X 2 Host−to− queue table Front−end FIFO queues (one per priority level) Back−end FIFO queues (many more than worker threads) Polite, Dynamic, Prioritizing Frontier n 1 3 2 Priority queue (e.g., heap)

Figure 3: Our best URL frontier implementation

[From: Najork and Heydon, 2001]

11

SLIDE 13

Details: Efficient URL Elimination

Fingerprinting
Sorted file of

fingerprints of seen URLs.

Cache most used

URLs.

Non-cached

URLs checked in batches (merge with file I/O).

025ef978 0382fc97 05117c6f ...

FP cache 2^16 entries

035f4ca8 1 http://u.gov/gw 07f6de43 2 http://a.com/xa 15ef7885 3 http://z.org/gu 234e7676 4 http://q.net/hi 27cc67ed 5 http://m.edu/tz 2f4e6710 6 http://n.mil/gd 327849c8 7 http://fq.de/pl 40678544 8 http://pa.fr/ok 42ca6ff7 9 http://tu.tw/ch ... ... ...

Front−buffer containing FPs and URL indices 2^21 entries Disk file containing URLs (one per front−buffer entry)

02f567e0 1 http://x.com/hr 04deca01 2 http://g.org/rf 12054693 3 http://p.net/gt 17fc8692 4 http://w.com/ml 230cd562 5 http://gr.be/zf 30ac8d98 6 http://gg.kw/kz 357cae05 7 http://it.il/mm 4296634c 8 http://g.com/yt 47693621 9 http://z.gov/ew ... ... ...

Back−buffer containing FPs and URL indices 2^21 entries Disk file containing URLs (one per back−buffer entry)

025fe427 04ff5234 07852310 ...

FP disk file 100m to 1b entries

T U F T’ U’

Figure 4: Our most effi cient disk-based DUE implementation

[From: Najork and Heydon, 2001]

12

SLIDE 14

Some Experiences

200 − OK (81.36%) 404 − Not Found (5.94%) 302 − Moved temporarily (3.04%) Excluded by robots.txt (3.92%) TCP error (3.12%) DNS error (1.02%) Other (1.59%)

Figure 6: Outcome of download attempts

text/html (65.34%) image/gif (15.77%) image/jpeg (14.36%) text/plain (1.24%) application/pdf (1.04%) Other (2.26%)

Figure 7: Distribution of content types

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 5% 10% 15%

Figure 8: Distribution of document sizes

13

SLIDE 15

Some Experiences

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 1 10 100 1000 10000 100000 1000000 10000000 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M 64M 256M 1G 4G 16G 64G 1 10 100 1000 10000 100000 1000000 10000000

(a) Distribution of pages over web servers (b) Distribution of bytes over web servers Figure 9: Document and web server size distributions

.com (47.20%) .de (7.93%) .net (7.88%) .org (4.63%) .uk (3.29%) raw IP addresses (3.25%) .jp (1.80%) .edu (1.53%) .ru (1.35%) .br (1.31%) .kr (1.30%) .nl (1.05%) .pl (1.02%) .au (0.95%) Other (15.52%) .com (51.44%) .net (6.74%) .org (6.31%) .edu (5.56%) .jp (4.09%) .de (3.37%) .uk (2.45%) raw IP addresses (1.43%) .ca (1.36%) .gov (1.19%) .us (1.14%) .cn (1.08%) .au (1.08%) .ru (1.00%) Other (11.76%)

(a) Distribution of hosts over (b) Distribution of pages over top-level domains top-level domains [From: Najork and Heydon, 2001]

14

SLIDE 16

Further Resources

Further resources for implementing a crawler:

Another good paper with practical info:

Shkapenyuk and Suel: Design and Implementation of a High-Performance Distributed Web Crawler. IEEE Int. Conf. on Data Engineering (ICDE), February 2002.

(http://cis.poly.edu/suel/papers/crawl.ps)

HTML specification (www.w3.org)
A free book on programming web agents.

(http://www.oreilly.com/openbook/webclient)

Software libraries (Java, Perl, Python, C++) for net

programming.

15