Web Crawling Najork and Heydon, High-Performance Web Crawling , - - PowerPoint PPT Presentation

web crawling
SMART_READER_LITE
LIVE PREVIEW

Web Crawling Najork and Heydon, High-Performance Web Crawling , - - PowerPoint PPT Presentation

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report 173, 2001. Also in Handbook of Massive Data Sets, Kluwer, 2001. Najork and Wiener, Breadth-first search crawling yields high-quality pages . Proc.


slide-1
SLIDE 1

Web Crawling

  • Najork and Heydon, High-Performance Web Crawling, Compaq SRC Research

Report 173, 2001. Also in Handbook of Massive Data Sets, Kluwer, 2001.

  • Najork and Wiener, Breadth-first search crawling yields high-quality pages.
  • Proc. 10th Int. WWW Conf., 2001.

1

slide-2
SLIDE 2

Web Crawling

Web Crawling = Graph Traversal S = {startpages} repeat remove an element s from S foreach (s, v) if v not crawled before insert v in S

2

slide-3
SLIDE 3

Issues

Theoretical:

  • Startset S
  • Choice of s (crawl strategy)
  • Refreshing of changing pages.

Practical:

  • Load balancing (own resources and resources of crawled

sites)

  • Size of data (compact representations)
  • Performance (I/Os).

3

slide-4
SLIDE 4

Crawl Strategy

  • Breath First Search
  • Depth First Search
  • Random
  • Priority Search

Possible priorities:

  • Often changing pages (how to estimate change rate?).
  • Using global ranking scheme for queries (e.g. PageRank).
  • Using query dependent ranking scheme for queries

(“focused crawling”, “collection building”).

4

slide-5
SLIDE 5

BFS is Good

5 10 15 20 25 30 35 40 45 50 55

Day of crawl

2 4 6 8

Average PageRank

Figure 1: Average PageRank score by day of crawl

1 10 100 1000 10000 100000 1e+06 1e+07 1e+08

top N

5 10 15 20 25

Average day top N pages were crawled

Figure 2: Average day on which the top N pages were crawled

[From: Najork and Wiener, 2001]

Statistics for crawl of 328 million pages.

5

slide-6
SLIDE 6

PageRank Priority is Even Better

(but computationally expensive to use. . . )

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% PageRank backlink breadth random Ordering metric: Pages crawled Hot pages crawled

Figure 2: The performance of various ordering metrics for IB(P); G = 100

[From: Arasu et al., Searching the Web. ACM Trans. Internet Technology, 1, 2001]

Statistics for crawl of 225.000 pages at Stanford.

6

slide-7
SLIDE 7

Load Balancing

Own resources:

  • Bandwidth (control global rate of requests)
  • Storage (compact representations, compression)
  • Industrial-strength crawlers must be distributed (e.g.

partition the url-space)

7

slide-8
SLIDE 8

Load Balancing

Own resources:

  • Bandwidth (control global rate of requests)
  • Storage (compact representations, compression)
  • Industrial-strength crawlers must be distributed (e.g.

partition the url-space) Resources of others:

  • BANDWIDTH. Control local rate of requests (e.g. 30 sec.

between request to same site).

  • Identify yourself in request. Give contact info (mail and

www).

  • Monitor the crawl.
  • Obey the Robots Exclusion Protocol (see

www.robotstxt.org).

7

slide-9
SLIDE 9

Efficiency

  • RAM: never enough for serious crawls. Efficient use of disk

based storage important. I/O when accessing data structures is often a bottleneck.

  • CPU cycles: not a problem (Java and scripting languages

are fine).

  • DNS lookup can be a bottleneck if using synchronized
  • version. Brug asynchronous DNS (e.g. GNU adns library).

Rates reported for serious crawlers: 200-400 pages/sec.

8

slide-10
SLIDE 10

Crawler Example: Mercator

Protocol Modules Processing Modules HTTP FTP Gopher Link Extractor GIF Stats Tag Counter Content Seen? DNS Resolver RIS URL Filter DUE URL Frontier I N T E R N E T

Mercator

Queue Files URL Set Log Log Doc FPs 1 2 3 4 5 6 7 8

Figure 1: Mercator’s main components.

[From: Najork and Heydon, 2001]

9

slide-11
SLIDE 11

Mercator

Further features:

  • Uses fingerprinting ((sparse) hashfunction on strings) for

URL IDs (see e.g. ex. md5 (128 bit) or the sha family (160-512 bits)).

  • Continuous crawling—crawled pages put back in queue

(prioritized using update history).

  • Checkpointing (crash recovery).
  • Very modular structure.

10

slide-12
SLIDE 12

Details: Politeness

1 2 k 3 1 C C C 2 X X X 3 A A A n F F F Prioritizer Random queue chooser with bias to high−priority queues Back−end queue router Back−end queue selector A 3 C 1 F n X 2 Host−to− queue table Front−end FIFO queues (one per priority level) Back−end FIFO queues (many more than worker threads) Polite, Dynamic, Prioritizing Frontier n 1 3 2 Priority queue (e.g., heap)

Figure 3: Our best URL frontier implementation

[From: Najork and Heydon, 2001]

11

slide-13
SLIDE 13

Details: Efficient URL Elimination

  • Fingerprinting
  • Sorted file of

fingerprints of seen URLs.

  • Cache most used

URLs.

  • Non-cached

URLs checked in batches (merge with file I/O).

025ef978 0382fc97 05117c6f ...

FP cache 2^16 entries

035f4ca8 1 http://u.gov/gw 07f6de43 2 http://a.com/xa 15ef7885 3 http://z.org/gu 234e7676 4 http://q.net/hi 27cc67ed 5 http://m.edu/tz 2f4e6710 6 http://n.mil/gd 327849c8 7 http://fq.de/pl 40678544 8 http://pa.fr/ok 42ca6ff7 9 http://tu.tw/ch ... ... ...

Front−buffer containing FPs and URL indices 2^21 entries Disk file containing URLs (one per front−buffer entry)

02f567e0 1 http://x.com/hr 04deca01 2 http://g.org/rf 12054693 3 http://p.net/gt 17fc8692 4 http://w.com/ml 230cd562 5 http://gr.be/zf 30ac8d98 6 http://gg.kw/kz 357cae05 7 http://it.il/mm 4296634c 8 http://g.com/yt 47693621 9 http://z.gov/ew ... ... ...

Back−buffer containing FPs and URL indices 2^21 entries Disk file containing URLs (one per back−buffer entry)

025fe427 04ff5234 07852310 ...

FP disk file 100m to 1b entries

T U F T’ U’

Figure 4: Our most efficient disk-based DUE implementation

[From: Najork and Heydon, 2001]

12

slide-14
SLIDE 14

Details: Parallelization

Link− extractor URL Filter Host Splitter DUE URL Frontier HTTP module Link− extractor URL Filter Host Splitter DUE URL Frontier HTTP module Link− extractor URL Filter Host Splitter DUE URL Frontier HTTP module Link− extractor URL Filter Host Splitter DUE URL Frontier HTTP module RIS RIS RIS RIS

Figure 2: A four-node distributed crawling hive

[From: Najork and Heydon, 2001]

13

slide-15
SLIDE 15

Some Experiences

200 − OK (81.36%) 404 − Not Found (5.94%) 302 − Moved temporarily (3.04%) Excluded by robots.txt (3.92%) TCP error (3.12%) DNS error (1.02%) Other (1.59%)

Figure 6: Outcome of download attempts

text/html (65.34%) image/gif (15.77%) image/jpeg (14.36%) text/plain (1.24%) application/pdf (1.04%) Other (2.26%)

Figure 7: Distribution of content types

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 5% 10% 15%

Figure 8: Distribution of document sizes

14

slide-16
SLIDE 16

Some Experiences

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 1 10 100 1000 10000 100000 1000000 10000000 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M 64M 256M 1G 4G 16G 64G 1 10 100 1000 10000 100000 1000000 10000000

(a) Distribution of pages over web servers (b) Distribution of bytes over web servers Figure 9: Document and web server size distributions

.com (47.20%) .de (7.93%) .net (7.88%) .org (4.63%) .uk (3.29%) raw IP addresses (3.25%) .jp (1.80%) .edu (1.53%) .ru (1.35%) .br (1.31%) .kr (1.30%) .nl (1.05%) .pl (1.02%) .au (0.95%) Other (15.52%) .com (51.44%) .net (6.74%) .org (6.31%) .edu (5.56%) .jp (4.09%) .de (3.37%) .uk (2.45%) raw IP addresses (1.43%) .ca (1.36%) .gov (1.19%) .us (1.14%) .cn (1.08%) .au (1.08%) .ru (1.00%) Other (11.76%)

(a) Distribution of hosts over (b) Distribution of pages over top-level domains top-level domains [From: Najork and Heydon, 2001]

15

slide-17
SLIDE 17

Robot Exclusion Protocol

Simple protocol suggested by Martijn Koster in 1993. De facto standard for robot exclusion. Full details at www.robotstxt.org.

  • Single file named robots.txt in root of server.
  • Contains simple directions for exclusion of parts of site.

Example: User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /joe/ User-agent: BadBot Disallow: /

16

slide-18
SLIDE 18

Robot Exclusion in HTML

Per page exclusion through the META tag in HTML. Example: <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> Further details at www.w3.org/TR/html4/ (the HTML 4.01 specification) and at www.robotstxt.org

17

slide-19
SLIDE 19

HTTP Protocol

One request message, one response message (over a single TCP connection). Format of messages: Request line Header line . . . Header line (Body) Response line Header line . . . Header line Body Request Response

18

slide-20
SLIDE 20

HTTP Example

GET /somedir/page.html HTTP/1.1 Host: www.somefirm.com Accept: text/* User-Agent: Mozilla 7.0 [en] HTTP/1.1 200 OK Content-Type: text/html Content-Length: 345 <HTML> <HEAD> . . . Request Response

19

slide-21
SLIDE 21

URLs

Absolute: http://www.somefirm.dk:80/main/test http://www.somefirm.dk/main/test#thirdEntry http://www.somefirm.dk/cgi-bin?item=123 Relative: ./dir/test.html Relative to

  • URL of doc containing URL
  • URL specified in <BASE> HTML tag.

Encoded characters: www.sdu.dk/~rolf → www.sdu.dk/%7Erolf

20

slide-22
SLIDE 22

Normalizing URLs

  • Add portnumber if not present (:80).
  • Convert escaped chars to real chars.
  • Remove ...#target from URL.

21