Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the - - PowerPoint PPT Presentation

web crawling
SMART_READER_LITE
LIVE PREVIEW

Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the - - PowerPoint PPT Presentation

Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the Web , Arvind Arasu, Junghoo Cho, Hector Garcia- Molina, Andreas Paepcke, and Sriram Raghavan. ACM Transactions on Internet Technology, 1, p. 2-43, 2001. [ BP98] The


slide-1
SLIDE 1

Web Crawling

gzsun@ustc.edu.cn

slide-2
SLIDE 2

Reference

 [ ACGPR01] Searching the Web, Arvind Arasu, Junghoo Cho, Hector Garcia- Molina, Andreas Paepcke, and Sriram

  • Raghavan. ACM Transactions on

Internet Technology, 1, p. 2-43, 2001.  [ BP98] The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence Page. Proceedings of the 7th International WWW conference, 1998.

slide-3
SLIDE 3

Search Engines for the W eb

slide-4
SLIDE 4
slide-5
SLIDE 5

What is it?

 Web Crawling = Graph Traversal  Abstract

 S = { start pages}  repeat

 remove an element s from S  foreach (s, v) / / s has a link to v

 if v not crawled before , insert v into S

 Let’s look an example

slide-6
SLIDE 6

http: / / www.ustc.edu.cn/ ch/ index.php http: / / news.ustc.edu.cn/ http: / / zsb.ustc.edu.cn/ http: / / ustcaf.org/ http: / / www.job.ustc.edu.cn/ http: / / email.ustc.edu.cn/ http: / / alumni.ustc.edu.cn/ http: / / netfee.ustc.edu.cn/ http: / / bbs.ustc.edu.cn/ http: / / bbs.ustc.edu.cn/ main.html

slide-7
SLIDE 7

Why is it not trivial? (Theoretical Issues)

 How to choose S? (start pages)

 e.g. we choose http: / / www.ustc.edu.cn/ ch/ index.php, because it is believed that from it we can achieve almost all the significant pages in USTC, but this is not always true in the internet.

slide-8
SLIDE 8

Why is it not trivial? (Theoretical Issues)

 How to choose s from S? (crawl strategy)

 e.g. we use DFS strategy to choose such s.

 1. email.ustc.edu.cn  2. netfee.ustc.edu.cn  3. bbs.ustc.edu.cn  ....

 In worst case, we get into http: / / bbs.ustc.edu.cn, and then we get into some board like test or water . Then in a long period, we get nothing but rubbish. It goes even worse if we can get only limited number of pages (because of lack of resources)

slide-9
SLIDE 9

Why is it not trivial? (Theoretical Issues)

 How to tackle with dynamic pages?

 every dynamic page may be modified after we fetch it, so we need to refresh them.  e.g. every article in our bbs may be modified, deleted or with marks changed, it is hard to keep update

slide-10
SLIDE 10

Why is it not trivial? (Practical Issues)

 Limited resources may bring many problems

 limited storage: pages must be compressed efficiently or even stored in distribute way  limited CPU resources: may need parallel technical  limited network resources: fetch pages as good as possible (but how to define * good* ?)

slide-11
SLIDE 11

A simple and special aimed example

 fetch all articles from our bbs

 visit http: / / bbs.ustc.edu.cn/ cgi/ bbsall to get all the board names  for each board name bn (e.g. algorithm)

 fetch http: / / bbs.ustc.edu.cn/ cgi/ bbsdoc?board= Algorithm&start= 1 and extract the article links in this page  fetch http: / / bbs.ustc.edu.cn/ cgi/ bbsdoc?board= Algorithm&start= 21 and extract the article links in this page  ...  till we can not get any new pages more  for each article file name we get (e.g. M3DFB1F2C )

 fetch http: / / bbs.ustc.edu.cn/ cgi/ bbscon?bn= Algorithm&fn= M3DFB1F2C

slide-12
SLIDE 12

A complex and universal example

stored URLS for downloading

download page PROBLEM 2: what’s REP? choose an URL for downloading, PROBLEM 1: how to choose? Discuss later! solve host names quickly temporary stored pages for multiple reading from multiple components store it if it is new PROBLEM 3: how? extract URLs for downloading filter out unwanted URLs has been encountered before? if not, store it PROEBLEM 4: how?

slide-13
SLIDE 13

Test something if it is new

 hash is good and good enough

 e.g. MD5

slide-14
SLIDE 14

What’s REP?

 A crawler should obey many rules such as the so-called Robots Exclusion Protocol

 crawler should fetch / robots.txt before any crawling and read it, then to determine whether to crawl this site or crawl what?  Also important to give your contact information in your queries? Look an example:

 66.249.72.244 - - [ 07/ Mar/ 2006: 16: 08: 21 + 0800] "GET / userstatus.php?user= Quester HTTP/ 1.1" 200 3525 "-" "Mozilla/ 5.0 (compatible; Googlebot/ 2.1; + http: / / www.google.com/ bot.html)" "-“  Do you know who submit this query?

slide-15
SLIDE 15

Now, how to choose an URL? (Crawl strategy)

 Breath First Search

 Maintain the URLs in FIFO structure  Drawback: maybe get into a host and crawl too much from this host, this may cause many problems, such as BANDWITH, overloading the server, especially in the case where many crawlers run in parallel.

 example: 1000 crawlers crawl bbs server at the same time VS only about 10 of them crawl the same server at the same time?

slide-16
SLIDE 16

How to choose an URL? (Crawl strategy)

 Depth First Search

 Maintain the URLs in a LIFO fashion  Shares the same drawback with BFS  Our first example unfortunately fall into this case.

slide-17
SLIDE 17

How to choose an URL? (Crawl strategy)

 Random

 Random is random, so we randomly to choose not to describe it with probability

  • f 1.
slide-18
SLIDE 18

How to choose an URL? (Crawl strategy)

 Priority Search

 Use some significant priority to determine which URL should be crawled firstly  E.g. in our BBS, the following should be crawled firstly

 Ontop articles  New articles  Hot articles (such as the top 10)  Notices  Maybe: article post by your lovely girl (or even girls)

slide-19
SLIDE 19

How to choose an URL? (Crawl strategy)

 Possible priority

 Often changing pages  with high global ranks (e.g. PageRank)  Pages that you are focusing (in such crawlers motivated by special aims such as a crawler for our BBS, or even for yourself)

slide-20
SLIDE 20

How to choose an URL? (Crawl strategy)

 How to estimate the goodness of a strategy?

 if it has crawl 100 0000 pages, how many of them are hot? That is with high importance!

slide-21
SLIDE 21

Comparison between these strategies

slide-22
SLIDE 22

Advanced Issues

 1. How to keep your pages update?  2. How to tackle with dynamic pages?

 e.g. google show you dynamic results depending on your query and the pages he owns till now. Can you crawl out all pages from google? Yet, then you will be another google,

 3. How to balance your resources to achieve high performance.

slide-23
SLIDE 23

Thank you!