Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the - - PowerPoint PPT Presentation
Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the - - PowerPoint PPT Presentation
Web Crawling gzsun@ustc.edu.cn Reference [ ACGPR01] Searching the Web , Arvind Arasu, Junghoo Cho, Hector Garcia- Molina, Andreas Paepcke, and Sriram Raghavan. ACM Transactions on Internet Technology, 1, p. 2-43, 2001. [ BP98] The
Reference
[ ACGPR01] Searching the Web, Arvind Arasu, Junghoo Cho, Hector Garcia- Molina, Andreas Paepcke, and Sriram
- Raghavan. ACM Transactions on
Internet Technology, 1, p. 2-43, 2001. [ BP98] The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence Page. Proceedings of the 7th International WWW conference, 1998.
Search Engines for the W eb
What is it?
Web Crawling = Graph Traversal Abstract
S = { start pages} repeat
remove an element s from S foreach (s, v) / / s has a link to v
if v not crawled before , insert v into S
Let’s look an example
http: / / www.ustc.edu.cn/ ch/ index.php http: / / news.ustc.edu.cn/ http: / / zsb.ustc.edu.cn/ http: / / ustcaf.org/ http: / / www.job.ustc.edu.cn/ http: / / email.ustc.edu.cn/ http: / / alumni.ustc.edu.cn/ http: / / netfee.ustc.edu.cn/ http: / / bbs.ustc.edu.cn/ http: / / bbs.ustc.edu.cn/ main.html
Why is it not trivial? (Theoretical Issues)
How to choose S? (start pages)
e.g. we choose http: / / www.ustc.edu.cn/ ch/ index.php, because it is believed that from it we can achieve almost all the significant pages in USTC, but this is not always true in the internet.
Why is it not trivial? (Theoretical Issues)
How to choose s from S? (crawl strategy)
e.g. we use DFS strategy to choose such s.
1. email.ustc.edu.cn 2. netfee.ustc.edu.cn 3. bbs.ustc.edu.cn ....
In worst case, we get into http: / / bbs.ustc.edu.cn, and then we get into some board like test or water . Then in a long period, we get nothing but rubbish. It goes even worse if we can get only limited number of pages (because of lack of resources)
Why is it not trivial? (Theoretical Issues)
How to tackle with dynamic pages?
every dynamic page may be modified after we fetch it, so we need to refresh them. e.g. every article in our bbs may be modified, deleted or with marks changed, it is hard to keep update
Why is it not trivial? (Practical Issues)
Limited resources may bring many problems
limited storage: pages must be compressed efficiently or even stored in distribute way limited CPU resources: may need parallel technical limited network resources: fetch pages as good as possible (but how to define * good* ?)
A simple and special aimed example
fetch all articles from our bbs
visit http: / / bbs.ustc.edu.cn/ cgi/ bbsall to get all the board names for each board name bn (e.g. algorithm)
fetch http: / / bbs.ustc.edu.cn/ cgi/ bbsdoc?board= Algorithm&start= 1 and extract the article links in this page fetch http: / / bbs.ustc.edu.cn/ cgi/ bbsdoc?board= Algorithm&start= 21 and extract the article links in this page ... till we can not get any new pages more for each article file name we get (e.g. M3DFB1F2C )
fetch http: / / bbs.ustc.edu.cn/ cgi/ bbscon?bn= Algorithm&fn= M3DFB1F2C
A complex and universal example
stored URLS for downloading
download page PROBLEM 2: what’s REP? choose an URL for downloading, PROBLEM 1: how to choose? Discuss later! solve host names quickly temporary stored pages for multiple reading from multiple components store it if it is new PROBLEM 3: how? extract URLs for downloading filter out unwanted URLs has been encountered before? if not, store it PROEBLEM 4: how?
Test something if it is new
hash is good and good enough
e.g. MD5
What’s REP?
A crawler should obey many rules such as the so-called Robots Exclusion Protocol
crawler should fetch / robots.txt before any crawling and read it, then to determine whether to crawl this site or crawl what? Also important to give your contact information in your queries? Look an example:
66.249.72.244 - - [ 07/ Mar/ 2006: 16: 08: 21 + 0800] "GET / userstatus.php?user= Quester HTTP/ 1.1" 200 3525 "-" "Mozilla/ 5.0 (compatible; Googlebot/ 2.1; + http: / / www.google.com/ bot.html)" "-“ Do you know who submit this query?
Now, how to choose an URL? (Crawl strategy)
Breath First Search
Maintain the URLs in FIFO structure Drawback: maybe get into a host and crawl too much from this host, this may cause many problems, such as BANDWITH, overloading the server, especially in the case where many crawlers run in parallel.
example: 1000 crawlers crawl bbs server at the same time VS only about 10 of them crawl the same server at the same time?
How to choose an URL? (Crawl strategy)
Depth First Search
Maintain the URLs in a LIFO fashion Shares the same drawback with BFS Our first example unfortunately fall into this case.
How to choose an URL? (Crawl strategy)
Random
Random is random, so we randomly to choose not to describe it with probability
- f 1.