Crawling
CE-324: Modern Information Retrieval
Sharif University of Technology
- M. Soleymani
Crawling CE-324: Modern Information Retrieval Sharif University of - - PowerPoint PPT Presentation
Crawling CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Sec. 20.2 Basic crawler operation } Begin with
2
3
4
5
6
7
8
9
10
11
} Must avoid trying to fetch them all at the same time
12
13
14
15
16
17
} E.g.,
18
19
20
21
22
23
24
25
} All steps are distributed
} Spam pages } Spider traps
} Malicious server that generates an infinite sequence of linked pages } Sophisticated traps generate pages that are not easily identified as dynamic.
} Latency/bandwidth to remote servers vary } Webmasters’ stipulations
} How “deep” should you crawl a site’s URL hierarchy?
} Site mirrors and duplicate pages
26
27
28
29
30
31
32
33
34
35
36
37
Host name Back queue … 3 1 20
Back queue selector
Heap
39
40
41