CAI: Cerca i Anàlisi d’Informació Grau en Ciència i Enginyeria de Dades, UPC
- 4. Searching the Web. Pagerank
October 17, 2019
Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldà, Department of Computer Science, UPC
1 / 43
4. Searching the Web. Pagerank October 17, 2019 Slides by Marta - - PowerPoint PPT Presentation
CAI: Cerca i Anlisi dInformaci Grau en Cincia i Enginyeria de Dades, UPC 4. Searching the Web. Pagerank October 17, 2019 Slides by Marta Arias, Jos Luis Balczar, Ramon Ferrer-i-Cancho, Ricard Gavald, Department of Computer
1 / 43
2 / 43
3 / 43
4 / 43
5 / 43
◮ by keywords ◮ implicitly in choice of seed pages
6 / 43
7 / 43
8 / 43
◮ Fast duplicate detection a problem in itself ◮ Fingerprints or k-shingles (similar to n-grams)
9 / 43
10 / 43
◮ /robots.txt file: administrator preferences ◮ “If you are agent X, please don’t explore directory Y”
11 / 43
12 / 43
◮ Text in an anchor very relevant for target page
◮ Font, placement in page makes some terms extra relevant
14 / 43
15 / 43
16 / 43
17 / 43
18 / 43
19 / 43
20 / 43
21 / 43
22 / 43
23 / 43
24 / 43
25 / 43
26 / 43
27 / 43
28 / 43
29 / 43
30 / 43
31 / 43
32 / 43
33 / 43
34 / 43
35 / 43
36 / 43
37 / 43
38 / 43
39 / 43
◮ E.g., pi(0) is the probability of being at state ji at time 0
◮ Starts at node i with probability p(0)(i). Then repeats
◮ With probability 1 − λ, jump to a randomly chosen node
◮ Else, if out(i) = 0, jump to a randomly chosen node
◮ Else jump to any successor of i chosen at random.
40 / 43
◮ TrustRank, SpamMass, . . . ◮ (see Leskovec, Rajaraman, Ullmann ch. 5.4) 41 / 43
◮ Computed off-line ◮ Collective reputation
◮ Insensitive to particular user’s needs 42 / 43
43 / 43