1998 enter link analysis
play

1998: enter Link Analysis uses hyperlink structure to focus the - PDF document

1998: enter Link Analysis uses hyperlink structure to focus the relevant set combine traditional IR score with popularity score Page and Brin 1998 Kleinberg Web Information Retrieval IR before the Web = traditional IR IR on the Web =


  1. 1998: enter Link Analysis • uses hyperlink structure to focus the relevant set • combine traditional IR score with popularity score Page and Brin 1998 Kleinberg

  2. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR

  3. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections?

  4. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections? • It’s huge. – over 10 billion pages, average page size of 500KB – 20 times size of Library of Congress print collection – Deep Web - 400 X bigger than Surface Web

  5. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections? • It’s huge. – over 10 billion pages, average page size of 500KB – 20 times size of Library of Congress print collection – Deep Web - 400 X bigger than Surface Web • It’s dynamic. – content changes: 40% of pages change in a week, 23% of .com change daily – size changes: billions of pages added each year

  6. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections? • It’s huge. – over 10 billion pages, average page size of 500KB – 20 times size of Library of Congress print collection – Deep Web - 400 X bigger than Surface Web • It’s dynamic. – content changes: 40% of pages change in a week, 23% of .com change daily – size changes: billions of pages added each year • It’s self-organized. – no standards, review process, formats – errors, falsehoods, link rot, and spammers!

  7. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections? • It’s huge. – over 10 billion pages, average page size of 500KB – 20 times size of Library of Congress print collection – Deep Web - 400 X bigger than Surface Web • It’s dynamic. – content changes: 40% of pages change in a week, 23% of .com change daily – size changes: billions of pages added each year • It’s self-organized. – no standards, review process, formats – errors, falsehoods, link rot, and spammers! A Herculean Task!

  8. Web Information Retrieval IR before the Web = traditional IR IR on the Web = web IR How is the Web different from other document collections? • It’s huge. – over 10 billion pages, each about 500KB – 20 times size of Library of Congress print collection – Deep Web - 400 X bigger than Surface Web • It’s dynamic. – content changes: 40% of pages change in a week, 23% of .com change daily – size changes: billions of pages added each year • It’s self-organized. – no standards, review process, formats – errors, falsehoods, link rot, and spammers! • Ah, but it’s hyperlinked ! – Vannevar Bush’s 1945 memex

  9. Elements of a Web Search Engine WWW Page Repository Crawler Module User Queries s t l u Indexing Module s e R query-independent Query Module Ranking Module Indexes Special-purpose indexes Content Index Structure Index

  10. The Ranking Module (generates popularity scores) • Measure the importance of each page

  11. The Ranking Module (generates popularity scores) • Measure the importance of each page • The measure should be Independent of any query — Primarily determined by the link structure of the Web — Tempered by some content considerations

  12. The Ranking Module (generates popularity scores) • Measure the importance of each page • The measure should be Independent of any query — Primarily determined by the link structure of the Web — Tempered by some content considerations • Compute these measures off-line long before any queries are processed

  13. The Ranking Module (generates popularity scores) • Measure the importance of each page • The measure should be Independent of any query — Primarily determined by the link structure of the Web — Tempered by some content considerations • Compute these measures off-line long before any queries are processed � technology distinguishes it from all com- • Google’s PageRank c petitors

  14. The Ranking Module (generates popularity scores) • Measure the importance of each page • The measure should be Independent of any query — Primarily determined by the link structure of the Web — Tempered by some content considerations • Compute these measures off-line long before any queries are processed � technology distinguishes it from all com- • Google’s PageRank c petitors Google’s PageRank = Google’s $$$$$

  15. Google’s PageRank (Lawrence Page & Sergey Brin 1998) The Google Goals • Create a PageRank r ( P ) that is not query dependent ⊲ Off-line calculations — No query time computation • Let the Web vote with in-links ⊲ But not by simple link counts — One link to P from Yahoo! is important — Many links to P from me is not Share The Vote • ⊲ Yahoo! casts many “votes” — value of vote from Y ahoo ! is diluted ⊲ If Yahoo! “votes” for n pages — Then P receives only r ( Y ) /n credit from Y

  16. Google’s PageRank (Lawrence Page & Sergey Brin 1998) The Google Goals • Create a PageRank r ( P ) that is not query dependent ⊲ Off-line calculations — No query time computation • Let the Web vote with in-links ⊲ But not by simple link counts — One link to P from Yahoo! is important — Many links to P from me is not Share The Vote • ⊲ Yahoo! casts many “votes” — value of vote from Y ahoo ! is diluted ⊲ If Yahoo! “votes” for n pages — Then P receives only r ( Y ) /n credit from Y

  17. Google’s PageRank (Lawrence Page & Sergey Brin 1998) The Google Goals • Create a PageRank r ( P ) that is not query dependent ⊲ Off-line calculations — No query time computation • Let the Web vote with in-links ⊲ But not by simple link counts — One link to P from Yahoo! is important — Many links to P from me is not Share The Vote • ⊲ Yahoo! casts many “votes” — value of vote from Y ahoo ! is diluted ⊲ If Yahoo! “votes” for n pages — Then P receives only r ( Y ) /n credit from Y

  18. Google’s PageRank (Lawrence Page & Sergey Brin 1998) The Google Goals • Create a PageRank r ( P ) that is not query dependent ⊲ Off-line calculations — No query time computation • Let the Web vote with in-links ⊲ But not by simple link counts — One link to P from Yahoo! is important — Many links to P from me is not Share The Vote • ⊲ Yahoo! casts many “votes” — value of vote from Y ahoo ! is diluted ⊲ If Yahoo! “votes” for n pages — Then P receives only r ( Y ) /n credit from Y

  19. PageRank The De fi nition B P = { all pages pointing to P } r ( P ) � r ( P ) = | P | | P | = number of out links from P P ∈B P

  20. PageRank The De fi nition B P = { all pages pointing to P } r ( P ) � r ( P ) = | P | | P | = number of out links from P P ∈B P Successive Re fi nement Start with r 0 ( P i ) = 1 /n for all pages P 1 , P 2 , ... , P n

  21. PageRank The De fi nition B P = { all pages pointing to P } r ( P ) � r ( P ) = | P | | P | = number of out links from P P ∈B P Successive Re fi nement Start with r 0 ( P i ) = 1 /n for all pages P 1 , P 2 , ... , P n Iteratively re fi ne rankings for each page r 0 ( P ) � r 1 ( P i ) = | P | P ∈B Pi

  22. PageRank The De fi nition B P = { all pages pointing to P } r ( P ) � r ( P ) = | P | | P | = number of out links from P P ∈B P Successive Re fi nement Start with r 0 ( P i ) = 1 /n for all pages P 1 , P 2 , ... , P n Iteratively re fi ne rankings for each page r 0 ( P ) � r 1 ( P i ) = | P | P ∈B Pi r 1 ( P ) � r 2 ( P i ) = | P | P ∈B Pi

  23. PageRank The De fi nition B P = { all pages pointing to P } r ( P ) � r ( P ) = | P | | P | = number of out links from P P ∈B P Successive Re fi nement Start with r 0 ( P i ) = 1 /n for all pages P 1 , P 2 , ... , P n Iteratively re fi ne rankings for each page r 0 ( P ) � r 1 ( P i ) = | P | P ∈B Pi r 1 ( P ) � r 2 ( P i ) = | P | P ∈B Pi ... r j ( P ) � r j + 1 ( P i ) = | P | P ∈B Pi

  24. In Matrix Notation After Step k — π T k = [ r k ( P 1 ) , r k ( P 2 ) , ... , r k ( P n )]

  25. In Matrix Notation After Step k — π T k = [ r k ( P 1 ) , r k ( P 2 ) , ... , r k ( P n )] � 1 / | P i | if i → j — π T k + 1 = π T k H where h ij = otherwise 0

  26. In Matrix Notation After Step k — π T k = [ r k ( P 1 ) , r k ( P 2 ) , ... , r k ( P n )] � 1 / | P i | if i → j — π T k + 1 = π T k H where h ij = otherwise 0 — PageRank vector = π T = lim k →∞ π T k = eigenvector for H Provided that the limit exists

  27. Tiny Web 1 2 P 1 P 2 P 3 P 4 P 5 P 6 ⎛ ⎞ P 1 P 2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 3 P 3 ⎜ ⎟ H = ⎜ ⎟ P 4 ⎜ ⎟ ⎜ ⎟ P 5 ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ 6 5 P 6 4

  28. Tiny Web 1 2 P 1 P 2 P 3 P 4 P 5 P 6 ⎛ ⎞ P 1 1 / 2 1 / 2 0 0 0 0 P 2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 3 P 3 ⎜ ⎟ H = ⎜ ⎟ P 4 ⎜ ⎟ ⎜ ⎟ P 5 ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ 6 5 P 6 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend