link analysis
play

Link Analysis Stony Brook University CSE545, Fall 2016 The Web , - PowerPoint PPT Presentation

Link Analysis Stony Brook University CSE545, Fall 2016 The Web , circa 1998 The Web , circa 1998 The Web , circa 1998 Match keywords, language ( information retrieval ) Explore directory The Web , circa 1998 Easy to game with term


  1. Link Analysis Stony Brook University CSE545, Fall 2016

  2. The Web , circa 1998

  3. The Web , circa 1998

  4. The Web , circa 1998 Match keywords, language ( information retrieval ) Explore directory

  5. The Web , circa 1998 Easy to game with “term spam” Time-consuming; Match keywords, language ( information retrieval ) Not open-ended Explore directory

  6. Enter PageRank ...

  7. PageRank Key Idea: Consider the citations of the website in addition to keywords.

  8. PageRank Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations?

  9. PageRank Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations? The Web as a directed graph:

  10. PageRank Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations? Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  11. PageRank Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links as votes Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  12. PageRank Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links (citations) as votes But citations from important pages should count more. Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  13. PageRank Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links (citations) as votes But citations from important pages should count more. Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  14. PageRank Key Idea: Consider the citations of the website in addition to keywords. Flow Model: How to compute? in-links (citations) as votes Each page ( j ) has an importance (i.e. rank, r j ) But citations from important pages should count more. ( n j is |out-links|) Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  15. A B PageRank C D How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|) Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  16. A B r A /1 PageRank r B /4 C D r C /2 r D = r A /1 + r B /4 + r C /2 How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|) Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  17. A B PageRank C D How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|) Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  18. A B PageRank C D A system of equations? How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|) Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  19. A B PageRank C D A system of equations? How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|) Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  20. A B PageRank C D A system of equations? How to compute? Provides Each page ( j ) has an importance (i.e. rank, r j ) intuition, but impractical to ( n j is |out-links|) solve at scale. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  21. A B PageRank C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 “Transition Matrix”, M Innovation 1: What pages would a “random Web surfer” end up at?

  22. A B PageRank C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 “Transition Matrix”, M Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,]

  23. A B PageRank C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 “Transition Matrix”, M Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24]

  24. A B PageRank C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 “Transition Matrix”, M Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  25. A B PageRank C D Power iteration algorithm to \ from A B C D r [0] = [1/N, … , 1/N], Initialize: A 0 1/2 1 0 r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): B 1/3 0 0 1/2 r [t+1] = M·r [t] C 1/3 0 0 1/2 t+=1 D 1/3 1/2 0 0 solution = r [t] “Transition Matrix”, M err_norm(v1, v2) = |v1 - v2| #L1 norm Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  26. A B PageRank C D Power iteration algorithm to \ from A B C D r [0] = [1/N, … , 1/N], Initialize: A 0 1/2 1 0 r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): B 1/3 0 0 1/2 r [t+1] = M·r [t] C 1/3 0 0 1/2 t+=1 D 1/3 1/2 0 0 solution = r [t] “Transition Matrix”, M err_norm(v1, v2) = |v1 - v2| #L1 norm Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  27. As err_norm gets smaller we are moving toward: PageRank r = M·r Power iteration algorithm We are actually just finding the r [0] = [1/N, … , 1/N], Initialize: eigenvector of M. r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r [t+1] = M·r [t] t+=1 solution = r [t] err_norm(v1, v2) = |v1 - v2| #L1 norm Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  28. As err_norm gets smaller we are moving toward: PageRank r = M·r Power iteration algorithm We are actually just finding the r [0] = [1/N, … , 1/N], Initialize: eigenvector of M. r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): x is an r [t+1] = M·r [t] eigenvector of � if: t+=1 A · x = � · x solution = r [t] err_norm(v1, v2) = |v1 - v2| #L1 norm Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  29. As err_norm gets smaller we are moving toward: PageRank r = M·r Power iteration algorithm We are actually just finds the... finding the r [0] = [1/N, … , 1/N], Initialize: eigenvector of M. r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): x is an r [t+1] = M·r [t] eigenvector of � if: t+=1 A · x = � · x solution = r [t] A = 1 since columns of M sum to 1. err_norm(v1, v2) = |v1 - v2| #L1 norm thus, 1r=Mr Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  30. As err_norm gets smaller we are moving toward: PageRank r = M·r Power iteration algorithm We are actually just finds the... finding the r [0] = [1/N, … , 1/N], Initialize: eigenvector of M. r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): x is an r [t+1] = M·r [t] eigenvector of � if: t+=1 A · x = � · x solution = r [t] A = 1 since columns of M sum to 1. err_norm(v1, v2) = |v1 - v2| #L1 norm thus, 1r=Mr Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend