link analysis
play

Link Analysis Stony Brook University CSE545, Spring 2019 The Web , - PowerPoint PPT Presentation

Link Analysis Stony Brook University CSE545, Spring 2019 The Web , circa 1998 The Web , circa 1998 Match keywords, language ( information retrieval ) Explore directory The Web , circa 1998 Easy to game with term spam Time-consuming;


  1. Link Analysis Stony Brook University CSE545, Spring 2019

  2. The Web , circa 1998

  3. The Web , circa 1998 Match keywords, language ( information retrieval ) Explore directory

  4. The Web , circa 1998 Easy to game with “term spam” Time-consuming; Match keywords, language ( information retrieval ) Not open-ended Explore directory

  5. Enter PageRank ...

  6. PageRank Key Idea: Consider the citations of the website.

  7. PageRank Key Idea: Consider the citations of the website. Who links to it? and what are their citations?

  8. PageRank Key Idea: Consider the citations of the website. Who links to it? and what are their citations? Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  9. PageRank A B C View 1: Flow Model: in-links as votes D E F Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  10. PageRank View 1: Flow Model: in-links as votes Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  11. PageRank View 1: Flow Model: in-links (citations) as votes but, citations from important pages should count more. => Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  12. A B PageRank View 1: Flow Model: C D How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  13. A B PageRank r A /1 r B /4 View 1: Flow Model: C D r C /2 r D = r A /1 + r B /4 + r C /2 How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  14. PageRank A B View 1: Flow Model: C D How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  15. PageRank A B View 1: Flow Model: C D A System of Equations: How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  16. PageRank A B View 1: Flow Model: C D A System of Equations: How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  17. PageRank A B View 1: Flow Model: Solve C D How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  18. PageRank A B C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  19. View 2: Matrix Formulation A B C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  20. Innovation: What pages would a “random Web surfer” end up at? A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  21. Innovation: What pages would a “random Web surfer” end up at? To Start, all are equally likely at ¼ A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  22. Innovation: What pages would a “random Web surfer” end up at? To Start, all are equally likely at ¼: ends up at D A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  23. Innovation: What pages would a “random Web surfer” end up at? To Start, all are equally likely at ¼: ends up at D C and B are then equally likely: ->D->B=¼*½; ->D->C=¼*½ A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  24. Innovation: What pages would a “random Web surfer” end up at? To Start, all are equally likely at ¼: ends up at D C and B are then equally likely: ->D->B=¼*½; ->D->C=¼*½ Ends up at C: then A is only option: ->D->C->A = ¼*½*1 A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  25. Innovation: What pages would a “random Web surfer” end up at? ... A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  26. Innovation: What pages would a “random Web surfer” end up at? ... A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  27. Innovation: What pages would a “random Web surfer” end up at? ... A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  28. Innovation: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  29. Innovation: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after 1st iteration: M·r = [3/8, 5/24, 5/24, 5/24] A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  30. Innovation: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after 1st iteration: M·r = [3/8, 5/24, 5/24, 5/24] after 2nd iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ] A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  31. Innovation: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after 1st iteration: M·r = [3/8, 5/24, 5/24, 5/24] after 2nd iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ] A B Power iteration algorithm C D r [0] = [1/N, … , 1/N], initialize: r [-1]=[0,...,0] to \ from A B C D while (err_norm( r[t] , r[t-1] )>min_err): A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 err_norm( v1, v2 ) = | v1 - v2 | #L1 norm “Transition Matrix”, M

  32. Innovation: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after 1st iteration: M·r = [3/8, 5/24, 5/24, 5/24] after 2nd iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ] A B Power iteration algorithm C D r [0] = [1/N, … , 1/N], initialize: r [-1]=[0,...,0] to \ from A B C D while (err_norm( r[t] , r[t-1] )>min_err): A 0 1/2 1 0 r [t+1] = M·r [t] B 1/3 0 0 1/2 t+=1 solution = r [t] C 1/3 0 0 1/2 D 1/3 1/2 0 0 err_norm( v1, v2 ) = | v1 - v2 | #L1 norm “Transition Matrix”, M

  33. As err_norm gets smaller we are moving toward: r = M·r View 3: Eigenvectors: Power iteration algorithm r [0] = [1/N, … , 1/N], initialize: r [-1]=[0,...,0] while (err_norm( r[t] , r[t-1] )>min_err): r [t+1] = M·r [t] t+=1 solution = r [t] err_norm( v1, v2 ) = | v1 - v2 | #L1 norm

  34. As err_norm gets smaller we are moving toward: r = M·r View 3: Eigenvectors: We are actually just finding the eigenvector of M. . . . e h t s d n i f Power iteration algorithm x is an r [0] = [1/N, … , 1/N], initialize: eigenvector of A if: r [-1]=[0,...,0] A · x = 𝛍 · x while (err_norm( r[t] , r[t-1] )>min_err): r [t+1] = M·r [t] t+=1 solution = r [t] err_norm( v1, v2 ) = | v1 - v2 | #L1 norm (Leskovec at al., 2014; http://www.mmds.org/)

  35. As err_norm gets smaller we are moving toward: r = M·r View 3: Eigenvectors: We are actually just finding the eigenvector of M. . . . e h t s d n i f Power iteration algorithm x is an r [0] = [1/N, … , 1/N], initialize: eigenvector of A if: r [-1]=[0,...,0] A · x = 𝛍 · x while (err_norm( r[t] , r[t-1] )>min_err): r [t+1] = M·r [t] 𝛍 = 1 (eigenvalue for 1st principal eigenvector) t+=1 since columns of M sum to 1. solution = r [t] Thus, if r is x , then Mr=1r err_norm( v1, v2 ) = sum(| v1 - v2 |) #L1 norm

  36. View 4: Markov Process Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk . Thus, r is a stationary distribution. Probability of being at given node.

  37. View 4: Markov Process Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk . Thus, r is a stationary distribution. Probability of being at given node. aka 1st order Markov Process ● Rich probabilistic theory. One finding: ○ Stationary distributions have a unique distribution if: ■ No “dead-ends” : a node can’t propagate its rank ■ No “spider traps” : set of nodes with no way out. Also known as being stochastic , irreducible , and aperiodic.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend