http cs246 stanford edu web pages are not equally
play

http://cs246.stanford.edu Web pages are not equally important - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs. www.stanford.edu We already know: Since there is large diversity in the


  1. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

  2.  Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  We already know: Since there is large diversity in the connectivity of the vs. webgraph we can rank the pages by the link structure 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

  3.  We will cover the following Link Analysis approaches to computing importances of nodes in a graph:  Page Rank  Hubs and Authorities (HITS)  Topic-Specific (Personalized) Page Rank  Web Spam Detection Algorithms 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

  4.  Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  www.stanford.edu has 23,400 inlinks  www.joe-schmoe.com has 1 inlink  Are all in-links are equal?  Links from important pages count more  Recursive question! 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

  5.  Each link’s vote is proportional to the importance of its source page  If page p with importance x has n out-links, each link gets x/n votes  Page p ’s own importance is the sum of the votes on its in-links p 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

  6.  A “vote” from an important The web in 1839 page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r j for node j m a m a/2 r ∑ = i r Flow equations: j d out (i) r y = r y /2 + r a /2 → i j r a = r y /2 + r m r m = r a /2 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

  7. Flow equations:  3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2  No unique solution  All solutions equivalent modulo scale factor  Additional constraint forces uniqueness  y + a + m = 1  y = 2/5, a = 2/5, m = 1/5  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

  8.  Stochastic adjacency matrix M  Let page j has d j out-links  If j → i , then M ij = 1/d j else M ij = 0  M is a column stochastic matrix  Columns sum to 1  Rank vector r : vector with an entry per page  r i is the importance score of page i  ∑ i r i = 1  The flow equations can be written r = M r 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

  9.  Suppose page j links to 3 pages, including i j i i = 1/3 M r r 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

  10.  The flow equations can be written r = M ∙ r  So the rank vector is an eigenvector of the stochastic web matrix  In fact, its first or principal eigenvector, with corresponding eigenvalue 1 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

  11. y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = Mr r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

  12.  Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme  Suppose there are N web pages ( t ) + = r ∑  Initialize: r (0) = [1/N,….,1/N] T ( t 1 ) i r j d →  Iterate: r (t+1) = M ∙ r (t) i j i d i …. out-degree of node i  Stop when | r (t+1) – r (t) | 1 < ε  | x | 1 = ∑ 1 ≤ i ≤ N |x i | is the L 1 norm  Can use any other vector norm e.g., Euclidean 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

  13. y a m  Power Iteration: y y ½ ½ 0  Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m 𝑠 𝑗 m 0 ½ 0 𝑘 = ∑  𝑠 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2  And iterate r a = r y /2 + r m  r i = ∑ j M ij ∙r j r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

  14. i 1 i 2 i 3  Imagine a random web surfer:  At any time t , surfer is on some page u  At time t+1 , the surfer follows an j r ∑ = out-link from u uniformly at random i r j d out (i)  Ends up on some page v linked from u → i j  Process repeats indefinitely  Let:  p (t) … vector whose i th coordinate is the prob. that the surfer is at page i at time t  p (t) is a probability distribution over pages 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

  15. i 1 i 2 i 3  Where is the surfer at time t+1 ?  Follows a link uniformly at random j p (t+1) = M · p (t) + = ⋅ p ( t 1 ) M p ( t )  Suppose the random walk reaches a state p (t+1) = M · p (t) = p (t) then p (t) is stationary distribution of a random walk  Our rank vector r satisfies r = M · r  So, it is a stationary distribution for the random walk 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

  16. + = ( t ) r ∑ r = ( t 1 ) i r Mr or j equivalently d → i j i  Does this converge?  Does it converge to what we want?  Are results reasonable? 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

  17. + = ( t ) r ∑ ( t 1 ) i r a b j d → i j i  Example: r a 1 0 1 0 = 1 r b 0 1 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

  18. ( t ) + = r ∑ ( t 1 ) i r a b j d → i j i  Example: r a 1 0 0 0 = 0 r b 0 1 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

  19. 2 problems:  Some pages are “ dead ends ” (have no out-links)  Such pages cause importance to “leak out”  Spider traps (all out links are within the group)  Eventually spider traps absorb all importance 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

  20. y a m  Power Iteration: y y ½ ½ 0  Set 𝑠 𝑘 = 1 a ½ 0 0 a m 𝑠 𝑗 m 0 ½ 1 𝑘 = ∑  𝑠 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2  And iterate r a = r y /2 r m = r a /2 + r m  Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 … 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

  21.  The Google solution for spider traps: At each time step, the random surfer has two options:  With probability β , follow a link at random  With probability 1- β , jump to some page uniformly at random  Common values for β are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps y y a a m m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

  22. y a m  Power Iteration: y y ½ ½ 0  Set 𝑠 𝑘 = 1 a ½ 0 0 a m 𝑠 𝑗 m 0 ½ 0 𝑘 = ∑  𝑠 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2  And iterate r a = r y /2 r m = r a /2  Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 … 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

  23.  Teleports: Follow random teleport links with probability 1.0 from dead-ends  Adjust matrix accordingly y y a a m m y a m y a m ⅓ y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

  24. + = ( t 1 ) ( t ) r Mr Markov Chains  Set of states X  Transition matrix P where P ij = P(X t =i | X t-1 =j)  π specifying the probability of being at each state x ∈ X  Goal is to find π such that π = P π 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

  25.  Theory of Markov chains  Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic , irreducible and aperiodic . 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

  26.  Stochastic: Every column sums to 1  A possible solution: Add green links 1 = + T • a i …=1 if node i has ( 1 S M a ) out deg 0, =0 else • 1 …vector of all 1s n y a m y r y = r y /2 + r a /2 + r m /3 y ½ ½ 1/3 r a = r y /2+ r m /3 a ½ 0 1/3 r m = r a /2 + r m /3 a m m 0 ½ 1/3 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

  27.  A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k .  A possible solution: Add green links y a m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

  28.  From any state, there is a non-zero probability of going from any one state to any another  A possible solution: Add green links y a m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend