data intensive distributed computing
play

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 4: Analyzing Graphs (2/2) Ali Abedi Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford University) These slides are available at


  1. Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 4: Analyzing Graphs (2/2) Ali Abedi Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford University) These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ 1

  2. Structure of the Course Analyzing Graphs Relational Data Analyzing Text Data Mining Analyzing “Core” framework features and algorithm design 2

  3. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

  4. Query: University of Waterloo fakeuw.ca uwaterloo.ca University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo Ranked retrieval fails! 4

  5.  Web contains many sources of information Who to “trust”? ▪ Trick: Trustworthy pages may point to each other! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

  6.  All web pages are not equally “important” www.joeschmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

  7. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

  8.  Idea: Links as votes ▪ Page is more important if it has more links ▪ In-coming links? Out-going links?  Think of in-links as votes: ▪ www.stanford.edu has 23,400 in-links ▪ www.joeschmoe.com has 1 in-link  Are all in-links equal? ▪ Links from important pages count more ▪ Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

  9. A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

  10.  Each link’s vote is proportional to the importance of its source page  If page j with importance r j has n out-links, each link gets r j / n votes  Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

  11.  Define a “rank” r j for page j y/2  r = y i r j d a/2 → i j i y/2 𝒆 𝒋 … out -degree of node 𝒋 m a m a/2 “Flow” equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

  12. Flow equations:  3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2 ▪ No unique solution ▪ All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness: ▪ 𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐 ▪ Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

  13.  Stochastic adjacency matrix 𝑵 ▪ Let page 𝑗 has 𝑒 𝑗 out-links 1 ▪ If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗 ▪ 𝑵 is a column stochastic matrix ▪ Columns sum to 1 y/2 y a m y y ½ ½ 0 a/2 a ½ 0 1 y/2 m m 0 ½ 0 a m a/2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

  14. y a m  Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m m 0 ½ 0 𝑠 𝑗 ▪ 1: 𝑠′ 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ 2: 𝑠 = 𝑠′ r a = r y /2 + r m ▪ Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

  15. y a m  Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m m 0 ½ 0 𝑠 𝑗 ▪ 1: 𝑠′ 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ 2: 𝑠 = 𝑠′ r a = r y /2 + r m ▪ Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

  16. i 1 i 2 i 3  Imagine a random web surfer: ▪ At any time 𝒖 , surfer is on some page 𝒋 ▪ At time 𝒖 + 𝟐 , the surfer follows an j  r out-link from 𝒋 uniformly at random = i r j d out (i) ▪ Ends up on some page 𝒌 linked from 𝒋 → i j ▪ Process repeats indefinitely  Let:  𝒒(𝒖) … vector whose 𝒋 th coordinate is the prob. that the surfer is at page 𝒋 at time 𝒖 ▪ So, 𝒒(𝒖) is a probability distribution over pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

  17. i 1 i 2 i 3  Where is the surfer at time t+1 ? ▪ Follows a link uniformly at random j 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) + =  ( 1 ) M ( ) p t p t  Suppose the random walk reaches a state 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) = 𝒒(𝒖) then 𝒒(𝒖) is stationary distribution of a random walk J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

  18.  A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions , the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

  19. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

  20. + = ( t )  r ( 1 ) t i r j d → i j i  Does this converge?  Does it converge to what we want?  Are results reasonable? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

  21. ( ) + = t  r ( t 1 ) i a b r j d → i j i  Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

  22. ( ) + = t  r ( t 1 ) i a b r j d → i j i  Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

  23. Dead end 2 problems:  (1) Some pages are dead ends (have no out-links) ▪ Random walk has “nowhere” to go to ▪ Such pages cause importance to “leak out”  (2) Spider traps: (all out-links are within the group) ▪ Random walker gets “stuck” in a trap ▪ And eventually spider traps absorb all importance J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

  24. y a m  Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 a ½ 0 0 a m m 0 ½ 1 𝑠 𝑗 ▪ 𝑠 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 m is a spider trap r y = r y /2 + r a /2 ▪ And iterate r a = r y /2 r m = r a /2 + r m  Example: r y 1/3 2/6 3/12 5/24 0 … r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, … All the PageRank score gets “trapped” in node m. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

  25.  The Google solution for spider traps: At each time step, the random surfer has two options ▪ With prob.  , follow a link at random ▪ With prob. 1-  , jump to some random page ▪ Common values for  are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps y y a a m m J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

  26. y a m  Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 a ½ 0 0 a m m 0 ½ 0 𝑠 𝑗 ▪ 𝑠 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ And iterate r a = r y /2 r m = r a /2  Example: r y 1/3 2/6 3/12 5/24 0 … r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … Here the PageRank “leaks” out since the matrix is not stochastic. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

  27.  Teleports: Follow random teleport links with probability 1.0 from dead-ends ▪ Adjust matrix accordingly y y a a m m y a m y a m ⅓ y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  28. Why are dead-ends and spider traps a problem and why do teleports solve the problem?  Spider-traps are not a problem, but with traps PageRank scores are not what we want ▪ Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps  Dead-ends are a problem ▪ The matrix is not column stochastic, so our initial assumptions are not met ▪ Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend