Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 4: Analyzing Graphs (2/2) Ali Abedi Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford University) These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ 1

Structure of the Course Analyzing Graphs Relational Data Analyzing Text Data Mining Analyzing “Core” framework features and algorithm design 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Query: University of Waterloo fakeuw.ca uwaterloo.ca University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo University of waterloo Ranked retrieval fails! 4

 Web contains many sources of information Who to “trust”? ▪ Trick: Trustworthy pages may point to each other! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

 All web pages are not equally “important” www.joeschmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

 Idea: Links as votes ▪ Page is more important if it has more links ▪ In-coming links? Out-going links?  Think of in-links as votes: ▪ www.stanford.edu has 23,400 in-links ▪ www.joeschmoe.com has 1 in-link  Are all in-links equal? ▪ Links from important pages count more ▪ Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

 Each link’s vote is proportional to the importance of its source page  If page j with importance r j has n out-links, each link gets r j / n votes  Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

 Define a “rank” r j for page j y/2  r = y i r j d a/2 → i j i y/2 𝒆 𝒋 … out -degree of node 𝒋 m a m a/2 “Flow” equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

Flow equations:  3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2 ▪ No unique solution ▪ All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness: ▪ 𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐 ▪ Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

 Stochastic adjacency matrix 𝑵 ▪ Let page 𝑗 has 𝑒 𝑗 out-links 1 ▪ If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗 ▪ 𝑵 is a column stochastic matrix ▪ Columns sum to 1 y/2 y a m y y ½ ½ 0 a/2 a ½ 0 1 y/2 m m 0 ½ 0 a m a/2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

y a m  Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m m 0 ½ 0 𝑠 𝑗 ▪ 1: 𝑠′ 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ 2: 𝑠 = 𝑠′ r a = r y /2 + r m ▪ Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

y a m  Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m m 0 ½ 0 𝑠 𝑗 ▪ 1: 𝑠′ 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ 2: 𝑠 = 𝑠′ r a = r y /2 + r m ▪ Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

i 1 i 2 i 3  Imagine a random web surfer: ▪ At any time 𝒖 , surfer is on some page 𝒋 ▪ At time 𝒖 + 𝟐 , the surfer follows an j  r out-link from 𝒋 uniformly at random = i r j d out (i) ▪ Ends up on some page 𝒌 linked from 𝒋 → i j ▪ Process repeats indefinitely  Let:  𝒒(𝒖) … vector whose 𝒋 th coordinate is the prob. that the surfer is at page 𝒋 at time 𝒖 ▪ So, 𝒒(𝒖) is a probability distribution over pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

i 1 i 2 i 3  Where is the surfer at time t+1 ? ▪ Follows a link uniformly at random j 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) + =  ( 1 ) M ( ) p t p t  Suppose the random walk reaches a state 𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) = 𝒒(𝒖) then 𝒒(𝒖) is stationary distribution of a random walk J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

 A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions , the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

+ = ( t )  r ( 1 ) t i r j d → i j i  Does this converge?  Does it converge to what we want?  Are results reasonable? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

( ) + = t  r ( t 1 ) i a b r j d → i j i  Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

( ) + = t  r ( t 1 ) i a b r j d → i j i  Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

Dead end 2 problems:  (1) Some pages are dead ends (have no out-links) ▪ Random walk has “nowhere” to go to ▪ Such pages cause importance to “leak out”  (2) Spider traps: (all out-links are within the group) ▪ Random walker gets “stuck” in a trap ▪ And eventually spider traps absorb all importance J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

y a m  Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 a ½ 0 0 a m m 0 ½ 1 𝑠 𝑗 ▪ 𝑠 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 m is a spider trap r y = r y /2 + r a /2 ▪ And iterate r a = r y /2 r m = r a /2 + r m  Example: r y 1/3 2/6 3/12 5/24 0 … r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, … All the PageRank score gets “trapped” in node m. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

 The Google solution for spider traps: At each time step, the random surfer has two options ▪ With prob.  , follow a link at random ▪ With prob. 1-  , jump to some random page ▪ Common values for  are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps y y a a m m J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

y a m  Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 a ½ 0 0 a m m 0 ½ 0 𝑠 𝑗 ▪ 𝑠 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ And iterate r a = r y /2 r m = r a /2  Example: r y 1/3 2/6 3/12 5/24 0 … r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … Here the PageRank “leaks” out since the matrix is not stochastic. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

 Teleports: Follow random teleport links with probability 1.0 from dead-ends ▪ Adjust matrix accordingly y y a a m m y a m y a m ⅓ y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Why are dead-ends and spider traps a problem and why do teleports solve the problem?  Spider-traps are not a problem, but with traps PageRank scores are not what we want ▪ Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps  Dead-ends are a problem ▪ The matrix is not column stochastic, so our initial assumptions are not met ▪ Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 4: Analyzing Graphs (2/2) Ali Abedi Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford University) These slides are available at

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Recent Developments in GNU Autotools Ralf Wildenhues GHM July 2010 The Hague Netherlands

11/26/12 F ailures Problems:Links and switches could fail Advertisements could get lost

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Link Analysis Stony Brook University CSE545, Spring 2019 The Web , circa 1998 The Web , circa

ProtoDUNE-SP FEMB Research, Development, Production, Installation and Commissioning Shanshan Gao

Near-exhaustive Precomputation of Secondary Cloth Effects Doyub Kim 1,3 , Woojong Koh 2 , Rahul

OSPF Fast Hellos OSPF HelloInterval and RouterDeadInterval OSPF timers are included in the OSPF

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)