Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - - PDF document

data intensive distributed computing
SMART_READER_LITE
LIVE PREVIEW

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 4: Analyzing Graphs (2/2) Ali Abedi Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford University) These slides are available at


slide-1
SLIDE 1

Data-Intensive Distributed Computing

Part 4: Analyzing Graphs (2/2)

CS 431/631 451/651 (Fall 2020) Ali Abedi

These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford University)

1

slide-2
SLIDE 2

Structure of the Course

“Core” framework features and algorithm design

Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining

2

slide-3
SLIDE 3
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

3

slide-4
SLIDE 4

uwaterloo.ca fakeuw.ca University of waterloo University of waterloo University of waterloo University

  • f waterloo University of waterloo

University of waterloo University of waterloo University of waterloo

Ranked retrieval fails!

Query: University of Waterloo

4

slide-5
SLIDE 5

 Web contains many sources of information

Who to “trust”?

▪ Trick: Trustworthy pages may point to each other!

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5

slide-6
SLIDE 6

 All web pages are not equally “important”

www.joeschmoe.com vs. www.stanford.edu

 There is large diversity

in the web-graph node connectivity. Let’s rank the pages by the link structure!

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

6

slide-7
SLIDE 7
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

7

slide-8
SLIDE 8

 Idea: Links as votes

▪ Page is more important if it has more links

▪ In-coming links? Out-going links?

 Think of in-links as votes:

▪ www.stanford.edu has 23,400 in-links ▪ www.joeschmoe.com has 1 in-link

 Are all in-links equal?

▪ Links from important pages count more ▪ Recursive question!

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

8

slide-9
SLIDE 9

B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

9

slide-10
SLIDE 10

10  Each link’s vote is proportional to the

importance of its source page

 If page j with importance rj has n out-links,

each link gets rj / n votes

 Page j’s own importance is the sum of the

votes on its in-links

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

j

k i rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri/3 rk/4

slide-11
SLIDE 11

 Define a “rank” rj for page j

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

=

j i i j

r r

i

d

y m a a/2 y/2 a/2 m y/2

“Flow” equations:

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

𝒆𝒋 … out-degree of node 𝒋

11

slide-12
SLIDE 12

12  3 equations, 3 unknowns,

no constants

▪ No unique solution ▪ All solutions equivalent modulo the scale factor

 Additional constraint forces uniqueness:

▪ 𝒔𝒛 + 𝒔𝒃 + 𝒔𝒏 = 𝟐 ▪ Solution: 𝒔𝒛 =

𝟑 𝟔 , 𝒔𝒃 = 𝟑 𝟔 , 𝒔𝒏 = 𝟐 𝟔

 Gaussian elimination method works for

small examples, but we need a better method for large web-size graphs

 We need a new formulation!

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

Flow equations:

slide-13
SLIDE 13

13  Stochastic adjacency matrix 𝑵

▪ Let page 𝑗 has 𝑒𝑗 out-links ▪ If 𝑗 → 𝑘, then 𝑁𝑘𝑗 =

1 𝑒𝑗

else 𝑁𝑘𝑗 = 0

▪ 𝑵 is a column stochastic matrix

▪ Columns sum to 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m y ½ ½ a ½ 1 m ½ y m a a/2 y/2 a/2 m y/2

slide-14
SLIDE 14

 Power Iteration:

▪ Set 𝑠

𝑘 = 1/N

▪ 1: 𝑠′𝑘 = σ𝑗→𝑘

𝑠𝑗 𝑒𝑗

▪ 2: 𝑠 = 𝑠′ ▪ Goto 1

 Example:

ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

y a m y ½ ½ a ½ 1 m ½ Iteration 0, 1, 2, …

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

14

slide-15
SLIDE 15

 Power Iteration:

▪ Set 𝑠

𝑘 = 1/N

▪ 1: 𝑠′𝑘 = σ𝑗→𝑘

𝑠𝑗 𝑒𝑗

▪ 2: 𝑠 = 𝑠′ ▪ Goto 1

 Example:

ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

y a m y ½ ½ a ½ 1 m ½ Iteration 0, 1, 2, …

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

15

slide-16
SLIDE 16

16  Imagine a random web surfer:

▪ At any time 𝒖, surfer is on some page 𝒋 ▪ At time 𝒖 + 𝟐, the surfer follows an

  • ut-link from 𝒋 uniformly at random

▪ Ends up on some page 𝒌 linked from 𝒋 ▪ Process repeats indefinitely

 Let:

 𝒒(𝒖) … vector whose 𝒋th coordinate is the

  • prob. that the surfer is at page 𝒋 at time 𝒖

▪ So, 𝒒(𝒖) is a probability distribution over pages

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

=

j i i j

r r (i) dout

j i1 i2 i3

slide-17
SLIDE 17

17  Where is the surfer at time t+1?

▪ Follows a link uniformly at random

𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖)

 Suppose the random walk reaches a state

𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) = 𝒒(𝒖)

then 𝒒(𝒖) is stationary distribution of a random walk

) ( M ) 1 ( t p t p  = +

j i1 i2 i3

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-18
SLIDE 18

18  A central result from the theory of random

walks (a.k.a. Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-19
SLIDE 19
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

19

slide-20
SLIDE 20

 Does this converge?  Does it converge to what we want?  Are results reasonable?

→ + = j i t i t j

r r

i ) ( ) 1 (

d

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

20

slide-21
SLIDE 21

 Example:

ra 1 1 rb 1 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

=

b a

Iteration 0, 1, 2, …

→ + = j i t i t j

r r

i ) ( ) 1 (

d

21

slide-22
SLIDE 22

 Example:

ra 1 rb 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

=

b a

Iteration 0, 1, 2, …

→ + = j i t i t j

r r

i ) ( ) 1 (

d

22

slide-23
SLIDE 23

2 problems:

 (1) Some pages are

dead ends (have no out-links)

▪ Random walk has “nowhere” to go to ▪ Such pages cause importance to “leak out”

 (2) Spider traps:

(all out-links are within the group)

▪ Random walker gets “stuck” in a trap ▪ And eventually spider traps absorb all importance

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Dead end

23

slide-24
SLIDE 24

 Power Iteration:

▪ Set 𝑠

𝑘 = 1

▪ 𝑠

𝑘 = σ𝑗→𝑘 𝑠𝑗 𝑒𝑗

▪ And iterate

 Example:

ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 3/6 7/12 16/24 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½ 1

ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm

m is a spider trap

All the PageRank score gets “trapped” in node m.

24

slide-25
SLIDE 25

25  The Google solution for spider traps: At each

time step, the random surfer has two options

▪ With prob. , follow a link at random ▪ With prob. 1-, jump to some random page ▪ Common values for  are in the range 0.8 to 0.9

 Surfer will teleport out of spider trap

within a few time steps

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m y a m

slide-26
SLIDE 26

 Power Iteration:

▪ Set 𝑠

𝑘 = 1

▪ 𝑠

𝑘 = σ𝑗→𝑘 𝑠𝑗 𝑒𝑗

▪ And iterate

 Example:

ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 1/6 1/12 2/24

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½

ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 Here the PageRank “leaks” out since the matrix is not stochastic.

26

slide-27
SLIDE 27

 Teleports: Follow random teleport links with

probability 1.0 from dead-ends

▪ Adjust matrix accordingly

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

y a m y ½ ½ ⅓ a ½ ⅓ m ½ ⅓ y a m y ½ ½ a ½ m ½

y a m

slide-28
SLIDE 28

Why are dead-ends and spider traps a problem and why do teleports solve the problem?

 Spider-traps are not a problem, but with traps

PageRank scores are not what we want

▪ Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps

 Dead-ends are a problem

▪ The matrix is not column stochastic, so our initial assumptions are not met ▪ Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

28

slide-29
SLIDE 29

 Google’s solution that does it all:

At each step, random surfer has two options:

▪ With probability , follow a link at random ▪ With probability 1-, jump to some random page

 PageRank equation [Brin-Page, 98]

𝑠

𝑘 = ෍ 𝑗→𝑘

𝛾 𝑠

𝑗

𝑒𝑗 + (1 − 𝛾) 1 𝑂

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

di … out-degree

  • f node i

This formulation assumes that 𝑵 has no dead ends. We can either preprocess matrix 𝑵 to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends.

29

slide-30
SLIDE 30

30 y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

13/15 7/15

1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A

slide-31
SLIDE 31

31

PageRank MapReduce Implementation

slide-32
SLIDE 32

32

Simplified PageRank

First, tackle the simple case:

No random jump factor No dangling (dead end) nodes

slide-33
SLIDE 33

n5 [n1, n2, n3] n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n2 n4 n3 n5 n1 n2 n3 n4 n5 n2 n4 n3 n5 n1 n2 n3 n4 n5 n5 [n1, n2, n3] n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

Map Reduce

PageRank in MapReduce

n1 n4 n3 n5 n2

33

slide-34
SLIDE 34

PageRank Pseudo-Code

class Mapper { def map(id: Long, n: Node) = { emit(id, n) p = n.PageRank / n.adjacenyList.length for (m <- n.adjacenyList) { emit(m, p) } } class Reducer { def reduce(id: Long, objects: Iterable[Object]) = { var s = 0 var n = null for (p <- objects) { if (isNode(p)) n = p else s += p } n.PageRank = s emit(id, n) } }

34

slide-35
SLIDE 35

Map Reduce PageRank BFS PR/N d+1 sum min

PageRank vs. BFS

A large class of graph algorithms involve:

Local computations at each node Propagating results: “traversing” the graph

35

slide-36
SLIDE 36

Complete PageRank

Two additional complexities

What is the proper treatment of dangling nodes? How do we factor in the random jump factor?

Solution: second pass to redistribute “missing PageRank mass” and account for random jumps One final optimization: fold into a single MR job

𝑠

𝑘 = ෍ 𝑗→𝑘

𝛾 𝑠

𝑗

𝑒𝑗 + (1 − 𝛾) 1 𝑂

36

slide-37
SLIDE 37

Convergence? reduce map HDFS HDFS map HDFS

Implementation Practicalities

Optimization: fold into one MapReduce job 37

slide-38
SLIDE 38

PageRank Convergence

Alternative convergence criteria

Iterate until PageRank values don’t change Iterate until PageRank rankings don’t change Fixed number of iterations

38

slide-39
SLIDE 39

Log Probs

PageRank values are really small… Product of probabilities = Addition of log probs Addition of probabilities? Solution?

39

slide-40
SLIDE 40

Beyond PageRank

Variations of PageRank

Weighted edges Personalized PageRank (A3/A4 ☺)

40

slide-41
SLIDE 41

Convergence? reduce map HDFS HDFS map HDFS

Implementation Practicalities

41

slide-42
SLIDE 42

MapReduce Sucks

Java verbosity Hadoop task startup time Stragglers Needless graph shuffling Checkpointing at each iteration

42

slide-43
SLIDE 43

reduce HDFS … map HDFS reduce map HDFS reduce map HDFS

Let’s Spark!

43

slide-44
SLIDE 44

reduce HDFS … map reduce map reduce map

44

slide-45
SLIDE 45

reduce HDFS map reduce map reduce map Adjacency Lists PageRank Mass Adjacency Lists PageRank Mass Adjacency Lists PageRank Mass …

45

slide-46
SLIDE 46

join HDFS map join map join map Adjacency Lists PageRank Mass Adjacency Lists PageRank Mass Adjacency Lists PageRank Mass …

46

slide-47
SLIDE 47

join join join … HDFS HDFS Adjacency Lists PageRank vector PageRank vector flatMap reduceByKey PageRank vector flatMap reduceByKey

47

slide-48
SLIDE 48

join join join … HDFS HDFS Adjacency Lists PageRank vector PageRank vector flatMap reduceByKey PageRank vector flatMap reduceByKey

Cache!

48

slide-49
SLIDE 49

171 80 72 28 20 40 60 80 100 120 140 160 180 30 60 Time per Iteration (s) Number

  • f

machines Hadoop Spark

Source: http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-part-2-amp-camp-2012-standalone-programs.pdf

MapReduce vs. Spark

49