FrogWild! Ioannis Mitliagkas Michael Borokhovich Alex Dimakis - - PowerPoint PPT Presentation

frogwild
SMART_READER_LITE
LIVE PREVIEW

FrogWild! Ioannis Mitliagkas Michael Borokhovich Alex Dimakis - - PowerPoint PPT Presentation

FrogWild! Ioannis Mitliagkas Michael Borokhovich Alex Dimakis Fast PageRank Approximations Constantine Caramanis on Graph Engines Web Ranking Given web graph Find important pages E B A D D C 2 Web Ranking Given web graph


slide-1
SLIDE 1

Fast PageRank Approximations

  • n Graph Engines

FrogWild!

Ioannis Mitliagkas Michael Borokhovich Alex Dimakis Constantine Caramanis

slide-2
SLIDE 2

Web Ranking

2

B D C E A

Find “important” pages

Given web graph

D

slide-3
SLIDE 3

Web Ranking

2

B D C E A

Find “important” pages

Given web graph

Rank Based on In-degree

Classic Approach

slide-4
SLIDE 4

Web Ranking

2

B D C E A

Find “important” pages

Given web graph

S S S S

A D

Rank Based on In-degree

Classic Approach

Susceptible

to manipulation by spammer networks

slide-5
SLIDE 5

PageRank [Page et al., 1999]

3

B D C E A

π

Page Importance

Described by distribution

slide-6
SLIDE 6

PageRank [Page et al., 1999]

3

B

D

C

E

A

π

π

Page Importance

Described by distribution

Recursive Definition

Important pages are pointed to by

❖ important pages are pointed to by

❖ important pages are pointed to by…

slide-7
SLIDE 7

PageRank [Page et al., 1999]

3

B

D

C

E

A

π

π

Page Importance

Described by distribution

Recursive Definition

Important pages are pointed to by

❖ important pages are pointed to by

❖ important pages are pointed to by…

Robust

to manipulation by spammer networks

slide-8
SLIDE 8

PageRank - Continuous Interpretation

4

Start: Gallon of water distributed evenly

B D C E A

slide-9
SLIDE 9

PageRank - Continuous Interpretation

4

Start: Gallon of water distributed evenly

B D C E A

Every Iteration

Each vertex spreads water evenly to successors

slide-10
SLIDE 10

PageRank - Continuous Interpretation

4

Start: Gallon of water distributed evenly

B D C E A

Every Iteration

Each vertex spreads water evenly to successors

slide-11
SLIDE 11

B D C E A

PageRank - Continuous Interpretation

4

Start: Gallon of water distributed evenly Every Iteration

Each vertex spreads water evenly to successors

Redistribute evenly

a fraction, pT = 0.15, of all water

slide-12
SLIDE 12

Repeat until convergence

PageRank - Continuous Interpretation

4

Start: Gallon of water distributed evenly

B D C E A

Every Iteration

Each vertex spreads water evenly to successors

Redistribute evenly

a fraction, pT = 0.15, of all water

π

Power Iteration employed usually

slide-13
SLIDE 13

Discrete Interpretation

5

B

D

C

E

A

Frog walks randomly on graph

Next vertex chosen uniformly at random

slide-14
SLIDE 14

Discrete Interpretation

5

B

D

C

E

A

1

Frog walks randomly on graph

Next vertex chosen uniformly at random

slide-15
SLIDE 15

Discrete Interpretation

5

B

D

C

E

A

1/3 1/3 1/3

Frog walks randomly on graph

Next vertex chosen uniformly at random

slide-16
SLIDE 16

Discrete Interpretation

5

B

D

C

E

A

1

Teleportation

Every step: teleport w.p. pT

Frog walks randomly on graph

Next vertex chosen uniformly at random

slide-17
SLIDE 17

Discrete Interpretation

5

B

D

C

E

A

Teleportation

Every step: teleport w.p. pT

Frog walks randomly on graph

Next vertex chosen uniformly at random

slide-18
SLIDE 18

Discrete Interpretation

5

B

D

C

E

A

Teleportation

Every step: teleport w.p. pT

Sampling after t steps

Frog location gives sample from

π

Frog walks randomly on graph

Next vertex chosen uniformly at random

slide-19
SLIDE 19

Discrete Interpretation

5

B

D

C

E

A

Teleportation

Every step: teleport w.p. pT

Sampling after t steps

Frog location gives sample from

π

PageRank Vector

Many frogs, estimate vector π

Frog walks randomly on graph

Next vertex chosen uniformly at random

π

slide-20
SLIDE 20

PageRank Approximation

6

Looking for k “heavy nodes”

Do not need full PageRank vector

Random Walk Sampling

Favors heavy nodes

B

D

C

E

A

B

D

C

E

A

Captured Mass Metric

For node set S: (S)

π

slide-21
SLIDE 21

PageRank Approximation

6

Looking for k “heavy nodes”

Do not need full PageRank vector

Random Walk Sampling

Favors heavy nodes

B

D

C

E

A

Return set {E,D} Captured mass = ({E,D}) k=2

Captured Mass Metric

For node set S: (S)

π π

slide-22
SLIDE 22

Platform

slide-23
SLIDE 23

Graph Engines

❖ Engine splits graph across cluster ❖ Vertex program describes logic

8

GAS abstraction

B D C E A

Other approaches: Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013]

slide-24
SLIDE 24

B D C E A

Graph Engines

❖ Engine splits graph across cluster ❖ Vertex program describes logic

8

  • 1. Gather

GAS abstraction

Other approaches: Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013]

slide-25
SLIDE 25

Graph Engines

❖ Engine splits graph across cluster ❖ Vertex program describes logic

8

  • 1. Gather
  • 2. Apply

GAS abstraction

B D C E A

Other approaches: Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013]

slide-26
SLIDE 26

B D C E A

Graph Engines

❖ Engine splits graph across cluster ❖ Vertex program describes logic

8

  • 1. Gather
  • 2. Apply
  • 3. Scatter

GAS abstraction

Other approaches: Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013]

slide-27
SLIDE 27

B D C E A

Edge Cuts

❖ Assign vertices to machines ❖ Cross-machine edges require

network communication

❖ Pregel, GraphLab 1.0 ❖ High-degree nodes generate

large volume of traffic

❖ Computational load imbalance

9

C D E B A

slide-28
SLIDE 28

Edge Cuts

❖ Assign vertices to machines ❖ Cross-machine edges require

network communication

❖ Pregel, GraphLab 1.0 ❖ High-degree nodes generate

large volume of traffic

❖ Computational load imbalance

9

C D E B A

Machine 2 Machine 1 Machine 3

slide-29
SLIDE 29

E D B A B C D B

Vertex Cuts

10

❖ Assign edges to machines ❖ High-degree nodes replicated ❖ One replica designated master ❖ Need for synchronization

  • 1. Gather
  • 2. Apply [on master]
  • 3. Synchronize mirrors
  • 4. Scatter

❖ GraphLab 2.0 - PowerGraph ❖ Balanced - Network still bottleneck

slide-30
SLIDE 30

E D B A B C D B

Vertex Cuts

10

Machine 2 Machine 1 Machine 3

❖ Assign edges to machines ❖ High-degree nodes replicated ❖ One replica designated master ❖ Need for synchronization

  • 1. Gather
  • 2. Apply [on master]
  • 3. Synchronize mirrors
  • 4. Scatter

❖ GraphLab 2.0 - PowerGraph ❖ Balanced - Network still bottleneck

slide-31
SLIDE 31

E D B A B C D B

Vertex Cuts

10

Machine 2 Machine 1 Machine 3

❖ Assign edges to machines ❖ High-degree nodes replicated ❖ One replica designated master ❖ Need for synchronization

  • 1. Gather
  • 2. Apply [on master]
  • 3. Synchronize mirrors
  • 4. Scatter

❖ GraphLab 2.0 - PowerGraph ❖ Balanced - Network still bottleneck

slide-32
SLIDE 32

Random Walks on GraphLab

Machine 1

A B

Machine 2

B C

Machine 3

B

D

Machine M

B Z

11

Master node decides step

Decision synced to all mirrors Only machine M needs it

Unnecessary network traffic

Average replication factor ~8

slide-33
SLIDE 33

Random Walks on GraphLab

Machine 1

A B

Machine 2

B C

Machine 3

B

D

Machine M

B Z

Z

11

Master node decides step

Decision synced to all mirrors Only machine M needs it

Unnecessary network traffic

Average replication factor ~8

slide-34
SLIDE 34

Random Walks on GraphLab

Machine 1

A B

Machine 2

B C

Machine 3

B

D

Machine M

B Z

Z Z Z

11

Master node decides step

Decision synced to all mirrors Only machine M needs it

Unnecessary network traffic

Average replication factor ~8

slide-35
SLIDE 35

Objective

Idea

Only synchronize the mirror that will receive the frog Doable, but requires

  • 1. Serious engine hacking
  • 2. Exposing an ugly/complicated API to programmer

Faster PageRank approximation on GraphLab

Simpler

Pick mirrors to synchronize at random!

12

Synchronize independently with probability pS

slide-36
SLIDE 36

1.Each frog dies w.p. (gives sample)
 Assume K frogs survive 2.For every mirror, draw bridge w.p. 3.Spread frogs evenly among
 synchronized mirrors.

FrogWild!

Machine 1

A B

Machine 2

B C

Machine 3

B

D

Machine M

B Z

pS

Ber(pS) Ber(pS) Ber(pS)

N

pT Release N frogs in parallel

Vertex Program

13

slide-37
SLIDE 37

1.Each frog dies w.p. (gives sample)
 Assume K frogs survive 2.For every mirror, draw bridge w.p. 3.Spread frogs evenly among
 synchronized mirrors.

FrogWild!

Machine 1

A B

Machine 2

B C

Machine 3

B

D

Machine M

B Z

pS

Ber(pS) Ber(pS) Ber(pS)

pT Release N frogs in parallel

Vertex Program

K

13

slide-38
SLIDE 38

1.Each frog dies w.p. (gives sample)
 Assume K frogs survive 2.For every mirror, draw bridge w.p. 3.Spread frogs evenly among
 synchronized mirrors.

FrogWild!

Machine 1

A B

Machine 2

B C

Machine 3

B

D

Machine M

B Z

pS

Ber(pS) Ber(pS)

pT Release N frogs in parallel

Vertex Program

K

13

slide-39
SLIDE 39

1.Each frog dies w.p. (gives sample)
 Assume K frogs survive 2.For every mirror, draw bridge w.p. 3.Spread frogs evenly among
 synchronized mirrors.

FrogWild!

Machine 1

A B

Machine 2

B C

Machine 3

B

D

Machine M

B Z

pS

Ber(pS)

pT Release N frogs in parallel

Vertex Program

K

13

slide-40
SLIDE 40

1.Each frog dies w.p. (gives sample)
 Assume K frogs survive 2.For every mirror, draw bridge w.p. 3.Spread frogs evenly among
 synchronized mirrors.

FrogWild!

Machine 1

A B

Machine 2

B C

Machine 3

B

D

Machine M

B Z

pS pT Release N frogs in parallel

Vertex Program

K

13

slide-41
SLIDE 41

1.Each frog dies w.p. (gives sample)
 Assume K frogs survive 2.For every mirror, draw bridge w.p. 3.Spread frogs evenly among
 synchronized mirrors.

FrogWild!

Machine 1

A B

Machine 2

B C

Machine 3

B

D

Machine M

B Z

pS pT Release N frogs in parallel

Vertex Program

K/2 K/2

13

slide-42
SLIDE 42

1.Each frog dies w.p. (gives sample)
 Assume K frogs survive 2.For every mirror, draw bridge w.p. 3.Spread frogs evenly among
 synchronized mirrors.

FrogWild!

Machine 1

A B

Machine 2

B C

Machine 3

B

D

Machine M

B Z

pS pT Release N frogs in parallel

Vertex Program

K/2 K/2

13

Bridges introduce dependencies!

slide-43
SLIDE 43

Contributions

14

1.Algorithm for approximate PageRank 2.Modification of GraphLab
 Exposes very simple API extension (pS).
 Allows for randomized synchronization. 3.Speedup of 7-10x 4.Theoretical guarantees for solution despite introduced dependencies

slide-44
SLIDE 44

Theoretical Guarantee

15

⇡(S) ≥ OPT − 2✏

p∩(t)  1 n + tkπk∞ pT ,

probability two Frogs meet at first t steps

Mass Captured by top-k set, S, of estimate

where

✏ < √ kt

2 +

s k

  •  1

N + (1 − p2

S)p∩(t)

  • w.p. 1 − δ

from N frogs after t steps

slide-45
SLIDE 45

Experiments

slide-46
SLIDE 46

Experimental Results

17

slide-47
SLIDE 47

Experimental Results

18

slide-48
SLIDE 48

Experimental Results

19

slide-49
SLIDE 49

Thank you!

slide-50
SLIDE 50

References

Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing." Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., & Hellerstein, J. M. (2010). Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990. Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., & Guestrin, C. (2012, October). PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs. In OSDI (Vol. 12, No. 1, p. 2). Nguyen, D., Lenharth, A., & Pingali, K. (2013, November). A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (pp. 456-471). ACM. Avery, C. (2011). Giraph: Large-scale graph processing infrastruction on Hadoop. Proceedings of Hadoop

  • Summit. Santa Clara, USA:[sn].

Xin, R. S., Gonzalez, J. E., Franklin, M. J., & Stoica, I. (2013, June). Graphx: A resilient distributed graph system

  • n spark. In First International Workshop on Graph Data Management Experiences and Systems (p. 2). ACM.
slide-51
SLIDE 51

Backup Slides

slide-52
SLIDE 52

PageRank [Page et al., 1999]

23

B

D

C

E

A

slide-53
SLIDE 53

PageRank [Page et al., 1999]

23

B

D

C

E

A

1 1 1/3 1/3 1/3

Normalized Adjacency Matrix

Pij = 1 dout(j), (j, i) ∈ G

slide-54
SLIDE 54

PageRank [Page et al., 1999]

23

B

D

C

E

A

Normalized Adjacency Matrix

Pij = 1 dout(j), (j, i) ∈ G

Augmented Matrix

Qij = (1 − pT )Pij + pT n

pT ∈ [0, 1]

slide-55
SLIDE 55

PageRank [Page et al., 1999]

23

B

D

C

E

A

PageRank Vector π = Qπ Normalized Adjacency Matrix

Pij = 1 dout(j), (j, i) ∈ G

Augmented Matrix

Qij = (1 − pT )Pij + pT n

pT ∈ [0, 1]

slide-56
SLIDE 56

PageRank [Page et al., 1999]

23

B

D

C

E

A

PageRank Vector π = Qπ Normalized Adjacency Matrix

Pij = 1 dout(j), (j, i) ∈ G

Augmented Matrix

Qij = (1 − pT )Pij + pT n

pT ∈ [0, 1]

Power Method

Qtp0 → π

slide-57
SLIDE 57

Here be dragons.

slide-58
SLIDE 58

Backup

B D C E A B D C E A B D C E A B D C E A B D C E A B D C E A

slide-59
SLIDE 59

Backup

B C

E

A

D

B

D

C

E

A