Families of distributed graph algorithms Divide and conquer arton - - PowerPoint PPT Presentation

families of distributed graph algorithms
SMART_READER_LITE
LIVE PREVIEW

Families of distributed graph algorithms Divide and conquer arton - - PowerPoint PPT Presentation

Families of distributed graph algorithms Divide and conquer arton Balassi 1 M mbalassi@ilab.sztaki.hu 1 Hungarian Academy of Sciences Institute for Computer Science and Control Data Mining & Search Group June 24, 2014 Families of


slide-1
SLIDE 1

Families of distributed graph algorithms

Divide and conquer M´ arton Balassi1 mbalassi@ilab.sztaki.hu

1Hungarian Academy of Sciences – Institute for Computer Science and Control

Data Mining & Search Group

June 24, 2014

slide-2
SLIDE 2

Families of distributed graph algorithms 2 / 61

Table of contents

Distributing data-intensive algorithms Motivation MapReduce & Pregel Counting the number of triangles in a graph Families of distributed graph algorithms Local algorithms Graph traversal based algorithms Matrix multiplication based algorithms Experiments Representative algorithms Results

slide-3
SLIDE 3

Families of distributed graph algorithms Distributing data-intensive algorithms 3 / 61

Table of contents

Distributing data-intensive algorithms Motivation MapReduce & Pregel Counting the number of triangles in a graph Families of distributed graph algorithms Local algorithms Graph traversal based algorithms Matrix multiplication based algorithms Experiments Representative algorithms Results

slide-4
SLIDE 4

Families of distributed graph algorithms Distributing data-intensive algorithms Motivation 4 / 61

A bit about myself

My background

◮ BSc, MSc in Computer Science, E¨

  • tv¨
  • s University Budapest

◮ BA in Economics, TU Budapest ◮ Distributed algorithms ◮ Big data architecture

slide-5
SLIDE 5

Families of distributed graph algorithms Distributing data-intensive algorithms Motivation 5 / 61

A bit about myself

My background

◮ BSc, MSc in Computer Science, E¨

  • tv¨
  • s University Budapest

◮ BA in Economics, TU Budapest ◮ Distributed algorithms ◮ Big data architecture

slide-6
SLIDE 6

Families of distributed graph algorithms Distributing data-intensive algorithms Motivation 6 / 61

A bit about myself

My background

◮ BSc, MSc in Computer Science, E¨

  • tv¨
  • s University Budapest

◮ BA in Economics, TU Budapest ◮ Distributed algorithms ◮ Big data architecture

slide-7
SLIDE 7

Families of distributed graph algorithms Distributing data-intensive algorithms Motivation 7 / 61

A bit about myself

My background

◮ BSc, MSc in Computer Science, E¨

  • tv¨
  • s University Budapest

◮ BA in Economics, TU Budapest ◮ Distributed algorithms ◮ Big data architecture

slide-8
SLIDE 8

Families of distributed graph algorithms Distributing data-intensive algorithms Motivation 8 / 61

Motivation

Let’s do a PageRank on this graph. . .

◮ A large Portugese webcrawl1 ◮ 3.1 · 109 nodes ◮ 1.1 · 1011 edges ◮ 80 GB of compressed data ◮ Divide and conquer is almost mandatory

1a large Portuguese crawl of the Portuguese Web Archive obtained from

Daniel Gomes

slide-9
SLIDE 9

Families of distributed graph algorithms Distributing data-intensive algorithms Motivation 9 / 61

Motivation

Let’s do a PageRank on this graph. . .

◮ A large Portugese webcrawl1 ◮ 3.1 · 109 nodes ◮ 1.1 · 1011 edges ◮ 80 GB of compressed data ◮ Divide and conquer is almost mandatory

1a large Portuguese crawl of the Portuguese Web Archive obtained from

Daniel Gomes

slide-10
SLIDE 10

Families of distributed graph algorithms Distributing data-intensive algorithms Motivation 10 / 61

Motivation

Let’s do a PageRank on this graph. . .

◮ A large Portugese webcrawl1 ◮ 3.1 · 109 nodes ◮ 1.1 · 1011 edges ◮ 80 GB of compressed data ◮ Divide and conquer is almost mandatory

1a large Portuguese crawl of the Portuguese Web Archive obtained from

Daniel Gomes

slide-11
SLIDE 11

Families of distributed graph algorithms Distributing data-intensive algorithms Motivation 11 / 61

Motivation

Let’s do a PageRank on this graph. . .

◮ A large Portugese webcrawl1 ◮ 3.1 · 109 nodes ◮ 1.1 · 1011 edges ◮ 80 GB of compressed data ◮ Divide and conquer is almost mandatory

1a large Portuguese crawl of the Portuguese Web Archive obtained from

Daniel Gomes

slide-12
SLIDE 12

Families of distributed graph algorithms Distributing data-intensive algorithms Motivation 12 / 61

Motivation

Let’s do a PageRank on this graph. . .

◮ A large Portugese webcrawl1 ◮ 3.1 · 109 nodes ◮ 1.1 · 1011 edges ◮ 80 GB of compressed data ◮ Divide and conquer is almost mandatory

1a large Portuguese crawl of the Portuguese Web Archive obtained from

Daniel Gomes

slide-13
SLIDE 13

Families of distributed graph algorithms Distributing data-intensive algorithms MapReduce & Pregel 13 / 61

MapReduce [DG04]

slide-14
SLIDE 14

Families of distributed graph algorithms Distributing data-intensive algorithms MapReduce & Pregel 14 / 61

Pregel [MAB+10]

Traits

◮ Bulk Synchronous

Parallel [Val90]

◮ ,,Think like a vertex” ◮ Graph kept in memory

Scheme of the BSP system

Wikipedia, public domain

slide-15
SLIDE 15

Families of distributed graph algorithms Distributing data-intensive algorithms MapReduce & Pregel 15 / 61

Pregel [MAB+10]

Traits

◮ Bulk Synchronous

Parallel [Val90]

◮ ,,Think like a vertex” ◮ Graph kept in memory

Scheme of the BSP system

Wikipedia, public domain

slide-16
SLIDE 16

Families of distributed graph algorithms Distributing data-intensive algorithms MapReduce & Pregel 16 / 61

Pregel [MAB+10]

Traits

◮ Bulk Synchronous

Parallel [Val90]

◮ ,,Think like a vertex” ◮ Graph kept in memory

Scheme of the BSP system

Wikipedia, public domain

slide-17
SLIDE 17

Families of distributed graph algorithms Distributing data-intensive algorithms MapReduce & Pregel 17 / 61

Pregel [MAB+10]

Vertex . . . In1 Inn . . . Out1 Outm t t t t − 1 t − 1 t − 1

Pregel schema as perceived from a vertex

slide-18
SLIDE 18

Families of distributed graph algorithms Distributing data-intensive algorithms Counting the number of triangles in a graph 18 / 61

Triangle Counter – Sequential algorithm

Sequential algorithm

Every vertex executes a search of itself bounded in depth of three. Thus every triangle is counted three times.

slide-19
SLIDE 19

Families of distributed graph algorithms Distributing data-intensive algorithms Counting the number of triangles in a graph 19 / 61

Triangle Counter – Sequential algorithm

Sequential algorithm

Every vertex executes a search of itself bounded in depth of three. Thus every triangle is counted three times. You can do better by making use of the ordering on the vertices.

slide-20
SLIDE 20

Families of distributed graph algorithms Distributing data-intensive algorithms Counting the number of triangles in a graph 20 / 61

Triangle Counter – distributed algorithm

Representation

0 1 2 1 2 2 0 3 1 2 3

slide-21
SLIDE 21

Families of distributed graph algorithms Distributing data-intensive algorithms Counting the number of triangles in a graph 21 / 61

Triangle Counter – distributed algorithm

First Map

Let’s send our ID to all of our neighbours possessing a higher ID than ours. Let’s send our neighbours to ourselves.

First Reduce

Let’s write out the information received. 1 2 1

slide-22
SLIDE 22

Families of distributed graph algorithms Distributing data-intensive algorithms Counting the number of triangles in a graph 22 / 61

Triangle Counter – distributed algorithm

Second Map

If the ID received is smaller then

  • urs let’s pass it on to our

neighbours. Let’s send our neighbours to

  • urselves.

Second Reduce

If the ID received is our neighbour then let’s increment a global counter. 0 [] 1 [0] 2 [1] 1

slide-23
SLIDE 23

Families of distributed graph algorithms Distributing data-intensive algorithms Counting the number of triangles in a graph 23 / 61

Triangle Counter – distributed algorithm

Second Map

If the ID received is smaller then

  • urs let’s pass it on to our

neighbours. Let’s send our neighbours to

  • urselves.

Second Reduce

If the ID received is our neighbour then let’s increment a global counter. 0 + + 1 2

slide-24
SLIDE 24

Families of distributed graph algorithms Families of distributed graph algorithms 24 / 61

Table of contents

Distributing data-intensive algorithms Motivation MapReduce & Pregel Counting the number of triangles in a graph Families of distributed graph algorithms Local algorithms Graph traversal based algorithms Matrix multiplication based algorithms Experiments Representative algorithms Results

slide-25
SLIDE 25

Families of distributed graph algorithms Families of distributed graph algorithms Local algorithms 25 / 61

Local algorithms

Traits

◮ Dependant on a small environment of the given vertex or edge. ◮ ,,Trivial” candidates for parallel computing. ◮ Examples are fingerprint computation, local clustering

coefficient and the number of triangles.

slide-26
SLIDE 26

Families of distributed graph algorithms Families of distributed graph algorithms Local algorithms 26 / 61

Local algorithms

Traits

◮ Dependant on a small environment of the given vertex or edge. ◮ ,,Trivial” candidates for parallel computing. ◮ Examples are fingerprint computation, local clustering

coefficient and the number of triangles.

slide-27
SLIDE 27

Families of distributed graph algorithms Families of distributed graph algorithms Local algorithms 27 / 61

Local algorithms

Traits

◮ Dependant on a small environment of the given vertex or edge. ◮ ,,Trivial” candidates for parallel computing. ◮ Examples are fingerprint computation, local clustering

coefficient and the number of triangles.

slide-28
SLIDE 28

Families of distributed graph algorithms Families of distributed graph algorithms Graph traversal based algorithms 28 / 61

Graph traversal based algorithms

Traits

◮ Dependant on taking long routes in the graph. ◮ Difficult to implement in a distributed environment. ◮ The distributed algorithm can be less effective than the

sequential as the representation is less powerful.

◮ Examples could be accessibility, betweenness centrality and

strongly connected components.

slide-29
SLIDE 29

Families of distributed graph algorithms Families of distributed graph algorithms Graph traversal based algorithms 29 / 61

Graph traversal based algorithms

Traits

◮ Dependant on taking long routes in the graph. ◮ Difficult to implement in a distributed environment. ◮ The distributed algorithm can be less effective than the

sequential as the representation is less powerful.

◮ Examples could be accessibility, betweenness centrality and

strongly connected components.

slide-30
SLIDE 30

Families of distributed graph algorithms Families of distributed graph algorithms Graph traversal based algorithms 30 / 61

Graph traversal based algorithms

Traits

◮ Dependant on taking long routes in the graph. ◮ Difficult to implement in a distributed environment. ◮ The distributed algorithm can be less effective than the

sequential as the representation is less powerful.

◮ Examples could be accessibility, betweenness centrality and

strongly connected components.

slide-31
SLIDE 31

Families of distributed graph algorithms Families of distributed graph algorithms Graph traversal based algorithms 31 / 61

Graph traversal based algorithms

Traits

◮ Dependant on taking long routes in the graph. ◮ Difficult to implement in a distributed environment. ◮ The distributed algorithm can be less effective than the

sequential as the representation is less powerful.

◮ Examples could be accessibility, betweenness centrality and

strongly connected components.

slide-32
SLIDE 32

Families of distributed graph algorithms Families of distributed graph algorithms Matrix multiplication based algorithms 32 / 61

Matrix multiplication based algorithms

Traits

◮ Are basically power iterations of matrix-vector multiplications

with typically fast convergence traits.

◮ Suitable to implement in Pregel-based systems and not that

straight-forward in plain MapReduce. [KBEK14]

◮ Representatives could be eigenvalue centrality, PageRank or

LineRank.

slide-33
SLIDE 33

Families of distributed graph algorithms Families of distributed graph algorithms Matrix multiplication based algorithms 33 / 61

Matrix multiplication based algorithms

Traits

◮ Are basically power iterations of matrix-vector multiplications

with typically fast convergence traits.

◮ Suitable to implement in Pregel-based systems and not that

straight-forward in plain MapReduce. [KBEK14]

◮ Representatives could be eigenvalue centrality, PageRank or

LineRank.

slide-34
SLIDE 34

Families of distributed graph algorithms Families of distributed graph algorithms Matrix multiplication based algorithms 34 / 61

Matrix multiplication based algorithms

Traits

◮ Are basically power iterations of matrix-vector multiplications

with typically fast convergence traits.

◮ Suitable to implement in Pregel-based systems and not that

straight-forward in plain MapReduce. [KBEK14]

◮ Representatives could be eigenvalue centrality, PageRank or

LineRank.

slide-35
SLIDE 35

Families of distributed graph algorithms Experiments 35 / 61

Table of contents

Distributing data-intensive algorithms Motivation MapReduce & Pregel Counting the number of triangles in a graph Families of distributed graph algorithms Local algorithms Graph traversal based algorithms Matrix multiplication based algorithms Experiments Representative algorithms Results

slide-36
SLIDE 36

Families of distributed graph algorithms Experiments Representative algorithms 36 / 61

Representative algorithms

Local Graph traversal based Matrix multiplication based

slide-37
SLIDE 37

Families of distributed graph algorithms Experiments Representative algorithms 37 / 61

Representative algorithms

Local

TriangleCounter already presented . . .

Graph traversal based Matrix multiplication based

slide-38
SLIDE 38

Families of distributed graph algorithms Experiments Representative algorithms 38 / 61

Representative algorithms

Local Graph traversal based

LineRank Construct the linegraph, where vertices represent the edges of the original graph and a directed edge points from e1 to e2 if the target of e1 is the source of e2.

Matrix multiplication based

slide-39
SLIDE 39

Families of distributed graph algorithms Experiments Representative algorithms 39 / 61

Representative algorithms

Local Graph traversal based

LineRank Construct the linegraph, where vertices represent the edges of the original graph and a directed edge points from e1 to e2 if the target of e1 is the source of e2. Run a PageRank on this. [KPST11, PBMW98]

Matrix multiplication based

slide-40
SLIDE 40

Families of distributed graph algorithms Experiments Representative algorithms 40 / 61

Representative algorithms

Local Graph traversal based Matrix multiplication based

SCC Run a label propagation to detect connected components. Remove edges with different labels, then do a reverse label

  • propagation. [MIHPR05]
slide-41
SLIDE 41

Families of distributed graph algorithms Experiments Representative algorithms 41 / 61

Representative algorithms

Local Graph traversal based Matrix multiplication based

SCC Run a label propagation to detect connected components. Remove edges with different labels, then do a reverse label

  • propagation. [MIHPR05] Remove the sccs found and iterate.
slide-42
SLIDE 42

Families of distributed graph algorithms Experiments Representative algorithms 42 / 61

Representative algorithms

Local Graph traversal based Matrix multiplication based

SCC Run a label propagation to detect connected components. Remove edges with different labels, then do a reverse label

  • propagation. [MIHPR05] Remove the sccs found and iterate.

The sequential algorithm is more efficient. [BCVDP11]

slide-43
SLIDE 43

Families of distributed graph algorithms Experiments Results 43 / 61

Results [EBKK14]

Name Vertices Edges Source web-Google 8, 7 · 105 5, 0 · 107 [LLDM08] wiki-Talk 2, 4 · 106 5, 1 · 107 [LHK10] soc-LiveJournal 4, 8 · 106 6, 9 · 108 [BHKL06] forest10M 107 2, 4 · 108 generated2 forest20M 2 · 107 4, 8 · 108 generated2

Summary of the graphs used

2A slightly modified version of the ForestFire model. [LF06, EBKK14]

slide-44
SLIDE 44

Families of distributed graph algorithms Experiments Results 44 / 61

Results [EBKK14]

Implementations

sequential Plain Java. MapReduce Apache Hadoop on 20 cores, 40 reducers. Pregel Apache Giraph on 17 cores, 3 occupied by Zookeeper.

slide-45
SLIDE 45

Families of distributed graph algorithms Experiments Results 45 / 61

Results [EBKK14]

Implementations

sequential Plain Java. MapReduce Apache Hadoop on 20 cores, 40 reducers. Pregel Apache Giraph on 17 cores, 3 occupied by Zookeeper.

slide-46
SLIDE 46

Families of distributed graph algorithms Experiments Results 46 / 61

Results [EBKK14]

Implementations

sequential Plain Java. MapReduce Apache Hadoop on 20 cores, 40 reducers. Pregel Apache Giraph on 17 cores, 3 occupied by Zookeeper.

slide-47
SLIDE 47

Families of distributed graph algorithms Experiments Results 47 / 61

Results [EBKK14]

Implementations

sequential Plain Java. MapReduce Apache Hadoop on 20 cores, 40 reducers. Pregel Apache Giraph on 17 cores, 3 occupied by Zookeeper.

slide-48
SLIDE 48

Families of distributed graph algorithms Experiments Results 48 / 61

Results [EBKK14]

slide-49
SLIDE 49

Families of distributed graph algorithms Experiments Results 49 / 61

Results [EBKK14]

slide-50
SLIDE 50

Families of distributed graph algorithms Experiments Results 50 / 61

Results [EBKK14]

slide-51
SLIDE 51

Families of distributed graph algorithms Experiments Results 51 / 61

Summary

Take home messages

◮ If your data is ,,big” you have to go multi-core/machine ◮ If multi-machine, then try MapReduce ◮ For distributed graph algorithms it is useful to distinguish

three families

◮ The families behave differently ◮ It is worth to distribute even for data not that ,,big”

slide-52
SLIDE 52

Families of distributed graph algorithms Experiments Results 52 / 61

Summary

Take home messages

◮ If your data is ,,big” you have to go multi-core/machine ◮ If multi-machine, then try MapReduce ◮ For distributed graph algorithms it is useful to distinguish

three families

◮ The families behave differently ◮ It is worth to distribute even for data not that ,,big”

slide-53
SLIDE 53

Families of distributed graph algorithms Experiments Results 53 / 61

Summary

Take home messages

◮ If your data is ,,big” you have to go multi-core/machine ◮ If multi-machine, then try MapReduce ◮ For distributed graph algorithms it is useful to distinguish

three families

◮ The families behave differently ◮ It is worth to distribute even for data not that ,,big”

slide-54
SLIDE 54

Families of distributed graph algorithms Experiments Results 54 / 61

Summary

Take home messages

◮ If your data is ,,big” you have to go multi-core/machine ◮ If multi-machine, then try MapReduce ◮ For distributed graph algorithms it is useful to distinguish

three families

◮ The families behave differently ◮ It is worth to distribute even for data not that ,,big”

slide-55
SLIDE 55

Families of distributed graph algorithms Experiments Results 55 / 61

Summary

Take home messages

◮ If your data is ,,big” you have to go multi-core/machine ◮ If multi-machine, then try MapReduce ◮ For distributed graph algorithms it is useful to distinguish

three families

◮ The families behave differently ◮ It is worth to distribute even for data not that ,,big”

slide-56
SLIDE 56

Families of distributed graph algorithms Experiments Results 56 / 61

M´ arton Balassi mbalassi@ilab.sztaki.hu

Hungarian Academy of Sciences – Institute for Computer Science and Control Data Mining & Search Group

slide-57
SLIDE 57

Families of distributed graph algorithms Experiments Results 57 / 61

Jiˇ r´ ı Barnat, Jakub Chaloupka, and Jaco Van De Pol. Distributed algorithms for scc decomposition.

  • J. Log. and Comput., 21(1):23–44, February 2011.

Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. Group formation in large social networks: Membership, growth, and evolution. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 44–54, 2006. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation, 2004.

slide-58
SLIDE 58

Families of distributed graph algorithms Experiments Results 58 / 61

P´ eter Englert, M´ arton Balassi, Bal´ azs K´

  • sa, and Attila Kiss.

Efficiency issues of computing graph properties of social networks. In Proceedings of the 9th International Conference on Applied Informatics, page to appear, 2014. Bal´ azs K´

  • sa, M´

arton Balassi, P´ eter Englert, and Attila Kiss. Betweenness versus linerank, 2014. sent to the 6th International Conference on Computational Collective Intelligence Technologies and Applications, download from: http://people.inf.elte.hu/balhal/ publications/BetweennessVsLinerank.pdf.

slide-59
SLIDE 59

Families of distributed graph algorithms Experiments Results 59 / 61

U Kang, Spiros Papadimitriou, Jimeng Sun, and Hanghang Tong. Centralities in large networks: Algorithms and observations. In SDM, pages 119–130. SIAM / Omnipress, 2011. Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 631–636, 2006. Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. Signed networks in social media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1361–1370, 2010.

slide-60
SLIDE 60

Families of distributed graph algorithms Experiments Results 60 / 61

Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters, 2008. cite arxiv:0810.1355Comment: 66 pages, a much expanded version of our WWW 2008 paper. Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD ’10, pages 135–146, 2010.

slide-61
SLIDE 61

Families of distributed graph algorithms Experiments Results 61 / 61

William McLendon III, Bruce Hendrickson, Steven J. Plimpton, and Lawrence Rauchwerger. Finding strongly connected components in distributed graphs. Journal of Parallel and Distributed Computing, 65(8):901–910, August 2005.

  • L. Page, S. Brin, R. Motwani, and T. Winograd.

The pagerank citation ranking: Bringing order to the web. In Proceedings of the 7th International World Wide Web Conference, pages 161–172, 1998. Leslie G. Valiant. A bridging model for parallel computation.

  • Commun. ACM, 33(8):103–111, August 1990.