HelP: High-level Primitives for Large- Scale Graph Processing Semih - - PowerPoint PPT Presentation

help high level primitives for large scale graph
SMART_READER_LITE
LIVE PREVIEW

HelP: High-level Primitives for Large- Scale Graph Processing Semih - - PowerPoint PPT Presentation

HelP: High-level Primitives for Large- Scale Graph Processing Semih Salihoglu Stanford University Jennifer Widom Stanford University 1 Large-scale Graph Processing 10s or 100s billion vertices and edges Distributed Shared-Nothing


slide-1
SLIDE 1

Semih Salihoglu — Stanford University Jennifer Widom — Stanford University

HelP: High-level Primitives for Large- Scale Graph Processing

1

slide-2
SLIDE 2

Large-scale Graph Processing

2

Distributed Shared-Nothing Systems 10s or 100s billion vertices and edges

Machine 1

Distributed Storage ………

Machine 2 Machine k

Pregel PowerGraph

slide-3
SLIDE 3

APIs of Existing Systems

3

Specialized map() and reduce() type APIs

  • Pregel’s compute()
  • PowerGraph’s gather(), apply(), scatter()

Vertex-centric/Graph-parallel Message-passing

Machine 1

Distributed Storage ………

Machine 2 Machine k

slide-4
SLIDE 4

Advantages

4

Transparent parallelism Flexible. Can express many graph algorithms:

PageRank HITS Shortest Paths Collaborative Filtering Affinity Propagation Loopy Belief Propagation Weakly Connected Components Triangle Counting Strongly Connected Components Betweenness-Centrality Minimum Spanning Tree Diameter Estimation … …

slide-5
SLIDE 5

Disadvantages

5

Custom code for common operations, such as:

  • Initializing vertex values
  • Aggregating neighbor values

Difficult to read and understand some programs:

  • Complex UDFs hide higher-level graph operations

Too low-level for some operations

  • E.g: forming super vertices in a minimum spanning tree
  • Multiple rounds of complex messaging inside compute()

… graph = Pregel.compute Pregel.compute(U (UDF1 F1) graph graph = = Pregel.compute Pregel.compute(U (UDF2 F2) graph = Pregel.compute Pregel.compute(U (UDF3 F3) …

slide-6
SLIDE 6

HelP Primitives

6

Large-Scale Data Processing map() reduce()

X X

Pig and Hive:

join, group by, select, …

Large-Scale Graph Processing compute() gather()

X X

HelP: ? apply()

X

scatter()

X

slide-7
SLIDE 7

Steps in Our Work

  • 1. Implemented a wide suite of distributed graph algorithms
  • 2. Identified the commonly appearing operations
  • 3. Abstracted the operations into HelP primitives
  • 4. Implemented HelP on GraphX
  • 5. Reimplemented the suite of algorithms on GraphX

7

slide-8
SLIDE 8

Graph Algorithms We Implemented

8

Algorithm

PageRank HITS Conductance

  • Approx. Betweenness Centrality

Clustering Coefficient Semi-clustering Multi-level clustering

  • Approx. Maximum Weight Matching

Random Bipartite Matching Weakly Connected Components Strongly Connected Components Single Source Shortest Paths Graph Coloring Maximal Independent Set K-core Triangle Counting Diameter Estimation K-truss Minimum Spanning Forest

slide-9
SLIDE 9

HelP Primitives

9

Primitive Type of Operation

Aggregate Neighbor Values (ANV) Vertex-centric Update Local Update of Vertices (LUV) Vertex-centric Update Update Vertices Using One Other Vertex (UVUOV) Vertex-centric Update Filter Topology Modification Form Supervertices (FS) Topology Modification Aggregate Global Value (AGV) Global Aggregation

slide-10
SLIDE 10

Algorithms & HelP Primitives

10

Algorithm Filter ANV LUV UVUOV FS AGV

PageRank x x HITS x x x Conductance x x

  • Approx. Betweenness Centrality

x x x Clustering Coefficient x x Semi-clustering x x x Multi-level clustering x x x

  • Approx. Maximum Weight Matching

x x Random Bipartite Matching x x x Weakly Connected Components x x Strongly Connected Components x x x x Single Source Shortest Paths x x Graph Coloring x x x Maximal Independent Set x x x K-core x x Triangle Counting x Diameter Estimation x x x K-truss x Minimum Spanning Forest x x x x

slide-11
SLIDE 11

Example: Aggregate Neighbor Values

11

Vertices aggregate some or all of their neighbors’ values Update own value with the aggregated value Version 1: Non-iterative => aggregateNeighborValues

PageRank … for (i=0; i < 10; ++i) { g.aggregat aggregateN eNeig eighb hbor

  • rVa

Valu lues es( v -> true /* aggregate all vertices */, nbr -> true /* which neighbors to aggregate */, nbr -> nbr.val.pr/nbr.degree, AggrFnc.SUM, (v, sumPr)->{v.val.pr = 0.85*sumPr + 0.15/g.numV;}) }

slide-12
SLIDE 12

Version 2: Iterative => propagateAndAggregate

12

Continue aggregations until vertex values converge Ex: Weakly Connected Components

5 1 4 2 3 7 9 8

1 5 4 2

5 1 4 2

3

3

8 9 7

7 9 8

4 5 5 4

5 1 4 2

5

3

9 9 9

7 9 8

… g.propaga gate teAn AndAg Aggr greg egat ate( EdgeDirection.BOTH, v -> true, /* start propagation from all */ v -> v.val.wccID, AggrFnc.MAX, (v, aggrWCCID) -> {v.val.wccID = aggrWCCID;})

5 5 5 5

5 1 4 2

5

3

9 9 9

7 9 8

slide-13
SLIDE 13

Related Work (see paper)

13

Vertex-centric APIs MapReduce-based APIs Higher-Level Data Analysis Languages Domain-Specific Graph Languages MPI-based Libraries

slide-14
SLIDE 14

GraphX Implementation, Limitations, Future Work

14

See Our Paper & Poster!

slide-15
SLIDE 15

15

Questions?

slide-16
SLIDE 16

GraphX Implementation (Non-iterative Version)

16

EdgesRDD

v1.ID v2.ID e1 v1.ID v3.ID e2 v2.ID v3.ID e3 v3.ID v1.ID e4 v4.ID v2.ID e5 v4.ID v1.ID e6

VerticesRDD

v1.I D v1.val v2.I D v2.val v3.I D v3.val v4.I D v4.val

mapreduceTriplets (join + map + reduceBy)

MessagesRDD

v1.I D aggrMsg1 v2.I D aggrMsg2 v3.I D aggrMsg3

join

VerticesMsgsRDD

v1.ID v1.val aggrMsg1 v2.ID v2.val aggrMsg2 v3.ID v3.val aggrMsg3 v4.ID v4.val aggrMsg4

map

NewVerticesRDD

v1.ID v1.newval v2.ID v2.newval v3.ID v3.newval v4.ID v4.mewval

Replace VerticesRDD with NewVerticesRDD.

Graph