Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 - - PowerPoint PPT Presentation

graph processing frameworks
SMART_READER_LITE
LIVE PREVIEW

Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 - - PowerPoint PPT Presentation

Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 1 / 13 Todays Biz 1. Reminders 2. Review 3. Graph Processing Frameworks 4. 2D Partitioning 2 / 13 Reminders Assignment 6: due date Dec 8th Final Project


slide-1
SLIDE 1

Graph Processing Frameworks

Lecture 24 CSCI 4974/6971 5 Dec 2016

1 / 13
slide-2
SLIDE 2

Today’s Biz

  • 1. Reminders
  • 2. Review
  • 3. Graph Processing Frameworks
  • 4. 2D Partitioning
2 / 13
slide-3
SLIDE 3

Reminders

◮ Assignment 6: due date Dec 8th ◮ Final Project Presentation: December 8th ◮ Project Report: December 11th

◮ Intro, Background and Prior Work, Methodology,

Experiments, Results

◮ Include: Report as PDF, compilable source, data if small
  • r link if large (google drive, linux.cs.rpi.edu or cci

filesystems

◮ Office hours: Tuesday & Wednesday 14:00-16:00 Lally

317

◮ Or email me for other availability 3 / 13
slide-4
SLIDE 4

Today’s Biz

  • 1. Reminders
  • 2. Review
  • 3. Graph Processing Frameworks
  • 4. 2D Partitioning
4 / 13
slide-5
SLIDE 5

Quick Review

Graphs on Manycores:

◮ Manycores: Xeon Phis and GPUs

◮ Hundreds to thousands of cores, even more threads ◮ Work balance among threads is king

5 / 13
slide-6
SLIDE 6

Today’s Biz

  • 1. Reminders
  • 2. Review
  • 3. Graph Processing Frameworks
  • 4. 2D Partitioning
6 / 13
slide-7
SLIDE 7

PREGEL

A System for Large-Scale Graph Processing

slide-8
SLIDE 8

The Problem

  • Large Graphs are often part of computations

required in modern systems (Social networks and Web graphs etc.)

  • There are many graph computing problems

like shortest path, clustering, page rank, minimum cut, connected components etc. but there exists no scalable general purpose system for implementing them.

2 Pregel
slide-9
SLIDE 9

Characteristics of the algorithms

  • They often exhibit poor locality of memory

access.

  • Very little computation work required per

vertex.

  • Changing degree of parallelism over the

course of execution.

Refer [1, 2]

3 Pregel
slide-10
SLIDE 10

Possible solutions

  • Crafting a custom distributed framework for every new

algorithm.

  • Existing distributed computing platforms like

MapReduce.

– These are sometimes used to mine large graphs[3, 4], but

  • ften give sub-optimal performance and have usability

issues.

  • Single-computer graph algorithm libraries

– Limiting the scale of the graph is necessary – BGL, LEDA, NetworkX, JDSL, Standford GraphBase or FGL

  • Existing parallel graph systems which do not handle

fault tolerance and other issues

– The Parallel BGL[5] and CGMgraph[6]

Pregel 4
slide-11
SLIDE 11

Pregel

Google, to overcome, these challenges came up with Pregel.

  • Provides scalability
  • Fault-tolerance
  • Flexibility to express arbitrary algorithms

The high level organization of Pregel programs is inspired by Valiant’s Bulk Synchronous Parallel model[7].

Pregel 5
slide-12
SLIDE 12

Message passing model

A pure message passing model has been used,

  • mitting remote reads and ways to emulate

shared memory because:

  • 1. Message passing model was found

sufficient for all graph algorithms

  • 2. Message passing model performs better

than reading remote values because latency can be amortized by delivering larges batches of messages asynchronously.

Pregel 6
slide-13
SLIDE 13

Message passing model

Pregel 7
slide-14
SLIDE 14

Example

Find the largest value of a vertex in a strongly connected graph

8 Pregel
slide-15
SLIDE 15

3 6 2 1 3 6 2 1 6 2 6 6 6 6 2 6 6 6 6 6 6 6 6

Blue Arrows are messages Blue vertices have voted to halt

9 Pregel

6 Finding the largest value in a graph

slide-16
SLIDE 16

Basic Organization

  • Computations consist of a sequence of iterations

called supersteps.

  • During a superstep, the framework invokes a user

defined function for each vertex which specifies the behavior at a single vertex V and a single Superstep S. The function can:

– Read messages sent to V in superstep S-1 – Send messages to other vertices that will be received in superstep S+1 – Modify the state of V and of the outgoing edges – Make topology changes (Introduce/Delete/Modify edges/vertices)

10 Pregel
slide-17
SLIDE 17

Basic Organization - Superstep

11 Pregel
slide-18
SLIDE 18

Model Of Computation: Entities

VERTEX

  • Identified by a unique identifier.
  • Has a modifiable, user defined value.

EDGE

  • Source vertex and Target vertex identifiers.
  • Has a modifiable, user defined value.
Pregel 12
slide-19
SLIDE 19

Model Of Computation: Progress

  • In superstep 0, all vertices are active.
  • Only active vertices participate in a superstep.

– They can go inactive by voting for halt. – They can be reactivated by an external message from another vertex.

  • The algorithm terminates when all vertices

have voted for halt and there are no messages in transit.

13 Pregel
slide-20
SLIDE 20

Model Of Computation: Vertex

State machine for a vertex

14 Pregel
slide-21
SLIDE 21

Comparison with MapReduce

Graph algorithms can be implemented as a series of MapReduce invocations but it requires passing of entire state of graph from

  • ne stage to the next, which is not the case

with Pregel. Also Pregel framework simplifies the programming complexity by using supersteps.

15 Pregel
slide-22
SLIDE 22

The C++ API

Creating a Pregel program typically involves subclassing the predefined Vertex class.

  • The user overrides the virtual Compute()
  • method. This method is the function that is

computed for every active vertex in supersteps.

  • Compute() can get the vertex’s associated value

by GetValue() or modify it using MutableValue()

  • Values of edges can be inspected and modified

using the out-edge iterator.

16 Pregel
slide-23
SLIDE 23

The C++ API – Message Passing

Each message consists of a value and the name

  • f the destination vertex.

–The type of value is specified in the template parameter of the Vertex class.

Any number of messages can be sent in a superstep.

–The framework guarantees delivery and non- duplication but not in-order delivery.

A message can be sent to any vertex if it’s identifier is known.

17 Pregel
slide-24
SLIDE 24

The C++ API – Pregel Code

Pregel Code for finding the max value Class MaxFindVertex : public Vertex<double, void, double> { public: virtual void Compute(MessageIterator* msgs) { int currMax = GetValue(); SendMessageToAllNeighbors(currMax); for ( ; !msgs->Done(); msgs->Next()) { if (msgs->Value() > currMax) currMax = msgs->Value(); } if (currMax > GetValue()) *MutableValue() = currMax; else VoteToHalt(); } };

18 Pregel
slide-25
SLIDE 25

The C++ API – Combiners

Sending a message to another vertex that exists

  • n a different machine has some overhead.

However if the algorithm doesn’t require each message explicitly but a function of it (example sum) then combiners can be used. This can be done by overriding the Combine() method.

  • It can be used only for associative and

commutative operations.

19 Pregel
slide-26
SLIDE 26

The C++ API – Combiners Example:

Say we want to count the number of incoming links to all the pages in a set of interconnected pages. In the first iteration, for each link from a vertex(page) we will send a message to the destination page. Here, count function over the incoming messages can be used a combiner to optimize performance. In the MaxValue Example, a Max combiner would reduce the communication load.

20 Pregel
slide-27
SLIDE 27

The C++ API – Combiners

21 Pregel
slide-28
SLIDE 28

The C++ API – Aggregators

They are used for Global communication, monitoring and data.

Each vertex can produce a value in a superstep S for the Aggregator to use. The Aggregated value is available to all the vertices in superstep S+1.

Aggregators can be used for statistics and for global communication. Can be implemented by subclassing the Aggregator Class

Commutativity and Assosiativity required

22 Pregel
slide-29
SLIDE 29

The C++ API – Aggregators

Example:

Sum operator applied to out-edge count of each vertex can be used to generate the total number

  • f edges in the graph and communicate it to all

the vertices.

  • More complex reduction operators can even

generate histograms. In the MaxValue example, we can finish the entire program in a single superstep by using a Max aggregator.

23 Pregel
slide-30
SLIDE 30

The C++ API – Topology Mutations

The Compute() function can also be used to modify the structure of the graph.

Example: Hierarchical Clustering

Mutations take effect in the superstep after the requests were issued. Ordering of mutations, with

– deletions taking place before additions, – deletion of edges before vertices and – addition of vertices before edges

resolves most of the conflicts. Rest are handled by user-defined handlers.

24 Pregel
slide-31
SLIDE 31

Implementation

Pregel is designed for the Google cluster architecture. The architecture schedules jobs to optimize resource allocation, involving killing instances or moving them to different locations. Persistent data is stored as files on a distributed storage system like GFS[8] or BigTable.

25 Pregel
slide-32
SLIDE 32

Basic Architecture

The Pregel library divides a graph into partitions, based on the vertex ID, each consisting of a set

  • f vertices and all of those vertices’ out-going

edges. The default function is hash(ID) mod N, where N is the number of partitions. The next few slides describe the several stages

  • f the execution of a Pregel program.
26 Pregel
slide-33
SLIDE 33

Pregel Execution

  • 1. Many copies of the user program begin

executing on a cluster of machines. One of these copies acts as the master. The master is not assigned any portion of the graph, but is responsible for coordinating worker activity.

27 Pregel
slide-34
SLIDE 34

Pregel Execution

  • 2. The master determines how many partitions

the graph will have and assigns one or more partitions to each worker machine. Each worker is responsible for maintaining the state of its section of the graph, executing the user’s Compute() method on its vertices, and managing messages to and from other workers.

28 Pregel
slide-35
SLIDE 35

Pregel Execution

29 Pregel

1 4 2 6 8 9 10 3 5 7 11 12

slide-36
SLIDE 36

Pregel Execution

  • 3. The master assigns a portion of the user’s

input to each worker. The input is treated as a set of records, each of which contains an arbitrary number of vertices and edges. After the input has finished loading, all vertices are marked are active.

30 Pregel
slide-37
SLIDE 37

Pregel Execution

  • 4. The master instructs each worker to perform

a superstep. The worker loops through its active vertices, and call Compute() for each active

  • vertex. It also delivers messages that were sent

in the previous superstep. When the worker finishes it responds to the master with the number of vertices that will be active in the next superstep.

31 Pregel
slide-38
SLIDE 38

Pregel Execution

32 Pregel
slide-39
SLIDE 39

Pregel Execution

33 Pregel
slide-40
SLIDE 40

Fault Tolerance

  • Checkpointing is used to implement fault

tolerance.

– At the start of every superstep the master may instruct the workers to save the state of their partitions in stable storage. – This includes vertex values, edge values and incoming messages.

  • Master uses “ping“ messages to detect worker

failures.

34 Pregel
slide-41
SLIDE 41

Fault Tolerance

  • When one or more workers fail, their

associated partitions’ current state is lost.

  • Master reassigns these partitions to available

set of workers.

– They reload their partition state from the most recent available checkpoint. This can be many steps old. – The entire system is restarted from this superstep.

  • Confined recovery can be used to reduce this

load

35 Pregel
slide-42
SLIDE 42

Applications

PageRank

36 Pregel
slide-43
SLIDE 43

PageRank

PageRank is a link analysis algorithm that is used to determine the importance of a document based on the number of references to it and the importance of the source documents themselves. [This was named after Larry Page (and not after rank of a webpage)]

37 Pregel
slide-44
SLIDE 44

PageRank

A = A given page T1 …. Tn = Pages that point to page A (citations) d = Damping factor between 0 and 1 (usually kept as 0.85) C(T) = number of links going out of T PR(A) = the PageRank of page A

) ) ( ) ( ........ ) ( ) ( ) ( ) ( ( ) 1 ( ) (

2 2 1 1 n n

T C T PR T C T PR T C T PR d d A PR       

38 Pregel
slide-45
SLIDE 45

PageRank

Courtesy: Wikipedia

39 Pregel
slide-46
SLIDE 46

PageRank

40 Pregel

PageRank can be solved in 2 ways:

  • A system of linear equations
  • An iterative loop till convergence

We look at the pseudo code of iterative version

Initial value of PageRank of all pages = 1.0; While ( sum of PageRank of all pages – numPages > epsilon) { for each Page Pi in list { PageRank(Pi) = (1-d); for each page Pj linking to page Pi { PageRank(Pi) += d × (PageRank(Pj)/numOutLinks(Pj)); } } }

slide-47
SLIDE 47

PageRank in MapReduce – Phase I

Parsing HTML

  • Map task takes (URL, page content) pairs and

maps them to (URL, (PRinit, list-of-urls))

– PRinit is the “seed” PageRank for URL – list-of-urls contains all pages pointed to by URL

  • Reduce task is just the identity function
41 Pregel
slide-48
SLIDE 48

PageRank in MapReduce – Phase 2

PageRank Distribution

  • Map task takes (URL, (cur_rank, url_list))

– For each u in url_list, emit (u, cur_rank/|url_list|) – Emit (URL, url_list) to carry the points-to list along through iterations

  • Reduce task gets (URL, url_list) and many

(URL, val) values

– Sum vals and fix up with d – Emit (URL, (new_rank, url_list))

42 Pregel
slide-49
SLIDE 49

PageRank in MapReduce - Finalize

  • A non-parallelizable component

determines whether convergence has been achieved

  • If so, write out the PageRank lists -

done

  • Otherwise, feed output of Phase 2

into another Phase 2 iteration

43 Pregel
slide-50
SLIDE 50

PageRank in Pregel

Class PageRankVertex : public Vertex<double, void, double> { public: virtual void Compute(MessageIterator* msgs) { if (superstep() >= 1) { double sum = 0; for (; !msgs->done(); msgs->Next()) sum += msgs->Value(); *MutableValue() = 0.15 + 0.85 * sum; } if (supersteps() < 30) { const int64 n = GetOutEdgeIterator().size(); SendMessageToAllNeighbors(GetValue() / n); } else { VoteToHalt(); }}};

44 Pregel
slide-51
SLIDE 51

PageRank in Pregel

The pregel implementation contains the PageRankVertex, which inherits from the Vertex class. The class has the vertex value type double to store tentative PageRank and message type double to carry PageRank fractions. The graph is initialized so that in superstep 0, value of each vertex is 1.0 .

45 Pregel
slide-52
SLIDE 52

PageRank in Pregel

In each superstep, each vertex sends out along each outgoing edge its tentative PageRank divided by the number of outgoing edges. Also, each vertex sums up the values arriving on messages into sum and sets its own tentative PageRank to For convergence, either there is a limit on the number of supersteps or aggregators are used to detect convergence.

46 Pregel

sum   85 . 15 .

slide-53
SLIDE 53

Apache Giraph

Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org> @claudiomartella Hadoop Summit @ Amsterdam - 3 April 2014

slide-54
SLIDE 54

2

slide-55
SLIDE 55

Graphs are simple

3

slide-56
SLIDE 56

A computer network

4

slide-57
SLIDE 57

A social network

5

slide-58
SLIDE 58

A semantic network

6

slide-59
SLIDE 59

A map

7

slide-60
SLIDE 60

Graphs are huge

  • Google’s index contains 50B pages
  • Facebook has around1.1B users
  • Google+ has around 570M users
  • T

witter has around 530M users

VERY rough estimates!

8

slide-61
SLIDE 61

9

slide-62
SLIDE 62

Graphs aren’t easy

10

slide-63
SLIDE 63

Graphs are nasty.

11

slide-64
SLIDE 64

Each vertex depends on its neighbours, recursively.

12

slide-65
SLIDE 65

Recursive problems are nicely solved iteratively.

13

slide-66
SLIDE 66

PageRank in MapReduce

  • Record: < v_i, pr, [ v_j, ..., v_k ] >
  • Mapper: emits < v_j, pr /

#neighbours >

  • Reducer: sums the partial values

14

slide-67
SLIDE 67

MapReduce datafmow

15

slide-68
SLIDE 68

Drawbacks

  • Each job is executed N times
  • Job bootstrap
  • Mappers send PR values and

structure

  • Extensive IO at input, shuffme &

sort, output

16

slide-69
SLIDE 69

17

slide-70
SLIDE 70

Timeline

  • Inspired by Google Pregel (2010)
  • Donated to ASF by Yahoo! in 2011
  • T
  • p-level project in 2012
  • 1.0 release in January 2013
  • 1.1 release in days 2014

18

slide-71
SLIDE 71

Plays well with Hadoop

19

slide-72
SLIDE 72

Vertex-centric API

20

slide-73
SLIDE 73

BSP machine

21

slide-74
SLIDE 74

BSP & Giraph

22

slide-75
SLIDE 75

Advantages

  • No locks: message-based

communication

  • No semaphores: global

synchronization

  • Iteration isolation: massively

parallelizable

23

slide-76
SLIDE 76

Architecture

24

slide-77
SLIDE 77

Giraph job lifetime

25

slide-78
SLIDE 78

Designed for iterations

  • Stateful (in-memory)
  • Only intermediate values

(messages) sent

  • Hits the disk at input, output,

checkpoint

  • Can go out-of-core

26

slide-79
SLIDE 79

A bunch of other things

  • Combiners (minimises messages)
  • Aggregators (global aggregations)
  • MasterCompute (executed on

master)

  • WorkerContext (executed per worker)
  • PartitionContext (executed per

partition)

27

slide-80
SLIDE 80

Shortest Paths

28

slide-81
SLIDE 81

Shortest Paths

29

slide-82
SLIDE 82

Shortest Paths

30

slide-83
SLIDE 83

Shortest Paths

31

slide-84
SLIDE 84

Shortest Paths

32

slide-85
SLIDE 85

Composable API

33

slide-86
SLIDE 86

Checkpointing

34

slide-87
SLIDE 87

No SPoFs

35

slide-88
SLIDE 88

Giraph scales

36

ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-e dges/10151617006153920
slide-89
SLIDE 89

Giraph is fast

  • 100x over MR (Pr)
  • jobs run within minutes
  • given you have

resources ;-)

37

slide-90
SLIDE 90

Serialised objects

38

slide-91
SLIDE 91

Primitive types

  • Autoboxing is expensive
  • Objects overhead (JVM)
  • Use primitive types on your own
  • Use primitive types-based libs (e.g.

fastutils)

39

slide-92
SLIDE 92

Sharded aggregators

40

slide-93
SLIDE 93

Many stores with Gora

41

slide-94
SLIDE 94

And graph databases

42

slide-95
SLIDE 95

Current and next steps

  • Out-of-core graph and messages
  • Jython interface
  • Remove Writable from < I V E M >
  • Partitioned supernodes
  • More documentation

43

slide-96
SLIDE 96

GraphLab: A New Framework for Parallel Machine Learning

Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010

slide-97
SLIDE 97

Overview

 Programming ML Algorithms in Parallel

  • Common Parallelism and MapReduce
  • Global Synchronization Barriers

 GraphLab

  • Data Dependency as a Graph
  • Synchronization as Fold/Reduce

 Implementation and Experiments  From Multicore to Distributed

Environment

slide-98
SLIDE 98

Parallel Processing for ML

 Parallel ML is a Necessity

  • 13 Million Wikipedia Pages
  • 3.6 Billion photos on Flickr
  • etc

 Parallel ML is Hard to Program

  • Concurrency v.s. Deadlock
  • Load Balancing
  • Debug
  • etc
slide-99
SLIDE 99

MapReduce is the Solution?

 High-level abstraction: Statistical Query

Model [Chu et al, 2006]

Weighted Linear Regression: only sufficient statistics 𝚺 = A-1b, A = 𝚻wi(xixi

T), b = 𝚻wi(xiyi)

slide-100
SLIDE 100

MapReduce is the Solution?

 High-level abstraction: Statistical Query

Model [Chu et al, 2006]

K-Means: only data assignments class mean = avg(xi), xi in class Embarrassingly Parallel independent computation No Communication needed

slide-101
SLIDE 101

ML in MapReduce

Multiple Mapper Single Reducer

 Iterative MapReduce needs global

synchronization at the single reducer

  • K-means
  • EM for graphical models
  • gradient descent algorithms, etc
slide-102
SLIDE 102

Not always Embarrassingly Parallel

 Data Dependency: not MapReducable

  • Gibbs Sampling
  • Belief Propagation
  • SVM
  • etc

 Capture Dependency as a Graph!

slide-103
SLIDE 103

Overview

 Programming ML Algorithms in Parallel

  • Common Parallelism and MapReduce
  • Global Synchronization Barriers

 GraphLab

  • Data Dependency as a Graph
  • Synchronization as Fold/Reduce

 Implementation and Experiments  From Multicore to Distributed

Environment

slide-104
SLIDE 104

Key Idea of GraphLab

 Sparse Data Dependencies  Local Computations

X4 X5 X6 X9 X8 X3 X2 X1 X7

slide-105
SLIDE 105

GraphLab for ML

 High-level Abstract

  • Express data dependencies
  • Iterative

 Automatic Multicore Parallelism

  • Data Synchronization
  • Consistency
  • Scheduling
slide-106
SLIDE 106

Main Components of GraphLab

Data Graph Shared Data Table Scheduling Update Functions and Scopes

GraphLab

Model

slide-107
SLIDE 107

Data Graph

 A Graph with data associated with every

vertex and edge.

x3: Sample value C(X3): sample counts Φ(X6,X9): Binary potential

X

1

X

2

X

3

X

5

X

6

X

7

X

8

X

9

X10 X

4

X11

slide-108
SLIDE 108

Update Functions

 Operations applied on a vertex that

transform data in the scope of the vertex

Gibbs Update:

  • Read samples on adjacent

vertices

  • Read edge potentials
  • Compute a new sample for

the current vertex

slide-109
SLIDE 109

Scope Rules

 Consistency v.s. Parallelism

  • Belief Propagation: Only uses edge data
  • Gibbs Sampling: Needs to read adjacent

vertices

slide-110
SLIDE 110

Scheduling

 Scheduler determines the order of

Update Function evaluations

 Static Scheduling

  • Round Robin, etc

 Dynamic Scheduling

  • FIFO, Priority Queue, etc
slide-111
SLIDE 111

Dynamic Scheduling

e f g k j i h d c b a

CPU 1 CPU 2

a h a b b i

slide-112
SLIDE 112

Global Information

 Shared Data Table in Shared Memory

  • Model parameters (updatable)
  • Sufficient statistics (updatable)
  • Constants, etc (fixed)

 Sync Functions for Updatable Shared Data

  • Accumulate performs an aggregation over

vertices

  • Apply makes a final modification to the

accumulated data

slide-113
SLIDE 113

Sync Functions

 Much like Fold/Reduce

  • Execute Aggregate over every vertices in turn
  • Execute Apply once at the end

 Can be called

  • Periodically when update functions are active

(asynchronous) or

  • By the update function or user code

(synchronous)

slide-114
SLIDE 114

GraphLab

GraphLab

Model Data Graph Shared Data Table Scheduling Update Functions and Scopes

slide-115
SLIDE 115

Overview

 Programming ML Algorithms in Parallel

  • Common Parallelism and MapReduce
  • Global Synchronization Barriers

 GraphLab

  • Data Dependency as a Graph
  • Synchronization as Fold/Reduce

 Implementation and Experiments  From Multicore to Distributed

Environment

slide-116
SLIDE 116

Implementation and Experiments

 Shared Memory Implemention in C++

using Pthreads

 Applications:

  • Belief Propagation
  • Gibbs Sampling
  • CoEM
  • Lasso
  • etc (more on the project page)
slide-117
SLIDE 117

Parallel Performance

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Speedup

Number of CPUs Optimal

Better

Round robin schedule Colored Schedule

slide-118
SLIDE 118

From Multicore to Distributed Enviroment

 MapReduce and GraphLab work well for

Multicores

  • Simple High-level Abstract
  • Local computation + global synchronization

 When Migrate to Clusters

  • Rethink Scope  synchronization
  • Rethink Shared Data  single “reducer”
  • Think Load Balancing
  • Maybe think abstract model?
slide-119
SLIDE 119

22.06.2015 DIMA – TU Berlin 1

Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/

Hot Topics in Information Management PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

Igor Shevchenko Mentor: Sebastian Schelter

slide-120
SLIDE 120

22.06.2015 DIMA – TU Berlin 2

Agenda

  • 1. Natural Graphs: Properties and Problems;
  • 2. PowerGraph: Vertex Cut and Vertex Programs;
  • 3. GAS Decomposition;
  • 4. Vertex Cut Partitioning;
  • 5. Delta Caching;
  • 6. Applications and Evaluation;

Paper: Gonzalez at al. PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs.

slide-121
SLIDE 121

22.06.2015 DIMA – TU Berlin 3

■ Natural graphs are graphs derived from real-world

  • r natural phenomena;

■ Graphs are big: billions of vertices and edges and rich metadata;

Natural graphs have Power-Law Degree Distribution

Natural Graphs

slide-122
SLIDE 122

22.06.2015 DIMA – TU Berlin 4

Power-Law Degree Distribution

(Andrei Broder et al. Graph structure in the web)

slide-123
SLIDE 123

22.06.2015 DIMA – TU Berlin 5

■ We want to analyze natural graphs; ■ Essential for Data Mining and Machine Learning; Goal

Identify influential people and information; Identify special nodes and communities; Model complex data dependencies; Target ads and products; Find communities; Flow scheduling;

slide-124
SLIDE 124

22.06.2015 DIMA – TU Berlin 6

■ Existing distributed graph computation systems perform poorly on natural graphs (Gonzalez et al. OSDI ’12); ■ The reason is presence of high degree vertices; Problem High Degree Vertices: Star-like motif

slide-125
SLIDE 125

22.06.2015 DIMA – TU Berlin 7

Possible problems with high degree vertices: ■ Limited single-machine resources; ■ Work imbalance; ■ Sequential computation; ■ Communication costs; ■ Graph partitioning; Applicable to: ■ Hadoop; GraphLab; Pregel (Piccolo); Problem Continued

slide-126
SLIDE 126

22.06.2015 DIMA – TU Berlin 8

■ High degree vertices can exceed the memory capacity of a single machine; ■ Store edge meta-data and adjacency information; Problem: Limited Single-Machine Resources

slide-127
SLIDE 127

22.06.2015 DIMA – TU Berlin 9

■ The power-law degree distribution can lead to significant work imbalance and frequency barriers; ■ For ex. with synchronous execution (Pregel): Problem: Work Imbalance

slide-128
SLIDE 128

22.06.2015 DIMA – TU Berlin 10

■ No parallelization of individual vertex-programs; ■ Edges are processed sequentially; ■ Locking does not scale well to high degree vertices (for ex. in GraphLab); Problem: Sequential Computation

Sequentially process edges Asynchronous execution requires heavy locking

slide-129
SLIDE 129

22.06.2015 DIMA – TU Berlin 11

■ Generate and send large amount of identical messages (for ex. in Pregel); ■ This results in communication asymmetry; Problem: Communication Costs

slide-130
SLIDE 130

22.06.2015 DIMA – TU Berlin 12

■ Natural graphs are difficult to partition; ■ Pregel and GraphLab use random (hashed) partitioning on natural graphs thus maximizing the network communication; Problem: Graph Partitioning

slide-131
SLIDE 131

22.06.2015 DIMA – TU Berlin 13

■ Natural graphs are difficult to partition; ■ Pregel and GraphLab use random (hashed) partitioning on natural graphs thus maximizing the network communication; Expected edges that are cut Examples: ■ 10 machines: ■ 100 machines: Problem: Graph Partitioning Continued

= number of machines

90% of edges cut; 99% of edges cut;

slide-132
SLIDE 132

22.06.2015 DIMA – TU Berlin 14

■ GraphLab and Pregel are not well suited for computations on natural graphs; Reasons: ■ Challenges of high-degree vertices; ■ Low quality partitioning; Solution: ■ PowerGraph new abstraction; In Summary

slide-133
SLIDE 133

22.06.2015 DIMA – TU Berlin 15

PowerGraph

slide-134
SLIDE 134

22.06.2015 DIMA – TU Berlin 16

Two approaches for partitioning the graph in a distributed environment:

■ Edge Cut; ■ Vertex Cut;

Partition Techniques

slide-135
SLIDE 135

22.06.2015 DIMA – TU Berlin 17

■ Used by Pregel and GraphLab abstractions; ■ Evenly assign vertices to machines; Edge Cut

slide-136
SLIDE 136

22.06.2015 DIMA – TU Berlin 18

■ Used by PowerGraph abstraction; ■ Evenly assign edged to machines; Vertex Cut

The strong point of the paper 4 edges 4 edges

slide-137
SLIDE 137

22.06.2015 DIMA – TU Berlin 19

Think like a Vertex

[Malewicz et al. SIGMOD’10]

User-defined Vertex-Program:

  • 1. Runs on each vertex;
  • 2. Interactions are constrained by graph structure;

Pregel and GraphLab also use this concept, where parallelism is achieved by running multiple vertex programs simultaneously; Vertex Programs

slide-138
SLIDE 138

22.06.2015 DIMA – TU Berlin 20

■ Vertex cut distributes a single vertex-program across several machines; ■ Allows to parallelize high-degree vertices; GAS Decomposition The strong point of the paper

slide-139
SLIDE 139

22.06.2015 DIMA – TU Berlin 21

Generalize the vertex-program into three phases:

  • 1. Gather
  • Accumulate information about neighborhood;
  • 2. Apply
  • Apply accumulated value to center vertex;
  • 3. Scatter
  • Update adjacent edges and vertices;

GAS Decomposition Gather, Apply and Scatter are user-defined functions;

The strong point of the paper

slide-140
SLIDE 140

22.06.2015 DIMA – TU Berlin 22

■ Executed on the edges in parallel; ■ Accumulate information about neighborhood; Gather Phase

slide-141
SLIDE 141

22.06.2015 DIMA – TU Berlin 23

■ Executed on the central vertex; ■ Apply accumulated value to center vertex; Apply Phase

slide-142
SLIDE 142

Today’s Biz

  • 1. Reminders
  • 2. Review
  • 3. Graph Processing Frameworks
  • 4. 2D Partitioning
7 / 13
slide-143
SLIDE 143

2D Partitioning Aydin Buluc and Kamesh Madduri

8 / 13
slide-144
SLIDE 144

Graph Partitioning for Scalable Distributed Graph Computations

Aydın Buluç Kamesh Madduri

ABuluc@lbl.gov madduri@cse.psu.edu

10th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering February 13-14, 2012 Atlanta, GA

slide-145
SLIDE 145

Overview of our study

  • We assess the impact of graph partitioning for

computations on ‘low diameter’ graphs

  • Does minimizing edge cut lead to lower

execution time?

  • We choose parallel Breadth-First Search as a

representative distributed graph computation

  • Performance analysis on DIMACS Challenge

instances

2
slide-146
SLIDE 146

Key Observations for Parallel BFS

  • Well-balanced vertex and edge partitions do not

guarantee load-balanced execution, particularly for real-world graphs

– Range of relative speedups (8.8-50X, 256-way parallel concurrency) for low-diameter DIMACS graph instances.

  • Graph partitioning methods reduce overall edge cut

and communication volume, but lead to increased computational load imbalance

  • Inter-node communication time is not the dominant

cost in our tuned bulk-synchronous parallel BFS implementation

3
slide-147
SLIDE 147

Talk Outline

  • Level-synchronous parallel BFS on distributed-

memory systems

– Analysis of communication costs

  • Machine-independent counts for inter-node

communication cost

  • Parallel BFS performance results for several

large-scale DIMACS graph instances

4
slide-148
SLIDE 148

Parallel BFS strategies

5
  • 1. Expand current frontier (level-synchronous approach, suited for low diameter graphs)

7 5 3 8 2 4 6 1 9

source vertex

  • 2. Stitch multiple concurrent traversals (Ullman-Yannakakis, for high-diameter graphs)
  • O(D) parallel steps
  • Adjacencies of all vertices

in current frontier are visited in parallel

7 5 3 8 2 4 6 1 9

source vertex

  • path-limited searches from

“super vertices”

  • APSP between “super

vertices”

slide-149
SLIDE 149
  • Consider a logical 2D processor grid (pr * pc = p) and

the dense matrix representation of the graph

  • Assign each processor a sub-matrix (i.e, the edges

within the sub-matrix)

“2D” graph distribution

7 5 3 8 2 4 6 1 x x x x x x x x x x x x x x x x x x x x x x x x 9 vertices, 9 processors, 3x3 processor grid Flatten Sparse matrices Per-processor local graph representation

slide-150
SLIDE 150

BFS with a 1D-partitioned graph

Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.

1 2 3 6 5 4

[0,1] [0,3] [0,3] [1,0] [1,4] [1,6] [2,3] [2,5] [2,5] [2,6] [3,0] [3,0] [3,2] [3,6] [4,1] [4,5] [4,6] [5,2] [5,2] [5,4] [6,1] [6,2] [6,3] [6,4] Consider an undirected graph with n vertices and m edges Each processor ‘owns’ n/p vertices and stores their adjacencies (~ 2m/p per processor, assuming balanced partitions).

slide-151
SLIDE 151

BFS with a 1D-partitioned graph

Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.

1 2 3 6 5 4

Current frontier: vertices 1 (partition Blue) and 6 (partition Green)

  • 1. Local discovery:

[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2

No work No work

slide-152
SLIDE 152

BFS with a 1D-partitioned graph

Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.

1 2 3 6 5 4

Current frontier: vertices 1 (partition Blue) and 6 (partition Green)

  • 2. All-to-all exchange:

[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2

No work No work

slide-153
SLIDE 153

BFS with a 1D-partitioned graph

Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.

1 2 3 6 5 4

Current frontier: vertices 1 (partition Blue) and 6 (partition Green)

  • 2. All-to-all exchange:

[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2

slide-154
SLIDE 154

BFS with a 1D-partitioned graph

Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.

1 2 3 6 5 4

Current frontier: vertices 1 (partition Blue) and 6 (partition Green)

  • 3. Local update:

[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2 2, 3 4 Frontier for next iteration

slide-155
SLIDE 155

Modeling parallel execution time

  • Time dominated by local memory references and inter-node

communication

  • Assuming perfectly balanced computation and

communication, we have

12

p m n p m

p n L L

 

/ ,

 

Local latency on working set |n/p| Inverse local RAM bandwidth

Local memory references:

p p edgecut p

N a a N

   ) (

2 ,

Inter-node communication:

All-to-all remote bandwidth with p participating processors

slide-156
SLIDE 156

BFS with a 2D-partitioned graph

  • Avoid expensive p-way All-to-all

communication step

  • Each process collectively ‘owns’

n/pr vertices

  • Additional ‘Allgather’

communication step for processes in a row

13

Local memory references:

p m p n p m

r c

p n L p n L L , ,

    

Inter-node communication:

c N c r c gather N r N r a a N

p p n p p p p edgecut p                 1 1 ) ( ) (

, 2 ,

slide-157
SLIDE 157

Temporal effects, communication-minimizing tuning prevent us from obtaining tighter bounds

  • The volume of communication can be further reduced by

maintaining state of non-local visited vertices

14

1 2 3 6 5 4

[0,3] [0,3] [1,3] [0,4] [1,4]

P0

Local pruning prior to All-to-all step [0,6] [1,6] [1,6] [0,3] [0,4] [1,6]

slide-158
SLIDE 158

Predictable BFS execution time for synthetic small-world graphs

  • Randomly permuting vertex IDs ensures load balance on

R-MAT graphs (used in the Graph 500 benchmark).

  • Our tuned parallel implementation for the NERSC Hopper

system (Cray XE6) is ranked #2 on the current Graph 500 list.

15

Buluc & Madduri, Parallel BFS on distributed memory systems, Proc. SC’11, 2011. Execution time is dominated by work performed in a few parallel phases

slide-159
SLIDE 159

Modeling BFS execution time for real-world graphs

  • Can we further reduce communication time

utilizing existing partitioning methods?

  • Does the model predict execution time for

arbitrary low-diameter graphs?

  • We try out various partitioning and graph

distribution schemes on the DIMACS Challenge graph instances

– Natural ordering, Random, Metis, PaToH

16
slide-160
SLIDE 160

Experimental Study

  • The (weak) upper bound on aggregate data volume

communication can be statically computed (based on partitioning of the graph)

  • We determine runtime estimates of

– Total aggregate communication volume – Sum of max. communication volume during each BFS iteration – Intra-node computational work balance – Communication volume reduction with 2D partitioning

  • We obtain and analyze execution times (at several

different parallel concurrencies) on a Cray XE6 system (Hopper, NERSC)

17
slide-161
SLIDE 161

Orderings for the CoPapersCiteseer graph

18

Natural Random PaToH checkerboard PaToH Metis

slide-162
SLIDE 162

BFS All-to-all phase total communication volume normalized to # of edges (m)

# of partitions Graph name

% compared to m

Natural Random PaToH

19
slide-163
SLIDE 163

Ratio of max. communication volume across iterations to total communication volume

# of partitions Graph name

Ratio over total volume

Natural Random PaToH

20
slide-164
SLIDE 164

Reduction in total All-to-all communication volume with 2D partitioning

21

Graph name

Ratio compared to 1D

Natural Random PaToH

# of partitions

slide-165
SLIDE 165

Edge count balance with 2D partitioning

Graph name

Max/Avg. ratio

Natural Random PaToH

# of partitions

slide-166
SLIDE 166

Parallel speedup on Hopper with 16-way partitioning

23
slide-167
SLIDE 167

Execution time breakdown

24 50 100 150 200 Random-1D Random-2D Metis-1D PaToH-1D

BFS time (ms) Partitioning Strategy

Computation Fold Expand 2 4 6 8 10 Random-1D Random-2D Metis-1D PaToH-1D
  • Comm. time (ms)

Partitioning Strategy

Fold Expand 50 100 150 200 250 300 Random-1D Random-2D Metis-1D PaToH-1D

BFS time (ms) Partitioning Strategy

Computation Fold Expand 0.5 1 1.5 2 2.5 3 Random-1D Random-2D Metis-1D PaToH-1D

  • Comm. time (ms)

Partitioning Strategy

Fold Expand

eu-2005 kron-simple-logn18

slide-168
SLIDE 168

Imbalance in parallel execution

25

eu-2005, 16 processes* PaToH Random

* Timeline of 4 processes shown in figures. PaToH-partitioned graph suffers from severe load imbalance in computational phases.

slide-169
SLIDE 169

Conclusions

  • Randomly permuting vertex identifiers improves

computational and communication load balance, particularly at higher process concurrencies

  • Partitioning methods reduce overall communication

volume, but introduce significant load imbalance

  • Substantially lower parallel speedup with real-world

graphs compared to synthetic graphs (8.8X vs 50X at 256- way parallel concurrency)

– Points to the need for dynamic load balancing

26
slide-170
SLIDE 170

Today: In class work

◮ Develop 2D partitioning strategy ◮ Implement BFS

Blank code and data available on website (Lecture 24) www.cs.rpi.edu/∼slotag/classes/FA16/index.html

9 / 13