Graph Processing Frameworks
Lecture 24 CSCI 4974/6971 5 Dec 2016
1 / 13
Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 - - PowerPoint PPT Presentation
Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 1 / 13 Todays Biz 1. Reminders 2. Review 3. Graph Processing Frameworks 4. 2D Partitioning 2 / 13 Reminders Assignment 6: due date Dec 8th Final Project
Graph Processing Frameworks
Lecture 24 CSCI 4974/6971 5 Dec 2016
1 / 13Today’s Biz
Reminders
◮ Assignment 6: due date Dec 8th ◮ Final Project Presentation: December 8th ◮ Project Report: December 11th
◮ Intro, Background and Prior Work, Methodology,Experiments, Results
◮ Include: Report as PDF, compilable source, data if smallfilesystems
◮ Office hours: Tuesday & Wednesday 14:00-16:00 Lally
317
◮ Or email me for other availability 3 / 13Today’s Biz
Quick Review
Graphs on Manycores:
◮ Manycores: Xeon Phis and GPUs
◮ Hundreds to thousands of cores, even more threads ◮ Work balance among threads is king◮
5 / 13Today’s Biz
A System for Large-Scale Graph Processing
The Problem
required in modern systems (Social networks and Web graphs etc.)
like shortest path, clustering, page rank, minimum cut, connected components etc. but there exists no scalable general purpose system for implementing them.
2 PregelCharacteristics of the algorithms
access.
vertex.
course of execution.
Refer [1, 2]
3 PregelPossible solutions
algorithm.
MapReduce.
– These are sometimes used to mine large graphs[3, 4], but
issues.
– Limiting the scale of the graph is necessary – BGL, LEDA, NetworkX, JDSL, Standford GraphBase or FGL
fault tolerance and other issues
– The Parallel BGL[5] and CGMgraph[6]
Pregel 4Pregel
Google, to overcome, these challenges came up with Pregel.
The high level organization of Pregel programs is inspired by Valiant’s Bulk Synchronous Parallel model[7].
Pregel 5Message passing model
A pure message passing model has been used,
shared memory because:
sufficient for all graph algorithms
than reading remote values because latency can be amortized by delivering larges batches of messages asynchronously.
Pregel 6Message passing model
Pregel 7Example
Find the largest value of a vertex in a strongly connected graph
8 Pregel3 6 2 1 3 6 2 1 6 2 6 6 6 6 2 6 6 6 6 6 6 6 6
Blue Arrows are messages Blue vertices have voted to halt
9 Pregel6 Finding the largest value in a graph
Basic Organization
called supersteps.
defined function for each vertex which specifies the behavior at a single vertex V and a single Superstep S. The function can:
– Read messages sent to V in superstep S-1 – Send messages to other vertices that will be received in superstep S+1 – Modify the state of V and of the outgoing edges – Make topology changes (Introduce/Delete/Modify edges/vertices)
10 PregelBasic Organization - Superstep
11 PregelModel Of Computation: Entities
VERTEX
EDGE
Model Of Computation: Progress
– They can go inactive by voting for halt. – They can be reactivated by an external message from another vertex.
have voted for halt and there are no messages in transit.
13 PregelModel Of Computation: Vertex
State machine for a vertex
14 PregelComparison with MapReduce
Graph algorithms can be implemented as a series of MapReduce invocations but it requires passing of entire state of graph from
with Pregel. Also Pregel framework simplifies the programming complexity by using supersteps.
15 PregelThe C++ API
Creating a Pregel program typically involves subclassing the predefined Vertex class.
computed for every active vertex in supersteps.
by GetValue() or modify it using MutableValue()
using the out-edge iterator.
16 PregelThe C++ API – Message Passing
Each message consists of a value and the name
–The type of value is specified in the template parameter of the Vertex class.
Any number of messages can be sent in a superstep.
–The framework guarantees delivery and non- duplication but not in-order delivery.
A message can be sent to any vertex if it’s identifier is known.
17 PregelThe C++ API – Pregel Code
Pregel Code for finding the max value Class MaxFindVertex : public Vertex<double, void, double> { public: virtual void Compute(MessageIterator* msgs) { int currMax = GetValue(); SendMessageToAllNeighbors(currMax); for ( ; !msgs->Done(); msgs->Next()) { if (msgs->Value() > currMax) currMax = msgs->Value(); } if (currMax > GetValue()) *MutableValue() = currMax; else VoteToHalt(); } };
18 PregelThe C++ API – Combiners
Sending a message to another vertex that exists
However if the algorithm doesn’t require each message explicitly but a function of it (example sum) then combiners can be used. This can be done by overriding the Combine() method.
commutative operations.
19 PregelThe C++ API – Combiners Example:
Say we want to count the number of incoming links to all the pages in a set of interconnected pages. In the first iteration, for each link from a vertex(page) we will send a message to the destination page. Here, count function over the incoming messages can be used a combiner to optimize performance. In the MaxValue Example, a Max combiner would reduce the communication load.
20 PregelThe C++ API – Combiners
21 PregelThe C++ API – Aggregators
They are used for Global communication, monitoring and data.
Each vertex can produce a value in a superstep S for the Aggregator to use. The Aggregated value is available to all the vertices in superstep S+1.
Aggregators can be used for statistics and for global communication. Can be implemented by subclassing the Aggregator Class
Commutativity and Assosiativity required
22 PregelThe C++ API – Aggregators
Example:
Sum operator applied to out-edge count of each vertex can be used to generate the total number
the vertices.
generate histograms. In the MaxValue example, we can finish the entire program in a single superstep by using a Max aggregator.
23 PregelThe C++ API – Topology Mutations
The Compute() function can also be used to modify the structure of the graph.
Example: Hierarchical Clustering
Mutations take effect in the superstep after the requests were issued. Ordering of mutations, with
– deletions taking place before additions, – deletion of edges before vertices and – addition of vertices before edges
resolves most of the conflicts. Rest are handled by user-defined handlers.
24 PregelImplementation
Pregel is designed for the Google cluster architecture. The architecture schedules jobs to optimize resource allocation, involving killing instances or moving them to different locations. Persistent data is stored as files on a distributed storage system like GFS[8] or BigTable.
25 PregelBasic Architecture
The Pregel library divides a graph into partitions, based on the vertex ID, each consisting of a set
edges. The default function is hash(ID) mod N, where N is the number of partitions. The next few slides describe the several stages
Pregel Execution
executing on a cluster of machines. One of these copies acts as the master. The master is not assigned any portion of the graph, but is responsible for coordinating worker activity.
27 PregelPregel Execution
the graph will have and assigns one or more partitions to each worker machine. Each worker is responsible for maintaining the state of its section of the graph, executing the user’s Compute() method on its vertices, and managing messages to and from other workers.
28 PregelPregel Execution
29 Pregel1 4 2 6 8 9 10 3 5 7 11 12
Pregel Execution
input to each worker. The input is treated as a set of records, each of which contains an arbitrary number of vertices and edges. After the input has finished loading, all vertices are marked are active.
30 PregelPregel Execution
a superstep. The worker loops through its active vertices, and call Compute() for each active
in the previous superstep. When the worker finishes it responds to the master with the number of vertices that will be active in the next superstep.
31 PregelPregel Execution
32 PregelPregel Execution
33 PregelFault Tolerance
tolerance.
– At the start of every superstep the master may instruct the workers to save the state of their partitions in stable storage. – This includes vertex values, edge values and incoming messages.
failures.
34 PregelFault Tolerance
associated partitions’ current state is lost.
set of workers.
– They reload their partition state from the most recent available checkpoint. This can be many steps old. – The entire system is restarted from this superstep.
load
35 PregelApplications
PageRank
PageRank is a link analysis algorithm that is used to determine the importance of a document based on the number of references to it and the importance of the source documents themselves. [This was named after Larry Page (and not after rank of a webpage)]
37 PregelPageRank
A = A given page T1 …. Tn = Pages that point to page A (citations) d = Damping factor between 0 and 1 (usually kept as 0.85) C(T) = number of links going out of T PR(A) = the PageRank of page A
) ) ( ) ( ........ ) ( ) ( ) ( ) ( ( ) 1 ( ) (
2 2 1 1 n n
T C T PR T C T PR T C T PR d d A PR
38 PregelPageRank
Courtesy: Wikipedia
39 PregelPageRank
40 PregelPageRank can be solved in 2 ways:
We look at the pseudo code of iterative version
Initial value of PageRank of all pages = 1.0; While ( sum of PageRank of all pages – numPages > epsilon) { for each Page Pi in list { PageRank(Pi) = (1-d); for each page Pj linking to page Pi { PageRank(Pi) += d × (PageRank(Pj)/numOutLinks(Pj)); } } }
PageRank in MapReduce – Phase I
Parsing HTML
maps them to (URL, (PRinit, list-of-urls))
– PRinit is the “seed” PageRank for URL – list-of-urls contains all pages pointed to by URL
PageRank in MapReduce – Phase 2
PageRank Distribution
– For each u in url_list, emit (u, cur_rank/|url_list|) – Emit (URL, url_list) to carry the points-to list along through iterations
(URL, val) values
– Sum vals and fix up with d – Emit (URL, (new_rank, url_list))
42 PregelPageRank in MapReduce - Finalize
determines whether convergence has been achieved
done
into another Phase 2 iteration
43 PregelPageRank in Pregel
Class PageRankVertex : public Vertex<double, void, double> { public: virtual void Compute(MessageIterator* msgs) { if (superstep() >= 1) { double sum = 0; for (; !msgs->done(); msgs->Next()) sum += msgs->Value(); *MutableValue() = 0.15 + 0.85 * sum; } if (supersteps() < 30) { const int64 n = GetOutEdgeIterator().size(); SendMessageToAllNeighbors(GetValue() / n); } else { VoteToHalt(); }}};
44 PregelPageRank in Pregel
The pregel implementation contains the PageRankVertex, which inherits from the Vertex class. The class has the vertex value type double to store tentative PageRank and message type double to carry PageRank fractions. The graph is initialized so that in superstep 0, value of each vertex is 1.0 .
45 PregelPageRank in Pregel
In each superstep, each vertex sends out along each outgoing edge its tentative PageRank divided by the number of outgoing edges. Also, each vertex sums up the values arriving on messages into sum and sets its own tentative PageRank to For convergence, either there is a limit on the number of supersteps or aggregators are used to detect convergence.
46 Pregelsum 85 . 15 .
Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org> @claudiomartella Hadoop Summit @ Amsterdam - 3 April 2014
2
Graphs are simple
3
A computer network
4
A social network
5
A semantic network
6
A map
7
Graphs are huge
witter has around 530M users
VERY rough estimates!
8
9
Graphs aren’t easy
10
Graphs are nasty.
11
Each vertex depends on its neighbours, recursively.
12
Recursive problems are nicely solved iteratively.
13
PageRank in MapReduce
#neighbours >
14
MapReduce datafmow
15
Drawbacks
structure
sort, output
16
17
Timeline
18
Plays well with Hadoop
19
Vertex-centric API
20
BSP machine
21
BSP & Giraph
22
Advantages
communication
synchronization
parallelizable
23
Architecture
24
Giraph job lifetime
25
Designed for iterations
(messages) sent
checkpoint
26
A bunch of other things
master)
partition)
27
Shortest Paths
28
Shortest Paths
29
Shortest Paths
30
Shortest Paths
31
Shortest Paths
32
Composable API
33
Checkpointing
34
No SPoFs
35
Giraph scales
36
ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-e dges/10151617006153920Giraph is fast
resources ;-)
37
Serialised objects
38
Primitive types
fastutils)
39
Sharded aggregators
40
Many stores with Gora
41
And graph databases
42
Current and next steps
43
GraphLab: A New Framework for Parallel Machine Learning
Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010
Overview
Programming ML Algorithms in Parallel
GraphLab
Implementation and Experiments From Multicore to Distributed
Environment
Parallel Processing for ML
Parallel ML is a Necessity
Parallel ML is Hard to Program
MapReduce is the Solution?
High-level abstraction: Statistical Query
Model [Chu et al, 2006]
Weighted Linear Regression: only sufficient statistics 𝚺 = A-1b, A = 𝚻wi(xixi
T), b = 𝚻wi(xiyi)
MapReduce is the Solution?
High-level abstraction: Statistical Query
Model [Chu et al, 2006]
K-Means: only data assignments class mean = avg(xi), xi in class Embarrassingly Parallel independent computation No Communication needed
ML in MapReduce
Multiple Mapper Single Reducer
Iterative MapReduce needs global
synchronization at the single reducer
Not always Embarrassingly Parallel
Data Dependency: not MapReducable
Capture Dependency as a Graph!
Overview
Programming ML Algorithms in Parallel
GraphLab
Implementation and Experiments From Multicore to Distributed
Environment
Key Idea of GraphLab
Sparse Data Dependencies Local Computations
X4 X5 X6 X9 X8 X3 X2 X1 X7
GraphLab for ML
High-level Abstract
Automatic Multicore Parallelism
Main Components of GraphLab
Data Graph Shared Data Table Scheduling Update Functions and Scopes
GraphLab
Model
Data Graph
A Graph with data associated with every
vertex and edge.
x3: Sample value C(X3): sample counts Φ(X6,X9): Binary potential
X
1X
2X
3X
5X
6X
7X
8X
9X10 X
4X11
Update Functions
Operations applied on a vertex that
transform data in the scope of the vertex
Gibbs Update:
vertices
the current vertex
Scope Rules
Consistency v.s. Parallelism
vertices
Scheduling
Scheduler determines the order of
Update Function evaluations
Static Scheduling
Dynamic Scheduling
Dynamic Scheduling
e f g k j i h d c b a
CPU 1 CPU 2
a h a b b i
Global Information
Shared Data Table in Shared Memory
Sync Functions for Updatable Shared Data
vertices
accumulated data
Sync Functions
Much like Fold/Reduce
Can be called
(asynchronous) or
(synchronous)
GraphLab
GraphLab
Model Data Graph Shared Data Table Scheduling Update Functions and Scopes
Overview
Programming ML Algorithms in Parallel
GraphLab
Implementation and Experiments From Multicore to Distributed
Environment
Implementation and Experiments
Shared Memory Implemention in C++
using Pthreads
Applications:
Parallel Performance
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Speedup
Number of CPUs Optimal
Better
Round robin schedule Colored Schedule
From Multicore to Distributed Enviroment
MapReduce and GraphLab work well for
Multicores
When Migrate to Clusters
22.06.2015 DIMA – TU Berlin 1
Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/
Hot Topics in Information Management PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
Igor Shevchenko Mentor: Sebastian Schelter
22.06.2015 DIMA – TU Berlin 2
Agenda
Paper: Gonzalez at al. PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs.
22.06.2015 DIMA – TU Berlin 3
■ Natural graphs are graphs derived from real-world
■ Graphs are big: billions of vertices and edges and rich metadata;
Natural graphs have Power-Law Degree Distribution
Natural Graphs
22.06.2015 DIMA – TU Berlin 4
Power-Law Degree Distribution
(Andrei Broder et al. Graph structure in the web)
22.06.2015 DIMA – TU Berlin 5
■ We want to analyze natural graphs; ■ Essential for Data Mining and Machine Learning; Goal
Identify influential people and information; Identify special nodes and communities; Model complex data dependencies; Target ads and products; Find communities; Flow scheduling;
22.06.2015 DIMA – TU Berlin 6
■ Existing distributed graph computation systems perform poorly on natural graphs (Gonzalez et al. OSDI ’12); ■ The reason is presence of high degree vertices; Problem High Degree Vertices: Star-like motif
22.06.2015 DIMA – TU Berlin 7
Possible problems with high degree vertices: ■ Limited single-machine resources; ■ Work imbalance; ■ Sequential computation; ■ Communication costs; ■ Graph partitioning; Applicable to: ■ Hadoop; GraphLab; Pregel (Piccolo); Problem Continued
22.06.2015 DIMA – TU Berlin 8
■ High degree vertices can exceed the memory capacity of a single machine; ■ Store edge meta-data and adjacency information; Problem: Limited Single-Machine Resources
22.06.2015 DIMA – TU Berlin 9
■ The power-law degree distribution can lead to significant work imbalance and frequency barriers; ■ For ex. with synchronous execution (Pregel): Problem: Work Imbalance
22.06.2015 DIMA – TU Berlin 10
■ No parallelization of individual vertex-programs; ■ Edges are processed sequentially; ■ Locking does not scale well to high degree vertices (for ex. in GraphLab); Problem: Sequential Computation
Sequentially process edges Asynchronous execution requires heavy locking
22.06.2015 DIMA – TU Berlin 11
■ Generate and send large amount of identical messages (for ex. in Pregel); ■ This results in communication asymmetry; Problem: Communication Costs
22.06.2015 DIMA – TU Berlin 12
■ Natural graphs are difficult to partition; ■ Pregel and GraphLab use random (hashed) partitioning on natural graphs thus maximizing the network communication; Problem: Graph Partitioning
22.06.2015 DIMA – TU Berlin 13
■ Natural graphs are difficult to partition; ■ Pregel and GraphLab use random (hashed) partitioning on natural graphs thus maximizing the network communication; Expected edges that are cut Examples: ■ 10 machines: ■ 100 machines: Problem: Graph Partitioning Continued
= number of machines
90% of edges cut; 99% of edges cut;
22.06.2015 DIMA – TU Berlin 14
■ GraphLab and Pregel are not well suited for computations on natural graphs; Reasons: ■ Challenges of high-degree vertices; ■ Low quality partitioning; Solution: ■ PowerGraph new abstraction; In Summary
22.06.2015 DIMA – TU Berlin 15
PowerGraph
22.06.2015 DIMA – TU Berlin 16
Two approaches for partitioning the graph in a distributed environment:
■ Edge Cut; ■ Vertex Cut;
Partition Techniques
22.06.2015 DIMA – TU Berlin 17
■ Used by Pregel and GraphLab abstractions; ■ Evenly assign vertices to machines; Edge Cut
22.06.2015 DIMA – TU Berlin 18
■ Used by PowerGraph abstraction; ■ Evenly assign edged to machines; Vertex Cut
The strong point of the paper 4 edges 4 edges
22.06.2015 DIMA – TU Berlin 19
Think like a Vertex
[Malewicz et al. SIGMOD’10]
User-defined Vertex-Program:
Pregel and GraphLab also use this concept, where parallelism is achieved by running multiple vertex programs simultaneously; Vertex Programs
22.06.2015 DIMA – TU Berlin 20
■ Vertex cut distributes a single vertex-program across several machines; ■ Allows to parallelize high-degree vertices; GAS Decomposition The strong point of the paper
22.06.2015 DIMA – TU Berlin 21
Generalize the vertex-program into three phases:
GAS Decomposition Gather, Apply and Scatter are user-defined functions;
The strong point of the paper
22.06.2015 DIMA – TU Berlin 22
■ Executed on the edges in parallel; ■ Accumulate information about neighborhood; Gather Phase
22.06.2015 DIMA – TU Berlin 23
■ Executed on the central vertex; ■ Apply accumulated value to center vertex; Apply Phase
Today’s Biz
2D Partitioning Aydin Buluc and Kamesh Madduri
8 / 13Graph Partitioning for Scalable Distributed Graph Computations
Aydın Buluç Kamesh Madduri
ABuluc@lbl.gov madduri@cse.psu.edu
10th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering February 13-14, 2012 Atlanta, GA
Overview of our study
computations on ‘low diameter’ graphs
execution time?
representative distributed graph computation
instances
2Key Observations for Parallel BFS
guarantee load-balanced execution, particularly for real-world graphs
– Range of relative speedups (8.8-50X, 256-way parallel concurrency) for low-diameter DIMACS graph instances.
and communication volume, but lead to increased computational load imbalance
cost in our tuned bulk-synchronous parallel BFS implementation
3Talk Outline
memory systems
– Analysis of communication costs
communication cost
large-scale DIMACS graph instances
4Parallel BFS strategies
57 5 3 8 2 4 6 1 9
source vertex
in current frontier are visited in parallel
7 5 3 8 2 4 6 1 9
source vertex
“super vertices”
vertices”
the dense matrix representation of the graph
within the sub-matrix)
“2D” graph distribution
7 5 3 8 2 4 6 1 x x x x x x x x x x x x x x x x x x x x x x x x 9 vertices, 9 processors, 3x3 processor grid Flatten Sparse matrices Per-processor local graph representation
BFS with a 1D-partitioned graph
Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.
1 2 3 6 5 4
[0,1] [0,3] [0,3] [1,0] [1,4] [1,6] [2,3] [2,5] [2,5] [2,6] [3,0] [3,0] [3,2] [3,6] [4,1] [4,5] [4,6] [5,2] [5,2] [5,4] [6,1] [6,2] [6,3] [6,4] Consider an undirected graph with n vertices and m edges Each processor ‘owns’ n/p vertices and stores their adjacencies (~ 2m/p per processor, assuming balanced partitions).
BFS with a 1D-partitioned graph
Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.
1 2 3 6 5 4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2
No work No work
BFS with a 1D-partitioned graph
Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.
1 2 3 6 5 4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2
No work No work
BFS with a 1D-partitioned graph
Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.
1 2 3 6 5 4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2
BFS with a 1D-partitioned graph
Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.
1 2 3 6 5 4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
[1,0] [6,2] [6, 3] [1,4] [1,6] [6,1] [6,4] P0 P3 P1 P2 2, 3 4 Frontier for next iteration
Modeling parallel execution time
communication
communication, we have
12p m n p m
p n L L
/ ,
Local latency on working set |n/p| Inverse local RAM bandwidth
Local memory references:
p p edgecut p
N a a N
) (
2 ,
Inter-node communication:
All-to-all remote bandwidth with p participating processors
BFS with a 2D-partitioned graph
communication step
n/pr vertices
communication step for processes in a row
13Local memory references:
p m p n p m
r cp n L p n L L , ,
Inter-node communication:
c N c r c gather N r N r a a N
p p n p p p p edgecut p 1 1 ) ( ) (
, 2 ,
Temporal effects, communication-minimizing tuning prevent us from obtaining tighter bounds
maintaining state of non-local visited vertices
141 2 3 6 5 4
[0,3] [0,3] [1,3] [0,4] [1,4]
P0
Local pruning prior to All-to-all step [0,6] [1,6] [1,6] [0,3] [0,4] [1,6]
Predictable BFS execution time for synthetic small-world graphs
R-MAT graphs (used in the Graph 500 benchmark).
system (Cray XE6) is ranked #2 on the current Graph 500 list.
15Buluc & Madduri, Parallel BFS on distributed memory systems, Proc. SC’11, 2011. Execution time is dominated by work performed in a few parallel phases
Modeling BFS execution time for real-world graphs
utilizing existing partitioning methods?
arbitrary low-diameter graphs?
distribution schemes on the DIMACS Challenge graph instances
– Natural ordering, Random, Metis, PaToH
16Experimental Study
communication can be statically computed (based on partitioning of the graph)
– Total aggregate communication volume – Sum of max. communication volume during each BFS iteration – Intra-node computational work balance – Communication volume reduction with 2D partitioning
different parallel concurrencies) on a Cray XE6 system (Hopper, NERSC)
17Orderings for the CoPapersCiteseer graph
18Natural Random PaToH checkerboard PaToH Metis
BFS All-to-all phase total communication volume normalized to # of edges (m)
# of partitions Graph name
% compared to m
Natural Random PaToH
19Ratio of max. communication volume across iterations to total communication volume
# of partitions Graph name
Ratio over total volume
Natural Random PaToH
20Reduction in total All-to-all communication volume with 2D partitioning
21Graph name
Ratio compared to 1D
Natural Random PaToH
# of partitions
Edge count balance with 2D partitioning
Graph name
Max/Avg. ratio
Natural Random PaToH
# of partitions
Parallel speedup on Hopper with 16-way partitioning
23Execution time breakdown
24 50 100 150 200 Random-1D Random-2D Metis-1D PaToH-1DBFS time (ms) Partitioning Strategy
Computation Fold Expand 2 4 6 8 10 Random-1D Random-2D Metis-1D PaToH-1DPartitioning Strategy
Fold Expand 50 100 150 200 250 300 Random-1D Random-2D Metis-1D PaToH-1DBFS time (ms) Partitioning Strategy
Computation Fold Expand 0.5 1 1.5 2 2.5 3 Random-1D Random-2D Metis-1D PaToH-1D
Partitioning Strategy
Fold Expand
eu-2005 kron-simple-logn18
Imbalance in parallel execution
25eu-2005, 16 processes* PaToH Random
* Timeline of 4 processes shown in figures. PaToH-partitioned graph suffers from severe load imbalance in computational phases.
Conclusions
computational and communication load balance, particularly at higher process concurrencies
volume, but introduce significant load imbalance
graphs compared to synthetic graphs (8.8X vs 50X at 256- way parallel concurrency)
– Points to the need for dynamic load balancing
26Today: In class work
◮ Develop 2D partitioning strategy ◮ Implement BFS
Blank code and data available on website (Lecture 24) www.cs.rpi.edu/∼slotag/classes/FA16/index.html
9 / 13