GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC - - PowerPoint PPT Presentation

graphreduce large scale graph analytics on accelerator
SMART_READER_LITE
LIVE PREVIEW

GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC - - PowerPoint PPT Presentation

GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems Dipanjan Sengupta Shuaiwen Leon Song Kapil Agarwal Pacific Northwest National Karsten Schwan Lab CERCS - Georgia Tech Talk Outline Motivation Background on


slide-1
SLIDE 1

GraphReduce: Large-Scale Graph Analytics

  • n Accelerator-Based HPC Systems

Dipanjan Sengupta Kapil Agarwal Karsten Schwan CERCS - Georgia Tech

Shuaiwen Leon Song Pacific Northwest National Lab

slide-2
SLIDE 2

Talk Outline

— Motivation — Background on GAS — Hybrid Programming model — GraphReduce Architecture — Experimental Results — Conclusion — Future Work

slide-3
SLIDE 3

Motivation

— Why use GPUs ? – GPU-based

frameworks are orders of magnitude faster

— Previous GPU-based graph

processing doesn’t handle datasets that doesn’t fit in memory

— Yahoo-web graph with 1.4 billion

vertices requires 6.6 GB memory just to store its vertex values.

— Se

Seve veral cha hallenge nges s in n large ge- scale scale grap graph h pro rocessing cessing — How to

to parti titi tion th the graph ?

— How and when to

to move th the parti titi tions betw tween host t and GPU GPU ?

— How to

to best t extr tract t multi ti-level parallelis parallelism in in G GPUs ? ?

slide-4
SLIDE 4

Background – GAS model

U1 U2

v

U3 U4

a b c d

U1 U2

v

U3 U4

a b c d

U1 U2

v

U3 U4

a b c d

Gather Apply Scatter

— Gather phase: each

vertex aggregates values associated with its incoming edges and source vertices

— Apply phase: each vertex

updates its state using the gather result

— Scatter phase: each

vertex updates the state

  • f every outgoing edge.
slide-5
SLIDE 5

Hybrid Programming Model

— Existing systems choose either vertex- or edge-centric GAS

programming model for graph execution.

— Different processing phases have different types of parallelism and

memory access characteristics

— GraphReduce adopts a hybrid model with a combination of both

vertex and edge centric model

vertex_scatter (vertex v) send updates over outgoing edges of v vertex gather (vertex v) apply updates from inbound edges of v while not done for all vertices v that need to scatter updates vertex_scatter (v) for all vertices v that have updates vertex_gather (v) edge_scatter (edge e) send update over e update_gather (update u) apply update u to u. destination while not done for all edges e edge_scatter (e) for all updates u update_gather (u)

Vertex-centric GAS Edge-centric GAS

slide-6
SLIDE 6

GraphReduce Architecture

slide-7
SLIDE 7

GraphReduce Architecture Contd…

—

Three major components —

Partition Engine

—

Data Movement Engine

—

Computation Engine

—

Partition Engine has two responsibilities

—

Load balanced shard creation, such that each shard contains approximately equal number of edges

—

Ordering the edges in a shard based on their source or destination vertices for efficient data movement and memory access

—

Data Movement Engine has following responsibilities

—

Moving shards in and out of limited GPU memory to process large-scale graphs

—

Efficiently utilize GPU hardware resources using CUDA streams and Hyper-Qs to achieve high performance

— Saturate the data transfer bandwidth of the PCI-E bus connecting the host

and the GPUs

!

slide-8
SLIDE 8

Compute Engine

— Four phases of computation

— Gather Map: fetches all the updates/

messages along the in-edges.

— Gather Reduce: reduce all the collected

updates for each vertex

— Apply: apply the update to each vertex — Scatter: distribute the updated states of

the vertices along the out-edges

— Combination of vertex and edge centric

implementation

— Gather Map – edge centric — Gather Reduce – vertex centric — Apply – vertex centric — Scatter – edge centric

slide-9
SLIDE 9

Experimental Setup

— Experimental Setup — Node configuration — Two Intel Xeon E5-2670 processors running at

2.6 GHz and 32 GB of RAM

— NVIDIA Tesla K20c GPU with 4.8 GB of DRAM

slide-10
SLIDE 10

Benchmarks and Dataset

— Graph algorith

thms used are are BFS BFS an and d Pa PageR geRank nk

— 9 real w

9 real world an

  • rld and

d synth theti tic graph data tasets ts as shown in th the ta table.

slide-11
SLIDE 11

Results

slide-12
SLIDE 12

Conclusions

— Gr

GraphR hRed educ uce dev develops elops a g a graph raph proces processin ing f fram ramew ework

  • rk

for input t data tasets ts th that t may or may not t fit t in GPU me memo mory

— Adopts

ts a combinati tion of both th edge and verte tex centr tric implementa tati tion of GAS programming model

— Leverages CUDA

DA str treams and hardware supports ts like hyper-Qs to to str tream data ta in and out t of GPU for high perf perform

  • rman

ance ce

— Outp

tperforms CPU-based out- t-of-core graph processing framework across a variety ty of real data ta sets ts

slide-13
SLIDE 13

Future Work

— Ex

Exte tending Gr GraphR hRed educ uce framework to to multi tiple nodes in a cluste ter using communicati tion models like MPI like MPI

— Addressing th

the limite ted on-node memory size th through th the usage of SSD D and oth ther sto torage dev devices ices

— Pr

Processing essing dyna namic mically lly evo evolving lving gr graphs hs

— Understa

tanding how dynamic profiling could be inte tegrate ted into to Gr GraphR hRed educ uce

slide-14
SLIDE 14

Thank You!