GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC - - PowerPoint PPT Presentation

▶

Jul 09, 2023 510 likes •665 views

GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems Dipanjan Sengupta Shuaiwen Leon Song Kapil Agarwal Pacific Northwest National Karsten Schwan Lab CERCS - Georgia Tech Talk Outline Motivation Background on

SLIDE 1

GraphReduce: Large-Scale Graph Analytics

n Accelerator-Based HPC Systems

Dipanjan Sengupta Kapil Agarwal Karsten Schwan CERCS - Georgia Tech

Shuaiwen Leon Song Pacific Northwest National Lab

SLIDE 2

Talk Outline

 Motivation  Background on GAS  Hybrid Programming model  GraphReduce Architecture  Experimental Results  Conclusion  Future Work

SLIDE 3

Motivation

 Why use GPUs ? – GPU-based

frameworks are orders of magnitude faster

 Previous GPU-based graph

processing doesn’t handle datasets that doesn’t fit in memory

 Yahoo-web graph with 1.4 billion

vertices requires 6.6 GB memory just to store its vertex values.

 Se

Seve veral cha hallenge nges s in n large ge- scale scale grap graph h pro rocessing cessing  How to

to parti titi tion th the graph ?

 How and when to

to move th the parti titi tions betw tween host t and GPU GPU ?

 How to

to best t extr tract t multi ti-level parallelis parallelism in in G GPUs ? ?

SLIDE 4

Background – GAS model

U1 U2

U3 U4

a b c d

U1 U2

U3 U4

a b c d

U1 U2

U3 U4

a b c d

Gather Apply Scatter

 Gather phase: each

vertex aggregates values associated with its incoming edges and source vertices

 Apply phase: each vertex

updates its state using the gather result

 Scatter phase: each

vertex updates the state

f every outgoing edge.

SLIDE 5

Hybrid Programming Model

 Existing systems choose either vertex- or edge-centric GAS

programming model for graph execution.

 Different processing phases have different types of parallelism and

memory access characteristics

 GraphReduce adopts a hybrid model with a combination of both

vertex and edge centric model

vertex_scatter (vertex v) send updates over outgoing edges of v vertex gather (vertex v) apply updates from inbound edges of v while not done for all vertices v that need to scatter updates vertex_scatter (v) for all vertices v that have updates vertex_gather (v) edge_scatter (edge e) send update over e update_gather (update u) apply update u to u. destination while not done for all edges e edge_scatter (e) for all updates u update_gather (u)

Vertex-centric GAS Edge-centric GAS

SLIDE 6

GraphReduce Architecture

SLIDE 7

GraphReduce Architecture Contd…



Three major components 

Partition Engine



Data Movement Engine



Computation Engine



Partition Engine has two responsibilities



Load balanced shard creation, such that each shard contains approximately equal number of edges



Ordering the edges in a shard based on their source or destination vertices for efficient data movement and memory access



Data Movement Engine has following responsibilities



Moving shards in and out of limited GPU memory to process large-scale graphs



Efficiently utilize GPU hardware resources using CUDA streams and Hyper-Qs to achieve high performance

 Saturate the data transfer bandwidth of the PCI-E bus connecting the host

and the GPUs

SLIDE 8

Compute Engine

 Four phases of computation

 Gather Map: fetches all the updates/

messages along the in-edges.

 Gather Reduce: reduce all the collected

updates for each vertex

 Apply: apply the update to each vertex  Scatter: distribute the updated states of

the vertices along the out-edges

 Combination of vertex and edge centric

implementation

 Gather Map – edge centric  Gather Reduce – vertex centric  Apply – vertex centric  Scatter – edge centric

SLIDE 9

Experimental Setup

 Experimental Setup  Node configuration  Two Intel Xeon E5-2670 processors running at

2.6 GHz and 32 GB of RAM

 NVIDIA Tesla K20c GPU with 4.8 GB of DRAM

SLIDE 10

Benchmarks and Dataset

 Graph algorith

thms used are are BFS BFS an and d Pa PageR geRank nk

 9 real w

9 real world an

rld and

d synth theti tic graph data tasets ts as shown in th the ta table.

SLIDE 11

Results

SLIDE 12

Conclusions

 Gr

GraphR hRed educ uce dev develops elops a g a graph raph proces processin ing f fram ramew ework

for input t data tasets ts th that t may or may not t fit t in GPU me memo mory

 Adopts

ts a combinati tion of both th edge and verte tex centr tric implementa tati tion of GAS programming model

 Leverages CUDA

DA str treams and hardware supports ts like hyper-Qs to to str tream data ta in and out t of GPU for high perf perform

rman

ance ce

 Outp

tperforms CPU-based out- t-of-core graph processing framework across a variety ty of real data ta sets ts

SLIDE 13

Future Work

 Ex

Exte tending Gr GraphR hRed educ uce framework to to multi tiple nodes in a cluste ter using communicati tion models like MPI like MPI

 Addressing th

the limite ted on-node memory size th through th the usage of SSD D and oth ther sto torage dev devices ices

 Pr

Processing essing dyna namic mically lly evo evolving lving gr graphs hs

 Understa

tanding how dynamic profiling could be inte tegrate ted into to Gr GraphR hRed educ uce

SLIDE 14

GraphReduce: Large-Scale Graph Analytics

Dipanjan Sengupta Kapil Agarwal Karsten Schwan CERCS - Georgia Tech

Talk Outline

 Motivation  Background on GAS  Hybrid Programming model  GraphReduce Architecture  Experimental Results  Conclusion  Future Work

Motivation

 Why use GPUs ? – GPU-based

 Previous GPU-based graph

 Se

Background – GAS model

 Gather phase: each

vertex aggregates values associated with its incoming edges and source vertices

 Apply phase: each vertex

updates its state using the gather result

 Scatter phase: each

vertex updates the state

Hybrid Programming Model

GraphReduce Architecture

GraphReduce Architecture Contd…

Compute Engine

Experimental Setup

 Experimental Setup  Node configuration  Two Intel Xeon E5-2670 processors running at

2.6 GHz and 32 GB of RAM

 NVIDIA Tesla K20c GPU with 4.8 GB of DRAM

Benchmarks and Dataset

 Graph algorith

thms used are are BFS BFS an and d Pa PageR geRank nk

 9 real w

9 real world an

d synth theti tic graph data tasets ts as shown in th the ta table.

Results

Conclusions

 Gr

 Adopts

 Leverages CUDA

 Outp

Future Work

 Ex

Exte tending Gr GraphR hRed educ uce framework to to multi tiple nodes in a cluste ter using communicati tion models like MPI like MPI

 Addressing th

the limite ted on-node memory size th through th the usage of SSD D and oth ther sto torage dev devices ices

 Pr

Processing essing dyna namic mically lly evo evolving lving gr graphs hs

 Understa

tanding how dynamic profiling could be inte tegrate ted into to Gr GraphR hRed educ uce

Thank You!

 Motivation  Background on GAS  Hybrid Programming model  GraphReduce Architecture  Experimental Results  Conclusion  Future Work

 Why use GPUs ? – GPU-based

 Previous GPU-based graph

 Se

 Gather phase: each

 Apply phase: each vertex

 Scatter phase: each

 Experimental Setup  Node configuration  Two Intel Xeon E5-2670 processors running at

 NVIDIA Tesla K20c GPU with 4.8 GB of DRAM

 Graph algorith

 9 real w

 Gr

 Adopts

 Leverages CUDA

 Outp

 Ex

 Addressing th

 Pr

 Understa