Networked Systems Laboratory (NetSysLab) University of British - PowerPoint PPT Presentation

How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat , Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Networked Systems Laboratory (NetSysLab) University of British Columbia A golf course … … a (nudist) beach (… and 199 days of rain each year)

Graphs are Everywhere 1B users 150B friendships 100B neurons 700T connections 4

Challenges in Graph Processing Poor locality Data-dependent memory access patterns Low compute-to- memory access ratio Graph500 “mini” graph Large memory footprint requires 128 GB. Varying degrees of parallelism (both intra- and inter- stage)

Processing Elements Characteristics GPUs CPUs Poor locality Data-dependent memory Caches Large Caches access patterns Massive hardware Low compute-to- multithreading memory access ratio Graph500 “mini” graph >1TB Large memory footprint ~ 16GB requires 128 GB. Assemble a Varying degrees of parallelism (both intra- and inter- stage) hybrid platform?

Graph Processing Frameworks Programming Model (Vertex Programming/Linear Algebra) High Performance Architecture (Single-node or Distributed) CPU/GPU/Hybrid

Motivation How architecture and programming model combination improves performance and efficiency of the system as a whole?

Graph Processing Frameworks Model Framework Architecture Programming Model Vertex Galois CPU Programming UTexas, Austin GraphMat CPU + Distributed Linear Algebra Intel Gunrock Multi - GPU Vertex UC, Davis Programming Nvgraph GPU Linear Algebra Nvidia Totem CPU + multi-GPU Vertex UBC Programming

Benchmark Algorithms • PageRank • Ranking web pages • Compute intensive • Single Source Shortest Paths (SSSP) • IP routing, Transportation networks • Breadth-First Search (BFS) • Finding connected component, subroutine • Memory intensive

Evaluation Metrics § Raw Performance § Traversed Edges Per Second (TEPS): Traversed Edges / Execution Time § Energy Consumption § Average Power consumed * Execution Time § Scalability § Strong scaling w.r.t processing units

Testbed Characteristics System 1 CPU 2x Intel Xeon E5-2695 v3 (Haswell) #CPU Cores 28 Host Memory 512 GB DDR4 L3 Cache 70 MB PCIe 3.0 – x16 GPU 2x Nvidia Tesla K40c GPU Thread 2880 Count GPU Memory 12 GB

Datasets Graph #Vertices #Edges Max Degree Avg. Degree Com-Orkut 3 M 234 M 33,313 78 liveJournal 4.8 M 68 M 20,292 14 Real World Road-USA 28.8 M 47.9 M 9 1.6 Twitter 52 M 3.9 B 3,691,240 75 RMAT22 4 M 128 M 168,729 32 RMAT23 8 M 256 M 272,808 32 Synthetic RMAT24 16 M 512 M 439,994 32 RMAT27 128 M 4 B 3,910,241 32

WDC, 2012

Memory Consumption Framework Memory layout PageRank SSSP BFS Nvgraph CSC (PageRank, 1,159 (1.8x) 1,111 (1.0x) 683 (1.0x) SSSP) and CSR (BFS) 9,354 MB Gunrock CSR and COO 641 (1.0x) 1,582 (1.4x) 1,443 (2.1x) during pre- processing Galois CSR 1,599 (2.5x) 2,074 (1.9x) 1,432 (2.1x) step GraphMat* DCSC 2,818 (4.4x) 2,786 (2.5x) 2,980 (4.4x) Totem-2S CSR 1,275 (2.0x) 2,198 (2.0x) 1,282 (1.9x) Totem-2S2G CSR 1,628 (2.5x) 2,587 (2.3x) 1,658 (2.4x) Memory Consumption (in MB) for RMAT22 graph (edge list size: 512 MB)

Experime mental Results 1. Raw Performa mance - Pa PageRank Nvgraph Gunrock Totem-1G Galois GraphMat Totem-2S Totem-2S2G 18 16 Billion TEPS / Iteration 14 Fastest: Totem-2S 12 Nvgraph vs GraphMat 10 8 6 4 2 0 Orkut LiveJournal RMAT22 RMAT23 RMAT24 RMAT27 Twitter

Experime mental Results 1. Raw Performa mance - SSSP SSSP Nvgraph Gunrock Totem-1G Galois GraphMat Totem-2S Totem-2S2G 4.50 4.00 3.50 3.00 Billion TEPS Fastest: Totem-2S 2.50 CSC is suitable for PageRank 2.00 1.50 1.00 0.50 0.00 Orkut LiveJournal Road_USA RMAT22 RMAT24 RMAT27 Twitter

Graph Layout in Memory CSR Representation 1 rowPtr 0 1 3 3 6 8 VertexId 0 1 2 3 4 5* edgeList 1 2 3 0 2 4 0 2 0 2 0 1 2 3 4 5 6 7 CSC Representation colPtr 0 2 3 6 7 8 VertexId 0 1 2 3 4 5* 3 4 edgeList 3 4 0 1 3 4 1 3 0 1 2 3 4 5 6 7

Experime mental Results 1. Raw Performa mance - BF BFS Nvgraph Gunrock Totem-1G Galois GraphMat Totem-2S Totem-2S2G 120 100 Billion TEPS 80 Fastest: Totem-2S Nvgraph vs GraphMat 60 CSR suitable for BFS Hybrid: ~2x 40 20 0 Orkut LiveJournal RMAT22 RMAT24 RMAT27 Twitter

2. Energy Consump Experime Energy (watt-sec) 1,000 100 mental Results 10 1 Nvgraph mption – Gunrock PageRank – GPU Fr Totem-1G Frame Totem-2S meworks – Totem-2S2G Nvgraph – Orkut Workload Gunrock SSSP Totem-1G Totem-2S Totem-2S2G Nvgraph Gunrock BFS Totem-1G Totem-2S Totem-2S2G

2. Energy Consump Experime Energy (watt-second) 100,000 10,000 mental Results 1,000 100 10 1 mption – Galois PageRank GraphMat – CPU Fr Frame Totem-2S meworks – Totem-2S2G – Twitter Workload Galois GraphMat SSSP Totem-2S Totem-2S2G Galois GraphMat BFS Totem-2S Totem-2S2G

Experime mental Results 2. Energy Consump mption – – CPU Fr Frame meworks – – Twitter Workload 100,000 Energy (watt-second) 10,000 1,000 Energy Efficient: Totem-2S 100 10 1 Galois GraphMat Totem-2S Totem-2S2G Galois GraphMat Totem-2S Totem-2S2G Galois GraphMat Totem-2S Totem-2S2G PageRank SSSP BFS

Summary • GPU + Linear Algebra| CPU + Vertex programming = Good Match • GPU based frameworks: ? • CPU based frameworks: Totem-2S • Totem Hybrid: Greenest • CSC PageRank • CSR BFS, SSSP

Discussion

Does hybrid have the future potential? Execution Time Energy 18 18000 Execution Time (seconds) 16 16000 Energy (Watt-Sec) 14 14000 12 12000 10 10000 8 8000 6 6000 4 4000 2 2000 0 0 BFS SSSP PR BFS SSSP PR 4S 2S2G Totem-4S vs Totem-2S2G for RMAT30 (edge list size: 128 GB) 4S Machine: 4x Intel Xeon E7-4870 v2 (Ivy bridge), with 1,536 GB memory

Hybrid Graph Processing GPUs CPUs Graph Processing Poor locality Large Caches + Caches + summary data Data-dependent memory summary data structures access patterns structures Massive hardware Low compute-to- multithreading memory access ratio >1TB Large memory footprint 16GB! Varying degrees of parallelism Low Degree (both intra- and inter- stage) High Degree 27

Qu Questions code@: netsyslab.ece.ubc.ca

Networked Systems Laboratory (NetSysLab) University of British - PowerPoint PPT Presentation

How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat , Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems Laboratory (NetSysLab) University of

Whats Next for Networked Games? Wu-chang Feng W. Feng, "What's Next for Networked

Networked Embedded Systems Ezio Bartocci Overview Networked Embedded Systems (182.717): 6 weeks

Smart.Net-IP NETWORKED ACCESS CONTROL Smart.Net-IP NETWORKED ACCESS CONTROL Smart.Net-IP is a

Networked VR Ashweeni Beeharee Ashweeni Beeharee VE Course - Networked VR 1 Content

NAG : Motivating Deployment of Networked Systems Mohit Lad UCLA Deployment of Networked Systems

Networked Control Systems Joo Hespanha Networked Control Systems controller sensor actuator

networked control systems Massimo Franceschetti PhD School on Control of Networked and

Electricity and heat from waste Expansion of district heat 2009-2012 : More networked schools

Networked Insurance Personal Lines Networked gives you the tools you need to make your Personal

Digitalization for the networked society Jose Luis Ayala July 18 th , 2017 Ericsson The

8.3 Networked Application 8.3 Networked Application History and Evolution History and

Networked Embedded Software: Introduction Luca Mottola Politecnico di Milano, Italy and SICS

Dynamic Policy Enforcement Dynamic Policy Enforcement in a Networked Environment in a Networked

A Semantic Framework for Data Analysis in Networked Systems Arun Viswanathan , Alefiya Hussain,

Glasnost: Enabling End Users to Detect Traffic Differentiation Krishna P. Gummadi Networked

Networked Embedded Systems Broadcast Applications Heinz Deinhart <heinz@ecs.tuwien.ac.at>

p-Norm Flow Diffusion for Local Graph Clustering Kimon Fountoulakis 1 , Di Wang 2 , Shenghao Yang

Damping Effect on PageRank Distribution IEEE High Performace Extreme Computing, Waltham, MA, USA

Robust PageRank and Locally Computable Spam Detection Features Vahab Mirrokni [Microsoft

PageRank Document Understanding, session 3 CS6200: Information Retrieval Link Structure of the

Google PageRank Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

DATA MINING LECTURE 11 Link Analysis Ranking PageRank -- Random walks HITS Absorbing Random

Basic Network features Bart Baesens, Ph.D. Professor of Data Science, KU Leuven and University of

iii. "go.EE't)

Networked Systems Laboratory (NetSysLab) University of British - PowerPoint PPT Presentation

How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat , Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems Laboratory (NetSysLab) University of

Whats Next for Networked Games? Wu-chang Feng W. Feng, &quot;What's Next for Networked

Networked Embedded Systems Ezio Bartocci Overview Networked Embedded Systems (182.717): 6 weeks

Smart.Net-IP NETWORKED ACCESS CONTROL Smart.Net-IP NETWORKED ACCESS CONTROL Smart.Net-IP is a

Networked VR Ashweeni Beeharee Ashweeni Beeharee VE Course - Networked VR 1 Content

NAG : Motivating Deployment of Networked Systems Mohit Lad UCLA Deployment of Networked Systems

Networked Control Systems Joo Hespanha Networked Control Systems controller sensor actuator

networked control systems Massimo Franceschetti PhD School on Control of Networked and

Electricity and heat from waste Expansion of district heat 2009-2012 : More networked schools

Networked Insurance Personal Lines Networked gives you the tools you need to make your Personal

Digitalization for the networked society Jose Luis Ayala July 18 th , 2017 Ericsson The

8.3 Networked Application 8.3 Networked Application History and Evolution History and

Networked Embedded Software: Introduction Luca Mottola Politecnico di Milano, Italy and SICS

Dynamic Policy Enforcement Dynamic Policy Enforcement in a Networked Environment in a Networked

A Semantic Framework for Data Analysis in Networked Systems Arun Viswanathan , Alefiya Hussain,

Glasnost: Enabling End Users to Detect Traffic Differentiation Krishna P. Gummadi Networked

Networked Embedded Systems Broadcast Applications Heinz Deinhart &lt;heinz@ecs.tuwien.ac.at&gt;

p-Norm Flow Diffusion for Local Graph Clustering Kimon Fountoulakis 1 , Di Wang 2 , Shenghao Yang

Damping Effect on PageRank Distribution IEEE High Performace Extreme Computing, Waltham, MA, USA

Robust PageRank and Locally Computable Spam Detection Features Vahab Mirrokni [Microsoft

PageRank Document Understanding, session 3 CS6200: Information Retrieval Link Structure of the

Google PageRank Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

DATA MINING LECTURE 11 Link Analysis Ranking PageRank -- Random walks HITS Absorbing Random

Basic Network features Bart Baesens, Ph.D. Professor of Data Science, KU Leuven and University of

iii. &quot;go.EE't)

Whats Next for Networked Games? Wu-chang Feng W. Feng, "What's Next for Networked

Networked Embedded Systems Broadcast Applications Heinz Deinhart <heinz@ecs.tuwien.ac.at>

iii. "go.EE't)