networked systems laboratory netsyslab university of
play

Networked Systems Laboratory (NetSysLab) University of British - PowerPoint PPT Presentation

How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat , Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems Laboratory (NetSysLab) University of


  1. How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat , Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

  2. Networked Systems Laboratory (NetSysLab) University of British Columbia A golf course … … a (nudist) beach (… and 199 days of rain each year)

  3. Graphs are Everywhere 1B users 150B friendships 100B neurons 700T connections 4

  4. Challenges in Graph Processing Poor locality Data-dependent memory access patterns Low compute-to- memory access ratio Graph500 “mini” graph Large memory footprint requires 128 GB. Varying degrees of parallelism (both intra- and inter- stage)

  5. Processing Elements Characteristics GPUs CPUs Poor locality Data-dependent memory Caches Large Caches access patterns Massive hardware Low compute-to- multithreading memory access ratio Graph500 “mini” graph >1TB Large memory footprint ~ 16GB requires 128 GB. Assemble a Varying degrees of parallelism (both intra- and inter- stage) hybrid platform?

  6. Graph Processing Frameworks Programming Model (Vertex Programming/Linear Algebra) High Performance Architecture (Single-node or Distributed) CPU/GPU/Hybrid

  7. Motivation How architecture and programming model combination improves performance and efficiency of the system as a whole?

  8. Graph Processing Frameworks Model Framework Architecture Programming Model Vertex Galois CPU Programming UTexas, Austin GraphMat CPU + Distributed Linear Algebra Intel Gunrock Multi - GPU Vertex UC, Davis Programming Nvgraph GPU Linear Algebra Nvidia Totem CPU + multi-GPU Vertex UBC Programming

  9. Benchmark Algorithms • PageRank • Ranking web pages • Compute intensive • Single Source Shortest Paths (SSSP) • IP routing, Transportation networks • Breadth-First Search (BFS) • Finding connected component, subroutine • Memory intensive

  10. Evaluation Metrics § Raw Performance § Traversed Edges Per Second (TEPS): Traversed Edges / Execution Time § Energy Consumption § Average Power consumed * Execution Time § Scalability § Strong scaling w.r.t processing units

  11. Testbed Characteristics System 1 CPU 2x Intel Xeon E5-2695 v3 (Haswell) #CPU Cores 28 Host Memory 512 GB DDR4 L3 Cache 70 MB PCIe 3.0 – x16 GPU 2x Nvidia Tesla K40c GPU Thread 2880 Count GPU Memory 12 GB

  12. Datasets Graph #Vertices #Edges Max Degree Avg. Degree Com-Orkut 3 M 234 M 33,313 78 liveJournal 4.8 M 68 M 20,292 14 Real World Road-USA 28.8 M 47.9 M 9 1.6 Twitter 52 M 3.9 B 3,691,240 75 RMAT22 4 M 128 M 168,729 32 RMAT23 8 M 256 M 272,808 32 Synthetic RMAT24 16 M 512 M 439,994 32 RMAT27 128 M 4 B 3,910,241 32

  13. WDC, 2012

  14. Memory Consumption Framework Memory layout PageRank SSSP BFS Nvgraph CSC (PageRank, 1,159 (1.8x) 1,111 (1.0x) 683 (1.0x) SSSP) and CSR (BFS) 9,354 MB Gunrock CSR and COO 641 (1.0x) 1,582 (1.4x) 1,443 (2.1x) during pre- processing Galois CSR 1,599 (2.5x) 2,074 (1.9x) 1,432 (2.1x) step GraphMat* DCSC 2,818 (4.4x) 2,786 (2.5x) 2,980 (4.4x) Totem-2S CSR 1,275 (2.0x) 2,198 (2.0x) 1,282 (1.9x) Totem-2S2G CSR 1,628 (2.5x) 2,587 (2.3x) 1,658 (2.4x) Memory Consumption (in MB) for RMAT22 graph (edge list size: 512 MB)

  15. Experime mental Results 1. Raw Performa mance - Pa PageRank Nvgraph Gunrock Totem-1G Galois GraphMat Totem-2S Totem-2S2G 18 16 Billion TEPS / Iteration 14 Fastest: Totem-2S 12 Nvgraph vs GraphMat 10 8 6 4 2 0 Orkut LiveJournal RMAT22 RMAT23 RMAT24 RMAT27 Twitter

  16. Experime mental Results 1. Raw Performa mance - SSSP SSSP Nvgraph Gunrock Totem-1G Galois GraphMat Totem-2S Totem-2S2G 4.50 4.00 3.50 3.00 Billion TEPS Fastest: Totem-2S 2.50 CSC is suitable for PageRank 2.00 1.50 1.00 0.50 0.00 Orkut LiveJournal Road_USA RMAT22 RMAT24 RMAT27 Twitter

  17. Graph Layout in Memory CSR Representation 1 rowPtr 0 1 3 3 6 8 VertexId 0 1 2 3 4 5* edgeList 1 2 3 0 2 4 0 2 0 2 0 1 2 3 4 5 6 7 CSC Representation colPtr 0 2 3 6 7 8 VertexId 0 1 2 3 4 5* 3 4 edgeList 3 4 0 1 3 4 1 3 0 1 2 3 4 5 6 7

  18. Experime mental Results 1. Raw Performa mance - BF BFS Nvgraph Gunrock Totem-1G Galois GraphMat Totem-2S Totem-2S2G 120 100 Billion TEPS 80 Fastest: Totem-2S Nvgraph vs GraphMat 60 CSR suitable for BFS Hybrid: ~2x 40 20 0 Orkut LiveJournal RMAT22 RMAT24 RMAT27 Twitter

  19. 2. Energy Consump Experime Energy (watt-sec) 1,000 100 mental Results 10 1 Nvgraph mption – Gunrock PageRank – GPU Fr Totem-1G Frame Totem-2S meworks – Totem-2S2G Nvgraph – Orkut Workload Gunrock SSSP Totem-1G Totem-2S Totem-2S2G Nvgraph Gunrock BFS Totem-1G Totem-2S Totem-2S2G

  20. 2. Energy Consump Experime Energy (watt-sec) 1,000 100 mental Results 10 1 Nvgraph mption – Gunrock PageRank – GPU Fr Totem-1G Frame Totem-2S meworks – Totem-2S2G Nvgraph – Orkut Workload Gunrock SSSP Totem-1G Totem-2S Totem-2S2G Nvgraph Gunrock BFS Totem-1G Totem-2S Totem-2S2G

  21. 2. Energy Consump Experime Energy (watt-second) 100,000 10,000 mental Results 1,000 100 10 1 mption – Galois PageRank GraphMat – CPU Fr Frame Totem-2S meworks – Totem-2S2G – Twitter Workload Galois GraphMat SSSP Totem-2S Totem-2S2G Galois GraphMat BFS Totem-2S Totem-2S2G

  22. Experime mental Results 2. Energy Consump mption – – CPU Fr Frame meworks – – Twitter Workload 100,000 Energy (watt-second) 10,000 1,000 Energy Efficient: Totem-2S 100 10 1 Galois GraphMat Totem-2S Totem-2S2G Galois GraphMat Totem-2S Totem-2S2G Galois GraphMat Totem-2S Totem-2S2G PageRank SSSP BFS

  23. Summary • GPU + Linear Algebra| CPU + Vertex programming = Good Match • GPU based frameworks: ? • CPU based frameworks: Totem-2S • Totem Hybrid: Greenest • CSC PageRank • CSR BFS, SSSP

  24. Discussion

  25. Does hybrid have the future potential? Execution Time Energy 18 18000 Execution Time (seconds) 16 16000 Energy (Watt-Sec) 14 14000 12 12000 10 10000 8 8000 6 6000 4 4000 2 2000 0 0 BFS SSSP PR BFS SSSP PR 4S 2S2G Totem-4S vs Totem-2S2G for RMAT30 (edge list size: 128 GB) 4S Machine: 4x Intel Xeon E7-4870 v2 (Ivy bridge), with 1,536 GB memory

  26. Hybrid Graph Processing GPUs CPUs Graph Processing Poor locality Large Caches + Caches + summary data Data-dependent memory summary data structures access patterns structures Massive hardware Low compute-to- multithreading memory access ratio >1TB Large memory footprint 16GB! Varying degrees of parallelism Low Degree (both intra- and inter- stage) High Degree 27

  27. Qu Questions code@: netsyslab.ece.ubc.ca

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend