Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 - PowerPoint PPT Presentation

Kolganov A.S., MSU

 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 2

 Breadth-first search – one of the most important and fundamental processing algorithms in graphs;  Алгоритмические трудности BFS: ◦ Very few computations; ◦ An irregular memory access. 4

 Using BFS algorithm for ranking supercomputers (TEPS PS – traversed edges per second);  Using the MTEPS / WATT metrics for ranking in the GreenGraph500 rating of energy-efficient supercomputers ;  The both lists have not yet been filled: ◦ Graph500 (201 positions in the list); ◦ GreenGraph500 (63 positions in the list). 6

 Generating of edges;  Building a graph from edges (timed, included in the table);  Generating of 64 random vertices;  For each vertex: ◦ Running BFS algorithm; (timed, included in the rating); ◦ Checking the result;  Printing the resulting information. 7

nodes cores scale 8

nodes cores scale 12,6 2,6 MW 7,8 ,8 MW 3,9 ,9 MW 17,8 ,8 MW x2 x2 x2 x2 9

Big DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 62,93 GraphCREST (CPU) 30 31,33 1 32 2 61,48 GraphCREST (CPU) 30 28,61 1 32 3 51,95 GraphCREST (CPU) 32 59,9 1 60 4 48,28 GraphCREST (CPU) 30 31,95 1 32 5 44,42 GraphCREST (CPU) 32 55,74 1 60 Small DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 10

Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32) Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 62,93 GraphCREST (CPU) 30 31,33 1 32 2 61,48 GraphCREST (CPU) 30 28,61 1 32 3 51,95 GraphCREST (CPU) 32 59,9 1 60 4 48,28 GraphCREST (CPU) 30 31,95 1 32 5 44,42 GraphCREST (CPU) 32 55,74 1 60 Small DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 11

Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32)  GTX Titan X – 12ГБ, Tesla K80 – 24ГБ ;  For computing scale 30 needed ~192 ГБ ;  <GTX Titan X> x 16 = 192 ГБ 4 kW peak!  <Tesla K80> x 8 = 192 ГБ 2.4 kW peak! Small DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 12

 Phase 1: ◦ reconstruction and transformation of graph; ◦ loading to GPU memory;  Phase 2: ◦ The main cycle of algorithm; ◦ Use the hybrid BFS (Top Down + Bottom Up).  The main ideas were taken from GraphCREST: « Fast and Energy-efficient Breadth-First Search on a Single NUMA System , 2014» 14

 Transformation to CSR (compressed sparse rows) COO CSR start vertex adj_ptr …….. …….. final vertex adjacency weights weights 15

 Global sorting of vertices by the degree of connectedness 16

 Local sorting of neighbors by the degree of connectedness V1 V2 V3 .. .. .. .. Vn V1 V2 V3 .. .. .. .. Vn 17

 Synchronized on levels To Top-Down own foreach (i = [0, N]) Next { iteration foreach (k = =[rInd[i], rInd[i+1] ) front { unsigned v = endV[k]; if (levels[v] == 0) { levels[v] = lvl; Current Level K+1 parents[v] = i; front } } Level K } 18

 Synchronized on levels Bottom-Up Up foreach ach (i = [0, N]) { if if (levels[i] == 0) { foreach ach (k=[rInd[i], rInd[i+1] ]) { unsigned endk = endV[k]; if if (levels[endk] == lvl - 1) { parents[i] = endk; levels[i] = lvl; break ak; Level K+1 } } } Level K } 19

 Hybrid algorithm: Top-Down wn + Bottom-Up Up (direction optimization) The graph SCALE 26 |V| = 2^26 (67,108,864) |E| = 2^30 (1,073,741,824) Leve vel Top-Down Bottom-Up Hybrid Hybrid 0 2 2,103,840,895 2 1 66,206 206 1,766,587,029 66,206 2 346,918,235 52,677 677,69 691 52,677,691 3 1,727,195,615 12,820 820,85 854 12,820,854 4 29,557,400 103, 3,184 184 103,184 5 82,357 21,467 467 21,467 6 221 221 21,240 221 Total: 2,103,820,036 3,936,072,360 65,689,625 100% 187% 3.12% = 2x|E| A significant decrease the number of edges viewed

 Using the CUDA Dynamic Parallelism for balancing load in Top-Down;  Using vectorization in each thread;  Using the align reordering for better access to memory in Bottom-Up;  Using queue in Bottom-Up at the last iterations. 21

 The first position in GGraph500: Small DATA: GT GTX Titan X – 132 GTEPS, 815 MTEPS/W, SCALE:26 26;  The second position in GGraph500: Small DATA: GT GTX Titan – 114 GTEPS, 540 MTEPS/W, SCALE:25 25;  The 15th position in GGraph500: Small DATA: Intel Xeon E5 – 10.6 GTEPS, 81 MTEPS/W, SCALE:27 27;  Reached memory bandwidth GP GPU at 140-150 GB/s (50- 60% of peak);  Reached energy consumption at 50% of peak. 22

The time of BFS on 1GPU GTX Titan X GTX Titan X SCALE 26 ~ 8.43m 3ms GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X 24

GTX Titan X GTX Titan X The all coping GTX Titan X GTX Titan X GPU->HOST: ~128 МБ GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X 25

GTX Titan X GTX Titan X The all back coping GTX Titan X GTX Titan X HOST->GPU: ~2000 МБ GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X 26

The time of BFS on 16 GPU GTX Titan X GTX Titan X SCALE 30 ~ 9 ms ms GTX Titan X GTX Titan X The total time of coping SCALE 30 ~ ~ 140 140 ms ms GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X 115 GTEPS 115 PS GTX Titan X GTX Titan X ~ 100 M MTEPS S / W W GTX Titan X GTX Titan X (the current 1 st position – 62,93 MTEPS / W ) 27

The time of BFS on 16 GPU GTX Titan X GTX Titan X SCALE 30 ~ 9 ms ms GTX Titan X GTX Titan X The total time of coping SCALE 30 ~ ~ 30 30 ms ms GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X NVlink GTX Titan X GTX Titan X 440 440 GTEPS GTX Titan X GTX Titan X ~ 3 300 MTEPS / W W 28

Alexander Kolganov, MSU, alexander.k.s@mail.ru 29

Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 - PowerPoint PPT Presentation

Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 2 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on

Scalable GPU graph traversal BFS Compressed Row Format Sequential BFS Parallel BFS

Massive Data Algorithmics Lecture 11: BFS and DFS Massive Data Algorithmics Lecture 11: BFS and

BFS and DFS Problem Solving Club Nov 2 2016 Breadth First Search (BFS) Review What is the

An Analytical Approach to the BFS vs. DFS Algorithm Selection Problem 1 Tom Everitt Marcus Hutter

AP CAPSTONE DIPLOMA at BFS BFS students are prepared to enter top universities worldwide and

Framework - Feed the Future Presenters Emily Hogue Krista Jacobs Farzana Ramzan USAID/BFS

Operationalizing CSA: Applications and CSA Metrics Moffatt Ngugi, BFS/CSI Tatiana Pulido,

BFS/DFS Applications BFS and DFS applications Tyler Moore Shortest path between two nodes in a

Foundations of Artificial Intelligence 10. State-Space Search: Breadth-first Search Malte Helmert

Graphs Lecture 2 Today BFS/DFS Review; proof about DFS tree Implementation Running time

Health Affairs: Health Affairs: Creating a Culture of Safety at MSU Creating a Culture of Safety

Food Solutions New England Tom Kelly PhD Executive Director UNH Sustainability Institute

Breadth First Search BFS intuition. Explore outward from s in all possible directions, adding

BFS for Single Source Shortest Path Dr. Mattox Beckman University of Illinois at Urbana-Champaign

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

The Reappointment, Promotion, and Tenure Process at MSU A Conversation with Terry Curry May 5,

Basic Search Philipp Koehn 20 February 2020 Philipp Koehn Artificial Intelligence: Basic Search

Uninformed Search Depth First Search Iterative Deepening Volker Sorge Uniform Cost Search

CSE 390B: Graph Algorithms Based on CSE 373 slides by Jessica Miller, Ruth Anderson 1 A Graph:

Graph Basics Lecturer: Shi Li Department of Computer Science and Engineering University at

COL351: Slides for Lecture Components 08 Thanks to Miles Jones, Russell Impagliazzo, and Sanjoy

Lecture 2 Music: 9 to 5 - Dolly Parton Por una Cabeza (instrumental) - written by

(gdb) run (lldb) process launch (gdb) r (lldb) run (lldb) r (gdb) b main (lldb) breakpoint

The Bro Debugger Vlad Grigorescu NCSA > whoami Member of the Bro development team Security

Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 - PowerPoint PPT Presentation

Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 2 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on

Scalable GPU graph traversal BFS Compressed Row Format Sequential BFS Parallel BFS

Massive Data Algorithmics Lecture 11: BFS and DFS Massive Data Algorithmics Lecture 11: BFS and

BFS and DFS Problem Solving Club Nov 2 2016 Breadth First Search (BFS) Review What is the

An Analytical Approach to the BFS vs. DFS Algorithm Selection Problem 1 Tom Everitt Marcus Hutter

AP CAPSTONE DIPLOMA at BFS BFS students are prepared to enter top universities worldwide and

Framework - Feed the Future Presenters Emily Hogue Krista Jacobs Farzana Ramzan USAID/BFS

Operationalizing CSA: Applications and CSA Metrics Moffatt Ngugi, BFS/CSI Tatiana Pulido,

BFS/DFS Applications BFS and DFS applications Tyler Moore Shortest path between two nodes in a

Foundations of Artificial Intelligence 10. State-Space Search: Breadth-first Search Malte Helmert

Graphs Lecture 2 Today BFS/DFS Review; proof about DFS tree Implementation Running time

Health Affairs: Health Affairs: Creating a Culture of Safety at MSU Creating a Culture of Safety

Food Solutions New England Tom Kelly PhD Executive Director UNH Sustainability Institute

Breadth First Search BFS intuition. Explore outward from s in all possible directions, adding

BFS for Single Source Shortest Path Dr. Mattox Beckman University of Illinois at Urbana-Champaign

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

The Reappointment, Promotion, and Tenure Process at MSU A Conversation with Terry Curry May 5,

Basic Search Philipp Koehn 20 February 2020 Philipp Koehn Artificial Intelligence: Basic Search

Uninformed Search Depth First Search Iterative Deepening Volker Sorge Uniform Cost Search

CSE 390B: Graph Algorithms Based on CSE 373 slides by Jessica Miller, Ruth Anderson 1 A Graph:

Graph Basics Lecturer: Shi Li Department of Computer Science and Engineering University at

COL351: Slides for Lecture Components 08 Thanks to Miles Jones, Russell Impagliazzo, and Sanjoy

Lecture 2 Music: 9 to 5 - Dolly Parton Por una Cabeza (instrumental) - written by

(gdb) run (lldb) process launch (gdb) r (lldb) run (lldb) r (gdb) b main (lldb) breakpoint

The Bro Debugger Vlad Grigorescu NCSA &gt; whoami Member of the Bro development team Security

The Bro Debugger Vlad Grigorescu NCSA > whoami Member of the Bro development team Security