 
              Kolganov A.S., MSU
 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 2
 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 3
 Breadth-first search – one of the most important and fundamental processing algorithms in graphs;  Алгоритмические трудности BFS: ◦ Very few computations; ◦ An irregular memory access. 4
 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 5
 Using BFS algorithm for ranking supercomputers (TEPS PS – traversed edges per second);  Using the MTEPS / WATT metrics for ranking in the GreenGraph500 rating of energy-efficient supercomputers ;  The both lists have not yet been filled: ◦ Graph500 (201 positions in the list); ◦ GreenGraph500 (63 positions in the list). 6
 Generating of edges;  Building a graph from edges (timed, included in the table);  Generating of 64 random vertices;  For each vertex: ◦ Running BFS algorithm; (timed, included in the rating); ◦ Checking the result;  Printing the resulting information. 7
nodes cores scale 8
nodes cores scale 12,6 2,6 MW 7,8 ,8 MW 3,9 ,9 MW 17,8 ,8 MW x2 x2 x2 x2 9
Big DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 62,93 GraphCREST (CPU) 30 31,33 1 32 2 61,48 GraphCREST (CPU) 30 28,61 1 32 3 51,95 GraphCREST (CPU) 32 59,9 1 60 4 48,28 GraphCREST (CPU) 30 31,95 1 32 5 44,42 GraphCREST (CPU) 32 55,74 1 60 Small DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 10
Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32) Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 62,93 GraphCREST (CPU) 30 31,33 1 32 2 61,48 GraphCREST (CPU) 30 28,61 1 32 3 51,95 GraphCREST (CPU) 32 59,9 1 60 4 48,28 GraphCREST (CPU) 30 31,95 1 32 5 44,42 GraphCREST (CPU) 32 55,74 1 60 Small DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 11
Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32)  GTX Titan X – 12ГБ, Tesla K80 – 24ГБ ;  For computing scale 30 needed ~192 ГБ ;  <GTX Titan X> x 16 = 192 ГБ 4 kW peak!  <Tesla K80> x 8 = 192 ГБ 2.4 kW peak! Small DATA Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 12
 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 13
 Phase 1: ◦ reconstruction and transformation of graph; ◦ loading to GPU memory;  Phase 2: ◦ The main cycle of algorithm; ◦ Use the hybrid BFS (Top Down + Bottom Up).  The main ideas were taken from GraphCREST: « Fast and Energy-efficient Breadth-First Search on a Single NUMA System , 2014» 14
 Transformation to CSR (compressed sparse rows) COO CSR start vertex adj_ptr …….. …….. final vertex adjacency weights weights 15
 Global sorting of vertices by the degree of connectedness 16
 Local sorting of neighbors by the degree of connectedness V1 V2 V3 .. .. .. .. Vn V1 V2 V3 .. .. .. .. Vn 17
 Synchronized on levels To Top-Down own foreach (i = [0, N]) Next { iteration foreach (k = =[rInd[i], rInd[i+1] ) front { unsigned v = endV[k]; if (levels[v] == 0) { levels[v] = lvl; Current Level K+1 parents[v] = i; front } } Level K } 18
 Synchronized on levels Bottom-Up Up foreach ach (i = [0, N]) { if if (levels[i] == 0) { foreach ach (k=[rInd[i], rInd[i+1] ]) { unsigned endk = endV[k]; if if (levels[endk] == lvl - 1) { parents[i] = endk; levels[i] = lvl; break ak; Level K+1 } } } Level K } 19
 Hybrid algorithm: Top-Down wn + Bottom-Up Up (direction optimization) The graph SCALE 26 |V| = 2^26 (67,108,864) |E| = 2^30 (1,073,741,824) Leve vel Top-Down Bottom-Up Hybrid Hybrid 0 2 2,103,840,895 2 1 66,206 206 1,766,587,029 66,206 2 346,918,235 52,677 677,69 691 52,677,691 3 1,727,195,615 12,820 820,85 854 12,820,854 4 29,557,400 103, 3,184 184 103,184 5 82,357 21,467 467 21,467 6 221 221 21,240 221 Total: 2,103,820,036 3,936,072,360 65,689,625 100% 187% 3.12% = 2x|E| A significant decrease the number of edges viewed
 Using the CUDA Dynamic Parallelism for balancing load in Top-Down;  Using vectorization in each thread;  Using the align reordering for better access to memory in Bottom-Up;  Using queue in Bottom-Up at the last iterations. 21
 The first position in GGraph500: Small DATA: GT GTX Titan X – 132 GTEPS, 815 MTEPS/W, SCALE:26 26;  The second position in GGraph500: Small DATA: GT GTX Titan – 114 GTEPS, 540 MTEPS/W, SCALE:25 25;  The 15th position in GGraph500: Small DATA: Intel Xeon E5 – 10.6 GTEPS, 81 MTEPS/W, SCALE:27 27;  Reached memory bandwidth GP GPU at 140-150 GB/s (50- 60% of peak);  Reached energy consumption at 50% of peak. 22
 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU / GPU)  Predicted scalability 23
The time of BFS on 1GPU GTX Titan X GTX Titan X SCALE 26 ~ 8.43m 3ms GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X 24
GTX Titan X GTX Titan X The all coping GTX Titan X GTX Titan X GPU->HOST: ~128 МБ GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X 25
GTX Titan X GTX Titan X The all back coping GTX Titan X GTX Titan X HOST->GPU: ~2000 МБ GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X 26
The time of BFS on 16 GPU GTX Titan X GTX Titan X SCALE 30 ~ 9 ms ms GTX Titan X GTX Titan X The total time of coping SCALE 30 ~ ~ 140 140 ms ms GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X 115 GTEPS 115 PS GTX Titan X GTX Titan X ~ 100 M MTEPS S / W W GTX Titan X GTX Titan X (the current 1 st position – 62,93 MTEPS / W ) 27
The time of BFS on 16 GPU GTX Titan X GTX Titan X SCALE 30 ~ 9 ms ms GTX Titan X GTX Titan X The total time of coping SCALE 30 ~ ~ 30 30 ms ms GTX Titan X GTX Titan X GTX Titan X GTX Titan X CPU CPU GTX Titan X GTX Titan X GTX Titan X GTX Titan X NVlink GTX Titan X GTX Titan X 440 440 GTEPS GTX Titan X GTX Titan X ~ 3 300 MTEPS / W W 28
Alexander Kolganov, MSU, alexander.k.s@mail.ru 29
Recommend
More recommend