Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 - - PowerPoint PPT Presentation
Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 - - PowerPoint PPT Presentation
Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 2 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on
The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU /
GPU)
Predicted scalability
2
The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU /
GPU)
Predicted scalability
3
Breadth-first search – one of the most
important and fundamental processing algorithms in graphs;
Алгоритмические трудности BFS:
- Very few computations;
- An irregular memory access.
4
The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU /
GPU)
Predicted scalability
5
Using BFS algorithm for ranking supercomputers
(TEPS PS – traversed edges per second);
Using the MTEPS / WATT metrics for ranking in
the GreenGraph500 rating of energy-efficient supercomputers ;
The both lists have not yet been filled:
- Graph500 (201 positions in the list);
- GreenGraph500 (63 positions in the list).
6
Generating of edges; Building a graph from edges
(timed, included in the table);
Generating of 64 random vertices; For each vertex:
- Running BFS algorithm;
(timed, included in the rating);
- Checking the result;
Printing the resulting information.
7
8
nodes cores scale
x2 x2 x2 x2
17,8 ,8 MW 7,8 ,8 MW 12,6 2,6 MW 3,9 ,9 MW
9
nodes cores scale
Big DATA Small DATA
10
Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 62,93 GraphCREST (CPU) 30 31,33 1 32 2 61,48 GraphCREST (CPU) 30 28,61 1 32 3 51,95 GraphCREST (CPU) 32 59,9 1 60 4 48,28 GraphCREST (CPU) 30 31,95 1 32 5 44,42 GraphCREST (CPU) 32 55,74 1 60
Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32) Small DATA
11
Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 62,93 GraphCREST (CPU) 30 31,33 1 32 2 61,48 GraphCREST (CPU) 30 28,61 1 32 3 51,95 GraphCREST (CPU) 32 59,9 1 60 4 48,28 GraphCREST (CPU) 30 31,95 1 32 5 44,42 GraphCREST (CPU) 32 55,74 1 60
Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32) Small DATA
- GTX Titan X – 12ГБ, Tesla K80 – 24ГБ;
- For computing scale 30 needed ~192ГБ;
- <GTX Titan X> x 16 = 192 ГБ 4 kW peak!
- <Tesla K80> x 8 = 192 ГБ 2.4 kW peak!
12
Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5
The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU /
GPU)
Predicted scalability
13
Phase 1:
- reconstruction and transformation of graph;
- loading to GPU memory;
Phase 2:
- The main cycle of algorithm;
- Use the hybrid BFS (Top Down + Bottom Up).
The main ideas were taken from GraphCREST:
«Fast and Energy-efficient Breadth-First Search on a Single NUMA System, 2014»
14
Transformation to CSR (compressed sparse rows)
start vertex weights
……..
COO final vertex
15
adj_ptr weights
……..
CSR adjacency
Global sorting of vertices by the degree of connectedness
16
Local sorting of neighbors by the degree of connectedness
17
V1 V2 V3 .. .. .. .. Vn V1 V2 V3 .. .. .. .. Vn
Synchronized on levels To
Top-Down
- wn
18
Level K Level K+1 Current front
Next iteration front
foreach (i = [0, N])
{ foreach (k = =[rInd[i], rInd[i+1]) { unsigned v = endV[k]; if (levels[v] == 0) { levels[v] = lvl; parents[v] = i; } } }
Synchronized on levels Bottom-Up
Up
19
Level K Level K+1
foreach ach (i = [0, N]) { if if (levels[i] == 0) { foreach ach (k=[rInd[i], rInd[i+1] ]) { unsigned endk = endV[k]; if if (levels[endk] == lvl - 1) { parents[i] = endk; levels[i] = lvl; break ak; } } } }
Hybrid algorithm: Top-Down
wn + Bottom-Up Up (direction optimization)
Leve vel Top-Down Bottom-Up Hybrid Hybrid 2 2,103,840,895 2 1 66,206 206 1,766,587,029 66,206 2 346,918,235 52,677 677,69 691 52,677,691 3 1,727,195,615 12,820 820,85 854 12,820,854 4 29,557,400 103, 3,184 184 103,184 5 82,357 21,467 467 21,467 6 221 221 21,240 221 Total: 2,103,820,036 3,936,072,360 65,689,625 100% 187% 3.12% The graph SCALE 26 |V| = 2^26 (67,108,864) |E| = 2^30 (1,073,741,824)
= 2x|E|
A significant decrease the number of edges viewed
Using the CUDA Dynamic Parallelism for balancing
load in Top-Down;
Using vectorization in each thread; Using the align reordering for better access to
memory in Bottom-Up;
Using queue in Bottom-Up at the last iterations.
21
The first position in GGraph500: Small DATA:
GT GTX Titan X – 132 GTEPS, 815 MTEPS/W, SCALE:26 26;
The second position in GGraph500: Small DATA:
GT GTX Titan – 114 GTEPS, 540 MTEPS/W, SCALE:25 25;
The 15th position in GGraph500: Small DATA:
Intel Xeon E5 – 10.6 GTEPS, 81 MTEPS/W, SCALE:27 27;
Reached memory bandwidth GP
GPU at 140-150 GB/s (50- 60% of peak);
Reached energy consumption at 50% of peak.
22
The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU /
GPU)
Predicted scalability
23
24
CPU CPU
GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X The time of BFS on 1GPU SCALE 26 ~ 8.43m 3ms
25
CPU CPU
GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X The all coping GPU->HOST: ~128 МБ
26
CPU CPU
GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X The all back coping HOST->GPU: ~2000 МБ
27
CPU CPU
GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X The time of BFS on 16 GPU SCALE 30 ~ 9 ms ms The total time of coping SCALE 30 ~ ~ 140 140 ms ms
115 115 GTEPS PS ~ 100 M MTEPS S / W W
(the current 1st position – 62,93 MTEPS / W )
28
CPU CPU
GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X The time of BFS on 16 GPU SCALE 30 ~ 9 ms ms The total time of coping SCALE 30 ~ ~ 30 30 ms ms
NVlink
440 440 GTEPS ~ 3 300 MTEPS / W W
29