Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 - - PowerPoint PPT Presentation

kolganov a s msu the bfs algorithm
SMART_READER_LITE
LIVE PREVIEW

Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 - - PowerPoint PPT Presentation

Kolganov A.S., MSU The BFS algorithm Graph500 && GGraph500 Implementation of BFS on shared memory (CPU / GPU) Predicted scalability 2 The BFS algorithm Graph500 && GGraph500 Implementation of BFS on


slide-1
SLIDE 1

Kolganov A.S., MSU

slide-2
SLIDE 2

 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU /

GPU)

 Predicted scalability

2

slide-3
SLIDE 3

 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU /

GPU)

 Predicted scalability

3

slide-4
SLIDE 4

 Breadth-first search – one of the most

important and fundamental processing algorithms in graphs;

 Алгоритмические трудности BFS:

  • Very few computations;
  • An irregular memory access.

4

slide-5
SLIDE 5

 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU /

GPU)

 Predicted scalability

5

slide-6
SLIDE 6

 Using BFS algorithm for ranking supercomputers

(TEPS PS – traversed edges per second);

 Using the MTEPS / WATT metrics for ranking in

the GreenGraph500 rating of energy-efficient supercomputers ;

 The both lists have not yet been filled:

  • Graph500 (201 positions in the list);
  • GreenGraph500 (63 positions in the list).

6

slide-7
SLIDE 7

 Generating of edges;  Building a graph from edges

(timed, included in the table);

 Generating of 64 random vertices;  For each vertex:

  • Running BFS algorithm;

(timed, included in the rating);

  • Checking the result;

 Printing the resulting information.

7

slide-8
SLIDE 8

8

nodes cores scale

slide-9
SLIDE 9

x2 x2 x2 x2

17,8 ,8 MW 7,8 ,8 MW 12,6 2,6 MW 3,9 ,9 MW

9

nodes cores scale

slide-10
SLIDE 10

Big DATA Small DATA

10

Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 62,93 GraphCREST (CPU) 30 31,33 1 32 2 61,48 GraphCREST (CPU) 30 28,61 1 32 3 51,95 GraphCREST (CPU) 32 59,9 1 60 4 48,28 GraphCREST (CPU) 30 31,95 1 32 5 44,42 GraphCREST (CPU) 32 55,74 1 60

slide-11
SLIDE 11

Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32) Small DATA

11

Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5 Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 62,93 GraphCREST (CPU) 30 31,33 1 32 2 61,48 GraphCREST (CPU) 30 28,61 1 32 3 51,95 GraphCREST (CPU) 32 59,9 1 60 4 48,28 GraphCREST (CPU) 30 31,95 1 32 5 44,42 GraphCREST (CPU) 32 55,74 1 60

slide-12
SLIDE 12

Big DATA: scale up to 30 (256 ГБ for int64 and 128 ГБ for int32) Small DATA

  • GTX Titan X – 12ГБ, Tesla K80 – 24ГБ;
  • For computing scale 30 needed ~192ГБ;
  • <GTX Titan X> x 16 = 192 ГБ 4 kW peak!
  • <Tesla K80> x 8 = 192 ГБ 2.4 kW peak!

12

Rank nk MTEPS/W PS/W Mach chine ne Sca cale le GTEPS PS Nodes es Cores es 1 815,68 TitanX (GPU PU) 26 132,14 1 28 2 540,94 Titan (GPU PU) 25 114,68 1 20 3 445,92 Colonial (GPU PU) 20 112,18 1 12 4 243,42 Monty Pi-thon 26 35,83 32 128 5 235,15 GraphCREST (ARM) 20 1,03 1 4 6 230,4 GraphCREST (ARM) 20 0,74 1 4 7 204,38 EBD 21 1,64 1 5

slide-13
SLIDE 13

 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU /

GPU)

 Predicted scalability

13

slide-14
SLIDE 14

 Phase 1:

  • reconstruction and transformation of graph;
  • loading to GPU memory;

 Phase 2:

  • The main cycle of algorithm;
  • Use the hybrid BFS (Top Down + Bottom Up).

 The main ideas were taken from GraphCREST:

«Fast and Energy-efficient Breadth-First Search on a Single NUMA System, 2014»

14

slide-15
SLIDE 15

 Transformation to CSR (compressed sparse rows)

start vertex weights

……..

COO final vertex

15

adj_ptr weights

……..

CSR adjacency

slide-16
SLIDE 16

 Global sorting of vertices by the degree of connectedness

16

slide-17
SLIDE 17

 Local sorting of neighbors by the degree of connectedness

17

V1 V2 V3 .. .. .. .. Vn V1 V2 V3 .. .. .. .. Vn

slide-18
SLIDE 18

 Synchronized on levels To

Top-Down

  • wn

18

Level K Level K+1 Current front

Next iteration front

foreach (i = [0, N])

{ foreach (k = =[rInd[i], rInd[i+1]) { unsigned v = endV[k]; if (levels[v] == 0) { levels[v] = lvl; parents[v] = i; } } }

slide-19
SLIDE 19

 Synchronized on levels Bottom-Up

Up

19

Level K Level K+1

foreach ach (i = [0, N]) { if if (levels[i] == 0) { foreach ach (k=[rInd[i], rInd[i+1] ]) { unsigned endk = endV[k]; if if (levels[endk] == lvl - 1) { parents[i] = endk; levels[i] = lvl; break ak; } } } }

slide-20
SLIDE 20

 Hybrid algorithm: Top-Down

wn + Bottom-Up Up (direction optimization)

Leve vel Top-Down Bottom-Up Hybrid Hybrid 2 2,103,840,895 2 1 66,206 206 1,766,587,029 66,206 2 346,918,235 52,677 677,69 691 52,677,691 3 1,727,195,615 12,820 820,85 854 12,820,854 4 29,557,400 103, 3,184 184 103,184 5 82,357 21,467 467 21,467 6 221 221 21,240 221 Total: 2,103,820,036 3,936,072,360 65,689,625 100% 187% 3.12% The graph SCALE 26 |V| = 2^26 (67,108,864) |E| = 2^30 (1,073,741,824)

= 2x|E|

A significant decrease the number of edges viewed

slide-21
SLIDE 21

 Using the CUDA Dynamic Parallelism for balancing

load in Top-Down;

 Using vectorization in each thread;  Using the align reordering for better access to

memory in Bottom-Up;

 Using queue in Bottom-Up at the last iterations.

21

slide-22
SLIDE 22

 The first position in GGraph500: Small DATA:

GT GTX Titan X – 132 GTEPS, 815 MTEPS/W, SCALE:26 26;

 The second position in GGraph500: Small DATA:

GT GTX Titan – 114 GTEPS, 540 MTEPS/W, SCALE:25 25;

 The 15th position in GGraph500: Small DATA:

Intel Xeon E5 – 10.6 GTEPS, 81 MTEPS/W, SCALE:27 27;

 Reached memory bandwidth GP

GPU at 140-150 GB/s (50- 60% of peak);

 Reached energy consumption at 50% of peak.

22

slide-23
SLIDE 23

 The BFS algorithm  Graph500 && GGraph500  Implementation of BFS on shared memory (CPU /

GPU)

 Predicted scalability

23

slide-24
SLIDE 24

24

CPU CPU

GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X The time of BFS on 1GPU SCALE 26 ~ 8.43m 3ms

slide-25
SLIDE 25

25

CPU CPU

GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X The all coping GPU->HOST: ~128 МБ

slide-26
SLIDE 26

26

CPU CPU

GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X The all back coping HOST->GPU: ~2000 МБ

slide-27
SLIDE 27

27

CPU CPU

GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X The time of BFS on 16 GPU SCALE 30 ~ 9 ms ms The total time of coping SCALE 30 ~ ~ 140 140 ms ms

115 115 GTEPS PS ~ 100 M MTEPS S / W W

(the current 1st position – 62,93 MTEPS / W )

slide-28
SLIDE 28

28

CPU CPU

GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X GTX Titan X The time of BFS on 16 GPU SCALE 30 ~ 9 ms ms The total time of coping SCALE 30 ~ ~ 30 30 ms ms

NVlink

440 440 GTEPS ~ 3 300 MTEPS / W W

slide-29
SLIDE 29

29

Alexander Kolganov, MSU, alexander.k.s@mail.ru