MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute - - PowerPoint PPT Presentation

multi gpu programming
SMART_READER_LITE
LIVE PREVIEW

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute - - PowerPoint PPT Presentation

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech Compute MOTIVATION Why use multiple GPUs? Need to compute larger, e.g. bigger networks, car models, Need to compute faster, e.g. weather


slide-1
SLIDE 1

Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech Compute

MULTI-GPU PROGRAMMING MODELS

slide-2
SLIDE 2

3

MOTIVATION

Need to compute larger, e.g. bigger networks, car models, … Need to compute faster, e.g. weather prediction Better energy efficiency with dense nodes with multiple GPUs

Why use multiple GPUs?

slide-3
SLIDE 3

4

DGX-1V

Two fully connected quads, connected at corners 300GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU

GPU1 GPU0 GPU3 GPU2 GPU4 GPU5 GPU6 GPU7 CPU 0 0 - 19 CPU 1 20-39

slide-4
SLIDE 4

5

EXAMPLE: JACOBI SOLVER

Solves the 2D-Laplace Equation on a rectangle

∆𝒗 𝒚, 𝒛 = 𝟏 ∀ 𝒚, 𝒛 ∈ Ω\𝜺Ω

Dirichlet boundary conditions (constant values on boundaries) on left and right boundary Periodic boundary conditions on top and bottom boundary

slide-5
SLIDE 5

6

EXAMPLE: JACOBI SOLVER

While not converged Do Jacobi step: for( int iy = 1; iy < ny-1; iy++ ) for( int ix = 1; ix < nx-1; ix++ ) a_new[iy*nx+ix] = -0.25 *

  • ( a[ iy

*nx+(ix+1)] + a[ iy *nx+ix-1] + a[(iy-1)*nx+ ix ] + a[(iy+1)*nx+ix ] ); Apply periodic boundary conditions Swap a_new and a Next iteration

Single GPU

slide-6
SLIDE 6

7

DOMAIN DECOMPOSITION

Minimize number of neighbors: Communicate to less neighbors Optimal for latency bound communication Minimize surface area/volume ratio: Communicate less data Optimal for bandwidth bound communication

Different Ways to split the work between processes:

Contiguous if data is row-major Contiguous if data is column-major

slide-7
SLIDE 7

8

While not converged Do Jacobi step: for (int iy = iy_start; iy < iy_end; iy++ ) for( int ix = 1; ix < nx-1; ix++ ) a_new[iy*nx+ix] = -0.25 *

  • ( a[ iy

*nx+(ix+1)] + a[ iy *nx+ix-1] + a[(iy-1)*nx+ ix ] + a[(iy+1)*nx+ix ] ); Apply periodic boundary conditions Exchange halo with 2 neighbors Swap a_new and a Next iteration

EXAMPLE: JACOBI SOLVER

Multi GPU

One-step with ring exchange

slide-8
SLIDE 8

9

SINGLE THREADED MULTI GPU PROGRAMMING

while ( l2_norm > tol && iter < iter_max ) { for ( int dev_id = 0; dev_id < num_devices; ++dev_id ) { const int top = dev_id > 0 ? dev_id - 1 : (num_devices-1); const int bottom = (dev_id+1)%num_devices; cudaSetDevice( dev_id ); cudaMemsetAsync(l2_norm_d[dev_id], 0 , sizeof(real) ); jacobi_kernel<<<dim_grid,dim_block>>>( a_new[dev_id], a[dev_id], l2_norm_d[dev_id], iy_start[dev_id], iy_end[dev_id], nx ); cudaMemcpyAsync( l2_norm_h[dev_id], l2_norm_d[dev_id], sizeof(real), cudaMemcpyDeviceToHost ); cudaMemcpyAsync( a_new[top]+(iy_end[top]*nx), a_new[dev_id]+iy_start[dev_id]*nx, nx*sizeof(real), ...); cudaMemcpyAsync( a_new[bottom], a_new[dev_id]+(iy_end[dev_id]-1)*nx, nx*sizeof(real), ...); } l2_norm = 0.0; for ( int dev_id = 0; dev_id < num_devices; ++dev_id ) { cudaSetDevice( dev_id ); cudaDeviceSynchronize(); l2_norm += *(l2_norm_h[dev_id]); } l2_norm = std::sqrt( l2_norm ); for ( int dev_id = 0; dev_id < num_devices; ++dev_id ) std::swap(a_new[dev_id],a[dev_id]); iter++; }

slide-9
SLIDE 9

10

EXAMPLE JACOBI

Top/Bottom Halo

cudaMemcpyAsync( a_new[top]+(iy_end[top]*nx), a_new[dev_id]+iy_start[dev_id]*nx, nx*sizeof(real), ...);

slide-10
SLIDE 10

11

EXAMPLE JACOBI

Top/Bottom Halo

1

cudaMemcpyAsync( a_new[top]+(iy_end[top]*nx), a_new[dev_id]+iy_start[dev_id]*nx, nx*sizeof(real), ...);

1

slide-11
SLIDE 11

12

EXAMPLE JACOBI

Top/Bottom Halo

1 1 2 2

cudaMemcpyAsync( a_new[top]+(iy_end[top]*nx), a_new[dev_id]+iy_start[dev_id]*nx, nx*sizeof(real), ...); cudaMemcpyAsync( a_new[bottom], a_new[dev_id]+(iy_end[dev_id]-1)*nx, nx*sizeof(real), ...); cudaMemcpyAsync( a_new[top]+(iy_end[top]*nx), a_new[dev_id]+iy_start[dev_id]*nx, nx*sizeof(real), ...);

slide-12
SLIDE 12

13

SCALABILTY METRICS FOR SUCCESS

Serial Time: 𝑈

𝑡: How long it takes to run the problem with a single process

Parallel Time: 𝑈

𝑞: How long it takes to run the problem with multiple processes

Number of Processes: 𝑄: The number of Processes operating on the task at hand Speedup: 𝑇 =

𝑈

𝑡

𝑈

𝑞: How much faster is the parallel version vs. serial. (optimal is 𝑄)

Efficiency: 𝐹 =

𝑇 𝑄: How efficient are the processors used (optimal is 1)

slide-13
SLIDE 13

14

EXAMPLE: JACOBI SOLVER

0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 12000 12500 13000 13500 14000 14500 15000 15500 16000 512 1024 1536 2048 2560 3072 3584 4096 4608 5120 5632 6144 6656 7168 7680 8192

Efficiency/Occupancy Performance (Mcells/s) Problem size (nx=ny)

Performance (Mcells/s) Efficiency (%) Achieved Occupancy (%)

Single GPU performance vs. problem size – Tesla V100 SXM2

slide-14
SLIDE 14

15

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 6 7 8

Parallel Efficiency Runtime (s) #GPUs

Chart Title

Single Threaded Copy Parallel Efficiency

DGX-1V - 7168 x 7168, 1000 iterations

slide-15
SLIDE 15

16

MULTI GPU JACOBI NVVP TIMELINE

Single Threaded Copy 4 V100 on DGX-1V

slide-16
SLIDE 16

17

MULTI GPU JACOBI NVVP TIMELINE

Single Threaded Copy 4 V100 on DGX-1V

slide-17
SLIDE 17

18

GPUDIRECT P2P

Maximizes intra node inter GPU Bandwidth Avoids Host memory and system topology bottlenecks

GPU1 GPU0 GPU3 GPU2 GPU4 GPU5 GPU6 GPU7 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM

slide-18
SLIDE 18

19

GPUDIRECT P2P

for ( int dev_id = 0; dev_id < num_devices; ++dev_id ) { cudaSetDevice( dev_id ); const int top = dev_id > 0 ? dev_id - 1 : (num_devices-1); int canAccessPeer = 0; cudaDeviceCanAccessPeer ( &canAccessPeer, dev_id, top ); if ( canAccessPeer ) cudaDeviceEnablePeerAccess ( top, 0 ); const int bottom = (dev_id+1)%num_devices; if ( top != bottom ) { cudaDeviceCanAccessPeer ( &canAccessPeer, dev_id, bottom ); if ( canAccessPeer ) cudaDeviceEnablePeerAccess ( bottom, 0 ); } }

Enable P2P

slide-19
SLIDE 19

20

MULTI GPU JACOBI NVVP TIMELINE

Single Threaded Copy 4 V100 on DGX-1V with P2P

slide-20
SLIDE 20

21

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8

Parallel Efficiency #GPUs

Chart Title

Single Threaded Copy Single Threaded Copy P2P

DGX-1V - 7168 x 7168, 1000 iterations

slide-21
SLIDE 21

22

1D RING EXCHANGE

Halo updates for 1D domain decomposition with periodic boundary conditions Unidirectional rings are important building block for collective algorithms

slide-22
SLIDE 22

23

GPU1 GPU0 GPU3 GPU2 GPU4 GPU5 GPU6 GPU7

MAPPING 1D RING EXCHANGE TO DGX-1V

Dom. Dom. 1 Dom. 2 Dom. 3 Dom. 4 Dom. 5 Dom. 6 Rank 7

slide-23
SLIDE 23

24

GPU1 GPU0 GPU3 GPU2 GPU4 GPU5 GPU6 GPU7

MAPPING 1D RING EXCHANGE TO DGX-1V

Dom. Dom. 1 Dom. 2 Dom. 3 Dom. 4 Dom. 5 Dom. 6 Rank 7

export CUDA_VISIBLE_DEVICES="0,3,2,1,5,6,7,4“

slide-24
SLIDE 24

25

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8

Parallel Efficiency #GPUs

Chart Title

Single Threaded Copy Single Threaded Copy P2P (no opt) Single Threaded Copy P2P

DGX-1V - 7168 x 7168, 1000 iterations

slide-25
SLIDE 25

26

MULTI GPU JACOBI NVVP TIMELINE

Single Threaded Copy 4 V100 on DGX-1V with P2P

slide-26
SLIDE 26

27

MULTI THREADED MULTI GPU PROGRAMMING

int num_devices = 0; cudaGetDeviceCount( &num_devices ); #pragma omp parallel num_threads( num_devices ) { int dev_id = omp_get_thread_num(); cudaSetDevice( dev_id ); }

Using OpenMP

slide-27
SLIDE 27

28

MULTI GPU JACOBI NVVP TIMELINE

Multi Threaded Copy 4 V100 on DGX-1V with P2P

slide-28
SLIDE 28

29

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8

Parallel Efficiency #GPUs

Chart Title

Single Threaded Copy P2P Multi Threaded Copy (no thread pinning)

DGX1 - 1024 x 1024, 1000 iterations

slide-29
SLIDE 29

30

GPU1 GPU0 GPU3 GPU2 GPU4 GPU5 GPU6 GPU7 CPU 0 0 - 19 CPU 1 20-39

GPU/CPU AFFINITY

thread thread 1 thread 2 thread 3 thread 4 thread 5 thread 6 thread 7

slide-30
SLIDE 30

31

GPU/CPU AFFINITY

$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_2 mlx5_1 mlx5_3 CPU Affinity GPU0 X NV1 NV1 NV2 NV2 SOC SOC SOC PIX SOC PHB SOC 0-19 GPU1 NV1 X NV2 NV1 SOC NV2 SOC SOC PIX SOC PHB SOC 0-19 GPU2 NV1 NV2 X NV2 SOC SOC NV1 SOC PHB SOC PIX SOC 0-19 GPU3 NV2 NV1 NV2 X SOC SOC SOC NV1 PHB SOC PIX SOC 0-19 GPU4 NV2 SOC SOC SOC X NV1 NV1 NV2 SOC PIX SOC PHB 20-39 GPU5 SOC NV2 SOC SOC NV1 X NV2 NV1 SOC PIX SOC PHB 20-39 GPU6 SOC SOC NV1 SOC NV1 NV2 X NV2 SOC PHB SOC PIX 20-39 GPU7 SOC SOC SOC NV1 NV2 NV1 NV2 X SOC PHB SOC PIX 20-39 mlx5_0 PIX PIX PHB PHB SOC SOC SOC SOC X SOC PHB SOC mlx5_2 SOC SOC SOC SOC PIX PIX PHB PHB SOC X SOC PHB mlx5_1 PHB PHB PIX PIX SOC SOC SOC SOC PHB SOC X SOC mlx5_3 SOC SOC SOC SOC PHB PHB PIX PIX SOC PHB SOC X Legend:

Querying system topology with nvidia-smi topo –m

CPU 0 CPU 1

slide-31
SLIDE 31

32

GPU/CPU AFFINITY

export OMP_PROC_BIND=TRUE export CUDA_VISIBLE_DEVICES="0,3,2,1,5,6,7,4“ export OMP_PLACES="{0},{1},{2},{3},{20},{21},{22},{23}"

Using CUDA_VISIBLE_DEVICES and OpenMP env. vars.

slide-32
SLIDE 32

34

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8

Parallel Efficiency #GPUs

Chart Title

Single Threaded Copy P2P Multi Threaded Copy (no thread pinning) Multi Threaded Copy

DGX-1V - 7168 x 7168, 1000 iterations

slide-33
SLIDE 33

35

MULTI GPU JACOBI NVVP TIMELINE

Multi Threaded Copy 4 V100 on DGX-1V with P2P

slide-34
SLIDE 34

36

COMMUNICATION + COMPUTATION OVERLAP

Process Whole Domain COMM No Overlap Process inner domain COMM Process boundary domain Dependency Boundary and inner domain processing can

  • verlap

Overlap Possible gain

slide-35
SLIDE 35

37

COMMUNICATION + COMPUTATION OVERLAP

//Compute bulk cudaStreamWaitEvent(compute_stream,push_top_done[(iter%2)][dev_id],0); cudaStreamWaitEvent(compute_stream,push_bottom_done[(iter%2)][dev_id],0); jacobi_kernel<<<dim_grid,dim_block,0,compute_stream>>>(a_new[dev_id],a,l2_norm_d,(iy_start+1),(iy_end[dev_id]-1),nx); //Compute boundaries cudaStreamWaitEvent( push_top_stream, reset_l2norm_done, 0 ); cudaStreamWaitEvent( push_top_stream, push_bottom_done[(iter%2)][top], 0 ); jacobi_kernel<<<nx/128+1,128,0,push_top_stream>>>( a_new[dev_id],a,l2_norm_d,iy_start,(iy_start+1),nx); cudaStreamWaitEvent(push_bottom_stream,reset_l2norm_done,0); cudaStreamWaitEvent(push_bottom_stream,push_top_done[(iter%2)][bottom], 0 ) ; jacobi_kernel<<<nx/128+1,128,0,push_bottom_stream>>>( a_new[dev_id],a,l2_norm_d,(iy_end[dev_id]-1),iy_end[dev_id],nx); //Apply periodic boundary conditions and exchange halo cudaMemcpyAsync(a_new[top]+(iy_end[top]*nx),a_new[dev_id]+iy_start*nx,nx*sizeof(real),cudaMemcpyDeviceToDevice,push_top_stream); cudaEventRecord(push_top_done[((iter+1)%2)][dev_id],push_top_stream); cudaMemcpyAsync(a_new[bottom],a_new[dev_id]+(iy_end[dev_id]-1)*nx,nx*sizeof(real),cudaMemcpyDeviceToDevice,push_bottom_stream); cudaEventRecord(push_bottom_done[((iter+1)%2)][dev_id],push_bottom_stream);

slide-36
SLIDE 36

38

COMMUNICATION + COMPUTATION OVERLAP

int leastPriority = 0; int greatestPriority = leastPriority; cudaDeviceGetStreamPriorityRange ( &leastPriority, &greatestPriority ); cudaStreamCreateWithPriority ( &compute_stream, cudaStreamDefault, leastPriority ); cudaStreamCreateWithPriority ( &push_top_stream, cudaStreamDefault, greatestPriority ); cudaStreamCreateWithPriority ( &push_bottom_stream, cudaStreamDefault, greatestPriority );

High Priority Streams

slide-37
SLIDE 37

39

MULTI GPU JACOBI NVVP TIMELINE

Multi Threaded Copy Overlap 4 V100 on DGX-1V with P2P

slide-38
SLIDE 38

40

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8

Parallel Efficiency #GPUs

Chart Title

Single Threaded Copy P2P Multi Threaded Copy Multi Threaded Copy Overlap

DGX-1V - 7168 x 7168, 1000 iterations

slide-39
SLIDE 39

41

MULTI GPU JACOBI NVVP TIMELINE

Multi Threaded Copy Overlap 4 V100 on DGX-1V with P2P

slide-40
SLIDE 40

42

MULTI THREADED MULTI GPU PROGRAMMING

while ( l2_norm > tol && iter < iter_max ) { cudaMemsetAsync(l2_norm_d, 0 , sizeof(real), compute_stream ); #pragma omp barrier cudaStreamWaitEvent( compute_stream, compute_done[iter%2][top], 0 ); cudaStreamWaitEvent( compute_stream, compute_done[iter%2][bottom], 0 ); jacobi_kernel<<<dim_grid,dim_block,0,compute_stream>>>( a_new[dev_id], a, l2_norm_d, iy_start, iy_end[dev_id], nx, a_new[top], iy_end[top], a_new[bottom], 0 ); cudaEventRecord( compute_done[(iter+1)%2][dev_id], compute_stream ); cudaMemcpyAsync(l2_norm,l2_norm_d,sizeof(real),cudaMemcpyDeviceToHost,compute_stream); // l2_norm reduction btw threads skipped ... #pragma omp barrier std::swap(a_new[dev_id],a); iter++; }

Using OpenMP and P2P Mappings

slide-41
SLIDE 41

43

MULTI THREADED MULTI GPU PROGRAMMING

__global__ void jacobi_kernel( ... ) { const int ix = bIdx.x*bDim.x+tIdx.x; const int iy = bIdx.y*bDim.y+tIdx.y + iy_start; real local_l2_norm = 0.0; if ( iy < iy_end && ix >= 1 && ix < (nx-1) ) { const real new_val = 0.25 * ( a[ iy * nx + ix + 1 ] + a[ iy * nx + ix - 1 ] + a[ (iy+1) * nx + ix ] + a[ (iy-1) * nx + ix ] ); a_new[ iy * nx + ix ] = new_val; if ( iy_start == iy ) a_new_top[ top_iy *nx + ix ] = new_val; if ( (iy_end - 1) == iy ) a_new_bottom[ bottom_iy*nx + ix ] = new_val; real residue = new_val - a[ iy * nx + ix ]; local_l2_norm += residue * residue; } atomicAdd( l2_norm, local_l2_norm ); }

Using OpenMP and P2P Mappings

slide-42
SLIDE 42

44

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8

Parallel Efficiency #GPUs

Chart Title

Single Threaded Copy P2P Multi Threaded Copy Multi Threaded Copy Overlap Multi Threaded P2P

DGX-1V - 7168 x 7168, 1000 iterations

slide-43
SLIDE 43

45

MULTI GPU JACOBI NVVP TIMELINE

Multi Threaded P2P 4 V100 on DGX-1V with P2P

slide-44
SLIDE 44

46

MULTI THREADED MULTI GPU PROGRAMMING

cudaMemcpyAsync( l2_norm_h,l2_norm_d,sizeof(real),cudaMemcpyDeviceToHost,compute_stream ); #pragma omp barrier #pragma omp single { l2_norm = 0.0; } #pragma omp barrier cudaStreamSynchronize( compute_stream ); #pragma omp atomic l2_norm += *(l2_norm_h); #pragma omp barrier #pragma omp single { l2_norm = std::sqrt( l2_norm ); } #pragma omp barrier

L2 norm reduction

Can be hidden if L2 norm check is delayed.

slide-45
SLIDE 45

47

MULTI THREADED MULTI GPU PROGRAMMING

cudaMemcpyAsync( l2_norm_h[ curr ],l2_norm_d[ curr ],sizeof( real ), cudaMemcpyDeviceToHost,compute_stream ); cudaEventRecord( copy_done[ curr ], compute_stream ); cudaEventSynchronize( copy_done[ prev ] ); #pragma omp atomic l2_norm[ prev ] += *( l2_norm_h[ prev ] ); #pragma omp barrier l2_norm[ prev ] = std::sqrt( l2_norm[ prev ] ); #pragma omp barrier l2_norm[ prev ] = 0.0;

Delayed L2 norm reduction

Issue H2D copy of current L2 norm Check L2 norm of last iteration (hidden)

slide-46
SLIDE 46

48

MULTI GPU JACOBI NVVP TIMELINE

Multi Threaded P2P 4 V100 on DGX-1V with P2P

slide-47
SLIDE 47

49

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8

Parallel Efficiency #GPUs

Chart Title

Single Threaded Copy P2P Multi Threaded Copy Multi Threaded Copy Overlap Multi Threaded P2P (norm opt)

DGX-1V - 7168 x 7168, 1000 iterations

slide-48
SLIDE 48

50

MESSAGE PASSING INTERFACE - MPI

Standard to exchange data between processes via messages

Defines API to exchanges messages

Point to Point: e.g. MPI_Send, MPI_Recv Collectives: e.g. MPI_Reduce

Multiple implementations (open source and commercial)

Bindings for C/C++, Fortran, Python, … E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, …

slide-49
SLIDE 49

51

MPI - SKELETON

#include <mpi.h> int main(int argc, char *argv[]) { int rank,size; /* Initialize the MPI library */ MPI_Init(&argc,&argv); /* Determine the calling process rank and total number of ranks */ MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size); /* Call MPI routines like MPI_Send, MPI_Recv, ... */ ... /* Shutdown MPI library */ MPI_Finalize(); return 0; }

slide-50
SLIDE 50

52

MPI

Compiling and Launching

$ mpicc -o myapp myapp.c $ mpirun -np 4 ./myapp <args>

myapp myapp myapp myapp rank = 0 rank = 1 rank = 2 rank = 3

slide-51
SLIDE 51

53

EXAMPLE JACOBI

Top/Bottom Halo

MPI_Sendrecv(a_new+iy_start*nx, nx, MPI_FLOAT, top , 0, a_new+(iy_end*nx), nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

slide-52
SLIDE 52

54

EXAMPLE JACOBI

Top/Bottom Halo

1 1

MPI_Sendrecv(a_new+iy_start*nx, nx, MPI_FLOAT, top , 0, a_new+(iy_end*nx), nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

slide-53
SLIDE 53

55

EXAMPLE JACOBI

Top/Bottom Halo

1 1 2 2

MPI_Sendrecv(a_new+iy_start*nx, nx, MPI_FLOAT, top , 0, a_new+(iy_end*nx), nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Sendrecv(a_new+(iy_end-1)*nx, nx, MPI_FLOAT, bottom, 0, a_new, nx, MPI_FLOAT, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

slide-54
SLIDE 54

56

HANDLING MULTIPLE MULTI GPU NODES

1 7

8 9 15

16 12 23

24 25 31

slide-55
SLIDE 55

57

HANDLING MULTIPLE MULTI GPU NODES

How to determine the local rank? – MPI-3

10-Oct-17

MPI_Comm local_comm; MPI_Info info; MPI_Info_create(&info); MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, info, &local_comm); int local_rank = -1; MPI_Comm_rank(local_comm,&local_rank); MPI_Comm_free(&local_comm); MPI_Info_free(&info);

slide-56
SLIDE 56

58

HANDLING MULTIPLE MULTI GPU NODES

Shared Comm Shared Comm Shared Comm Shared Comm

1 7

1 7

8 9 15

16 12 23

24 25 31

1 7

1 7

1 7

slide-57
SLIDE 57

59

HANDLING MULTIPLE MULTI GPU NODES

GPU-affinity

Use local rank:

int local_rank = -1; MPI_Comm_rank(local_comm,&local_rank); int num_devices = 0; cudaGetDeviceCount(&num_devices); cudaSetDevice(local_rank % num_devices); 10-Oct-17

slide-58
SLIDE 58

60

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8

Parallel Efficiency #GPUs

Chart Title

Multi Threaded Copy Overlap MPI

DGX-1V - 7168 x 7168, 1000 iterations

slide-59
SLIDE 59

61

MULTI GPU JACOBI NVVP TIMELINE

MPI 4 V100 on DGX-1V

slide-60
SLIDE 60

62

COMMUNICATION + COMPUTATION OVERLAP

launch_jacobi_kernel( a_new, a, l2_norm_d, iy_start, (iy_start+1), nx, push_top_stream ); launch_jacobi_kernel( a_new, a, l2_norm_d, (iy_end-1), iy_end, nx, push_bottom_stream ); launch_jacobi_kernel( a_new, a, l2_norm_d, (iy_start+1), (iy_end-1), nx, compute_stream ); const int top = rank > 0 ? rank - 1 : (size-1); const int bottom = (rank+1)%size; cudaStreamSynchronize( push_top_stream ); MPI_Sendrecv( a_new+iy_start*nx, nx, MPI_REAL_TYPE, top , 0, a_new+(iy_end*nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ); cudaStreamSynchronize( push_bottom_stream ); MPI_Sendrecv( a_new+(iy_end-1)*nx, nx, MPI_REAL_TYPE, bottom, 0, a_new, nx, MPI_REAL_TYPE, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE );

10-Oct-17

slide-61
SLIDE 61

63

MULTI GPU JACOBI NVVP TIMELINE

MPI Overlaping 4 V100 on DGX-1V

slide-62
SLIDE 62

64

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8

Parallel Efficiency #GPUs

Chart Title

Multi Threaded Copy Overlap MPI MPI Overlap

DGX-1V - 7168 x 7168, 1000 iterations

slide-63
SLIDE 63

65

NVSHMEM

Implementation of OpenSHMEM, a Partitioned Global Address Space (PGAS) library

Defines API to (symmetrically) allocate memory that is remotely accessible Defines API to access remote data

One-sided: e.g. shmem_putmem, shmem_getmem Collectives: e.g. shmem_broadcast

NVSHMEM features Symmetric memory allocations in device memory Communication API calls on CPU (standard and stream-ordered) Allows kernel-side communication (API and LD/ST*) between GPUs (within a single OS instance for the first release) Interoperable with MPI

slide-64
SLIDE 64

66

NVSHMEM - SKELETON

#include <shmem.h> #include <shmemx.h> int main(int argc, char *argv[]) { ... MPI_Comm mpi_comm; shmemx_init_attr_t attr; mpi_comm = MPI_COMM_WORLD; attr.mpi_comm = &mpi_comm; shmemx_init_attr (SHMEMX_INIT_WITH_MPI_COMM, &attr); int npes = shmem_n_pes(); int mype = shmem_my_pe(); ... return 0; }

slide-65
SLIDE 65

67

NVSHMEM – ALLOCATE MEMORY

a = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real)); a_new = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real)); … while ( l2_norm > tol && iter < iter_max ) { … jacobi_kernel<<<dim_grid,dim_block,0,compute_stream>>>( a_new, a, l2_norm_d, iy_start, iy_end, nx, top, iy_end_top, bottom, iy_start_bottom ); shmem_barrier_all_stream(compute_stream); … } shmem_barrier_all(); shmem_free(a); shmem_free(a_new);

slide-66
SLIDE 66

68

NVSHEM - KERNEL

__global__ void jacobi_kernel( ... ) { const int ix = bIdx.x*bDim.x+tIdx.x; const int iy = bIdx.y*bDim.y+tIdx.y + iy_start; real local_l2_norm = 0.0; if ( iy < iy_end && ix >= 1 && ix < (nx-1) ) { const real new_val = 0.25 * ( a[ iy * nx + ix + 1 ] + a[ iy * nx + ix - 1 ] + a[ (iy+1) * nx + ix ] + a[ (iy-1) * nx + ix ] ); a_new[ iy * nx + ix ] = new_val; if ( iy_start == iy ) shmem_float_p(a_new + top_iy*nx + ix, new_val, top_pe); if ( iy_end == iy ) shmem_float_p(a_new + bottom_iy*nx + ix, new_val, bottom_pe); real residue = new_val - a[ iy * nx + ix ]; } atomicAdd( l2_norm, local_l2_norm ); }

slide-67
SLIDE 67

69

MULTI GPU JACOBI RUNTIME

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8

Parallel Efficiency #GPUs

Multi Threaded Copy Overlap MPI MPI Overlap NVSHMEM

DGX-1V - 7168 x 7168, 1000 iterations

slide-68
SLIDE 68

70

NOT COVERED IN THIS TALK I

Enables MPI communication to follow CUDA stream ordering Avoids unwanted CPU/GPU synchronization

MPI with GPUDirect Async support (under development)

slide-69
SLIDE 69

71

NOT COVERED IN THIS TALK II

NCCL: Accelerating multi-GPU collective communications

GOAL:

  • A library for collective communication using CUDA kernels for reduction and data

movement. APPROACH:

  • Allreduce, Reduce, Broadcast, ReduceScatter and Allgather primitives, similar to

MPI primitives.

  • CUDA oriented : works on CUDA pointers only, enqueues operations to CUDA

streams.

  • Supports any mapping between processes, threads and GPUs per thread to

integrate into any hybrid model.

10-Oct-17

slide-70
SLIDE 70

72

CONCLUSION

Programming Models GPUDirect P2P Multi Node Single Threaded CUDA Improves Perf. No Multi Threaded CUDA + OpenMP/TBB/… Improves Perf. No Multi Threaded P2P CUDA + OpenMP/TBB/… Required No MPI CUDA + MPI Improves Perf. Yes NVSHMEM CUDA + NVSHMEM (MPI interop.) Required Yes* *Initial version will only support a single OS instance.