Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech Compute
MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute - - PowerPoint PPT Presentation
MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute - - PowerPoint PPT Presentation
MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech Compute MOTIVATION Why use multiple GPUs? Need to compute larger, e.g. bigger networks, car models, Need to compute faster, e.g. weather
3
MOTIVATION
Need to compute larger, e.g. bigger networks, car models, … Need to compute faster, e.g. weather prediction Better energy efficiency with dense nodes with multiple GPUs
Why use multiple GPUs?
4
DGX-1V
Two fully connected quads, connected at corners 300GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU
GPU1 GPU0 GPU3 GPU2 GPU4 GPU5 GPU6 GPU7 CPU 0 0 - 19 CPU 1 20-39
5
EXAMPLE: JACOBI SOLVER
Solves the 2D-Laplace Equation on a rectangle
∆𝒗 𝒚, 𝒛 = 𝟏 ∀ 𝒚, 𝒛 ∈ Ω\𝜺Ω
Dirichlet boundary conditions (constant values on boundaries) on left and right boundary Periodic boundary conditions on top and bottom boundary
6
EXAMPLE: JACOBI SOLVER
While not converged Do Jacobi step: for( int iy = 1; iy < ny-1; iy++ ) for( int ix = 1; ix < nx-1; ix++ ) a_new[iy*nx+ix] = -0.25 *
- ( a[ iy
*nx+(ix+1)] + a[ iy *nx+ix-1] + a[(iy-1)*nx+ ix ] + a[(iy+1)*nx+ix ] ); Apply periodic boundary conditions Swap a_new and a Next iteration
Single GPU
7
DOMAIN DECOMPOSITION
Minimize number of neighbors: Communicate to less neighbors Optimal for latency bound communication Minimize surface area/volume ratio: Communicate less data Optimal for bandwidth bound communication
Different Ways to split the work between processes:
Contiguous if data is row-major Contiguous if data is column-major
8
While not converged Do Jacobi step: for (int iy = iy_start; iy < iy_end; iy++ ) for( int ix = 1; ix < nx-1; ix++ ) a_new[iy*nx+ix] = -0.25 *
- ( a[ iy
*nx+(ix+1)] + a[ iy *nx+ix-1] + a[(iy-1)*nx+ ix ] + a[(iy+1)*nx+ix ] ); Apply periodic boundary conditions Exchange halo with 2 neighbors Swap a_new and a Next iteration
EXAMPLE: JACOBI SOLVER
Multi GPU
One-step with ring exchange
9
SINGLE THREADED MULTI GPU PROGRAMMING
while ( l2_norm > tol && iter < iter_max ) { for ( int dev_id = 0; dev_id < num_devices; ++dev_id ) { const int top = dev_id > 0 ? dev_id - 1 : (num_devices-1); const int bottom = (dev_id+1)%num_devices; cudaSetDevice( dev_id ); cudaMemsetAsync(l2_norm_d[dev_id], 0 , sizeof(real) ); jacobi_kernel<<<dim_grid,dim_block>>>( a_new[dev_id], a[dev_id], l2_norm_d[dev_id], iy_start[dev_id], iy_end[dev_id], nx ); cudaMemcpyAsync( l2_norm_h[dev_id], l2_norm_d[dev_id], sizeof(real), cudaMemcpyDeviceToHost ); cudaMemcpyAsync( a_new[top]+(iy_end[top]*nx), a_new[dev_id]+iy_start[dev_id]*nx, nx*sizeof(real), ...); cudaMemcpyAsync( a_new[bottom], a_new[dev_id]+(iy_end[dev_id]-1)*nx, nx*sizeof(real), ...); } l2_norm = 0.0; for ( int dev_id = 0; dev_id < num_devices; ++dev_id ) { cudaSetDevice( dev_id ); cudaDeviceSynchronize(); l2_norm += *(l2_norm_h[dev_id]); } l2_norm = std::sqrt( l2_norm ); for ( int dev_id = 0; dev_id < num_devices; ++dev_id ) std::swap(a_new[dev_id],a[dev_id]); iter++; }
10
EXAMPLE JACOBI
Top/Bottom Halo
cudaMemcpyAsync( a_new[top]+(iy_end[top]*nx), a_new[dev_id]+iy_start[dev_id]*nx, nx*sizeof(real), ...);
11
EXAMPLE JACOBI
Top/Bottom Halo
1
cudaMemcpyAsync( a_new[top]+(iy_end[top]*nx), a_new[dev_id]+iy_start[dev_id]*nx, nx*sizeof(real), ...);
1
12
EXAMPLE JACOBI
Top/Bottom Halo
1 1 2 2
cudaMemcpyAsync( a_new[top]+(iy_end[top]*nx), a_new[dev_id]+iy_start[dev_id]*nx, nx*sizeof(real), ...); cudaMemcpyAsync( a_new[bottom], a_new[dev_id]+(iy_end[dev_id]-1)*nx, nx*sizeof(real), ...); cudaMemcpyAsync( a_new[top]+(iy_end[top]*nx), a_new[dev_id]+iy_start[dev_id]*nx, nx*sizeof(real), ...);
13
SCALABILTY METRICS FOR SUCCESS
Serial Time: 𝑈
𝑡: How long it takes to run the problem with a single process
Parallel Time: 𝑈
𝑞: How long it takes to run the problem with multiple processes
Number of Processes: 𝑄: The number of Processes operating on the task at hand Speedup: 𝑇 =
𝑈
𝑡
𝑈
𝑞: How much faster is the parallel version vs. serial. (optimal is 𝑄)
Efficiency: 𝐹 =
𝑇 𝑄: How efficient are the processors used (optimal is 1)
14
EXAMPLE: JACOBI SOLVER
0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 12000 12500 13000 13500 14000 14500 15000 15500 16000 512 1024 1536 2048 2560 3072 3584 4096 4608 5120 5632 6144 6656 7168 7680 8192
Efficiency/Occupancy Performance (Mcells/s) Problem size (nx=ny)
Performance (Mcells/s) Efficiency (%) Achieved Occupancy (%)
Single GPU performance vs. problem size – Tesla V100 SXM2
15
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 6 7 8
Parallel Efficiency Runtime (s) #GPUs
Chart Title
Single Threaded Copy Parallel Efficiency
DGX-1V - 7168 x 7168, 1000 iterations
16
MULTI GPU JACOBI NVVP TIMELINE
Single Threaded Copy 4 V100 on DGX-1V
17
MULTI GPU JACOBI NVVP TIMELINE
Single Threaded Copy 4 V100 on DGX-1V
18
GPUDIRECT P2P
Maximizes intra node inter GPU Bandwidth Avoids Host memory and system topology bottlenecks
GPU1 GPU0 GPU3 GPU2 GPU4 GPU5 GPU6 GPU7 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM
19
GPUDIRECT P2P
for ( int dev_id = 0; dev_id < num_devices; ++dev_id ) { cudaSetDevice( dev_id ); const int top = dev_id > 0 ? dev_id - 1 : (num_devices-1); int canAccessPeer = 0; cudaDeviceCanAccessPeer ( &canAccessPeer, dev_id, top ); if ( canAccessPeer ) cudaDeviceEnablePeerAccess ( top, 0 ); const int bottom = (dev_id+1)%num_devices; if ( top != bottom ) { cudaDeviceCanAccessPeer ( &canAccessPeer, dev_id, bottom ); if ( canAccessPeer ) cudaDeviceEnablePeerAccess ( bottom, 0 ); } }
Enable P2P
20
MULTI GPU JACOBI NVVP TIMELINE
Single Threaded Copy 4 V100 on DGX-1V with P2P
21
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8
Parallel Efficiency #GPUs
Chart Title
Single Threaded Copy Single Threaded Copy P2P
DGX-1V - 7168 x 7168, 1000 iterations
22
1D RING EXCHANGE
Halo updates for 1D domain decomposition with periodic boundary conditions Unidirectional rings are important building block for collective algorithms
…
23
GPU1 GPU0 GPU3 GPU2 GPU4 GPU5 GPU6 GPU7
MAPPING 1D RING EXCHANGE TO DGX-1V
Dom. Dom. 1 Dom. 2 Dom. 3 Dom. 4 Dom. 5 Dom. 6 Rank 7
24
GPU1 GPU0 GPU3 GPU2 GPU4 GPU5 GPU6 GPU7
MAPPING 1D RING EXCHANGE TO DGX-1V
Dom. Dom. 1 Dom. 2 Dom. 3 Dom. 4 Dom. 5 Dom. 6 Rank 7
export CUDA_VISIBLE_DEVICES="0,3,2,1,5,6,7,4“
25
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8
Parallel Efficiency #GPUs
Chart Title
Single Threaded Copy Single Threaded Copy P2P (no opt) Single Threaded Copy P2P
DGX-1V - 7168 x 7168, 1000 iterations
26
MULTI GPU JACOBI NVVP TIMELINE
Single Threaded Copy 4 V100 on DGX-1V with P2P
27
MULTI THREADED MULTI GPU PROGRAMMING
int num_devices = 0; cudaGetDeviceCount( &num_devices ); #pragma omp parallel num_threads( num_devices ) { int dev_id = omp_get_thread_num(); cudaSetDevice( dev_id ); }
Using OpenMP
28
MULTI GPU JACOBI NVVP TIMELINE
Multi Threaded Copy 4 V100 on DGX-1V with P2P
29
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8
Parallel Efficiency #GPUs
Chart Title
Single Threaded Copy P2P Multi Threaded Copy (no thread pinning)
DGX1 - 1024 x 1024, 1000 iterations
30
GPU1 GPU0 GPU3 GPU2 GPU4 GPU5 GPU6 GPU7 CPU 0 0 - 19 CPU 1 20-39
GPU/CPU AFFINITY
thread thread 1 thread 2 thread 3 thread 4 thread 5 thread 6 thread 7
31
GPU/CPU AFFINITY
$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_2 mlx5_1 mlx5_3 CPU Affinity GPU0 X NV1 NV1 NV2 NV2 SOC SOC SOC PIX SOC PHB SOC 0-19 GPU1 NV1 X NV2 NV1 SOC NV2 SOC SOC PIX SOC PHB SOC 0-19 GPU2 NV1 NV2 X NV2 SOC SOC NV1 SOC PHB SOC PIX SOC 0-19 GPU3 NV2 NV1 NV2 X SOC SOC SOC NV1 PHB SOC PIX SOC 0-19 GPU4 NV2 SOC SOC SOC X NV1 NV1 NV2 SOC PIX SOC PHB 20-39 GPU5 SOC NV2 SOC SOC NV1 X NV2 NV1 SOC PIX SOC PHB 20-39 GPU6 SOC SOC NV1 SOC NV1 NV2 X NV2 SOC PHB SOC PIX 20-39 GPU7 SOC SOC SOC NV1 NV2 NV1 NV2 X SOC PHB SOC PIX 20-39 mlx5_0 PIX PIX PHB PHB SOC SOC SOC SOC X SOC PHB SOC mlx5_2 SOC SOC SOC SOC PIX PIX PHB PHB SOC X SOC PHB mlx5_1 PHB PHB PIX PIX SOC SOC SOC SOC PHB SOC X SOC mlx5_3 SOC SOC SOC SOC PHB PHB PIX PIX SOC PHB SOC X Legend:
Querying system topology with nvidia-smi topo –m
CPU 0 CPU 1
32
GPU/CPU AFFINITY
export OMP_PROC_BIND=TRUE export CUDA_VISIBLE_DEVICES="0,3,2,1,5,6,7,4“ export OMP_PLACES="{0},{1},{2},{3},{20},{21},{22},{23}"
Using CUDA_VISIBLE_DEVICES and OpenMP env. vars.
34
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8
Parallel Efficiency #GPUs
Chart Title
Single Threaded Copy P2P Multi Threaded Copy (no thread pinning) Multi Threaded Copy
DGX-1V - 7168 x 7168, 1000 iterations
35
MULTI GPU JACOBI NVVP TIMELINE
Multi Threaded Copy 4 V100 on DGX-1V with P2P
36
COMMUNICATION + COMPUTATION OVERLAP
Process Whole Domain COMM No Overlap Process inner domain COMM Process boundary domain Dependency Boundary and inner domain processing can
- verlap
Overlap Possible gain
37
COMMUNICATION + COMPUTATION OVERLAP
//Compute bulk cudaStreamWaitEvent(compute_stream,push_top_done[(iter%2)][dev_id],0); cudaStreamWaitEvent(compute_stream,push_bottom_done[(iter%2)][dev_id],0); jacobi_kernel<<<dim_grid,dim_block,0,compute_stream>>>(a_new[dev_id],a,l2_norm_d,(iy_start+1),(iy_end[dev_id]-1),nx); //Compute boundaries cudaStreamWaitEvent( push_top_stream, reset_l2norm_done, 0 ); cudaStreamWaitEvent( push_top_stream, push_bottom_done[(iter%2)][top], 0 ); jacobi_kernel<<<nx/128+1,128,0,push_top_stream>>>( a_new[dev_id],a,l2_norm_d,iy_start,(iy_start+1),nx); cudaStreamWaitEvent(push_bottom_stream,reset_l2norm_done,0); cudaStreamWaitEvent(push_bottom_stream,push_top_done[(iter%2)][bottom], 0 ) ; jacobi_kernel<<<nx/128+1,128,0,push_bottom_stream>>>( a_new[dev_id],a,l2_norm_d,(iy_end[dev_id]-1),iy_end[dev_id],nx); //Apply periodic boundary conditions and exchange halo cudaMemcpyAsync(a_new[top]+(iy_end[top]*nx),a_new[dev_id]+iy_start*nx,nx*sizeof(real),cudaMemcpyDeviceToDevice,push_top_stream); cudaEventRecord(push_top_done[((iter+1)%2)][dev_id],push_top_stream); cudaMemcpyAsync(a_new[bottom],a_new[dev_id]+(iy_end[dev_id]-1)*nx,nx*sizeof(real),cudaMemcpyDeviceToDevice,push_bottom_stream); cudaEventRecord(push_bottom_done[((iter+1)%2)][dev_id],push_bottom_stream);
38
COMMUNICATION + COMPUTATION OVERLAP
int leastPriority = 0; int greatestPriority = leastPriority; cudaDeviceGetStreamPriorityRange ( &leastPriority, &greatestPriority ); cudaStreamCreateWithPriority ( &compute_stream, cudaStreamDefault, leastPriority ); cudaStreamCreateWithPriority ( &push_top_stream, cudaStreamDefault, greatestPriority ); cudaStreamCreateWithPriority ( &push_bottom_stream, cudaStreamDefault, greatestPriority );
High Priority Streams
39
MULTI GPU JACOBI NVVP TIMELINE
Multi Threaded Copy Overlap 4 V100 on DGX-1V with P2P
40
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8
Parallel Efficiency #GPUs
Chart Title
Single Threaded Copy P2P Multi Threaded Copy Multi Threaded Copy Overlap
DGX-1V - 7168 x 7168, 1000 iterations
41
MULTI GPU JACOBI NVVP TIMELINE
Multi Threaded Copy Overlap 4 V100 on DGX-1V with P2P
42
MULTI THREADED MULTI GPU PROGRAMMING
while ( l2_norm > tol && iter < iter_max ) { cudaMemsetAsync(l2_norm_d, 0 , sizeof(real), compute_stream ); #pragma omp barrier cudaStreamWaitEvent( compute_stream, compute_done[iter%2][top], 0 ); cudaStreamWaitEvent( compute_stream, compute_done[iter%2][bottom], 0 ); jacobi_kernel<<<dim_grid,dim_block,0,compute_stream>>>( a_new[dev_id], a, l2_norm_d, iy_start, iy_end[dev_id], nx, a_new[top], iy_end[top], a_new[bottom], 0 ); cudaEventRecord( compute_done[(iter+1)%2][dev_id], compute_stream ); cudaMemcpyAsync(l2_norm,l2_norm_d,sizeof(real),cudaMemcpyDeviceToHost,compute_stream); // l2_norm reduction btw threads skipped ... #pragma omp barrier std::swap(a_new[dev_id],a); iter++; }
Using OpenMP and P2P Mappings
43
MULTI THREADED MULTI GPU PROGRAMMING
__global__ void jacobi_kernel( ... ) { const int ix = bIdx.x*bDim.x+tIdx.x; const int iy = bIdx.y*bDim.y+tIdx.y + iy_start; real local_l2_norm = 0.0; if ( iy < iy_end && ix >= 1 && ix < (nx-1) ) { const real new_val = 0.25 * ( a[ iy * nx + ix + 1 ] + a[ iy * nx + ix - 1 ] + a[ (iy+1) * nx + ix ] + a[ (iy-1) * nx + ix ] ); a_new[ iy * nx + ix ] = new_val; if ( iy_start == iy ) a_new_top[ top_iy *nx + ix ] = new_val; if ( (iy_end - 1) == iy ) a_new_bottom[ bottom_iy*nx + ix ] = new_val; real residue = new_val - a[ iy * nx + ix ]; local_l2_norm += residue * residue; } atomicAdd( l2_norm, local_l2_norm ); }
Using OpenMP and P2P Mappings
44
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8
Parallel Efficiency #GPUs
Chart Title
Single Threaded Copy P2P Multi Threaded Copy Multi Threaded Copy Overlap Multi Threaded P2P
DGX-1V - 7168 x 7168, 1000 iterations
45
MULTI GPU JACOBI NVVP TIMELINE
Multi Threaded P2P 4 V100 on DGX-1V with P2P
46
MULTI THREADED MULTI GPU PROGRAMMING
cudaMemcpyAsync( l2_norm_h,l2_norm_d,sizeof(real),cudaMemcpyDeviceToHost,compute_stream ); #pragma omp barrier #pragma omp single { l2_norm = 0.0; } #pragma omp barrier cudaStreamSynchronize( compute_stream ); #pragma omp atomic l2_norm += *(l2_norm_h); #pragma omp barrier #pragma omp single { l2_norm = std::sqrt( l2_norm ); } #pragma omp barrier
L2 norm reduction
Can be hidden if L2 norm check is delayed.
47
MULTI THREADED MULTI GPU PROGRAMMING
cudaMemcpyAsync( l2_norm_h[ curr ],l2_norm_d[ curr ],sizeof( real ), cudaMemcpyDeviceToHost,compute_stream ); cudaEventRecord( copy_done[ curr ], compute_stream ); cudaEventSynchronize( copy_done[ prev ] ); #pragma omp atomic l2_norm[ prev ] += *( l2_norm_h[ prev ] ); #pragma omp barrier l2_norm[ prev ] = std::sqrt( l2_norm[ prev ] ); #pragma omp barrier l2_norm[ prev ] = 0.0;
Delayed L2 norm reduction
Issue H2D copy of current L2 norm Check L2 norm of last iteration (hidden)
48
MULTI GPU JACOBI NVVP TIMELINE
Multi Threaded P2P 4 V100 on DGX-1V with P2P
49
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8
Parallel Efficiency #GPUs
Chart Title
Single Threaded Copy P2P Multi Threaded Copy Multi Threaded Copy Overlap Multi Threaded P2P (norm opt)
DGX-1V - 7168 x 7168, 1000 iterations
50
MESSAGE PASSING INTERFACE - MPI
Standard to exchange data between processes via messages
Defines API to exchanges messages
Point to Point: e.g. MPI_Send, MPI_Recv Collectives: e.g. MPI_Reduce
Multiple implementations (open source and commercial)
Bindings for C/C++, Fortran, Python, … E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, …
51
MPI - SKELETON
#include <mpi.h> int main(int argc, char *argv[]) { int rank,size; /* Initialize the MPI library */ MPI_Init(&argc,&argv); /* Determine the calling process rank and total number of ranks */ MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size); /* Call MPI routines like MPI_Send, MPI_Recv, ... */ ... /* Shutdown MPI library */ MPI_Finalize(); return 0; }
52
MPI
Compiling and Launching
$ mpicc -o myapp myapp.c $ mpirun -np 4 ./myapp <args>
myapp myapp myapp myapp rank = 0 rank = 1 rank = 2 rank = 3
53
EXAMPLE JACOBI
Top/Bottom Halo
MPI_Sendrecv(a_new+iy_start*nx, nx, MPI_FLOAT, top , 0, a_new+(iy_end*nx), nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
54
EXAMPLE JACOBI
Top/Bottom Halo
1 1
MPI_Sendrecv(a_new+iy_start*nx, nx, MPI_FLOAT, top , 0, a_new+(iy_end*nx), nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
55
EXAMPLE JACOBI
Top/Bottom Halo
1 1 2 2
MPI_Sendrecv(a_new+iy_start*nx, nx, MPI_FLOAT, top , 0, a_new+(iy_end*nx), nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Sendrecv(a_new+(iy_end-1)*nx, nx, MPI_FLOAT, bottom, 0, a_new, nx, MPI_FLOAT, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
56
HANDLING MULTIPLE MULTI GPU NODES
1 7
…
8 9 15
…
16 12 23
…
24 25 31
…
57
HANDLING MULTIPLE MULTI GPU NODES
How to determine the local rank? – MPI-3
10-Oct-17
MPI_Comm local_comm; MPI_Info info; MPI_Info_create(&info); MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, info, &local_comm); int local_rank = -1; MPI_Comm_rank(local_comm,&local_rank); MPI_Comm_free(&local_comm); MPI_Info_free(&info);
58
HANDLING MULTIPLE MULTI GPU NODES
Shared Comm Shared Comm Shared Comm Shared Comm
1 7
…
1 7
…
8 9 15
…
16 12 23
…
24 25 31
…
1 7
…
1 7
…
1 7
…
59
HANDLING MULTIPLE MULTI GPU NODES
GPU-affinity
Use local rank:
int local_rank = -1; MPI_Comm_rank(local_comm,&local_rank); int num_devices = 0; cudaGetDeviceCount(&num_devices); cudaSetDevice(local_rank % num_devices); 10-Oct-17
60
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8
Parallel Efficiency #GPUs
Chart Title
Multi Threaded Copy Overlap MPI
DGX-1V - 7168 x 7168, 1000 iterations
61
MULTI GPU JACOBI NVVP TIMELINE
MPI 4 V100 on DGX-1V
62
COMMUNICATION + COMPUTATION OVERLAP
launch_jacobi_kernel( a_new, a, l2_norm_d, iy_start, (iy_start+1), nx, push_top_stream ); launch_jacobi_kernel( a_new, a, l2_norm_d, (iy_end-1), iy_end, nx, push_bottom_stream ); launch_jacobi_kernel( a_new, a, l2_norm_d, (iy_start+1), (iy_end-1), nx, compute_stream ); const int top = rank > 0 ? rank - 1 : (size-1); const int bottom = (rank+1)%size; cudaStreamSynchronize( push_top_stream ); MPI_Sendrecv( a_new+iy_start*nx, nx, MPI_REAL_TYPE, top , 0, a_new+(iy_end*nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ); cudaStreamSynchronize( push_bottom_stream ); MPI_Sendrecv( a_new+(iy_end-1)*nx, nx, MPI_REAL_TYPE, bottom, 0, a_new, nx, MPI_REAL_TYPE, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE );
10-Oct-17
63
MULTI GPU JACOBI NVVP TIMELINE
MPI Overlaping 4 V100 on DGX-1V
64
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8
Parallel Efficiency #GPUs
Chart Title
Multi Threaded Copy Overlap MPI MPI Overlap
DGX-1V - 7168 x 7168, 1000 iterations
65
NVSHMEM
Implementation of OpenSHMEM, a Partitioned Global Address Space (PGAS) library
Defines API to (symmetrically) allocate memory that is remotely accessible Defines API to access remote data
One-sided: e.g. shmem_putmem, shmem_getmem Collectives: e.g. shmem_broadcast
NVSHMEM features Symmetric memory allocations in device memory Communication API calls on CPU (standard and stream-ordered) Allows kernel-side communication (API and LD/ST*) between GPUs (within a single OS instance for the first release) Interoperable with MPI
66
NVSHMEM - SKELETON
#include <shmem.h> #include <shmemx.h> int main(int argc, char *argv[]) { ... MPI_Comm mpi_comm; shmemx_init_attr_t attr; mpi_comm = MPI_COMM_WORLD; attr.mpi_comm = &mpi_comm; shmemx_init_attr (SHMEMX_INIT_WITH_MPI_COMM, &attr); int npes = shmem_n_pes(); int mype = shmem_my_pe(); ... return 0; }
67
NVSHMEM – ALLOCATE MEMORY
a = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real)); a_new = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real)); … while ( l2_norm > tol && iter < iter_max ) { … jacobi_kernel<<<dim_grid,dim_block,0,compute_stream>>>( a_new, a, l2_norm_d, iy_start, iy_end, nx, top, iy_end_top, bottom, iy_start_bottom ); shmem_barrier_all_stream(compute_stream); … } shmem_barrier_all(); shmem_free(a); shmem_free(a_new);
68
NVSHEM - KERNEL
__global__ void jacobi_kernel( ... ) { const int ix = bIdx.x*bDim.x+tIdx.x; const int iy = bIdx.y*bDim.y+tIdx.y + iy_start; real local_l2_norm = 0.0; if ( iy < iy_end && ix >= 1 && ix < (nx-1) ) { const real new_val = 0.25 * ( a[ iy * nx + ix + 1 ] + a[ iy * nx + ix - 1 ] + a[ (iy+1) * nx + ix ] + a[ (iy-1) * nx + ix ] ); a_new[ iy * nx + ix ] = new_val; if ( iy_start == iy ) shmem_float_p(a_new + top_iy*nx + ix, new_val, top_pe); if ( iy_end == iy ) shmem_float_p(a_new + bottom_iy*nx + ix, new_val, bottom_pe); real residue = new_val - a[ iy * nx + ix ]; } atomicAdd( l2_norm, local_l2_norm ); }
69
MULTI GPU JACOBI RUNTIME
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8
Parallel Efficiency #GPUs
Multi Threaded Copy Overlap MPI MPI Overlap NVSHMEM
DGX-1V - 7168 x 7168, 1000 iterations
70
NOT COVERED IN THIS TALK I
Enables MPI communication to follow CUDA stream ordering Avoids unwanted CPU/GPU synchronization
MPI with GPUDirect Async support (under development)
71
NOT COVERED IN THIS TALK II
NCCL: Accelerating multi-GPU collective communications
GOAL:
- A library for collective communication using CUDA kernels for reduction and data
movement. APPROACH:
- Allreduce, Reduce, Broadcast, ReduceScatter and Allgather primitives, similar to
MPI primitives.
- CUDA oriented : works on CUDA pointers only, enqueues operations to CUDA
streams.
- Supports any mapping between processes, threads and GPUs per thread to
integrate into any hybrid model.
10-Oct-17
72
CONCLUSION
Programming Models GPUDirect P2P Multi Node Single Threaded CUDA Improves Perf. No Multi Threaded CUDA + OpenMP/TBB/… Improves Perf. No Multi Threaded P2P CUDA + OpenMP/TBB/… Required No MPI CUDA + MPI Improves Perf. Yes NVSHMEM CUDA + NVSHMEM (MPI interop.) Required Yes* *Initial version will only support a single OS instance.