multi gpu programming
play

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute - PowerPoint PPT Presentation

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech Compute MOTIVATION Why use multiple GPUs? Need to compute larger, e.g. bigger networks, car models, Need to compute faster, e.g. weather


  1. MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech Compute

  2. MOTIVATION Why use multiple GPUs? Need to compute larger, e.g. bigger networks, car models, … Need to compute faster, e.g. weather prediction Better energy efficiency with dense nodes with multiple GPUs 3

  3. DGX-1V Two fully connected quads, connected at corners GPU0 GPU1 GPU5 GPU4 300GB/s per GPU bidirectional to Peers Load/store access to Peer Memory GPU2 GPU3 GPU7 GPU6 Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU CPU 0 CPU 1 0 - 19 20-39 4

  4. EXAMPLE: JACOBI SOLVER Solves the 2D-Laplace Equation on a rectangle ∆𝒗 𝒚, 𝒛 = 𝟏 ∀ 𝒚, 𝒛 ∈ Ω\𝜺Ω Dirichlet boundary conditions (constant values on boundaries) on left and right boundary Periodic boundary conditions on top and bottom boundary 5

  5. EXAMPLE: JACOBI SOLVER Single GPU While not converged Do Jacobi step: for( int iy = 1 ; iy < ny - 1 ; iy ++ ) for( int ix = 1 ; ix < nx - 1 ; ix ++ ) a_new [ iy * nx + ix ] = - 0.25 * -( a [ iy * nx +( ix + 1 )] + a [ iy * nx + ix - 1 ] + a [( iy - 1 )* nx + ix ] + a [( iy + 1 )* nx + ix ] ); Apply periodic boundary conditions Swap a_new and a Next iteration 6

  6. DOMAIN DECOMPOSITION Different Ways to split the work between processes: Minimize number of neighbors: Minimize surface area/volume ratio: Communicate to less neighbors Communicate less data Optimal for latency bound communication Optimal for bandwidth bound communication Contiguous if data Contiguous if data is row-major is column-major 7

  7. EXAMPLE: JACOBI SOLVER Multi GPU While not converged Do Jacobi step: for ( int iy = iy_start; iy < iy_end; iy ++ ) for( int ix = 1 ; ix < nx - 1 ; ix ++ ) a_new [ iy * nx + ix ] = - 0.25 * -( a [ iy * nx +( ix + 1 )] + a [ iy * nx + ix - 1 ] + a [( iy - 1 )* nx + ix ] + a [( iy + 1 )* nx + ix ] ); Apply periodic boundary conditions One-step with ring exchange Exchange halo with 2 neighbors Swap a_new and a Next iteration 8

  8. SINGLE THREADED MULTI GPU PROGRAMMING while ( l2_norm > tol && iter < iter_max ) { for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) { const int top = dev_id > 0 ? dev_id - 1 : ( num_devices - 1 ); const int bottom = ( dev_id + 1 )% num_devices ; cudaSetDevice( dev_id ); cudaMemsetAsync ( l2_norm_d [ dev_id ], 0 , sizeof( real ) ); jacobi_kernel <<< dim_grid , dim_block >>>( a_new [ dev_id ], a [ dev_id ], l2_norm_d [ dev_id ], iy_start [ dev_id ], iy_end [ dev_id ], nx ); cudaMemcpyAsync ( l2_norm_h [ dev_id ], l2_norm_d [ dev_id ], sizeof( real ), cudaMemcpyDeviceToHost ); cudaMemcpyAsync ( a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); cudaMemcpyAsync ( a_new [ bottom ], a_new [ dev_id ]+( iy_end [ dev_id ]- 1 )* nx , nx *sizeof( real ), ... ); } l2_norm = 0.0 ; for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) { cudaSetDevice( dev_id ); cudaDeviceSynchronize (); l2_norm += *( l2_norm_h [ dev_id ]); } l2_norm = std :: sqrt ( l2_norm ); for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) std :: swap ( a_new [ dev_id ], a [ dev_id ]); iter ++; } 9

  9. EXAMPLE JACOBI Top/Bottom Halo cudaMemcpyAsync ( a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); 10

  10. EXAMPLE JACOBI Top/Bottom Halo cudaMemcpyAsync ( 1 a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); 1 11

  11. EXAMPLE JACOBI Top/Bottom Halo 2 cudaMemcpyAsync ( cudaMemcpyAsync ( 1 a_new [ top ]+( iy_end [ top ]* nx ), a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); cudaMemcpyAsync ( a_new [ bottom ], 2 a_new [ dev_id ]+( iy_end [ dev_id ]- 1 )* nx , nx *sizeof( real ), ... ); 1 12

  12. SCALABILTY METRICS FOR SUCCESS Serial Time: 𝑈 𝑡 : How long it takes to run the problem with a single process Parallel Time: 𝑈 𝑞 : How long it takes to run the problem with multiple processes Number of Processes: 𝑄 : The number of Processes operating on the task at hand 𝑈 Speedup: 𝑇 = 𝑞 : How much faster is the parallel version vs. serial. (optimal is 𝑄 ) 𝑡 𝑈 𝑇 Efficiency: 𝐹 = 𝑄 : How efficient are the processors used (optimal is 1 ) 13

  13. EXAMPLE: JACOBI SOLVER Single GPU performance vs. problem size – Tesla V100 SXM2 16000 100.00% 15500 Efficiency/Occupancy Performance (Mcells/s) 80.00% 15000 14500 60.00% 14000 40.00% 13500 13000 20.00% 12500 12000 0.00% 512 1024 1536 2048 2560 3072 3584 4096 4608 5120 5632 6144 6656 7168 7680 8192 Problem size (nx=ny) Performance (Mcells/s) Efficiency (%) Achieved Occupancy (%) 14

  14. MULTI GPU JACOBI RUNTIME DGX-1V - 7168 x 7168, 1000 iterations Chart Title 4 100.00% 90.00% 3.5 80.00% Parallel Efficiency 3 70.00% Runtime (s) 2.5 60.00% 2 50.00% 40.00% 1.5 30.00% 1 20.00% 0.5 10.00% 0 0.00% 1 2 3 4 5 6 7 8 #GPUs Single Threaded Copy Parallel Efficiency 15

  15. MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 V100 on DGX-1V 16

  16. MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 V100 on DGX-1V 17

  17. GPUDIRECT P2P MEM MEM MEM MEM MEM MEM MEM GPU0 GPU1 MEM GPU5 GPU4 MEM MEM MEM MEM MEM MEM MEM MEM GPU2 GPU3 GPU7 GPU6 Maximizes intra node inter GPU Bandwidth Avoids Host memory and system topology bottlenecks 18

  18. GPUDIRECT P2P Enable P2P for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) { cudaSetDevice ( dev_id ); const int top = dev_id > 0 ? dev_id - 1 : ( num_devices - 1 ); int canAccessPeer = 0 ; cudaDeviceCanAccessPeer ( & canAccessPeer , dev_id , top ); if ( canAccessPeer ) cudaDeviceEnablePeerAccess ( top , 0 ); const int bottom = ( dev_id + 1 )% num_devices ; if ( top != bottom ) { cudaDeviceCanAccessPeer ( & canAccessPeer , dev_id , bottom ); if ( canAccessPeer ) cudaDeviceEnablePeerAccess ( bottom , 0 ); } } 19

  19. MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 V100 on DGX-1V with P2P 20

  20. MULTI GPU JACOBI RUNTIME DGX-1V - 7168 x 7168, 1000 iterations Chart Title 100.00% 90.00% 80.00% Parallel Efficiency 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1 2 3 4 5 6 7 8 #GPUs Single Threaded Copy Single Threaded Copy P2P 21

  21. 1D RING EXCHANGE … Halo updates for 1D domain decomposition with periodic boundary conditions Unidirectional rings are important building block for collective algorithms 22

  22. MAPPING 1D RING EXCHANGE TO DGX-1V GPU0 GPU1 GPU5 GPU4 GPU2 GPU3 GPU7 GPU6 Dom. Dom. Dom. Dom. Dom. Dom. Dom. Rank 0 1 2 3 4 5 6 7 23

  23. MAPPING 1D RING EXCHANGE TO DGX-1V GPU0 GPU1 GPU5 GPU4 GPU2 GPU3 GPU7 GPU6 Dom. Dom. Dom. Dom. Dom. Dom. Dom. Rank 0 1 2 3 4 5 6 7 export CUDA_VISIBLE_DEVICES = "0,3,2,1,5,6,7,4“ 24

  24. MULTI GPU JACOBI RUNTIME DGX-1V - 7168 x 7168, 1000 iterations Chart Title 100.00% 90.00% 80.00% Parallel Efficiency 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1 2 3 4 5 6 7 8 #GPUs Single Threaded Copy Single Threaded Copy P2P (no opt) Single Threaded Copy P2P 25

  25. MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 V100 on DGX-1V with P2P 26

  26. MULTI THREADED MULTI GPU PROGRAMMING Using OpenMP int num_devices = 0 ; cudaGetDeviceCount ( & num_devices ); #pragma omp parallel num_threads( num_devices ) { int dev_id = omp_get_thread_num (); cudaSetDevice ( dev_id ); } 27

  27. MULTI GPU JACOBI NVVP TIMELINE Multi Threaded Copy 4 V100 on DGX-1V with P2P 28

  28. MULTI GPU JACOBI RUNTIME DGX1 - 1024 x 1024, 1000 iterations Chart Title 100.00% 90.00% 80.00% Parallel Efficiency 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1 2 3 4 5 6 7 8 #GPUs Single Threaded Copy P2P Multi Threaded Copy (no thread pinning) 29

  29. GPU/CPU AFFINITY GPU0 GPU1 GPU5 GPU4 GPU2 GPU3 GPU7 GPU6 CPU 0 CPU 1 0 - 19 20-39 thread thread thread thread thread thread thread thread 0 1 2 3 4 5 6 7 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend