dcuda hardware supported overlap of computation and
play

dCUDA: Hardware Supported Overlap of Computation and Communication - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Br, and Torsten Hoefler spcl.inf.ethz.ch @spcl_eth GPU computing gained a lot of popularity in various application domains


  1. spcl.inf.ethz.ch @spcl_eth dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Bär, and Torsten Hoefler

  2. spcl.inf.ethz.ch @spcl_eth GPU computing gained a lot of popularity in various application domains weather & climate machine learning molecular dynamics

  3. spcl.inf.ethz.ch @spcl_eth GPU cluster programming using MPI and CUDA node 2 code node 1 // run compute kernel device device __global__ void mykernel ( … ) { } memory memory // launch compute kernel PCI-Express PCI-Express mykernel<<<64,128 >>>( … ); // on-node data movement cudaMemcpy( host host psize, &size, memory memory sizeof(int), cudaMemcpyDeviceToHost); PCI-Express PCI-Express // inter-node data movement mpi_send( pdata, size, MPI_FLOAT, … ); mpi_recv( pdata, size, interconnect MPI_FLOAT, … );

  4. spcl.inf.ethz.ch @spcl_eth Disadvantages of the MPI-CUDA approach host device complexity • two programming models • duplicated functionality mykernel <<< >>>( … ); cudaMemcpy ( … ); mykernel ( … ) { … copy sync } device sync time mpi_send ( … ); mpi_recv ( … ); cluster sync mykernel<<< >>>( … ); performance mykernel ( … ) { • encourages sequential execution … } • low utilization of the costly hardware …

  5. spcl.inf.ethz.ch @spcl_eth Achieve high resource utilization using oversubscription & hardware threads code thread 1 thread 2 thread 3 instruction pipeline ld %r0,%r1 ld %r0,%r1 ready ready ld mul %r0,%r0,3 stall ld %r0,%r1 ready ld ld st %r0,%r1 ready stall ld %r0,%r1 ld ld mul %r0,%r0,3 ready stall mul ld GPU cores use time stall mul %r0,%r0,3 ready mul mul “parallel slack” to ready stall mul %r0,%r0,3 mul mul hide instruction pipeline latencies st %r0,%r1 ready stall st mul stall st %r0,%r1 ready st st ready stall st %r0,%r1 st st …

  6. spcl.inf.ethz.ch @spcl_eth Use oversubscription & hardware threads to hide remote memory latencies code thread 1 thread 2 thread 3 instruction pipeline get … get … ready ready get mul %r0,%r0,3 stall get … ready get get put … stall stall get … get get stall stall stall ! get introduce put & get time ready stall stall ! ! operations to access mul ! ready stall mul %r0,%r0,3 distributed memory mul mul stall mul %r0,%r0,3 ready ready stall mul %r0,%r0,3 mul mul ready stall put mul put … …

  7. spcl.inf.ethz.ch @spcl_eth How much “parallel slack” is necessary to fully utilize the interconnect? Little’s law 𝑑𝑝𝑜𝑑𝑣𝑠𝑠𝑓𝑜𝑑𝑧 = 𝑚𝑏𝑢𝑓𝑜𝑑𝑧 ∗ 𝑢ℎ𝑠𝑝𝑣𝑕ℎ𝑞𝑣𝑢 device memory interconnect latency 1µs 19µs bandwidth 200GB/s 6GB/s concurrency 200kB 114kB #threads ~12000 ~7000 >>

  8. spcl.inf.ethz.ch @spcl_eth dCUDA (distributed CUDA) extends CUDA with MPI-3 RMA and notifications for ( int i = 0; i < steps; ++i) { • iterative stencil kernel for ( int idx = from; idx < to; idx += jstride) • thread specific idx out[idx] = -4.0 * in[idx] + computation in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify (ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify (ctx, wout, rank + 1, 0, jstride, &out[len], tag); communication • map ranks to blocks dcuda_wait_notifications (ctx, wout, • device-side put/get operations DCUDA_ANY_SOURCE, tag, lsend + rsend); • notifications for synchronization • shared and distributed memory swap(in, out); swap(win, wout); }

  9. spcl.inf.ethz.ch @spcl_eth Advantages of the dCUDA approach performance device 1 device 2 • avoid device synchronization rank 1 rank 2 rank 3 rank 4 • latency hiding at cluster scale stencil( … ) ; stencil( … ) ; put( … ); put( … ); put( … ); put( … ); wait( … ); wait( … ); stencil( … ) ; stencil( … ) ; complexity put( … ); put( … ); put( … ); put( … ); • unified programming model wait( … ); wait( … ); time • one communication mechanism sync sync stencil( … ) ; stencil( … ) ; put( … ); put( … ); device 1 device 2 put( … ); put( … ); sync sync put wait( … ); wait( … ); stencil( … ) ; stencil( … ) ; rank 1 rank 2 rank 3 rank 4 put( … ); put( … ); put( … ); put( … ); wait( … ); wait( … ); … … … … put put

  10. spcl.inf.ethz.ch @spcl_eth Implementation of the dCUDA runtime system event handler host-side block manager block manager block manager MPI GPU direct device-library device-library device-library device-side put ( … ); put( … ); put( … ); get( … ); get( … ); get( … ); wait( … ); wait( … ); wait( … );

  11. spcl.inf.ethz.ch @spcl_eth Overlap of a copy kernel with halo exchange communication benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) no overlap 1000 execution time [ms] compute & exchange halo exchange 500 compute only 0 30 60 90 # of copy iterations per exchange

  12. spcl.inf.ethz.ch @spcl_eth Weak scaling of MPI-CUDA and dCUDA for a stencil program benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) MPI-CUDA 100 dCUDA execution time [ms] 50 halo exchange 0 2 4 6 8 # of nodes

  13. spcl.inf.ethz.ch @spcl_eth Weak scaling of MPI-CUDA and dCUDA for a particle simulation benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) MPI-CUDA 200 execution time [ms] dCUDA 150 100 50 halo exchange 0 2 4 6 8 # of nodes

  14. spcl.inf.ethz.ch @spcl_eth Weak scaling of MPI-CUDA and dCUDA for sparse-matrix vector multiplication benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) 200 150 execution time [ms] dCUDA 100 MPI-CUDA 50 communication 0 1 4 9 # of nodes

  15. spcl.inf.ethz.ch @spcl_eth Conclusions  unified programming model for GPU clusters  device-side remote memory access operations with notifications  transparent support of shared and distributed memory  extend the latency hiding technique of CUDA to the full cluster  inter-node communication without device synchronization  use oversubscription & hardware threads to hide remote memory latencies  automatic overlap of computation and communication  synthetic benchmarks demonstrate perfect overlap  example applications demonstrate the applicability to real codes  https://spcl.inf.ethz.ch/Research/Parallel_Programming/dCUDA/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend