progress in automatic gpu compilation and why you want to
play

Progress in automatic GPU compilation and why you want to run MPI on - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T ORSTEN H OEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 spcl.inf.ethz.ch @spcl_eth #pragma


  1. spcl.inf.ethz.ch @spcl_eth T ORSTEN H OEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016

  2. spcl.inf.ethz.ch @spcl_eth #pragma ivdep 2

  3. spcl.inf.ethz.ch @spcl_eth !$ACC DATA & !$ACC PRESENT(density1,energy1) & !$ACC PRESENT(vol_flux_x,vol_flux_y,volume,mass_flux_x,mass_flux_y,vertexdx,vertexdy) & !$ACC PRESENT(pre_vol,post_vol,ener_flux) !$ACC KERNELS IF(dir.EQ.g_xdir) THEN IF(sweep_number.EQ.1)THEN !$ACC LOOP INDEPENDENT DO k=y_min-2,y_max+2 !$ACC LOOP INDEPENDENT DO j=x_min-2,x_max+2 pre_vol(j,k)=volume(j,k)+(vol_flux_x(j+1,k )-vol_flux_x(j,k)+vol_flux_y(j ,k+1)-vol_flux_y(j,k)) post_vol(j,k)=pre_vol(j,k)-(vol_flux_x(j+1,k )-vol_flux_x(j,k)) ENDDO ENDDO ELSE !$ACC LOOP INDEPENDENT DO k=y_min-2,y_max+2 !$ACC LOOP INDEPENDENT DO j=x_min-2,x_max+2 pre_vol(j,k)=volume(j,k)+vol_flux_x(j+1,k)-vol_flux_x(j,k) post_vol(j,k)=volume(j,k) ENDDO ENDDO 3 ENDIF

  4. spcl.inf.ethz.ch @spcl_eth Heitlager et al.: A Practical Model for Measuring Maintainability 4

  5. spcl.inf.ethz.ch @spcl_eth !$ACC DATA & !$ACC COPY(chunk%tiles(1)%field%density0) & !$ACC COPY(chunk%tiles(1)%field%density1) & !$ACC COPY(chunk%tiles(1)%field%energy0) & !$ACC COPY(chunk%tiles(1)%field%energy1) & !$ACC COPY(chunk%tiles(1)%field%pressure) & !$ACC COPY(chunk%tiles(1)%field%soundspeed) & !$ACC COPY(chunk%tiles(1)%field%viscosity) & !$ACC COPY(chunk%tiles(1)%field%xvel0) & !$ACC COPY(chunk%tiles(1)%field%yvel0) & !$ACC COPY(chunk%tiles(1)%field%xvel1) & !$ACC COPY(chunk%tiles(1)%field%yvel1) & !$ACC COPY(chunk%tiles(1)%field%vol_flux_x) & !$ACC COPY(chunk%tiles(1)%field%vol_flux_y) & !$ACC COPY(chunk%tiles(1)%field%mass_flux_x)& !$ACC COPY(chunk%tiles(1)%field%mass_flux_y)& !$ACC COPY(chunk%tiles(1)%field%volume) & Sloccount *f90: 6,440 !$ACC COPY(chunk%tiles(1)%field%work_array1)& !$ACC COPY(chunk%tiles(1)%field%work_array2)& !$ACC COPY(chunk%tiles(1)%field%work_array3)& !$ACC COPY(chunk%tiles(1)%field%work_array4)& !$ACC COPY(chunk%tiles(1)%field%work_array5)& !$ACC COPY(chunk%tiles(1)%field%work_array6)& !$ACC COPY(chunk%tiles(1)%field%work_array7)& !$ACC: 833 (13%) !$ACC COPY(chunk%tiles(1)%field%cellx) & !$ACC COPY(chunk%tiles(1)%field%celly) & !$ACC COPY(chunk%tiles(1)%field%celldx) & !$ACC COPY(chunk%tiles(1)%field%celldy) & !$ACC COPY(chunk%tiles(1)%field%vertexx) & !$ACC COPY(chunk%tiles(1)%field%vertexdx) & !$ACC COPY(chunk%tiles(1)%field%vertexy) & !$ACC COPY(chunk%tiles(1)%field%vertexdy) & !$ACC COPY(chunk%tiles(1)%field%xarea) & !$ACC COPY(chunk%tiles(1)%field%yarea) & !$ACC COPY(chunk%left_snd_buffer) & !$ACC COPY(chunk%left_rcv_buffer) & !$ACC COPY(chunk%right_snd_buffer) & !$ACC COPY(chunk%right_rcv_buffer) & !$ACC COPY(chunk%bottom_snd_buffer) & !$ACC COPY(chunk%bottom_rcv_buffer) & !$ACC COPY(chunk%top_snd_buffer) & !$ACC COPY(chunk%top_rcv_buffer) 5

  6. spcl.inf.ethz.ch @spcl_eth 6

  7. spcl.inf.ethz.ch @spcl_eth do i = 0, N do j = 0, i y(i,j) = ( y(i,j) + y(i,j+1) )/2 7

  8. spcl.inf.ethz.ch @spcl_eth Some results: Polybench 3.2 geomean: ~6x arithmean: ~30x Speedup over icc – O3 Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 8 T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  9. spcl.inf.ethz.ch @spcl_eth Compiles all of SPEC CPU 2006 – Example: LBM 8:24 essentially my 4-core x86 laptop with the (free) GPU that’s in there 7:12 6:00 Runtime (m:s) Xeon E5-2690 (10 cores, 0.5Tflop) vs. 4:48 ~20% Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 3:36 2:24 ~4x 1:12 0:00 Mobile Workstation icc icc -openmp clang Polly ACC 9 T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  10. spcl.inf.ethz.ch @spcl_eth 10

  11. spcl.inf.ethz.ch @spcl_eth GPU latency hiding vs. MPI ld st ld st … ld st ld st CUDA MPI • over-subscribe hardware • host controlled • use spare parallel slack for latency hiding • full device synchronization device compute core active thread instruction latency T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  12. spcl.inf.ethz.ch @spcl_eth Hardware latency hiding at the cluster level? st ld ld put ld st ld st st ld st ld ld put ld st dCUDA (distributed CUDA) • unified programming model for GPU clusters • avoid unnecessary device synchronization to enable system wide latency hiding device compute core active thread instruction latency T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  13. spcl.inf.ethz.ch @spcl_eth dCUDA: MPI-3 RMA extensions • iterative stencil kernel for ( int i = 0; i < steps; ++i) { for ( int idx = from; idx < to; idx += jstride) • thread specific idx out[idx] = -4.0 * in[idx] + computation in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify (ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify (ctx, wout, rank + 1, 0, jstride, &out[len], tag); communication • map ranks to blocks • device-side put/get operations dcuda_wait_notifications (ctx, wout, • notifications for synchronization DCUDA_ANY_SOURCE, tag, lsend + rsend); • shared and distributed memory swap(in, out); swap(win, wout); } T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  14. spcl.inf.ethz.ch @spcl_eth Hardware supported communication overlap dCUDA traditional MPI-CUDA 1 4 5 1 4 5 7 7 2 3 6 8 2 3 6 8 2 4 6 8 1 4 6 7 1 3 5 7 2 3 5 8 device compute core active block T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  15. spcl.inf.ethz.ch @spcl_eth The dCUDA runtime system event handler block manager MPI host-side ack notification more blocks queue queue logging command queue queue device-side device library context T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  16. spcl.inf.ethz.ch @spcl_eth (Very) simple stencil benchmark  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node no overlap 1000 execution time [ms] compute & exchange halo exchange 500 compute only 0 30 60 90 # of copy iterations per exchange T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  17. spcl.inf.ethz.ch @spcl_eth Real stencil (COSMO weather/climate code)  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node MPI-CUDA 100 dCUDA execution time [ms] 50 halo exchange 0 2 4 6 8 # of nodes T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  18. spcl.inf.ethz.ch @spcl_eth Particle simulation code (Barnes Hut)  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node MPI-CUDA 200 execution time [ms] dCUDA 150 100 50 halo exchange 0 2 4 6 8 # of nodes T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  19. spcl.inf.ethz.ch @spcl_eth Sparse matrix-vector multiplication  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node 200 150 execution time [ms] dCUDA 100 MPI-CUDA 50 communication 0 1 4 9 # of nodes T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  20. spcl.inf.ethz.ch @spcl_eth dCUDA – distributed memory http://spcl.inf.ethz.ch/Polly-ACC for ( int i = 0; i < steps; ++i) { for ( int idx = from; idx < to; idx += jstride) out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify (ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify (ctx, wout, rank + 1, 0, jstride, &out[len], tag); dcuda_wait_notifications (ctx, wout, Automatic DCUDA_ANY_SOURCE, tag, lsend + rsend); Automatic swap(in, out); swap(win, wout); } High Performance Overlap “Regression Free” High Performance 20

  21. spcl.inf.ethz.ch @spcl_eth LLVM Nightly Test Suite 10000 1000 100 10 1 SCoPs 0-dim 1-dim 2-dim 3-dim No Heuristics Heuristics 21

  22. spcl.inf.ethz.ch @spcl_eth Cactus ADM (SPEC 2006) Workstation Mobile 22

  23. spcl.inf.ethz.ch @spcl_eth Evading various “ends” – the hardware view 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend