Progress in automatic GPU compilation and why you want to run MPI on - - PowerPoint PPT Presentation

progress in automatic gpu compilation and why you want to
SMART_READER_LITE
LIVE PREVIEW

Progress in automatic GPU compilation and why you want to run MPI on - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T ORSTEN H OEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 spcl.inf.ethz.ch @spcl_eth #pragma


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

TORSTEN HOEFLER

Progress in automatic GPU compilation and why you want to run MPI on your GPU

with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016

slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

2

#pragma ivdep

slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

3

!$ACC DATA & !$ACC PRESENT(density1,energy1) & !$ACC PRESENT(vol_flux_x,vol_flux_y,volume,mass_flux_x,mass_flux_y,vertexdx,vertexdy) & !$ACC PRESENT(pre_vol,post_vol,ener_flux) !$ACC KERNELS IF(dir.EQ.g_xdir) THEN IF(sweep_number.EQ.1)THEN !$ACC LOOP INDEPENDENT DO k=y_min-2,y_max+2 !$ACC LOOP INDEPENDENT DO j=x_min-2,x_max+2 pre_vol(j,k)=volume(j,k)+(vol_flux_x(j+1,k )-vol_flux_x(j,k)+vol_flux_y(j ,k+1)-vol_flux_y(j,k)) post_vol(j,k)=pre_vol(j,k)-(vol_flux_x(j+1,k )-vol_flux_x(j,k)) ENDDO ENDDO ELSE !$ACC LOOP INDEPENDENT DO k=y_min-2,y_max+2 !$ACC LOOP INDEPENDENT DO j=x_min-2,x_max+2 pre_vol(j,k)=volume(j,k)+vol_flux_x(j+1,k)-vol_flux_x(j,k) post_vol(j,k)=volume(j,k) ENDDO ENDDO ENDIF

slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

4 Heitlager et al.: A Practical Model for Measuring Maintainability

slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

5

!$ACC DATA & !$ACC COPY(chunk%tiles(1)%field%density0) & !$ACC COPY(chunk%tiles(1)%field%density1) & !$ACC COPY(chunk%tiles(1)%field%energy0) & !$ACC COPY(chunk%tiles(1)%field%energy1) & !$ACC COPY(chunk%tiles(1)%field%pressure) & !$ACC COPY(chunk%tiles(1)%field%soundspeed) & !$ACC COPY(chunk%tiles(1)%field%viscosity) & !$ACC COPY(chunk%tiles(1)%field%xvel0) & !$ACC COPY(chunk%tiles(1)%field%yvel0) & !$ACC COPY(chunk%tiles(1)%field%xvel1) & !$ACC COPY(chunk%tiles(1)%field%yvel1) & !$ACC COPY(chunk%tiles(1)%field%vol_flux_x) & !$ACC COPY(chunk%tiles(1)%field%vol_flux_y) & !$ACC COPY(chunk%tiles(1)%field%mass_flux_x)& !$ACC COPY(chunk%tiles(1)%field%mass_flux_y)& !$ACC COPY(chunk%tiles(1)%field%volume) & !$ACC COPY(chunk%tiles(1)%field%work_array1)& !$ACC COPY(chunk%tiles(1)%field%work_array2)& !$ACC COPY(chunk%tiles(1)%field%work_array3)& !$ACC COPY(chunk%tiles(1)%field%work_array4)& !$ACC COPY(chunk%tiles(1)%field%work_array5)& !$ACC COPY(chunk%tiles(1)%field%work_array6)& !$ACC COPY(chunk%tiles(1)%field%work_array7)& !$ACC COPY(chunk%tiles(1)%field%cellx) & !$ACC COPY(chunk%tiles(1)%field%celly) & !$ACC COPY(chunk%tiles(1)%field%celldx) & !$ACC COPY(chunk%tiles(1)%field%celldy) & !$ACC COPY(chunk%tiles(1)%field%vertexx) & !$ACC COPY(chunk%tiles(1)%field%vertexdx) & !$ACC COPY(chunk%tiles(1)%field%vertexy) & !$ACC COPY(chunk%tiles(1)%field%vertexdy) & !$ACC COPY(chunk%tiles(1)%field%xarea) & !$ACC COPY(chunk%tiles(1)%field%yarea) & !$ACC COPY(chunk%left_snd_buffer) & !$ACC COPY(chunk%left_rcv_buffer) & !$ACC COPY(chunk%right_snd_buffer) & !$ACC COPY(chunk%right_rcv_buffer) & !$ACC COPY(chunk%bottom_snd_buffer) & !$ACC COPY(chunk%bottom_rcv_buffer) & !$ACC COPY(chunk%top_snd_buffer) & !$ACC COPY(chunk%top_rcv_buffer)

Sloccount *f90: 6,440 !$ACC: 833 (13%)

slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

6

slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

7

do i = 0, N do j = 0, i y(i,j) = ( y(i,j) + y(i,j+1) )/2

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

8

Some results: Polybench 3.2

  • T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

arithmean: ~30x geomean: ~6x Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)

Speedup over icc –O3

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

0:00 1:12 2:24 3:36 4:48 6:00 7:12 8:24 Mobile Workstation icc icc -openmp clang Polly ACC

9

Compiles all of SPEC CPU 2006 – Example: LBM

  • T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Runtime (m:s)

Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) essentially my 4-core x86 laptop with the (free) GPU that’s in there

~20% ~4x

slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

10

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

GPU latency hiding vs. MPI

ld ld ld ld st st st st

device compute core active thread instruction latency

CUDA

  • over-subscribe hardware
  • use spare parallel slack for latency hiding

MPI

  • host controlled
  • full device synchronization

  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

Hardware latency hiding at the cluster level?

ld ld ld ld

device compute core active thread instruction latency

dCUDA (distributed CUDA)

  • unified programming model for GPU clusters
  • avoid unnecessary device synchronization to enable system wide latency hiding

st put st put ld ld ld ld st st st st

  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

dCUDA: MPI-3 RMA extensions

for (int i = 0; i < steps; ++i) { for (int idx = from; idx < to; idx += jstride)

  • ut[idx] = -4.0 * in[idx] +

in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag); dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend); swap(in, out); swap(win, wout); }

computation communication

  • iterative stencil kernel
  • thread specific idx
  • map ranks to blocks
  • device-side put/get operations
  • notifications for synchronization
  • shared and distributed memory
  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

Hardware supported communication overlap

traditional MPI-CUDA dCUDA 1 device compute core active block 2 1 2 4 3 4 3 5 6 6 5 7 8 7 8 1 2 4 3 5 6 7 8 2 1 4 3 6 5 8 7

  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

The dCUDA runtime system

MPI context logging queue command queue ack queue notification queue more blocks event handler device library block manager

host-side device-side

  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

  • Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node

(Very) simple stencil benchmark

compute & exchange compute only halo exchange 500 1000 30 60 90 # of copy iterations per exchange execution time [ms]

  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

no overlap

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

  • Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node

Real stencil (COSMO weather/climate code)

dCUDA halo exchange MPI-CUDA 50 100 2 4 6 8 # of nodes execution time [ms]

  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
slide-18
SLIDE 18

spcl.inf.ethz.ch @spcl_eth

  • Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node

Particle simulation code (Barnes Hut)

dCUDA halo exchange MPI-CUDA 50 100 150 200 2 4 6 8 # of nodes execution time [ms]

  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
slide-19
SLIDE 19

spcl.inf.ethz.ch @spcl_eth

  • Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node

Sparse matrix-vector multiplication

dCUDA communication MPI-CUDA 50 100 150 200 1 4 9 # of nodes execution time [ms]

  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
slide-20
SLIDE 20

spcl.inf.ethz.ch @spcl_eth

for (int i = 0; i < steps; ++i) { for (int idx = from; idx < to; idx += jstride)

  • ut[idx] = -4.0 * in[idx] +

in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag); dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend); swap(in, out); swap(win, wout); }

20

http://spcl.inf.ethz.ch/Polly-ACC

Automatic “Regression Free” High Performance

dCUDA – distributed memory

Automatic Overlap High Performance

slide-21
SLIDE 21

spcl.inf.ethz.ch @spcl_eth

1 10 100 1000 10000 SCoPs 0-dim 1-dim 2-dim 3-dim No Heuristics Heuristics

21

LLVM Nightly Test Suite

slide-22
SLIDE 22

spcl.inf.ethz.ch @spcl_eth

22

Cactus ADM (SPEC 2006)

Workstation Mobile

slide-23
SLIDE 23

spcl.inf.ethz.ch @spcl_eth

23

Evading various “ends” – the hardware view