Progress in automatic GPU compilation and why you want to run MPI on - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T ORSTEN H OEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016

spcl.inf.ethz.ch @spcl_eth #pragma ivdep 2

spcl.inf.ethz.ch @spcl_eth !$ACC DATA & !$ACC PRESENT(density1,energy1) & !$ACC PRESENT(vol_flux_x,vol_flux_y,volume,mass_flux_x,mass_flux_y,vertexdx,vertexdy) & !$ACC PRESENT(pre_vol,post_vol,ener_flux) !$ACC KERNELS IF(dir.EQ.g_xdir) THEN IF(sweep_number.EQ.1)THEN !$ACC LOOP INDEPENDENT DO k=y_min-2,y_max+2 !$ACC LOOP INDEPENDENT DO j=x_min-2,x_max+2 pre_vol(j,k)=volume(j,k)+(vol_flux_x(j+1,k )-vol_flux_x(j,k)+vol_flux_y(j ,k+1)-vol_flux_y(j,k)) post_vol(j,k)=pre_vol(j,k)-(vol_flux_x(j+1,k )-vol_flux_x(j,k)) ENDDO ENDDO ELSE !$ACC LOOP INDEPENDENT DO k=y_min-2,y_max+2 !$ACC LOOP INDEPENDENT DO j=x_min-2,x_max+2 pre_vol(j,k)=volume(j,k)+vol_flux_x(j+1,k)-vol_flux_x(j,k) post_vol(j,k)=volume(j,k) ENDDO ENDDO 3 ENDIF

spcl.inf.ethz.ch @spcl_eth Heitlager et al.: A Practical Model for Measuring Maintainability 4

spcl.inf.ethz.ch @spcl_eth !$ACC DATA & !$ACC COPY(chunk%tiles(1)%field%density0) & !$ACC COPY(chunk%tiles(1)%field%density1) & !$ACC COPY(chunk%tiles(1)%field%energy0) & !$ACC COPY(chunk%tiles(1)%field%energy1) & !$ACC COPY(chunk%tiles(1)%field%pressure) & !$ACC COPY(chunk%tiles(1)%field%soundspeed) & !$ACC COPY(chunk%tiles(1)%field%viscosity) & !$ACC COPY(chunk%tiles(1)%field%xvel0) & !$ACC COPY(chunk%tiles(1)%field%yvel0) & !$ACC COPY(chunk%tiles(1)%field%xvel1) & !$ACC COPY(chunk%tiles(1)%field%yvel1) & !$ACC COPY(chunk%tiles(1)%field%vol_flux_x) & !$ACC COPY(chunk%tiles(1)%field%vol_flux_y) & !$ACC COPY(chunk%tiles(1)%field%mass_flux_x)& !$ACC COPY(chunk%tiles(1)%field%mass_flux_y)& !$ACC COPY(chunk%tiles(1)%field%volume) & Sloccount *f90: 6,440 !$ACC COPY(chunk%tiles(1)%field%work_array1)& !$ACC COPY(chunk%tiles(1)%field%work_array2)& !$ACC COPY(chunk%tiles(1)%field%work_array3)& !$ACC COPY(chunk%tiles(1)%field%work_array4)& !$ACC COPY(chunk%tiles(1)%field%work_array5)& !$ACC COPY(chunk%tiles(1)%field%work_array6)& !$ACC COPY(chunk%tiles(1)%field%work_array7)& !$ACC: 833 (13%) !$ACC COPY(chunk%tiles(1)%field%cellx) & !$ACC COPY(chunk%tiles(1)%field%celly) & !$ACC COPY(chunk%tiles(1)%field%celldx) & !$ACC COPY(chunk%tiles(1)%field%celldy) & !$ACC COPY(chunk%tiles(1)%field%vertexx) & !$ACC COPY(chunk%tiles(1)%field%vertexdx) & !$ACC COPY(chunk%tiles(1)%field%vertexy) & !$ACC COPY(chunk%tiles(1)%field%vertexdy) & !$ACC COPY(chunk%tiles(1)%field%xarea) & !$ACC COPY(chunk%tiles(1)%field%yarea) & !$ACC COPY(chunk%left_snd_buffer) & !$ACC COPY(chunk%left_rcv_buffer) & !$ACC COPY(chunk%right_snd_buffer) & !$ACC COPY(chunk%right_rcv_buffer) & !$ACC COPY(chunk%bottom_snd_buffer) & !$ACC COPY(chunk%bottom_rcv_buffer) & !$ACC COPY(chunk%top_snd_buffer) & !$ACC COPY(chunk%top_rcv_buffer) 5

spcl.inf.ethz.ch @spcl_eth 6

spcl.inf.ethz.ch @spcl_eth do i = 0, N do j = 0, i y(i,j) = ( y(i,j) + y(i,j+1) )/2 7

spcl.inf.ethz.ch @spcl_eth Some results: Polybench 3.2 geomean: ~6x arithmean: ~30x Speedup over icc – O3 Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 8 T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch @spcl_eth Compiles all of SPEC CPU 2006 – Example: LBM 8:24 essentially my 4-core x86 laptop with the (free) GPU that’s in there 7:12 6:00 Runtime (m:s) Xeon E5-2690 (10 cores, 0.5Tflop) vs. 4:48 ~20% Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 3:36 2:24 ~4x 1:12 0:00 Mobile Workstation icc icc -openmp clang Polly ACC 9 T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch @spcl_eth 10

spcl.inf.ethz.ch @spcl_eth GPU latency hiding vs. MPI ld st ld st … ld st ld st CUDA MPI • over-subscribe hardware • host controlled • use spare parallel slack for latency hiding • full device synchronization device compute core active thread instruction latency T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth Hardware latency hiding at the cluster level? st ld ld put ld st ld st st ld st ld ld put ld st dCUDA (distributed CUDA) • unified programming model for GPU clusters • avoid unnecessary device synchronization to enable system wide latency hiding device compute core active thread instruction latency T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth dCUDA: MPI-3 RMA extensions • iterative stencil kernel for ( int i = 0; i < steps; ++i) { for ( int idx = from; idx < to; idx += jstride) • thread specific idx out[idx] = -4.0 * in[idx] + computation in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify (ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify (ctx, wout, rank + 1, 0, jstride, &out[len], tag); communication • map ranks to blocks • device-side put/get operations dcuda_wait_notifications (ctx, wout, • notifications for synchronization DCUDA_ANY_SOURCE, tag, lsend + rsend); • shared and distributed memory swap(in, out); swap(win, wout); } T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth Hardware supported communication overlap dCUDA traditional MPI-CUDA 1 4 5 1 4 5 7 7 2 3 6 8 2 3 6 8 2 4 6 8 1 4 6 7 1 3 5 7 2 3 5 8 device compute core active block T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth The dCUDA runtime system event handler block manager MPI host-side ack notification more blocks queue queue logging command queue queue device-side device library context T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth (Very) simple stencil benchmark  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node no overlap 1000 execution time [ms] compute & exchange halo exchange 500 compute only 0 30 60 90 # of copy iterations per exchange T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth Real stencil (COSMO weather/climate code)  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node MPI-CUDA 100 dCUDA execution time [ms] 50 halo exchange 0 2 4 6 8 # of nodes T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth Particle simulation code (Barnes Hut)  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node MPI-CUDA 200 execution time [ms] dCUDA 150 100 50 halo exchange 0 2 4 6 8 # of nodes T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth Sparse matrix-vector multiplication  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node 200 150 execution time [ms] dCUDA 100 MPI-CUDA 50 communication 0 1 4 9 # of nodes T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch @spcl_eth dCUDA – distributed memory http://spcl.inf.ethz.ch/Polly-ACC for ( int i = 0; i < steps; ++i) { for ( int idx = from; idx < to; idx += jstride) out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify (ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify (ctx, wout, rank + 1, 0, jstride, &out[len], tag); dcuda_wait_notifications (ctx, wout, Automatic DCUDA_ANY_SOURCE, tag, lsend + rsend); Automatic swap(in, out); swap(win, wout); } High Performance Overlap “Regression Free” High Performance 20

spcl.inf.ethz.ch @spcl_eth LLVM Nightly Test Suite 10000 1000 100 10 1 SCoPs 0-dim 1-dim 2-dim 3-dim No Heuristics Heuristics 21

spcl.inf.ethz.ch @spcl_eth Cactus ADM (SPEC 2006) Workstation Mobile 22

spcl.inf.ethz.ch @spcl_eth Evading various “ends” – the hardware view 23

Progress in automatic GPU compilation and why you want to run MPI on - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T ORSTEN H OEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 spcl.inf.ethz.ch @spcl_eth #pragma

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Automatic Library Compilation and Proof Tree Visualisation for Coq Proof General Hendrik Tews

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

ocamlbuild , a tool for automatic compilation of OCaml projects Berke Durak Nicolas Pouillard

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

The Compilation Process Preprocessing: o processes include-files, conditional compilation and

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Use Tesla to provide first GPU VM Service in China Feng Zhu

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Machine learning of a Higgs decay classifier via quantum annealing Presenter: Joshua Job 1

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Project Duration Digital IC Project and Verification Dec 1st April 1st ~14 weeks Project

A Suite of Hard ACL2 Theorems Arising in Refinement-Based Processor Verification Panagiotis

Hardware-Sensitive Scan Operator Variants for Compiled Selection Pipelines Databases D B and

Finding heap-bounds for hardware synthesis B. Cook + J. Simsa* A. Gupta # S. Singh + S. Magill*

Programming Language for Switches ECE/CS598HPN Radhika Mittal Conventional SDN Very

AUTOMATIC CODE RESTRUCTURING FOR FPGAS: CURRENT STATUS, TRENDS AND OPEN IS ISSUES Special Day on

Progress in automatic GPU compilation and why you want to run MPI on - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T ORSTEN H OEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 spcl.inf.ethz.ch @spcl_eth #pragma

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Automatic Library Compilation and Proof Tree Visualisation for Coq Proof General Hendrik Tews

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

ocamlbuild , a tool for automatic compilation of OCaml projects Berke Durak Nicolas Pouillard

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

The Compilation Process Preprocessing: o processes include-files, conditional compilation and

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Use Tesla to provide first GPU VM Service in China Feng Zhu

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Machine learning of a Higgs decay classifier via quantum annealing Presenter: Joshua Job 1

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Project Duration Digital IC Project and Verification Dec 1st April 1st ~14 weeks Project

A Suite of Hard ACL2 Theorems Arising in Refinement-Based Processor Verification Panagiotis

Hardware-Sensitive Scan Operator Variants for Compiled Selection Pipelines Databases D B and

Finding heap-bounds for hardware synthesis B. Cook + J. Simsa* A. Gupta # S. Singh + S. Magill*

Programming Language for Switches ECE/CS598HPN Radhika Mittal Conventional SDN Very

AUTOMATIC CODE RESTRUCTURING FOR FPGAS: CURRENT STATUS, TRENDS AND OPEN IS ISSUES Special Day on

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,