GPU Programming 101 GridKa School 2017: make science && run - - PowerPoint PPT Presentation

gpu programming 101
SMART_READER_LITE
LIVE PREVIEW

GPU Programming 101 GridKa School 2017: make science && run - - PowerPoint PPT Presentation

Member of the Helmholtz Association GPU Programming 101 GridKa School 2017: make science && run Andreas Herten , Forschungszentrum Jlich, 31 August 2017 Programming GPUs Member of the Helmholtz Association Optimizing scientific


slide-1
SLIDE 1

Member of the Helmholtz Association

GPU Programming 101

GridKa School 2017: make science && run

Andreas Herten, Forschungszentrum Jülich, 31 August 2017

slide-2
SLIDE 2

Member of the Helmholtz Association

About, Outline

Andreas Herten Physics in

— Aachen (Dipl. at CMS) — Jülich/Bochum (Dr. at PANDA)

2015-04-13 22:58:19
  • 1

x / cm 0.042 − 0.04 − 0.038 − 0.036 − 0.034 − 0.032 − 0.03 − 0.028 − 0.026 − 0.024 −

  • 1

y / cm 0.038 − 0.036 − 0.034 − 0.032 − 0.03 − 0.028 − step size) ° Line Hough Transform Around Isochrone (5

Since then: NVIDIA Application Lab

Optimizing scientific applications for/on

GPUs at Jülich Supercomputing Centre

Motivation Platform Hardware Features High Throughput Summary Programming GPUs Libraries Directives Languages Abstraction Libraries/DSL Tools Conclusions

Andreas Herten | GPU Programming 101 | 31 August 2017 # 2 41

slide-3
SLIDE 3

Member of the Helmholtz Association

About, Outline

Andreas Herten Physics in

— Aachen (Dipl. at CMS) — Jülich/Bochum (Dr. at PANDA)

2015-04-13 22:58:19
  • 1

x / cm 0.042 − 0.04 − 0.038 − 0.036 − 0.034 − 0.032 − 0.03 − 0.028 − 0.026 − 0.024 −

  • 1

y / cm 0.038 − 0.036 − 0.034 − 0.032 − 0.03 − 0.028 − step size) ° Line Hough Transform Around Isochrone (5

Since then: NVIDIA Application Lab

Optimizing scientific applications for/on

GPUs at Jülich Supercomputing Centre

Motivation Platform Hardware Features High Throughput Summary Programming GPUs Libraries Directives Languages Abstraction Libraries/DSL Tools Conclusions

Andreas Herten | GPU Programming 101 | 31 August 2017 # 2 41

slide-4
SLIDE 4

Member of the Helmholtz Association

Status Quo

GPU all around

1999: General computations with shaders of graphics hardware 2001: NVIDIA GeForce 3 with programmable shaders [1]; 2003: DirectX 9 at ATI 2007: CUDA 2017: Top 500: 15 % with GPUs, Green 500: 9 of 10 of top 10 with GPUs

Andreas Herten | GPU Programming 101 | 31 August 2017 # 3 41

slide-5
SLIDE 5

Member of the Helmholtz Association

Status Quo

Peak performance single precision

102 103 104 2008 2010 2012 2014 2016 HD 3870 HD 4870 HD 5870 HD 6970 HD 6970 HD 7970 GHz Ed. HD 8970 FirePro W9100 FirePro S9150 X5482 X5492 W5590 X5680 X5690 E5-2690 E5-2697 v2 E5-2699 v3 E5-2699 v3 E5-2699 v4 8800 GTS GTX 280 GTX 285 GTX 580 GTX 580 GTX 680 GTX Titan Tesla K40 GTX Titan X NVIDIA Titan X Xeon Phi 7120 (KNC) Xeon Phi 7290 (KNL) GFLOP/sec End of Year Theoretical Peak Performance, Single Precision INTEL Xeon CPUs NVIDIA GeForce GPUs AMD Radeon GPUs INTEL Xeon Phis

Graphic: Rupp [2] Andreas Herten | GPU Programming 101 | 31 August 2017 # 3 41

slide-6
SLIDE 6

Member of the Helmholtz Association

Status Quo

Performance per Watt

100 101 102 2008 2010 2012 2014 2016 HD 3870 HD 4870 HD 5870 HD 6970 HD 6970 HD 7970 GHz Ed. HD 8970 FirePro W9100 FirePro S9150 X5482 X5492 W5590 X5680 X5690 E5-2690 E5-2697 v2 E5-2699 v3 E5-2699 v3E5-2699 v4 8800 GTS GTX 280 GTX 285 GTX 580 GTX 580 GTX 680 GTX Titan Tesla K40 GTX Titan X NVIDIA Titan X Xeon Phi 7120 (KNC) Xeon Phi 7290 (KNL) GFLOP/sec per Watt End of Year Theoretical Peak Floating Point Operations per Watt, Single Precision INTEL Xeon CPUs NVIDIA GeForce GPUs AMD Radeon GPUs INTEL Xeon Phis

Graphic: Rupp [2] Andreas Herten | GPU Programming 101 | 31 August 2017 # 3 41

slide-7
SLIDE 7

Member of the Helmholtz Association

Status Quo

Peak performance double precision

102 103 104 2008 2010 2012 2014 2016 HD 3870 HD 4870 HD 5870 HD 6970 HD 6970 HD 7970 GHz Ed. HD 8970 FirePro W9100 FirePro S9150 X5482 X5492 W5590 X5680 X5690 E5-2690 E5-2697 v2 E5-2699 v3 E5-2699 v3 E5-2699 v4 Tesla C1060 Tesla C1060 Tesla C2050Tesla M2090 Tesla K20 Tesla K20X Tesla K40 Tesla K40 Tesla P100 Xeon Phi 7120 (KNC) Xeon Phi 7290 (KNL) GFLOP/sec End of Year Theoretical Peak Performance, Double Precision INTEL Xeon CPUs NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis

Graphic: Rupp [2] Andreas Herten | GPU Programming 101 | 31 August 2017 # 3 41

slide-8
SLIDE 8

Member of the Helmholtz Association

Status Quo

JURECA: Top 500 #70

Andreas Herten | GPU Programming 101 | 31 August 2017 # 3 41

slide-9
SLIDE 9

Member of the Helmholtz Association

Status Quo

JURECA: Top 500 #70

But why?!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 3 41

slide-10
SLIDE 10

Member of the Helmholtz Association

Status Quo

JURECA: Top 500 #70

But why?!

Let’s find out!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 3 41

slide-11
SLIDE 11

Member of the Helmholtz Association

Platform

Andreas Herten | GPU Programming 101 | 31 August 2017 # 4 41

slide-12
SLIDE 12

Member of the Helmholtz Association

CPU vs. GPU

A matter of specialties

Transporting one

Graphics: Lee [3] and Shearings Holidays [4]

Transporting many

Andreas Herten | GPU Programming 101 | 31 August 2017 # 5 41

slide-13
SLIDE 13

Member of the Helmholtz Association

CPU vs. GPU

A matter of specialties

Transporting one

Graphics: Lee [3] and Shearings Holidays [4]

Transporting many

Andreas Herten | GPU Programming 101 | 31 August 2017 # 5 41

slide-14
SLIDE 14

Member of the Helmholtz Association

CPU vs. GPU

Chip ALU ALU ALU ALU Control Cache DRAM DRAM

Andreas Herten | GPU Programming 101 | 31 August 2017 # 5 41

slide-15
SLIDE 15

Member of the Helmholtz Association

GPU Architecture

Overview

Aim: Hide Latency

Everything else follows

SIMT

Asynchronicity Memory

High Throughput

Andreas Herten | GPU Programming 101 | 31 August 2017 # 6 41

slide-16
SLIDE 16

Member of the Helmholtz Association

GPU Architecture

Overview

Aim: Hide Latency

Everything else follows

SIMT

Asynchronicity Memory

High Throughput

Andreas Herten | GPU Programming 101 | 31 August 2017 # 6 41

slide-17
SLIDE 17

Member of the Helmholtz Association

GPU Architecture

Overview

Aim: Hide Latency

Everything else follows

SIMT

Asynchronicity Memory

High Throughput

Andreas Herten | GPU Programming 101 | 31 August 2017 # 6 41

slide-18
SLIDE 18

Member of the Helmholtz Association

Memory

GPU memory ain’t no CPU memory

DRAM ALU ALU ALU ALU Control Cache DRAM

Host Device

GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) Example values

P100

16 GB RAM, 720 GB s

V100

16 GB RAM, 900 GB s

Andreas Herten | GPU Programming 101 | 31 August 2017 # 7 41

slide-19
SLIDE 19

Member of the Helmholtz Association

Memory

GPU memory ain’t no CPU memory

DRAM ALU ALU ALU ALU Control Cache DRAM

Host Device

GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) Example values

P100

16 GB RAM, 720 GB s

V100

16 GB RAM, 900 GB s

Andreas Herten | GPU Programming 101 | 31 August 2017 # 7 41

slide-20
SLIDE 20

Member of the Helmholtz Association

Memory

GPU memory ain’t no CPU memory

DRAM ALU ALU ALU ALU Control Cache DRAM

Host Device

PCIe <16 GB/s HBM2 <720 GB/s

GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) Example values

P100

16 GB RAM, 720 GB s

V100

16 GB RAM, 900 GB s

Andreas Herten | GPU Programming 101 | 31 August 2017 # 7 41

slide-21
SLIDE 21

Member of the Helmholtz Association

Memory

GPU memory ain’t no CPU memory

DRAM ALU ALU ALU ALU Control Cache DRAM

Host Device

PCIe <16 GB/s HBM2 <720 GB/s

GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) Example values

P100

16 GB RAM, 720 GB s

V100

16 GB RAM, 900 GB s

Andreas Herten | GPU Programming 101 | 31 August 2017 # 7 41

slide-22
SLIDE 22

Member of the Helmholtz Association

Memory

GPU memory ain’t no CPU memory

DRAM ALU ALU ALU ALU Control Cache DRAM

Host Device

PCIe <16 GB/s HBM2 <720 GB/s

GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) Example values

P100

16 GB RAM, 720 GB s

V100

16 GB RAM, 900 GB s

Andreas Herten | GPU Programming 101 | 31 August 2017 # 7 41

slide-23
SLIDE 23

Member of the Helmholtz Association

Memory

GPU memory ain’t no CPU memory

DRAM ALU ALU ALU ALU Control Cache DRAM

Host Device

HBM2 <720 GB/s NVLink ≈160 GB/s

GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) Example values

P100

16 GB RAM, 720 GB s

V100

16 GB RAM, 900 GB s

Andreas Herten | GPU Programming 101 | 31 August 2017 # 7 41

slide-24
SLIDE 24

Member of the Helmholtz Association

Memory

GPU memory ain’t no CPU memory

DRAM ALU ALU ALU ALU Control Cache DRAM

Host Device

HBM2 <720 GB/s NVLink ≈160 GB/s

GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) Example values

P100

16 GB RAM, 720 GB/s

V100

16 GB RAM, 900 GB s

Andreas Herten | GPU Programming 101 | 31 August 2017 # 7 41

slide-25
SLIDE 25

Member of the Helmholtz Association

Memory

GPU memory ain’t no CPU memory

DRAM ALU ALU ALU ALU Control Cache DRAM

Host Device

NVLink ≈300 GB/s HBM2 <900 GB/s

GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) Example values

P100

16 GB RAM, 720 GB/s

V100

16 GB RAM, 900 GB/s

Andreas Herten | GPU Programming 101 | 31 August 2017 # 7 41

slide-26
SLIDE 26

Member of the Helmholtz Association

GPU Architecture

Overview

Aim: Hide Latency

Everything else follows

SIMT

Asynchronicity Memory

High Throughput

Andreas Herten | GPU Programming 101 | 31 August 2017 # 8 41

slide-27
SLIDE 27

Member of the Helmholtz Association

GPU Architecture

Overview

Aim: Hide Latency

Everything else follows

SIMT

Asynchronicity Memory

High Throughput

Andreas Herten | GPU Programming 101 | 31 August 2017 # 8 41

slide-28
SLIDE 28

Member of the Helmholtz Association

Async

Following difgerent streams

Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! → Overlap tasks Copy and compute engines run separately (streams)

Copy Compute Copy Compute Copy Compute Copy Compute

GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization

Also: Fast switching of contexts to keep GPU busy.

Andreas Herten | GPU Programming 101 | 31 August 2017 # 9 41

slide-29
SLIDE 29

Member of the Helmholtz Association

GPU Architecture

Overview

Aim: Hide Latency

Everything else follows

SIMT

Asynchronicity Memory

High Throughput

Andreas Herten | GPU Programming 101 | 31 August 2017 # 10 41

slide-30
SLIDE 30

Member of the Helmholtz Association

GPU Architecture

Overview

Aim: Hide Latency

Everything else follows

SIMT

Asynchronicity Memory

High Throughput

Andreas Herten | GPU Programming 101 | 31 August 2017 # 10 41

slide-31
SLIDE 31

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core

GPU multiprocessor (SM)

— Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

A0 A1 A2 A3 B0 B1 B2 B3 + + + + C0 C1 C2 C3 = = = =

Scalar

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-32
SLIDE 32

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core

GPU multiprocessor (SM)

— Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

A0 A1 A2 A3 B0 B1 B2 B3 + C0 C1 C2 C3 =

Vector

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-33
SLIDE 33

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core

GPU multiprocessor (SM)

— Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

A0 A1 A2 A3 B0 B1 B2 B3 + C0 C1 C2 C3 =

Vector Core Core Core Core

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-34
SLIDE 34

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core

GPU multiprocessor (SM)

— Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

A0 A1 A2 A3 B0 B1 B2 B3 + C0 C1 C2 C3 =

Vector Core Core Core Core

Thread Thread

SMT

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-35
SLIDE 35

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core

GPU multiprocessor (SM)

— Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

A0 A1 A2 A3 B0 B1 B2 B3 + C0 C1 C2 C3 =

Vector Core Core Core Core

Thread Thread

SMT

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-36
SLIDE 36

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core

GPU multiprocessor (SM)

— Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

A0 A1 A2 A3 B0 B1 B2 B3 + C0 C1 C2 C3 =

Vector Core Core Core Core

Thread Thread

SMT SIMT

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-37
SLIDE 37

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core ≊ GPU multiprocessor (SM) — Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

A0 A1 A2 A3 B0 B1 B2 B3 + C0 C1 C2 C3 =

Vector Core Core Core Core

Thread Thread

SMT SIMT

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-38
SLIDE 38

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core ≊ GPU multiprocessor (SM) — Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

Tesla V100

Graphics: Nvidia Corporation [5]

A0 A1 A2 A3 B0 B1 B2 B3 + C0 C1 C2 C3 =

Vector Core Core Core Core

Thread Thread

SMT SIMT

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-39
SLIDE 39

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core ≊ GPU multiprocessor (SM) — Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

Tesla V100

Graphics: Nvidia Corporation [5]

A0 A1 A2 A3 B0 B1 B2 B3 + C0 C1 C2 C3 =

Vector Core Core Core Core

Thread Thread

SMT SIMT

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-40
SLIDE 40

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core ≊ GPU multiprocessor (SM) — Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

Tesla V100 Multiprocessor

Graphics: Nvidia Corporation [5]

A0 A1 A2 A3 B0 B1 B2 B3 + C0 C1 C2 C3 =

Vector Core Core Core Core

Thread Thread

SMT SIMT

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-41
SLIDE 41

Member of the Helmholtz Association

SIMT

Of threads and warps

CPU: — Single Instruction, Multiple Data (SIMD) — Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) — CPU core ≊ GPU multiprocessor (SM) — Working unit: set of threads (32, a warp) — Fast switching of threads (large register file) — Branching if

Tesla V100 Multiprocessor

Tensor Cores

120 PFLOP/s for Deep Learning

Graphics: Nvidia Corporation [5]

A0 A1 A2 A3 B0 B1 B2 B3 + C0 C1 C2 C3 =

Vector Core Core Core Core

Thread Thread

SMT SIMT

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 41

slide-42
SLIDE 42

Member of the Helmholtz Association

Low Latency vs. High Throughput

Maybe GPU’s ultimate feature

CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps

CPU Core: Low Latency T1 T2 T3 T4 GPU Streaming Multiprocessor: High Throughput W1 W2 W3 W4 Waiting Ready Context Switch Processing Thread/Warp

Andreas Herten | GPU Programming 101 | 31 August 2017 # 12 41

slide-43
SLIDE 43

Member of the Helmholtz Association

Low Latency vs. High Throughput

Maybe GPU’s ultimate feature

CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps

CPU Core: Low Latency T1 T2 T3 T4 GPU Streaming Multiprocessor: High Throughput W1 W2 W3 W4 Waiting Ready Context Switch Processing Thread/Warp

Andreas Herten | GPU Programming 101 | 31 August 2017 # 12 41

slide-44
SLIDE 44

Member of the Helmholtz Association

Low Latency vs. High Throughput

Maybe GPU’s ultimate feature

CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps

CPU Core: Low Latency T1 T2 T3 T4 GPU Streaming Multiprocessor: High Throughput W1 W2 W3 W4 Waiting Ready Context Switch Processing Thread/Warp

Andreas Herten | GPU Programming 101 | 31 August 2017 # 12 41

slide-45
SLIDE 45

Member of the Helmholtz Association

CPU vs. GPU

Let’s summarize this!

Optimized for low latency + Large main memory + Fast clock rate + Large caches + Branch prediction + Powerful ALU − Relatively low memory bandwidth − Cache misses costly − Low performance per watt Optimized for high throughput + High bandwidth main memory + Latency tolerant (parallelism) + More compute resources + High performance per watt − Limited memory capacity − Low per-thread performance − Extension card

Andreas Herten | GPU Programming 101 | 31 August 2017 # 13 41

slide-46
SLIDE 46

Member of the Helmholtz Association

Programming GPUs

Andreas Herten | GPU Programming 101 | 31 August 2017 # 14 41

slide-47
SLIDE 47

Member of the Helmholtz Association

Preface: CPU

A simple CPU program as reference!

SAXPY:⃗ y = a⃗ x +⃗ y, with single precision Part of LAPACK BLAS Level 1

void saxpy(int n, float a, float * x, float * y) { for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

saxpy(n, a, x, y);

Andreas Herten | GPU Programming 101 | 31 August 2017 # 15 41

slide-48
SLIDE 48

Member of the Helmholtz Association

Libraries

The truth is out there!

Programming GPUs is easy: Just don’t!

Use applications & libraries!

Wizard: Breazell [6]

cuBLAS cuSPARSE cuFFT cuRAND CUDA Math

Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41

slide-49
SLIDE 49

Member of the Helmholtz Association

Libraries

The truth is out there!

Programming GPUs is easy: Just don’t!

Use applications & libraries!

Wizard: Breazell [6]

cuBLAS cuSPARSE cuFFT cuRAND CUDA Math

Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41

slide-50
SLIDE 50

Member of the Helmholtz Association

Libraries

The truth is out there!

Programming GPUs is easy: Just don’t!

Use applications & libraries!

Wizard: Breazell [6]

cuBLAS cuSPARSE cuFFT cuRAND CUDA Math

Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41

slide-51
SLIDE 51

Member of the Helmholtz Association

Libraries

The truth is out there!

Programming GPUs is easy: Just don’t!

Use applications & libraries!

Wizard: Breazell [6]

cuBLAS cuSPARSE cuFFT cuRAND CUDA Math

th ano

Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41

slide-52
SLIDE 52

Member of the Helmholtz Association

Libraries

The truth is out there!

Programming GPUs is easy: Just don’t!

Use applications & libraries!

Wizard: Breazell [6]

cuBLAS cuSPARSE cuFFT cuRAND CUDA Math

th ano

Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41

slide-53
SLIDE 53

Member of the Helmholtz Association

cuBLAS

Parallel algebra

GPU-parallel BLAS (all 152 routines)

Single, double, complex data types Constant competition with Intel’s MKL Multi-GPU support → https://developer.nvidia.com/cublas

http://docs.nvidia.com/cuda/cublas

Andreas Herten | GPU Programming 101 | 31 August 2017 # 17 41

slide-54
SLIDE 54

Member of the Helmholtz Association

cuBLAS

Code example

int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle);

Andreas Herten | GPU Programming 101 | 31 August 2017 # 18 41

slide-55
SLIDE 55

Member of the Helmholtz Association

cuBLAS

Code example

int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle);

Initialize

Andreas Herten | GPU Programming 101 | 31 August 2017 # 18 41

slide-56
SLIDE 56

Member of the Helmholtz Association

cuBLAS

Code example

int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle);

Initialize Allocate GPU memory

Andreas Herten | GPU Programming 101 | 31 August 2017 # 18 41

slide-57
SLIDE 57

Member of the Helmholtz Association

cuBLAS

Code example

int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle);

Initialize Allocate GPU memory Copy data to GPU

Andreas Herten | GPU Programming 101 | 31 August 2017 # 18 41

slide-58
SLIDE 58

Member of the Helmholtz Association

cuBLAS

Code example

int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle);

Initialize Allocate GPU memory Copy data to GPU Call BLAS routine

Andreas Herten | GPU Programming 101 | 31 August 2017 # 18 41

slide-59
SLIDE 59

Member of the Helmholtz Association

cuBLAS

Code example

int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle);

Initialize Allocate GPU memory Copy data to GPU Call BLAS routine Copy result to host

Andreas Herten | GPU Programming 101 | 31 August 2017 # 18 41

slide-60
SLIDE 60

Member of the Helmholtz Association

cuBLAS

Code example

int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle);

Initialize Allocate GPU memory Copy data to GPU Call BLAS routine Copy result to host Finalize

Andreas Herten | GPU Programming 101 | 31 August 2017 # 18 41

slide-61
SLIDE 61

Member of the Helmholtz Association

Libraries

The truth is out there!

Programming GPUs is easy: Just don’t!

Use applications & libraries!

Wizard: Breazell [6]

cuBLAS cuSPARSE cuFFT cuRAND CUDA Math

th ano

Andreas Herten | GPU Programming 101 | 31 August 2017 # 19 41

slide-62
SLIDE 62

Member of the Helmholtz Association

Libraries

The truth is out there!

Programming GPUs is easy: Just don’t!

Use applications & libraries!

Wizard: Breazell [6]

cuBLAS cuSPARSE cuFFT cuRAND CUDA Math

th ano

Andreas Herten | GPU Programming 101 | 31 August 2017 # 19 41

slide-63
SLIDE 63

Member of the Helmholtz Association

Thrust

Iterators! Iterators everywhere! Thrust CUDA = STL C++

Template library Based on iterators Data-parallel primitives (scan(), sort(), reduce(), … ) Fully compatible with plain CUDA C (comes with CUDA Toolkit) Great with [](){} lambdas! → http://thrust.github.io/

http://docs.nvidia.com/cuda/thrust/

Andreas Herten | GPU Programming 101 | 31 August 2017 # 20 41

slide-64
SLIDE 64

Member of the Helmholtz Association

Thrust

Code example

int a = 42; int n = 10; thrust::host_vector<float> x(n), y(n);

// fill x, y

thrust::device_vector d_x = x, d_y = y; using namespace thrust::placeholders; thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), a * _1 + _2);

֒ →

x = d_x;

Andreas Herten | GPU Programming 101 | 31 August 2017 # 21 41

slide-65
SLIDE 65

Member of the Helmholtz Association

Thrust

Code example with lambdas #include <thrust/for_each.h> #include <thrust/execution_policy.h>

constexpr int gGpuThreshold = 10000; void saxpy(float *x, float *y, float a, int N) { auto r = thrust::counting_iterator<int>(0); auto lambda = [=] __host__ __device__ (int i) { y[i] = a * x[i] + y[i];}; if(N > gGpuThreshold) thrust::for_each(thrust::device, r, r+N, lambda); else thrust::for_each(thrust::host, r, r+N, lambda);}

Source Andreas Herten | GPU Programming 101 | 31 August 2017 # 21 41

slide-66
SLIDE 66

Member of the Helmholtz Association

Programming GPUs

Directives

Andreas Herten | GPU Programming 101 | 31 August 2017 # 22 41

slide-67
SLIDE 67

Member of the Helmholtz Association

GPU Programming with Directives

Keepin’ you portable

Annotate usual source code by directives

#pragma acc loop

for (int i = 0; i < 1; i+*) {};

Also: Generalized API functions

acc_copy();

Compiler interprets directives, creates according instructions Pro Portability

— Other compiler? No problem! To it, it’s a serial program — Difgerent target architectures from same code

Easy to program Con Compilers support limited Raw power hidden Somewhat harder to debug

Andreas Herten | GPU Programming 101 | 31 August 2017 # 23 41

slide-68
SLIDE 68

Member of the Helmholtz Association

GPU Programming with Directives

Keepin’ you portable

Annotate usual source code by directives

#pragma acc loop

for (int i = 0; i < 1; i+*) {};

Also: Generalized API functions

acc_copy();

Compiler interprets directives, creates according instructions Pro Portability

— Other compiler? No problem! To it, it’s a serial program — Difgerent target architectures from same code

Easy to program Con Compilers support limited Raw power hidden Somewhat harder to debug

Andreas Herten | GPU Programming 101 | 31 August 2017 # 23 41

slide-69
SLIDE 69

Member of the Helmholtz Association

GPU Programming with Directives

Keepin’ you portable

Annotate usual source code by directives

#pragma acc loop

for (int i = 0; i < 1; i+*) {};

Also: Generalized API functions

acc_copy();

Compiler interprets directives, creates according instructions Pro Portability

— Other compiler? No problem! To it, it’s a serial program — Difgerent target architectures from same code

Easy to program Con Compilers support limited Raw power hidden Somewhat harder to debug

Andreas Herten | GPU Programming 101 | 31 August 2017 # 23 41

slide-70
SLIDE 70

Member of the Helmholtz Association

GPU Programming with Directives

The power of… two.

OpenMP Standard for multithread programming on CPU, GPU since 4.0, better since 4.5

#pragma omp target map(tofrom:y), map(to:x) #pragma omp teams num_teams(10) num_threads(10) #pragma omp distribute

for ( ) {

#pragma omp parallel for

for ( ) {

// …

} }

OpenACC Similar to OpenMP, but more specifically for GPUs Might eventually be re-merged into OpenMP standard

Andreas Herten | GPU Programming 101 | 31 August 2017 # 24 41

slide-71
SLIDE 71

Member of the Helmholtz Association

OpenACC

Code example

void saxpy_acc(int n, float a, float * x, float * y) {

#pragma acc kernels

for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

saxpy_acc(n, a, x, y);

Andreas Herten | GPU Programming 101 | 31 August 2017 # 25 41

slide-72
SLIDE 72

Member of the Helmholtz Association

OpenACC

Code example

void saxpy_acc(int n, float a, float * x, float * y) {

#pragma acc parallel loop copy(y) copyin(x)

for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

saxpy_acc(n, a, x, y);

Andreas Herten | GPU Programming 101 | 31 August 2017 # 25 41

slide-73
SLIDE 73

Member of the Helmholtz Association

OpenACC

Code example

void saxpy_acc(int n, float a, float * x, float * y) {

#pragma acc parallel loop copy(y) copyin(x)

for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

saxpy_acc(n, a, x, y);

G P U

tutorial this afuernoon!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 25 41

slide-74
SLIDE 74

Member of the Helmholtz Association

Programming GPUs

Languages

Andreas Herten | GPU Programming 101 | 31 August 2017 # 26 41

slide-75
SLIDE 75

Member of the Helmholtz Association

Programming GPU Directly

Finally…

Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009

— Platform: Programming language (OpenCL C/C++), API, and compiler — Targets CPUs, GPUs, FPGAs, and other many-core machines — Fully open source — Difgerent compilers available

CUDA NVIDIA’s GPU platform 2007

— Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … — Only NVIDIA GPUs — Compilation with nvcc (free, but not open)

clang has CUDA support, but CUDA needed for last step

— Also: CUDA Fortran

Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm

Andreas Herten | GPU Programming 101 | 31 August 2017 # 27 41

slide-76
SLIDE 76

Member of the Helmholtz Association

Programming GPU Directly

Finally…

Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009

— Platform: Programming language (OpenCL C/C++), API, and compiler — Targets CPUs, GPUs, FPGAs, and other many-core machines — Fully open source — Difgerent compilers available

CUDA NVIDIA’s GPU platform 2007

— Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … — Only NVIDIA GPUs — Compilation with nvcc (free, but not open)

clang has CUDA support, but CUDA needed for last step

— Also: CUDA Fortran

Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm

Andreas Herten | GPU Programming 101 | 31 August 2017 # 27 41

slide-77
SLIDE 77

Member of the Helmholtz Association

Programming GPU Directly

Finally…

Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009

— Platform: Programming language (OpenCL C/C++), API, and compiler — Targets CPUs, GPUs, FPGAs, and other many-core machines — Fully open source — Difgerent compilers available

CUDA NVIDIA’s GPU platform 2007

— Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … — Only NVIDIA GPUs — Compilation with nvcc (free, but not open)

clang has CUDA support, but CUDA needed for last step

— Also: CUDA Fortran

Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm

Andreas Herten | GPU Programming 101 | 31 August 2017 # 27 41

slide-78
SLIDE 78

Member of the Helmholtz Association

Programming GPU Directly

Finally…

Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009

— Platform: Programming language (OpenCL C/C++), API, and compiler — Targets CPUs, GPUs, FPGAs, and other many-core machines — Fully open source — Difgerent compilers available

CUDA NVIDIA’s GPU platform 2007

— Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … — Only NVIDIA GPUs — Compilation with nvcc (free, but not open)

clang has CUDA support, but CUDA needed for last step

— Also: CUDA Fortran

Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm

Andreas Herten | GPU Programming 101 | 31 August 2017 # 27 41

slide-79
SLIDE 79

Member of the Helmholtz Association

Programming GPU Directly

Finally…

Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009

— Platform: Programming language (OpenCL C/C++), API, and compiler — Targets CPUs, GPUs, FPGAs, and other many-core machines — Fully open source — Difgerent compilers available

CUDA NVIDIA’s GPU platform 2007

— Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … — Only NVIDIA GPUs — Compilation with nvcc (free, but not open)

clang has CUDA support, but CUDA needed for last step

— Also: CUDA Fortran

Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm

Andreas Herten | GPU Programming 101 | 31 August 2017 # 27 41

slide-80
SLIDE 80

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads Block — Blocks Grid — Threads & blocks in 3D 3D 3D 3D

Parallel function: kernel

— __global__ kernel(int a, float * b) { } — Access own ID by global variables threadIdx.x, blockIdx.y, …

Execution entity: threads

— Lightweight fast switchting! — 1000s threads execute simultaneously

  • rder non-deterministic!

SAXPY!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 28 41

slide-81
SLIDE 81

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Thread Block — Blocks Grid — Threads & blocks in 3D 3D 3D 3D

Parallel function: kernel

— __global__ kernel(int a, float * b) { } — Access own ID by global variables threadIdx.x, blockIdx.y, …

Execution entity: threads

— Lightweight fast switchting! — 1000s threads execute simultaneously

  • rder non-deterministic!

SAXPY!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 28 41

slide-82
SLIDE 82

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads Block — Blocks Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5

Parallel function: kernel

— __global__ kernel(int a, float * b) { } — Access own ID by global variables threadIdx.x, blockIdx.y, …

Execution entity: threads

— Lightweight fast switchting! — 1000s threads execute simultaneously

  • rder non-deterministic!

SAXPY!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 28 41

slide-83
SLIDE 83

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5

Parallel function: kernel

— __global__ kernel(int a, float * b) { } — Access own ID by global variables threadIdx.x, blockIdx.y, …

Execution entity: threads

— Lightweight fast switchting! — 1000s threads execute simultaneously

  • rder non-deterministic!

SAXPY!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 28 41

slide-84
SLIDE 84

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Block Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5

Parallel function: kernel

— __global__ kernel(int a, float * b) { } — Access own ID by global variables threadIdx.x, blockIdx.y, …

Execution entity: threads

— Lightweight fast switchting! — 1000s threads execute simultaneously

  • rder non-deterministic!

SAXPY!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 28 41

slide-85
SLIDE 85

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

1 2

Parallel function: kernel

— __global__ kernel(int a, float * b) { } — Access own ID by global variables threadIdx.x, blockIdx.y, …

Execution entity: threads

— Lightweight fast switchting! — 1000s threads execute simultaneously

  • rder non-deterministic!

SAXPY!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 28 41

slide-86
SLIDE 86

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks → Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

1 2

Parallel function: kernel

— __global__ kernel(int a, float * b) { } — Access own ID by global variables threadIdx.x, blockIdx.y, …

Execution entity: threads

— Lightweight fast switchting! — 1000s threads execute simultaneously

  • rder non-deterministic!

SAXPY!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 28 41

slide-87
SLIDE 87

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks → Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

1 2

Parallel function: kernel

— __global__ kernel(int a, float * b) { } — Access own ID by global variables threadIdx.x, blockIdx.y, …

Execution entity: threads

— Lightweight fast switchting! — 1000s threads execute simultaneously

  • rder non-deterministic!

SAXPY!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 28 41

slide-88
SLIDE 88

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks → Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

1 2

Parallel function: kernel

— __global__ kernel(int a, float * b) { } — Access own ID by global variables threadIdx.x, blockIdx.y, …

Execution entity: threads

— Lightweight → fast switchting! — 1000s threads execute simultaneously → order non-deterministic!

SAXPY!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 28 41

slide-89
SLIDE 89

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks → Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

1 2

Parallel function: kernel

— __global__ kernel(int a, float * b) { } — Access own ID by global variables threadIdx.x, blockIdx.y, …

Execution entity: threads

— Lightweight → fast switchting! — 1000s threads execute simultaneously → order non-deterministic!

⇒ SAXPY!

Andreas Herten | GPU Programming 101 | 31 August 2017 # 28 41

slide-90
SLIDE 90

Member of the Helmholtz Association

CUDA SAXPY

With runtime-managed data transfers

__global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize();

Andreas Herten | GPU Programming 101 | 31 August 2017 # 29 41

slide-91
SLIDE 91

Member of the Helmholtz Association

CUDA SAXPY

With runtime-managed data transfers

__global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize();

Specify kernel

Andreas Herten | GPU Programming 101 | 31 August 2017 # 29 41

slide-92
SLIDE 92

Member of the Helmholtz Association

CUDA SAXPY

With runtime-managed data transfers

__global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize();

Specify kernel ID variables

Andreas Herten | GPU Programming 101 | 31 August 2017 # 29 41

slide-93
SLIDE 93

Member of the Helmholtz Association

CUDA SAXPY

With runtime-managed data transfers

__global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize();

Specify kernel ID variables Guard against too many threads

Andreas Herten | GPU Programming 101 | 31 August 2017 # 29 41

slide-94
SLIDE 94

Member of the Helmholtz Association

CUDA SAXPY

With runtime-managed data transfers

__global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize();

Specify kernel ID variables Guard against too many threads Allocate

GPU-capable

memory

Andreas Herten | GPU Programming 101 | 31 August 2017 # 29 41

slide-95
SLIDE 95

Member of the Helmholtz Association

CUDA SAXPY

With runtime-managed data transfers

__global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize();

Specify kernel ID variables Guard against too many threads Allocate

GPU-capable

memory Call kernel 2 blocks, each 5 threads

Andreas Herten | GPU Programming 101 | 31 August 2017 # 29 41

slide-96
SLIDE 96

Member of the Helmholtz Association

CUDA SAXPY

With runtime-managed data transfers

__global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n];

// fill x, y

cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize();

Specify kernel ID variables Guard against too many threads Allocate

GPU-capable

memory Call kernel 2 blocks, each 5 threads Wait for kernel to finish

Andreas Herten | GPU Programming 101 | 31 August 2017 # 29 41

slide-97
SLIDE 97

Member of the Helmholtz Association

Programming GPUs

Abstraction Libraries/DSL

Andreas Herten | GPU Programming 101 | 31 August 2017 # 30 41

slide-98
SLIDE 98

Member of the Helmholtz Association

Abstraction Libraries & DSLs

Libraries with ready-programmed abstractions; partly compiler/transpiler necessary Have difgerent backends to choose from for targeted accelerator Between Thrust, OpenACC, and CUDA Examples: Kokkos, Alpaka, Futhark, HIP, C++AMP, …

Andreas Herten | GPU Programming 101 | 31 August 2017 # 31 41

slide-99
SLIDE 99

Member of the Helmholtz Association

Abstraction Libraries & DSLs

Libraries with ready-programmed abstractions; partly compiler/transpiler necessary Have difgerent backends to choose from for targeted accelerator Between Thrust, OpenACC, and CUDA Examples: Kokkos, Alpaka, Futhark, HIP, C++AMP, …

Andreas Herten | GPU Programming 101 | 31 August 2017 # 31 41

slide-100
SLIDE 100

Member of the Helmholtz Association

An Alternative: Kokkos

From Sandia National Laboratories

C++ library for performance portability Data-parallel patterns, architecture-aware memory layouts, … → https://github.com/kokkos/kokkos/

Kokkos::View<double*> x("X", length); Kokkos::View<double*> y("Y", length); double a = 2.0;

// Fill x, y

Kokkos::parallel_for(length, KOKKOS_LAMBDA (const int& i) { x(i) = a*x(i) + y(i); });

Andreas Herten | GPU Programming 101 | 31 August 2017 # 32 41

slide-101
SLIDE 101

Member of the Helmholtz Association

Programming GPUs

Tools

Andreas Herten | GPU Programming 101 | 31 August 2017 # 33 41

slide-102
SLIDE 102

Member of the Helmholtz Association

GPU Tools

The helpful helpers helping helpless (and others)

NVIDIA

cuda-gdb GDB-like command line utility for debugging cuda-memcheck Like Valgrind’s memcheck, for checking errors in

memory accesses Nsight IDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio (Windows)

nvprof Command line profiler, including detailed

performance counters Visual Profiler Timeline profiling and annotated performance experiments OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.

Andreas Herten | GPU Programming 101 | 31 August 2017 # 34 41

slide-103
SLIDE 103

Member of the Helmholtz Association

GPU Tools

The helpful helpers helping helpless (and others)

NVIDIA

cuda-gdb GDB-like command line utility for debugging cuda-memcheck Like Valgrind’s memcheck, for checking errors in

memory accesses Nsight IDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio (Windows)

nvprof Command line profiler, including detailed

performance counters Visual Profiler Timeline profiling and annotated performance experiments OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.

Andreas Herten | GPU Programming 101 | 31 August 2017 # 34 41

slide-104
SLIDE 104

Member of the Helmholtz Association

nvprof

Command that line

Usage: nvprof ./app

$ nvprof ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 ==37064== Profiling application: ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 ==37064== Profiling result: Time(%) Time Calls Avg Min Max Name 99.19% 262.43ms 301 871.86us 863.88us 882.44us void matrixMulCUDA<int=32>(float*, float*, float*, int, int) 0.58% 1.5428ms 2 771.39us 764.65us 778.12us [CUDA memcpy HtoD] 0.23% 599.40us 1 599.40us 599.40us 599.40us [CUDA memcpy DtoH] ==37064== API calls: Time(%) Time Calls Avg Min Max Name 61.26% 258.38ms 1 258.38ms 258.38ms 258.38ms cudaEventSynchronize 35.68% 150.49ms 3 50.164ms 914.97us 148.65ms cudaMalloc 0.73% 3.0774ms 3 1.0258ms 1.0097ms 1.0565ms cudaMemcpy 0.62% 2.6287ms 4 657.17us 655.12us 660.56us cuDeviceTotalMem 0.56% 2.3408ms 301 7.7760us 7.3810us 53.103us cudaLaunch 0.48% 2.0111ms 364 5.5250us 235ns 201.63us cuDeviceGetAttribute 0.21% 872.52us 1 872.52us 872.52us 872.52us cudaDeviceSynchronize 0.15% 612.20us 1505 406ns 361ns 1.1970us cudaSetupArgument 0.12% 499.01us 3 166.34us 140.45us 216.16us cudaFree 0.11% 477.69us 1 477.69us 477.69us 477.69us cudaGetDeviceProperties 0.04% 179.27us 4 44.817us 40.744us 53.504us cuDeviceGetName 0.03% 136.20us 301 452ns 401ns 2.4000us cudaConfigureCall 0.00% 9.0850us 2 4.5420us 3.4760us 5.6090us cudaEventRecord 0.00% 8.7210us 1 8.7210us 8.7210us 8.7210us cudaGetDevice

Andreas Herten | GPU Programming 101 | 31 August 2017 # 35 41

slide-105
SLIDE 105

Member of the Helmholtz Association

nvprof

Command that line

With metrics: nvprof --metrics flop_sp_efficiency ./app

$ nvprof --metrics flop_sp_efficiency ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 [Matrix Multiply Using CUDA] - Starting... ==37122== NVPROF is profiling process 37122, command: ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 GPU Device 0: "Tesla P100-SXM2-16GB" with compute capability 6.0 MatrixA(1024,1024), MatrixB(1024,1024) Computing result using CUDA Kernel... ==37122== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. done122== Replaying kernel "void matrixMulCUDA<int=32>(float*, float*, float*, int, int)" (0 of 2)... ==37122== Replaying kernel "void matrixMulCUDA<int=32>(float*, float*, float*, int, int)" (done) ... ==37122== Replaying kernel "void matrixMulCUDA<int=32>(float*, float*, float*, int, int)" (done) Performance= 26.61 GFlop/s, Time= 80.697 msec, Size= 2147483648 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. ==37122== Profiling application: ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 ==37122== Profiling result: ==37122== Metric result: Invocations Metric Name Metric Description Min Max Avg Device "Tesla P100-SXM2-16GB (0)" Kernel: void matrixMulCUDA<int=32>(float*, float*, float*, int, int) 301 flop_sp_efficiency FLOP Efficiency(Peak Single) 22.96% 23.40% 23.15%

Andreas Herten | GPU Programming 101 | 31 August 2017 # 35 41

slide-106
SLIDE 106

Member of the Helmholtz Association

Visual Profiler

Your new favorite tool

Andreas Herten | GPU Programming 101 | 31 August 2017 # 36 41

slide-107
SLIDE 107

Member of the Helmholtz Association

Conclusions

Andreas Herten | GPU Programming 101 | 31 August 2017 # 37 41

slide-108
SLIDE 108

Member of the Helmholtz Association

Summary of Acceleration Possibilities

Application Libraries OpenACC Directives Programming Languages

Drop-in Acceleration Easy Acceleration Flexible Acceleration

Andreas Herten | GPU Programming 101 | 31 August 2017 # 38 41

slide-109
SLIDE 109

Member of the Helmholtz Association

Summary of Acceleration Possibilities

Application Libraries OpenACC Directives Programming Languages

A fu e r n

  • n

Drop-in Acceleration Easy Acceleration Flexible Acceleration

Andreas Herten | GPU Programming 101 | 31 August 2017 # 38 41

slide-110
SLIDE 110

Member of the Helmholtz Association

The Performance Cookbook

For fashionable modern programmers

Identify available parallelism Parallelize functions Optimize data locality Optimize function performance

Andreas Herten | GPU Programming 101 | 31 August 2017 # 39 41

slide-111
SLIDE 111

Member of the Helmholtz Association

Omitted

There’s so much more!

What I did not talk about Atomic

  • perations

Shared memory Pinned memory Managed memory Debugging Overlapping streams Multi-GPU programming (intra-node; MPI) Cooperative threads Half precision FP16 …

Andreas Herten | GPU Programming 101 | 31 August 2017 # 40 41

slide-112
SLIDE 112

Member of the Helmholtz Association

Summary & Conclusion

GPUs can improve your performance many-fold

For a fitting, parallelizable application Libraries are easiest Direct programming (plain CUDA) is most powerful OpenACC is somewhere in between (and portable) There are many tools helping the programmer → See it in action this afuernoon at OpenACC tutorial

Andreas Herten | GPU Programming 101 | 31 August 2017 # 41 41

slide-113
SLIDE 113

Member of the Helmholtz Association

Summary & Conclusion

GPUs can improve your performance many-fold

For a fitting, parallelizable application Libraries are easiest Direct programming (plain CUDA) is most powerful OpenACC is somewhere in between (and portable) There are many tools helping the programmer → See it in action this afuernoon at OpenACC tutorial

Thank you for your attention!

a . h e r t e n @ f z

  • j

u e l i c h . d e

Andreas Herten | GPU Programming 101 | 31 August 2017 # 41 41

slide-114
SLIDE 114

Member of the Helmholtz Association

Appendix Further Reading & Links

GPU Performances

Glossary References

Andreas Herten | GPU Programming 101 | 31 August 2017 # 1 11

slide-115
SLIDE 115

Member of the Helmholtz Association

Further Reading & Links

More!

A discussion of SIMD, SIMT, SMT by Y. Kreinin. NVIDIA’s documentation: docs.nvidia.com NVIDIA’s Parallel For All blog

Andreas Herten | GPU Programming 101 | 31 August 2017 # 2 11

slide-116
SLIDE 116

Member of the Helmholtz Association

Volta Performance

Table 1. Comparison of NVIDIA Tesla GPUs

Tesla Product Tesla K40 Tesla M40 Tesla P100 Tesla V100 GPU GK180 (Kepler) GM200 (Maxwell) GP100 (Pascal) GV100 (Volta) SMs 15 24 56 80 TPCs 15 24 28 40 FP32 Cores / SM 192 128 64 64 FP32 Cores / GPU 2880 3072 3584 5120 FP64 Cores / SM 64 4 32 32 FP64 Cores / GPU 960 96 1792 2560 Tensor Cores / SM NA NA NA 8 Tensor Cores / GPU NA NA NA 640 GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz 1462 MHz Peak FP32 TFLOPS1 5 6.8 10.6 15 Peak FP64 TFLOPS1 1.7 .21 5.3 7.5 Peak Tensor TFLOPS1 NA NA NA 120 Texture Units 240 192 224 320 Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2 Memory Size Up to 12 GB Up to 24 GB 16 GB 16 GB L2 Cache Size 1536 KB 3072 KB 4096 KB 6144 KB Shared Memory Size / SM 16 KB/32 KB/48 KB 96 KB 64 KB Configurable up to 96 KB Register File Size / SM 256 KB 256 KB 256 KB 256KB Register File Size / GPU 3840 KB 6144 KB 14336 KB 20480 KB TDP 235 Watts 250 Watts 300 Watts 300 Watts Transistors 7.1 billion 8 billion 15.3 billion 21.1 billion GPU Die Size 551 mm² 601 mm² 610 mm² 815 mm² Manufacturing Process 28 nm 28 nm 16 nm FinFET+ 12 nm FFN

1 Peak TFLOPS rates are based on GPU Boost Clock

Figure: Tesla V100 performance characteristics in comparison [5]

Andreas Herten | GPU Programming 101 | 31 August 2017 # 3 11

slide-117
SLIDE 117

Member of the Helmholtz Association

Appendix

Andreas Herten | GPU Programming 101 | 31 August 2017 # 4 11

slide-118
SLIDE 118

Member of the Helmholtz Association

Glossary I

API A programmatic interface to sofuware by well-defined

  • functions. Short for application programming
  • interface. 67, 68, 69, 75, 76, 77, 78, 79, 118

ATI Canada-based GPUs manufacturing company; bought

by AMD in 2006. 4, 118

CPI Cycles per Instructions; a metric to determine

efgiciency of an architecture or program. 118

CUDA Computing platform for GPUs from NVIDIA. Provides,

among others, CUDA C/C++. 4, 63, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 98, 99, 112, 113, 118

Andreas Herten | GPU Programming 101 | 31 August 2017 # 5 11

slide-119
SLIDE 119

Member of the Helmholtz Association

Glossary II

DSL A Domain-Specific Language is a specialization of a

more general language to a specific domain. 2, 3, 97, 98, 99, 118

GCC The GNU Compiler Collection, the collection of open

source compilers, among others for C and Fortran. 118

IPC Instructions per Cycle; a metric to determine efgiciency

  • f an architecture or program. 118

LLVM An open Source compiler infrastructure, providing,

among others, Clang for C. 118

MPI The Message Passing Interface, a API definition for

multi-node computing. 111, 118

Andreas Herten | GPU Programming 101 | 31 August 2017 # 6 11

slide-120
SLIDE 120

Member of the Helmholtz Association

Glossary III

NVIDIA US technology company creating GPUs. 2, 3, 4, 75, 76,

77, 78, 79, 102, 103, 115, 118

NVLink NVIDIA’s communication protocol connecting CPU ↔ GPU and GPU ↔ GPU with 80 GB/s. PCI-Express:

16 GB/s. 118

OpenACC Directive-based programming, primarily for many-core

  • machines. 70, 71, 72, 73, 98, 99, 112, 113, 118

OpenCL The Open Computing Language. Framework for writing

code for heterogeneous architectures (CPU, GPU, DSP,

FPGA). The alternative to CUDA. 75, 76, 77, 78, 79, 102,

103, 118

OpenGL The Open Graphics Library, an API for rendering

graphics across difgerent hardware architectures. 118

Andreas Herten | GPU Programming 101 | 31 August 2017 # 7 11

slide-121
SLIDE 121

Member of the Helmholtz Association

Glossary IV

OpenMP Directive-based programming, primarily for

multi-threaded machines. 70, 118

P100 A large GPU with the Pascal architecture from NVIDIA. It

employs NVLink as its interconnect and has fast HBM2

  • memory. 118

PAPI The Performance API, a C/C++ API for querying

performance counters. 118

Pascal GPU architecture from NVIDIA (announced 2016). 118 perf Part of the Linux kernel which facilitates access to

performance counters; comes with command line

  • utilities. 118

PGI Compiler creators. Formerly The Portland Group, Inc.;

since 2013 part of NVIDIA. 118

Andreas Herten | GPU Programming 101 | 31 August 2017 # 8 11

slide-122
SLIDE 122

Member of the Helmholtz Association

Glossary V

POWER8 CPU architecture from IBM, available also under the

OpenPOWER Foundation. 118

SAXPY Single-precision A × X + Y. A simple code example of

scaling a vector and adding an ofgset. 47, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 118

Score-P Collection of tools for instrumenting and subsequently

scoring applications to gain insight into the program’s

  • performance. 118

Tesla The GPU product line for general purpose computing

computing of NVIDIA. 118

Thrust A parallel algorithms library for (among others) GPUs.

See https://thrust.github.io/. 63, 118

Andreas Herten | GPU Programming 101 | 31 August 2017 # 9 11

slide-123
SLIDE 123

Member of the Helmholtz Association

References I

[1] Chris McClanahan. “History and evolution of gpu architecture”. In: A Survey Paper (2010). URL:

http://mcclanahoochie.com/blog/wp- content/uploads/2011/03/gpu-hist-paper.pdf (page 4).

[2] Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL:

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic- hardware-characteristics-over-time/ (pages 5–7).

[3] Mark Lee. Picture: kawasaki ninja. URL:

https://www.flickr.com/photos/pochacco20/39030210/

(pages 12, 13). [4] Shearings Holidays. Picture: Shearings coach 636. URL:

https://www.flickr.com/photos/shearings/13583388025/

(pages 12, 13).

Andreas Herten | GPU Programming 101 | 31 August 2017 # 10 11

slide-124
SLIDE 124

Member of the Helmholtz Association

References II

[5] Nvidia Corporation. Pictures: Volta GPU. Volta Architecture

  • Whitepaper. URL:

https://images.nvidia.com/content/volta- architecture/pdf/Volta-Architecture-Whitepaper- v1.0.pdf (pages 38–41, 116).

[6] Wes Breazell. Picture: Wizard. URL:

https://thenounproject.com/wes13/collection/its-a- wizards-world/ (pages 48–52, 61, 62).

Andreas Herten | GPU Programming 101 | 31 August 2017 # 11 11