OpenACC Tutorial GridKa School 2017: make science && run - - PowerPoint PPT Presentation

openacc tutorial
SMART_READER_LITE
LIVE PREVIEW

OpenACC Tutorial GridKa School 2017: make science && run - - PowerPoint PPT Presentation

Member of the Helmholtz Association OpenACC Tutorial GridKa School 2017: make science && run Andreas Herten , Forschungszentrum Jlich, 31 August 2017 Member of the Helmholtz Association Interoperability Visual Profiler Data Locality


slide-1
SLIDE 1

Member of the Helmholtz Association

OpenACC Tutorial

GridKa School 2017: make science && run

Andreas Herten, Forschungszentrum Jülich, 31 August 2017

slide-2
SLIDE 2

Member of the Helmholtz Association

Outline

The GPU Platform Introduction Threading Model App Showcase Parallel Models OpenACC History OpenMP Modus Operandi OpenACC’s Models OpenACC by Example OpenACC Workflow Identify Parallelism Parallelize Loops

parallel loops pgprof

Directive: Kernels

Data Transfers

GPU Memory Spaces

Portability Clause: copy Visual Profiler

Data Locality

Analyse Flow Directive: data

Optimize

Levels of Parallelism Clause: gang Memory Coalescing Pinned

Interoperability The Keyword Tasks

Task 1 Task 2 Task 3 Task 4

Conclusions List of Tasks

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 2 111

slide-3
SLIDE 3

Member of the Helmholtz Association

The GPU Platform

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 3 111

slide-4
SLIDE 4

Member of the Helmholtz Association

CPU vs. GPU

A matter of specialties

Transporting one

Graphics: Lee [1] and Shearings Holidays [2]

Transporting many

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 4 111

slide-5
SLIDE 5

Member of the Helmholtz Association

CPU vs. GPU

A matter of specialties

Transporting one

Graphics: Lee [1] and Shearings Holidays [2]

Transporting many

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 4 111

slide-6
SLIDE 6

Member of the Helmholtz Association

CPU vs. GPU

Chip ALU ALU ALU ALU Control Cache DRAM DRAM

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 5 111

slide-7
SLIDE 7

Member of the Helmholtz Association

Processing Flow

CPU → GPU → CPU

CPU CPU Memory Scheduler . . . Interconnect L2 DRAM

1

Transfer data from CPU memory to GPU memory , transfer program

2

Load GPU program, execute on SMs, get (cached) data from memory; write back

3

Transfer results back to host memory

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 6 111

slide-8
SLIDE 8

Member of the Helmholtz Association

Processing Flow

CPU → GPU → CPU

CPU CPU Memory Scheduler . . . Interconnect L2 DRAM

1

Transfer data from CPU memory to GPU memory , transfer program

2

Load GPU program, execute on SMs, get (cached) data from memory; write back

3

Transfer results back to host memory

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 6 111

slide-9
SLIDE 9

Member of the Helmholtz Association

Processing Flow

CPU → GPU → CPU

CPU CPU Memory Scheduler . . . Interconnect L2 DRAM

1

Transfer data from CPU memory to GPU memory, transfer program

2

Load GPU program, execute on SMs, get (cached) data from memory; write back

3

Transfer results back to host memory

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 6 111

slide-10
SLIDE 10

Member of the Helmholtz Association

Processing Flow

CPU → GPU → CPU

CPU CPU Memory Scheduler . . . Interconnect L2 DRAM

1

Transfer data from CPU memory to GPU memory, transfer program

2

Load GPU program, execute on SMs, get (cached) data from memory; write back

3

Transfer results back to host memory

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 6 111

slide-11
SLIDE 11

Member of the Helmholtz Association

Processing Flow

CPU → GPU → CPU

CPU CPU Memory Scheduler . . . Interconnect L2 DRAM

1

Transfer data from CPU memory to GPU memory, transfer program

2

Load GPU program, execute on SMs, get (cached) data from memory; write back

3

Transfer results back to host memory

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 6 111

slide-12
SLIDE 12

Member of the Helmholtz Association

Processing Flow

CPU → GPU → CPU

CPU CPU Memory Scheduler . . . Interconnect L2 DRAM

1

Transfer data from CPU memory to GPU memory, transfer program

2

Load GPU program, execute on SMs, get (cached) data from memory; write back

3

Transfer results back to host memory

Old: Manual data transfer invocations – UVA New: Driver automatically transfers data – UM

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 6 111

slide-13
SLIDE 13

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads Block — Blocks Grid — Threads & blocks in 3D 3D 3D 3D

Threads: parallel execution units

— Lightweight fast switchting! — 1000s threads execute simultaneously

Parallel execution unit: kernel

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 111

slide-14
SLIDE 14

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Thread Block — Blocks Grid — Threads & blocks in 3D 3D 3D 3D

Threads: parallel execution units

— Lightweight fast switchting! — 1000s threads execute simultaneously

Parallel execution unit: kernel

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 111

slide-15
SLIDE 15

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads Block — Blocks Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5

Threads: parallel execution units

— Lightweight fast switchting! — 1000s threads execute simultaneously

Parallel execution unit: kernel

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 111

slide-16
SLIDE 16

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5

Threads: parallel execution units

— Lightweight fast switchting! — 1000s threads execute simultaneously

Parallel execution unit: kernel

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 111

slide-17
SLIDE 17

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Block Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5

Threads: parallel execution units

— Lightweight fast switchting! — 1000s threads execute simultaneously

Parallel execution unit: kernel

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 111

slide-18
SLIDE 18

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

1 2

Threads: parallel execution units

— Lightweight fast switchting! — 1000s threads execute simultaneously

Parallel execution unit: kernel

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 111

slide-19
SLIDE 19

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks → Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

1 2

Threads: parallel execution units

— Lightweight fast switchting! — 1000s threads execute simultaneously

Parallel execution unit: kernel

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 111

slide-20
SLIDE 20

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks → Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

1 2

Threads: parallel execution units

— Lightweight fast switchting! — 1000s threads execute simultaneously

Parallel execution unit: kernel

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 111

slide-21
SLIDE 21

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks → Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

1 2

Threads: parallel execution units

— Lightweight → fast switchting! — 1000s threads execute simultaneously

Parallel execution unit: kernel

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 111

slide-22
SLIDE 22

Member of the Helmholtz Association

CUDA Threading Model

Warp the kernel, it’s a thread!

Methods to exploit parallelism:

— Threads → Block — Blocks → Grid — Threads & blocks in 3D 3D 3D 3D

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

1 2

Threads: parallel execution units

— Lightweight → fast switchting! — 1000s threads execute simultaneously

Parallel execution unit: kernel

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 111

slide-23
SLIDE 23

Member of the Helmholtz Association

Getting GPU-Acquainted

Preparations

Task 0⋆: Setup Login to JURON

ssh -i mykey train0XX@juron.fz-juelich.de

Directory of tasks

cd $HOME/GPU/Tasks/Tasks/

Solutions are always given! You decide when to look. Directory of solutions: $HOME/GPU/Tasks/Solutions/ Load required modules

module load pgi [cuda]

vim is available as editor (or copy files with scp or rsync)

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 8 111

slide-24
SLIDE 24

Member of the Helmholtz Association

Getting GPU-Acquainted

Some Applications

Task 0: Getting Started Change to GPU/Tasks/Task0/ directory Read Instructions.rst TASK 0

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 9 111

slide-25
SLIDE 25

Member of the Helmholtz Association

Getting GPU-Acquainted

Some Applications

Task 0: Getting Started Change to GPU/Tasks/Task0/ directory Read Instructions.rst Dot Product GEMM Mandelbrot N-Body TASK 0

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 9 111

slide-26
SLIDE 26

Member of the Helmholtz Association

Getting GPU-Acquainted

Some Applications

103 104 105 106 107 108 109 Length of Vector 100 101 102 103 104 MFLOP/s

DDot Benchmark

CPU GPU

Dot Product

2000 4000 6000 8000 10000 12000 14000 16000 Size of Square Matrix 250 500 750 1000 1250 1500 1750 2000 GFLOP/s

DGEMM Benchmark

CPU GPU

GEMM

5000 10000 15000 20000 25000 30000 Width of Image 200 400 600 800 1000 MPixel/s

Mandelbrot Benchmark

CPU GPU

Mandelbrot

20000 40000 60000 80000 100000 120000 Number of Particles 2500 5000 7500 10000 12500 15000 17500 GFLOP/s

N-Body Benchmark

1 GPU SP 2 GPUs SP 4 GPUs SP 1 GPU DP 2 GPUs DP 4 GPUs DP N-Body

TASK 0

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 9 111

slide-27
SLIDE 27

Member of the Helmholtz Association

Primer on Parallel Scaling

Amdahl’s Law

Possible maximum speedup for N parallel processors Total Time t = tserial + tparallel N Processors t N ts tp N Speedup s N t t N

ts tp ts tp N

Efgiciency:

s N

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Number of Processors

20 40 60 80 100

Speedup

Parallel Portion: 50% Parallel Portion: 75% Parallel Portion: 90% Parallel Portion: 95% Parallel Portion: 99% Andreas Herten | OpenACC Tutorial | 31 August 2017 # 10 111

slide-28
SLIDE 28

Member of the Helmholtz Association

Primer on Parallel Scaling

Amdahl’s Law

Possible maximum speedup for N parallel processors Total Time t = tserial + tparallel N Processors t(N) = ts + tp/N Speedup s N t t N

ts tp ts tp N

Efgiciency:

s N

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Number of Processors

20 40 60 80 100

Speedup

Parallel Portion: 50% Parallel Portion: 75% Parallel Portion: 90% Parallel Portion: 95% Parallel Portion: 99% Andreas Herten | OpenACC Tutorial | 31 August 2017 # 10 111

slide-29
SLIDE 29

Member of the Helmholtz Association

Primer on Parallel Scaling

Amdahl’s Law

Possible maximum speedup for N parallel processors Total Time t = tserial + tparallel N Processors t(N) = ts + tp/N Speedup s(N) = t/t(N) =

ts+tp ts+tp/N

Efgiciency: ε = s/

N

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Number of Processors

20 40 60 80 100

Speedup

Parallel Portion: 50% Parallel Portion: 75% Parallel Portion: 90% Parallel Portion: 95% Parallel Portion: 99% Andreas Herten | OpenACC Tutorial | 31 August 2017 # 10 111

slide-30
SLIDE 30

Member of the Helmholtz Association

Primer on Parallel Scaling

Amdahl’s Law

Possible maximum speedup for N parallel processors Total Time t = tserial + tparallel N Processors t(N) = ts + tp/N Speedup s(N) = t/t(N) =

ts+tp ts+tp/N

Efgiciency: ε = s/

N

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Number of Processors

20 40 60 80 100

Speedup

Parallel Portion: 50% Parallel Portion: 75% Parallel Portion: 90% Parallel Portion: 95% Parallel Portion: 99% Andreas Herten | OpenACC Tutorial | 31 August 2017 # 10 111

slide-31
SLIDE 31

Member of the Helmholtz Association

Primer on Parallel Scaling II

Gustafson-Barsis’s Law

[…] speedup should be measured by scaling the problem to the number of processors, not fixing problem size.

– John Gustafson

256512 1024 2048 4096

Number of Processors

1000 2000 3000 4000

Speedup

Serial Portion: 1% Serial Portion: 10% Serial Portion: 50% Serial Portion: 75% Serial Portion: 90% Serial Portion: 99% Andreas Herten | OpenACC Tutorial | 31 August 2017 # 11 111

slide-32
SLIDE 32

Member of the Helmholtz Association

!

Parallelism

Parallel programming is not easy! Things to consider: Is my application computationally intensive enough? What are the levels of parallelism? How much data needs to be transferred? Is the gain worth the pain?

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 12 111

slide-33
SLIDE 33

Member of the Helmholtz Association

Possibilities

Difgerent levels of closeness to GPU when GPU-programming, which can ease the pain… OpenACC OpenMP Thrust PyCUDA CUDA Fortran CUDA OpenCL

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 13 111

slide-34
SLIDE 34

Member of the Helmholtz Association

Summary of Acceleration Possibilities

Application Libraries Directives Programming Languages

Drop-in Acceleration Easy Acceleration Flexible Acceleration

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 14 111

slide-35
SLIDE 35

Member of the Helmholtz Association

Summary of Acceleration Possibilities

Application Libraries Directives Programming Languages

Drop-in Acceleration Easy Acceleration Flexible Acceleration

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 14 111

slide-36
SLIDE 36

Member of the Helmholtz Association

Summary of Acceleration Possibilities

Application Libraries Directives Programming Languages

Drop-in Acceleration Easy Acceleration Flexible Acceleration

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 14 111

slide-37
SLIDE 37

Member of the Helmholtz Association

Summary of Acceleration Possibilities

Application Libraries Directives Programming Languages

Drop-in Acceleration Easy Acceleration Flexible Acceleration

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 14 111

slide-38
SLIDE 38

Member of the Helmholtz Association

Summary of Acceleration Possibilities

Application Libraries Directives Programming Languages

OpenACC Drop-in Acceleration Easy Acceleration Flexible Acceleration

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 14 111

slide-39
SLIDE 39

Member of the Helmholtz Association

OpenACC History

2011 OpenACC 1.0 specification is released  NVIDIA, Cray, PGI, CAPS 2013 OpenACC 2.0: More functionality, portability  2015 OpenACC 2.5: Enhancements, clarifications  2016 OpenACC 2.6 proposed (deep copy, …)  → https://www.openacc.org/ Also: Best practice guide 

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 15 111

slide-40
SLIDE 40

Member of the Helmholtz Association

Open{MP↔ACC}

Everything’s connected

OpenACC modeled afuer OpenMP … … but specific for accelerators Might eventually be absorbed into OpenMP

But OpenMP 4.0 now also has ofgloading feature

Fork/join model

Master thread launches parallel child threads; merge afuer execution

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 16 111

slide-41
SLIDE 41

Member of the Helmholtz Association

Open{MP↔ACC}

Everything’s connected

OpenACC modeled afuer OpenMP … … but specific for accelerators Might eventually be absorbed into OpenMP

But OpenMP 4.0 now also has ofgloading feature

Fork/join model

Master thread launches parallel child threads; merge afuer execution

master master fork parallel join

OpenMP

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 16 111

slide-42
SLIDE 42

Member of the Helmholtz Association

Open{MP↔ACC}

Everything’s connected

OpenACC modeled afuer OpenMP … … but specific for accelerators Might eventually be absorbed into OpenMP

But OpenMP 4.0 now also has ofgloading feature

Fork/join model

Master thread launches parallel child threads; merge afuer execution

master master fork parallel join

OpenMP

master master fork parallel join

OpenACC

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 16 111

slide-43
SLIDE 43

Member of the Helmholtz Association

Modus Operandi

Three-step program

1 Annotate code with directives, indicating parallelism 2 OpenACC-capable compiler generates accelerator-specific code 3 $uccess

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 17 111

slide-44
SLIDE 44

Member of the Helmholtz Association

1 Directives

pragmatic

Compiler directives state intend to compiler C/C++

#pragma acc kernels

for (int i = 0; i < 23; i++)

// ...

Fortran

!$acc kernels

do i = 1, 24

! ... !$acc end kernels

Ignored by compiler which does not understand OpenACC High level programming model for accelerators; heterogeneous programs OpenACC: Compiler directives, library routines, environment variables Portable across host systems and accelerator architectures

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 18 111

slide-45
SLIDE 45

Member of the Helmholtz Association

2 Compiler

Simple and abstracted

Compiler support

— PGI Best performance, great support, free — GCC Beta, limited coverage, OSS — Cray ???

Trust compiler to generate intended parallelism; check status

  • utput!

No need to know ins’n’outs of accelerator; leave it to expert compiler engineers One code can target difgerent accelerators: GPUs, or even multi-core CPUs → Portability

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 19 111

slide-46
SLIDE 46

Member of the Helmholtz Association

3 $uccess

Iteration is key

Serial to parallel: fast Serial to fast parallel: more time needed Start simple → refine ⇒ Productivity Because of generalness: Sometimes not last bit of hardware performance accessible But: Use OpenACC together with other accelerator-targeting techniques (CUDA, libraries, …)

Expose Parallelism Compile Measure Andreas Herten | OpenACC Tutorial | 31 August 2017 # 20 111

slide-47
SLIDE 47

Member of the Helmholtz Association

OpenACC Execution Model

Main program executes

  • n host

Device code is transferred to accelerator Execution on accelerator is started Host waits until return (except: async)

Start main program Wait for code Run code Finish code Return to host

Transfer Wait

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 21 111

slide-48
SLIDE 48

Member of the Helmholtz Association

OpenACC Memory Model

Host Memory Device Memory

DMA Transfers

Usually: Two separate memory spaces Data needs to be transferred to device for computation; needs to be transferred back for further evaluation

— Transfers hidden from programmer – caution: latency, bandwidth, memory size — Memories are not coherent — Compiler helps; GPU runtime helps

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 22 111

slide-49
SLIDE 49

Member of the Helmholtz Association

OpenACC Programming Model

A binary perspective

OpenACC interpretation needs to be activated as compile flag

PGI pgcc -acc [-ta=tesla]

GCC gcc -fopenacc Additional flags possible to improve/modify compilation

  • ta=tesla:cc60 Use compute capability 6.0
  • ta=tesla:lineinfo Add source code correlation into binary
  • ta=tesla:managed Use unified memory
  • fopenacc-dim=geom Use geom configuration for threads

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 23 111

slide-50
SLIDE 50

Member of the Helmholtz Association

OpenACC Programming Model

A source code perspective

Compiler directives, ignored by incapable compilers Similar to OpenMP Support for GPU, multicore CPU, other accelerators (Intel Xeon Phi) Syntax C/C++

#pragma acc directive [clause, [, clause] ...] newline

Syntax Fortran

!$acc directive [clause, [, clause] ...] !$acc end directive

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 24 111

slide-51
SLIDE 51

Member of the Helmholtz Association

A Glimpse of OpenACC

#pragma acc data copy(x[0:N],y[0:N]) #pragma acc parallel loop

{ for (int i=0; i<N; i++) { x[i] = 1.0; y[i] = 2.0; } for (int i=0; i<N; i++) { y[i] = i*x[i]+y[i]; } }

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 25 111

slide-52
SLIDE 52

Member of the Helmholtz Association

OpenACC by Example

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 26 111

slide-53
SLIDE 53

Member of the Helmholtz Association

OpenACC Workflow

Identify available parallelism Parallelize loops with OpenACC Optimize data locality Optimize loop performance

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 27 111

slide-54
SLIDE 54

Member of the Helmholtz Association

Jacobi Solver

Algorithmic description

Example for acceleration: Jacobi solver Iterative solver, converges to correct value Each iteration step: compute average of neighboring points Example: 2D Poisson equation: ∇2A(x, y) = B(x, y) Ai,j+1 Ai−1,j Ai,j−1 Ai+1,j

Data Point Boundary Point Stencil

Ak+1(i, j) = − 1 4 (B(i, j) − (Ak(i − 1, j) + Ak(i, j + 1), +Ak(i + 1, j) + Ak(i, j − 1)))

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 28 111

slide-55
SLIDE 55

Member of the Helmholtz Association

Jacobi Solver

Source code

while ( error > tol && iter < iter_max ) { error = 0.0; for (int ix = ix_start; ix < ix_end; ix++) { for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

֒ →

}} for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; }} for (int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; }

// same for iy

iter++; }

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 29 111

slide-56
SLIDE 56

Member of the Helmholtz Association

Jacobi Solver

Source code

while ( error > tol && iter < iter_max ) { error = 0.0; for (int ix = ix_start; ix < ix_end; ix++) { for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

֒ →

}} for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; }} for (int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; }

// same for iy

iter++; }

Iterate until converged

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 29 111

slide-57
SLIDE 57

Member of the Helmholtz Association

Jacobi Solver

Source code

while ( error > tol && iter < iter_max ) { error = 0.0; for (int ix = ix_start; ix < ix_end; ix++) { for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

֒ →

}} for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; }} for (int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; }

// same for iy

iter++; }

Iterate until converged Iterate across matrix elements

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 29 111

slide-58
SLIDE 58

Member of the Helmholtz Association

Jacobi Solver

Source code

while ( error > tol && iter < iter_max ) { error = 0.0; for (int ix = ix_start; ix < ix_end; ix++) { for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

֒ →

}} for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; }} for (int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; }

// same for iy

iter++; }

Iterate until converged Iterate across matrix elements Calculate new value from neighbors

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 29 111

slide-59
SLIDE 59

Member of the Helmholtz Association

Jacobi Solver

Source code

while ( error > tol && iter < iter_max ) { error = 0.0; for (int ix = ix_start; ix < ix_end; ix++) { for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

֒ →

}} for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; }} for (int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; }

// same for iy

iter++; }

Iterate until converged Iterate across matrix elements Calculate new value from neighbors Accumulate error

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 29 111

slide-60
SLIDE 60

Member of the Helmholtz Association

Jacobi Solver

Source code

while ( error > tol && iter < iter_max ) { error = 0.0; for (int ix = ix_start; ix < ix_end; ix++) { for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

֒ →

}} for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; }} for (int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; }

// same for iy

iter++; }

Iterate until converged Iterate across matrix elements Calculate new value from neighbors Accumulate error Swap input/output

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 29 111

slide-61
SLIDE 61

Member of the Helmholtz Association

Jacobi Solver

Source code

while ( error > tol && iter < iter_max ) { error = 0.0; for (int ix = ix_start; ix < ix_end; ix++) { for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

֒ →

}} for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; }} for (int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; }

// same for iy

iter++; }

Iterate until converged Iterate across matrix elements Calculate new value from neighbors Accumulate error Swap input/output Set boundary conditions

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 29 111

slide-62
SLIDE 62

Member of the Helmholtz Association

OpenACC Workflow

Identify available parallelism Parallelize loops with OpenACC Optimize data locality Optimize loop performance

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 30 111

slide-63
SLIDE 63

Member of the Helmholtz Association

Identify Parallelism

Generate Profile

TASK 1 Use pgprof to analyze unaccelerated version of Jacobi solver Investigate! Task 1: Analyze Application Change to Task1/ directory Compile: make task1 Usually, compile just with make (but this exercise is special) Submit profiling run to the batch system:

make task1_profile

Study bsub call and pgprof call; try to understand ??? Where is hotspot? Which parts should be accelerated?

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 31 111

slide-64
SLIDE 64

Member of the Helmholtz Association

Identify Parallelism

Generate Profile

TASK 1 Use pgprof to analyze unaccelerated version of Jacobi solver Investigate! Task 1: Analyze Application Change to Task1/ directory Compile: make task1 Usually, compile just with make (but this exercise is special) Submit profiling run to the batch system:

make task1_profile

Study bsub call and pgprof call; try to understand ??? Where is hotspot? Which parts should be accelerated?

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 31 111

slide-65
SLIDE 65

Member of the Helmholtz Association

Profile of Application

Info during compilation

$ pgcc -DUSE_DOUBLE -Minfo=all,intensity -fast -Minfo=ccff -Mprof=ccff poisson2d_reference.o poisson2d.c -o poisson2d poisson2d.c: main: 68, Generated vector simd code for the loop FMA (fused multiply-add) instruction(s) generated 98, FMA (fused multiply-add) instruction(s) generated 105, Loop not vectorized: data dependency 123, Loop not fused: different loop trip count Loop not vectorized: data dependency Loop unrolled 8 times

Automated optimization of compiler, due to -fast Vectorization, FMA, unrolling

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 32 111

slide-66
SLIDE 66

Member of the Helmholtz Association

Profile of Application

Info during run

======== CPU profiling result (flat): Time(%) Time Name 77.52% 999.99ms main (poisson2d.c:148 0x6d8) 9.30% 120ms main (0x704) 7.75% 99.999ms main (0x718) 0.78% 9.9999ms main (poisson2d.c:128 0x348) 0.78% 9.9999ms main (poisson2d.c:123 0x398) 0.78% 9.9999ms __xlmass_expd2 (0xffcc011c) 0.78% 9.9999ms __c_mcopy8 (0xffcc0054) 0.78% 9.9999ms __xlmass_expd2 (0xffcc0034) ======== Data collected at 100Hz frequency

78 % in main() Since everything is in main – limited helpfulness Let’s look into main!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 33 111

slide-67
SLIDE 67

Member of the Helmholtz Association

Code Independency Analysis

What is independent?

while ( error > tol && iter < iter_max ) { error = 0.0; for (int ix = ix_start; ix < ix_end; ix++) { for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

֒ →

}} for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; }} for (int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; }

// same for iy

iter++; }

Data dependency between iterations Independent loop iterations Independent loop iterations Independent loop iterations

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 34 111

slide-68
SLIDE 68

Member of the Helmholtz Association

Code Independency Analysis

What is independent?

while ( error > tol && iter < iter_max ) { error = 0.0; for (int ix = ix_start; ix < ix_end; ix++) { for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

֒ →

}} for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; }} for (int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; }

// same for iy

iter++; }

Data dependency between iterations Independent loop iterations Independent loop iterations Independent loop iterations

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 34 111

slide-69
SLIDE 69

Member of the Helmholtz Association

OpenACC Workflow

Identify available parallelism Parallelize loops with OpenACC Optimize data locality Optimize loop performance

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 35 111

slide-70
SLIDE 70

Member of the Helmholtz Association

Parallel Loops: Parallel

Maybe the second most important directive

Programmer identifies block containing parallelism → compiler generates GPU code (kernel) Program launch creates gangs of parallel threads on GPU Implicit barrier at end of parallel region Each gang executes same code sequentially

 OpenACC: parallel

#pragma acc parallel [clause, [, clause] ...] newline

{structured block}

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 36 111

slide-71
SLIDE 71

Member of the Helmholtz Association

Parallel Loops: Parallel

Clauses

Diverse clauses to augment the parallel region

private(var) A copy of variables var is made for each gang firstprivate(var) Same as private, except var will initialized

with value from host

if(cond) Parallel region will execute on accelerator only

if cond is true

reduction(op:var) Reduction is performed on variable var with

  • peration op; supported: + * max min …

async[(int)] No implicit barrier at end of parallel region

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 37 111

slide-72
SLIDE 72

Member of the Helmholtz Association

Parallel Loops: Loops

Maybe the third most important directive

Programmer identifies loop eligible for parallelization Directive must be directly before loop Optional: Describe type of parallelism  OpenACC: loop

#pragma acc loop [clause, [, clause] ...] newline

{structured block}

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 38 111

slide-73
SLIDE 73

Member of the Helmholtz Association

Parallel Loops: Loops

Clauses independent Iterations of loop are data-independent (implied

if in parallel region (and no seq or auto))

collapse(int) Collapse int tightly-nested loops seq This loop is to be executed sequentially (not

parallel)

tile(int[,int]) Split loops into loops over tiles of the full size auto Compiler decides what to do

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 39 111

slide-74
SLIDE 74

Member of the Helmholtz Association

Parallel Loops: Parallel Loops

Maybe the most important directive

Combined directive: shortcut Because its used so ofuen Any clause that is allowed on parallel or loop allowed Restriction: May not appear in body of another parallel region  OpenACC: parallel loop

#pragma acc parallel loop [clause, [, clause] ...]

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 40 111

slide-75
SLIDE 75

Member of the Helmholtz Association

Parallel Loops Example

double sum = 0.0;

#pragma acc parallel loop

for (int i=0; i<N; i++) { x[i] = 1.0; y[i] = 2.0; }

#pragma acc parallel loop reduction(+:sum)

{ for (int i=0; i<N; i++) { y[i] = i*x[i]+y[i]; sum+=y[i]; } }

Kernel 1 Kernel 2

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 41 111

slide-76
SLIDE 76

Member of the Helmholtz Association

Parallel Jacobi

Add parallelism

TASK 2 Add OpenACC parallelism to main loop in Jacobi Profile code → Congratulations, you are a GPU developer! Task 2: A First Parallel Loop Change to Task2/ directory Compile: make Submit parallel run to the batch system: make run Adapt the bsub call and run with other number of iterations, matrix sizes Profile: make profile

pgprof or nvprof is prefix to call to poisson2d

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 42 111

slide-77
SLIDE 77

Member of the Helmholtz Association

Parallel Jacobi

Compilation result

$ make pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60,managed poisson2d_reference.c -o poisson2d_reference.o pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60,managed poisson2d.c poisson2d_reference.o -o poisson2d poisson2d.c: main: 109, Accelerator kernel generated Generating Tesla code 109, Generating reduction(max:error) 110, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 112, #pragma acc loop seq 109, Generating implicit copyin(A[:],rhs[:]) Generating implicit copyout(Anew[:]) 112, Complex loop carried dependence of Anew-> prevents parallelization Loop carried dependence of Anew-> prevents parallelization Loop carried backward dependence of Anew-> prevents vectorization

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 43 111

slide-78
SLIDE 78

Member of the Helmholtz Association

Parallel Jacobi

Run result

$ make run bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d Job <4444> is submitted to default queue <normal.i>. <<Waiting for dispatch ...>> <<Starting on juronc11>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 60.0827 s, This: 9.5541 s, speedup: 6.29

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 44 111

slide-79
SLIDE 79

Member of the Helmholtz Association

pgprof / nvprof

NVIDIA’s command line profiler

Profiles applications, mainly for NVIDIA GPUs, but also CPU code

GPU: CUDA kernels, API calls, OpenACC

pgprof vs nvprof: Twins with other configurations

Generate concise performance reports, full timelines; measure events and metrics (hardware counters) ⇒ Powerful tool for GPU application analysis → http://docs.nvidia.com/cuda/profiler-users-guide/

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 45 111

slide-80
SLIDE 80

Member of the Helmholtz Association

Profile of Jacobi

With pgprof

$ make profile ==116606== PGPROF is profiling process 116606, command: ./poisson2d 10 ==116606== Profiling application: ./poisson2d 10 Jacobi relaxation calculation: max 10 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 2048x2048: Ref: 0.8378 s, This: 0.2716 s, speedup: 3.08 ==116606== Profiling result: Time(%) Time Calls Avg Min Max Name 99.96% 129.82ms 10 12.982ms 11.204ms 20.086ms main_109_gpu 0.02% 30.560us 10 3.0560us 2.6240us 3.8720us main_109_gpu_red 0.01% 10.304us 10 1.0300us 960ns 1.2480us [CUDA memcpy HtoD] 0.00% 6.3680us 10 636ns 608ns 672ns [CUDA memcpy DtoH] ==116606== Unified Memory profiling result: Device "Tesla P100-SXM2-16GB (0)" Count Avg Size Min Size Max Size Total Size Total Time Name 3360 204.80KB 64.000KB 960.00KB 672.0000MB 25.37254ms Host To Device 3200 204.80KB 64.000KB 960.00KB 640.0000MB 30.94435ms Device To Host 2454

  • 66.99111ms

GPU Page fault groups Total CPU Page faults: 2304 ==116606== API calls: Time(%) Time Calls Avg Min Max Name 58.17% 639.81ms 5 127.96ms 564ns 189.20ms cuDevicePrimaryCtxRetain 26.35% 289.79ms 4 72.449ms 69.684ms 74.126ms cuDevicePrimaryCtxRelease

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 46 111

slide-81
SLIDE 81

Member of the Helmholtz Association

Profile of Jacobi

With pgprof

$ make profile ==116606== PGPROF is profiling process 116606, command: ./poisson2d 10 ==116606== Profiling application: ./poisson2d 10 Jacobi relaxation calculation: max 10 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 2048x2048: Ref: 0.8378 s, This: 0.2716 s, speedup: 3.08 ==116606== Profiling result: Time(%) Time Calls Avg Min Max Name 99.96% 129.82ms 10 12.982ms 11.204ms 20.086ms main_109_gpu 0.02% 30.560us 10 3.0560us 2.6240us 3.8720us main_109_gpu_red 0.01% 10.304us 10 1.0300us 960ns 1.2480us [CUDA memcpy HtoD] 0.00% 6.3680us 10 636ns 608ns 672ns [CUDA memcpy DtoH] ==116606== Unified Memory profiling result: Device "Tesla P100-SXM2-16GB (0)" Count Avg Size Min Size Max Size Total Size Total Time Name 3360 204.80KB 64.000KB 960.00KB 672.0000MB 25.37254ms Host To Device 3200 204.80KB 64.000KB 960.00KB 640.0000MB 30.94435ms Device To Host 2454

  • 66.99111ms

GPU Page fault groups Total CPU Page faults: 2304 ==116606== API calls: Time(%) Time Calls Avg Min Max Name 58.17% 639.81ms 5 127.96ms 564ns 189.20ms cuDevicePrimaryCtxRetain 26.35% 289.79ms 4 72.449ms 69.684ms 74.126ms cuDevicePrimaryCtxRelease

Only one function is parallelized! Let’s do the rest!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 46 111

slide-82
SLIDE 82

Member of the Helmholtz Association

More Parallelism: Kernels

More freedom for compiler

Kernels directive: second way to expose parallelism Region may contain parallelism Compiler determines parallelization opportunities → More freedom for compiler Rest: Same as for parallel  OpenACC: kernels

#pragma acc kernels [clause, [, clause] ...] newline

structured block

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 47 111

slide-83
SLIDE 83

Member of the Helmholtz Association

Kernels Example

double sum = 0.0;

#pragma acc kernels

{ for (int i=0; i<N; i++) { x[i] = 1.0; y[i] = 2.0; } for (int i=0; i<N; i++) { y[i] = i*x[i]+y[i]; sum+=y[i]; } }

Kernels created here

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 48 111

slide-84
SLIDE 84

Member of the Helmholtz Association

kernels vs. parallel Both approaches equally valid; can perform equally well

kernels

— Compiler performs parallel analysis — Can cover large area of code with single directive — Gives compiler additional leeway

parallel

— Requires parallel analysis by programmer — Will also parallelize what compiler may miss — Similar to OpenMP

Both regions may not contain other kernels/parallel regions No braunching into or out Program must not depend on order of evaluation of clauses At most: One if clause

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 49 111

slide-85
SLIDE 85

Member of the Helmholtz Association

kernels vs. parallel Both approaches equally valid; can perform equally well

kernels

— Compiler performs parallel analysis — Can cover large area of code with single directive — Gives compiler additional leeway

parallel

— Requires parallel analysis by programmer — Will also parallelize what compiler may miss — Similar to OpenMP

Both regions may not contain other kernels/parallel regions No braunching into or out Program must not depend on order of evaluation of clauses At most: One if clause

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 49 111

slide-86
SLIDE 86

Member of the Helmholtz Association

kernels vs. parallel Both approaches equally valid; can perform equally well

kernels

— Compiler performs parallel analysis — Can cover large area of code with single directive — Gives compiler additional leeway

parallel

— Requires parallel analysis by programmer — Will also parallelize what compiler may miss — Similar to OpenMP

Both regions may not contain other kernels/parallel regions No braunching into or out Program must not depend on order of evaluation of clauses At most: One if clause

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 49 111

slide-87
SLIDE 87

Member of the Helmholtz Association

Parallel Jacobi II

Add more parallelism

TASK 3 Add OpenACC parallelism to other loops of while (L:123 – L:141) Use either kernels or parallel Do they perform equally well? Task 3: More Parallel Loops Change to Task3/ directory Change source code Compile: make Study the compiler output! Submit parallel run to the batch system: make run

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 50 111

slide-88
SLIDE 88

Member of the Helmholtz Association

Parallel Jacobi II

Compilation result

$ make pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60,managed poisson2d_reference.c -o poisson2d_reference.o poisson2d.c: main: 109, Accelerator kernel generated Generating Tesla code 109, Generating reduction(max:error) 110, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 112, #pragma acc loop seq 109, ... 121, Accelerator kernel generated Generating Tesla code 124, #pragma acc loop gang /* blockIdx.x */ 126, #pragma acc loop vector(128) /* threadIdx.x */ 121, Generating implicit copyin(Anew[:]) Generating implicit copyout(A[:]) 126, Loop is parallelizable 133, Accelerator kernel genera...

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 51 111

slide-89
SLIDE 89

Member of the Helmholtz Association

Parallel Jacobi II

Run result

$ make run bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d Job <4458> is submitted to default queue <normal.i>. <<Waiting for dispatch ...>> <<Starting on juronc15>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 64.9401 s, This: 0.4099 s, speedup: 158.45

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 52 111

slide-90
SLIDE 90

Member of the Helmholtz Association

Parallel Jacobi II

Run result

$ make run bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d Job <4458> is submitted to default queue <normal.i>. <<Waiting for dispatch ...>> <<Starting on juronc15>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 64.9401 s, This: 0.4099 s, speedup: 158.45

Done?!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 52 111

slide-91
SLIDE 91

Member of the Helmholtz Association

Parallel Jacobi

while ( error > tol && iter < iter_max ) { error = 0.0;

#pragma acc parallel loop reduction(max:error)

for (int ix = ix_start; ix < ix_end; ix++) { for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix])); error = fmaxr(error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix])); }}

#pragma acc parallel loop

for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; }}

#pragma acc parallel loop

for (int ix = ix_start; ix < ix_end; ix++) { A[0*nx+ix] = A[(ny-2)*nx+ix]; A[(ny-1)*nx+ix] = A[1*nx+ix]; }

// same for iy

iter++; } Andreas Herten | OpenACC Tutorial | 31 August 2017 # 53 111

slide-92
SLIDE 92

Member of the Helmholtz Association

Automatic Data Transfers

Up to now: We did not care about data transfers Compiler and runtime care Magic keyword: -ta=tesla:managed Only feature of (recent) NVIDIA GPUs!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 54 111

slide-93
SLIDE 93

Member of the Helmholtz Association

GPU Memory Spaces

Location, location, location

CPU Memory CPU DRAM Scheduler . . . Interconnect L2

At the Beginning CPU and GPU memory very distinct, own addresses CUDA 4.0 Unified Virtual Addressing: pointer from same address pool, but data copy manual CUDA 6.0 Unified Memory*: Data copy by driver, but whole data at once (Kepler) CUDA 8.0 Unified Memory (truly): Data copy by driver, page faults on-demand initiate data migrations (Pascal)

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 55 111

slide-94
SLIDE 94

Member of the Helmholtz Association

GPU Memory Spaces

Location, location, location

CPU Memory CPU DRAM Scheduler . . . Interconnect L2

Unified Virtual Addressing At the Beginning CPU and GPU memory very distinct, own addresses CUDA 4.0 Unified Virtual Addressing: pointer from same address pool, but data copy manual CUDA 6.0 Unified Memory*: Data copy by driver, but whole data at once (Kepler) CUDA 8.0 Unified Memory (truly): Data copy by driver, page faults on-demand initiate data migrations (Pascal)

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 55 111

slide-95
SLIDE 95

Member of the Helmholtz Association

GPU Memory Spaces

Location, location, location

CPU Memory CPU DRAM Scheduler . . . Interconnect L2 Unified Memory

At the Beginning CPU and GPU memory very distinct, own addresses CUDA 4.0 Unified Virtual Addressing: pointer from same address pool, but data copy manual CUDA 6.0 Unified Memory*: Data copy by driver, but whole data at once (Kepler) CUDA 8.0 Unified Memory (truly): Data copy by driver, page faults on-demand initiate data migrations (Pascal)

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 55 111

slide-96
SLIDE 96

Member of the Helmholtz Association

GPU Memory Spaces

Location, location, location

CPU Memory CPU DRAM Scheduler . . . Interconnect L2 Unified Memory

At the Beginning CPU and GPU memory very distinct, own addresses CUDA 4.0 Unified Virtual Addressing: pointer from same address pool, but data copy manual CUDA 6.0 Unified Memory*: Data copy by driver, but whole data at once (Kepler) CUDA 8.0 Unified Memory (truly): Data copy by driver, page faults on-demand initiate data migrations (Pascal)

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 55 111

slide-97
SLIDE 97

Member of the Helmholtz Association

Portability

Managed memory: Only NVIDIA GPU feature Great OpenACC features: Portability → Code should also be fast without -ta=tesla:managed! Let’s remove it from compile flags!

$ make pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60 poisson2d_reference.c -o poisson2d_reference.o poisson2d.c: PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Could not find allocated-variable index for symbol (poisson2d.c: 110) ... PGC/power Linux 17.4-0: compilation completed with severe errors

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 56 111

slide-98
SLIDE 98

Member of the Helmholtz Association

Portability

Managed memory: Only NVIDIA GPU feature Great OpenACC features: Portability → Code should also be fast without -ta=tesla:managed! Let’s remove it from compile flags!

$ make pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60 poisson2d_reference.c -o poisson2d_reference.o poisson2d.c: PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Could not find allocated-variable index for symbol (poisson2d.c: 110) ... PGC/power Linux 17.4-0: compilation completed with severe errors

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 56 111

slide-99
SLIDE 99

Member of the Helmholtz Association

Copy Statements

Compiler implicitly created copy clauses to copy data to device

134, Generating implicit copyin(A[:]) Generating implicit copyout(A[nx*(ny-1)+1:nx-2])

It couldn’t determine length of copied data … …but before: no problem – Unified Memory! Now: Problem! We need to give that information! (see also later)  OpenACC: copy

#pragma acc parallel copy(A[start:end]) Also: copyin(B[s:e]) copyout(C[s:e]) present(D[s:e]) create(E[s:e])

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 57 111

slide-100
SLIDE 100

Member of the Helmholtz Association

Copy Statements

Compiler implicitly created copy clauses to copy data to device

134, Generating implicit copyin(A[:]) Generating implicit copyout(A[nx*(ny-1)+1:nx-2])

It couldn’t determine length of copied data … …but before: no problem – Unified Memory! Now: Problem! We need to give that information! (see also later)  OpenACC: copy

#pragma acc parallel copy(A[start:end]) Also: copyin(B[s:e]) copyout(C[s:e]) present(D[s:e]) create(E[s:e])

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 57 111

slide-101
SLIDE 101

Member of the Helmholtz Association

Data Copies

Tell compiler which data is needed where

TASK 4 Add copy clauses to parallel regions Profile with Visual Profiler Task 4: Data Copies Change to Task4/ directory Work on TODOs Compile: make Submit parallel run to the batch system: make run It might take some time Generate profile with make profile_tofile

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 58 111

slide-102
SLIDE 102

Member of the Helmholtz Association

Data Copies

Compiler Output

$ make pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60 poisson2d_reference.c -o poisson2d_reference.o poisson2d.c: main: 109, Generating copy(A[:ny*nx],Anew[:ny*nx],rhs[:ny*nx]) ... 121, Generating copy(Anew[:ny*nx],A[:ny*nx]) ... 131, Generating copy(A[:ny*nx]) Accelerator kernel generated Generating Tesla code 132, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 137, Generating copy(A[:ny*nx]) Accelerator kernel generated Generating Tesla code 138, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 59 111

slide-103
SLIDE 103

Member of the Helmholtz Association

Data Copies

Run Result

$ make run <<Starting on juronc13>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 114.7186 s, This: 25.0522 s, speedup: 4.58

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 60 111

slide-104
SLIDE 104

Member of the Helmholtz Association

Data Copies

Run Result

$ make run <<Starting on juronc13>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 114.7186 s, This: 25.0522 s, speedup: 4.58

Slower?! Why?

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 60 111

slide-105
SLIDE 105

Member of the Helmholtz Association

PGI/NVIDIA Visual Profiler

GUI tool accompanying pgprof / nvprof

PGI Start pgprof without parameters NVIDIA Start nvvp Timeline view of all things GPU → Study stages and interplay of application Interactive or with input from command line profilers View launch and run configurations Guided and unguided analysis → https://developer.nvidia.com/nvidia-visual-profiler

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 61 111

slide-106
SLIDE 106

Member of the Helmholtz Association

PGI/NVIDIA Visual Profiler

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 62 111

slide-107
SLIDE 107

Member of the Helmholtz Association

Jacboi in Visual Profiler

Zoom in to kernel calls

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 63 111

slide-108
SLIDE 108

Member of the Helmholtz Association

OpenACC Workflow

Identify available parallelism Parallelize loops with OpenACC Optimize data locality Optimize loop performance

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 64 111

slide-109
SLIDE 109

Member of the Helmholtz Association

Analyze Jacobi Data Flow

In code

while (error > tol && iter < iter_max) { error = 0.0;

A, Anew resident on host

iter++ }

#pragma acc parallel loop

for (int ix = ix_start; ix < ix_end; ix++) {

֒ →

for (int iy = iy_start; iy < iy_end; iy++) {

֒ →

// ...

}}

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 65 111

slide-110
SLIDE 110

Member of the Helmholtz Association

Analyze Jacobi Data Flow

In code

while (error > tol && iter < iter_max) { error = 0.0;

A, Anew resident on host

iter++ }

#pragma acc parallel loop

A, Anew resident on device

for (int ix = ix_start; ix < ix_end; ix++) {

֒ →

for (int iy = iy_start; iy < iy_end; iy++) {

֒ →

// ...

}}

copy

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 65 111

slide-111
SLIDE 111

Member of the Helmholtz Association

Analyze Jacobi Data Flow

In code

while (error > tol && iter < iter_max) { error = 0.0;

A, Anew resident on host

iter++ }

#pragma acc parallel loop

A, Anew resident on device

for (int ix = ix_start; ix < ix_end; ix++) {

֒ →

for (int iy = iy_start; iy < iy_end; iy++) {

֒ →

// ...

}}

A, Anew resident on device

copy

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 65 111

slide-112
SLIDE 112

Member of the Helmholtz Association

Analyze Jacobi Data Flow

In code

while (error > tol && iter < iter_max) { error = 0.0;

A, Anew resident on host A, Anew resident on host

iter++ }

#pragma acc parallel loop

A, Anew resident on device

for (int ix = ix_start; ix < ix_end; ix++) {

֒ →

for (int iy = iy_start; iy < iy_end; iy++) {

֒ →

// ...

}}

A, Anew resident on device

copy

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 65 111

slide-113
SLIDE 113

Member of the Helmholtz Association

Analyze Jacobi Data Flow

In code

while (error > tol && iter < iter_max) { error = 0.0;

A, Anew resident on host A, Anew resident on host

iter++ }

#pragma acc parallel loop

A, Anew resident on device

for (int ix = ix_start; ix < ix_end; ix++) {

֒ →

for (int iy = iy_start; iy < iy_end; iy++) {

֒ →

// ...

}}

A, Anew resident on device

copy

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 65 111

slide-114
SLIDE 114

Member of the Helmholtz Association

Analyze Jacobi Data Flow

In code

while (error > tol && iter < iter_max) { error = 0.0;

A, Anew resident on host A, Anew resident on host

iter++ }

#pragma acc parallel loop

A, Anew resident on device

for (int ix = ix_start; ix < ix_end; ix++) {

֒ →

for (int iy = iy_start; iy < iy_end; iy++) {

֒ →

// ...

}}

A, Anew resident on device

copy

Copies are done in each iteration!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 65 111

slide-115
SLIDE 115

Member of the Helmholtz Association

Analyze Jacobi Data Flow

Summary

Meanwhile, whole algorithm is using GPU At beginning of while loop, data copied to device; at end of loop, coped by to host Depending on type of parallel regions in while loop: Data copied in between regions as well Slow! Data copies are expensive!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 66 111

slide-116
SLIDE 116

Member of the Helmholtz Association

Analyze Jacobi Data Flow

Summary

Meanwhile, whole algorithm is using GPU At beginning of while loop, data copied to device; at end of loop, coped by to host Depending on type of parallel regions in while loop: Data copied in between regions as well Slow! Data copies are expensive!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 66 111

slide-117
SLIDE 117

Member of the Helmholtz Association

Data Regions

To manually specify data locations

Defines region of code in which data remains on device Data is shared among all kernels in region Explicit data transfers  OpenACC: data

#pragma acc data [clause, [, clause] ...] newline

{structured block}

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 67 111

slide-118
SLIDE 118

Member of the Helmholtz Association

Data Regions

Clauses

Clauses to augment the data regions

copy(var) Allocates memory of var on GPU, copies data to GPU

at beginning of region, copies data to host at end of region Specifies size of var: var[lowerBound:size]

copyin(var) Allocates memory of var on GPU, copies data to GPU

at beginning of region

copyout(var) Allocates memory of var on GPU, copies data to host

at end of region

create(var) Allocates memory of var on GPU present(var) Data of var is not copies automatically to GPU but

considered present

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 68 111

slide-119
SLIDE 119

Member of the Helmholtz Association

Data Region Example

#pragma acc data copyout(y[0:N]) create(x[0:N])

{ double sum = 0.0;

#pragma acc parallel loop

for (int i=0; i<N; i++) { x[i] = 1.0; y[i] = 2.0; }

#pragma acc parallel loop

for (int i=0; i<N; i++) { y[i] = i*x[i]+y[i]; } }

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 69 111

slide-120
SLIDE 120

Member of the Helmholtz Association

Data Region

One data set to rule them all

TASK 5 Add data region such that all data resides on device during iterations Optional: See your success in Visual Profiler Task 5: Data Region Change to Task5/ directory Work on TODOs Compile: make Submit to the batch system: make run Generate profile with make profile_tofile

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 70 111

slide-121
SLIDE 121

Member of the Helmholtz Association

Data Region

Compiler Output

$ make pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60 poisson2d.c poisson2d_reference.o -o poisson2d poisson2d.c: main: 104, Generating copyin(rhs[:ny*nx]) Generating create(Anew[:ny*nx]) Generating copy(A[:ny*nx]) 110, Accelerator kernel generated Generating Tesla code 110, Generating reduction(max:error) 111, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 113, #pragma acc loop seq ...

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 71 111

slide-122
SLIDE 122

Member of the Helmholtz Association

Data Region

Run Result

$ make run <<Starting on juronc12>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 115.0765 s, This: 0.4807 s, speedup: 239.38

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 72 111

slide-123
SLIDE 123

Member of the Helmholtz Association

Data Region

Run Result

$ make run <<Starting on juronc12>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 115.0765 s, This: 0.4807 s, speedup: 239.38

Wow! But can we be even better?

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 72 111

slide-124
SLIDE 124

Member of the Helmholtz Association

OpenACC Workflow

Identify available parallelism Parallelize loops with OpenACC Optimize data locality Optimize loop performance

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 73 111

slide-125
SLIDE 125

Member of the Helmholtz Association

Understanding Compiler Output

110, Accelerator kernel generated Generating Tesla code 110, Generating reduction(max:error) 111, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 114, #pragma acc loop seq 114, Complex loop carried dependence of Anew-> prevents parallelization

110

#pragma acc parallel loop reduction(max:error)

111

for (int ix = ix_start; ix < ix_end; ix++)

112

{

113

// Inner loop

114

for (int iy = iy_start; iy < iy_end; iy++)

115

{

116

Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix] ));

֒ → 117

error = fmaxr( error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));

118

}

119

}

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 74 111

slide-126
SLIDE 126

Member of the Helmholtz Association

Understanding Compiler Output

110, Accelerator kernel generated Generating Tesla code 110, Generating reduction(max:error) 111, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 114, #pragma acc loop seq 114, Complex loop carried dependence of Anew-> prevents parallelization

Outer loop: Parallelism with gang and vector Inner loop: Sequentially per thread (#pragma acc loop seq) Inner loop was never parallelized! Rule of thumb: Expose as much parallelism as possible

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 74 111

slide-127
SLIDE 127

Member of the Helmholtz Association

OpenACC Parallelism

3 Levels of Parallelism

Gang $ Workers Vector

Vector Vector threads work in lockstep (SIMD/SIMT parallelism) Worker Has 1 or more vector; workers share common resource (cache) Gang Has 1 or more workers; multiple gangs work independently from each

  • ther

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 75 111

slide-128
SLIDE 128

Member of the Helmholtz Association

CUDA Parallelism

CUDA Execution Model

Sofuware Hardware

Thread Scalar Processor Threads executed by scalar processors (CUDA cores) Thread Block Multiprocessor Thread blocks: Executed on multiprocessors (SM) Do not migrate Several concurrent thread blocks can reside on multiprocessor Limit: Multiprocessor resources (register file; shared memory) Grid Device Kernel launched as grid of thread blocks Blocks, grids: Multiple dimensions

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 76 111

slide-129
SLIDE 129

Member of the Helmholtz Association

CUDA Parallelism

CUDA Execution Model

Sofuware Hardware

Thread Scalar Processor Threads executed by scalar processors (CUDA cores) Thread Block Multiprocessor Thread blocks: Executed on multiprocessors (SM) Do not migrate Several concurrent thread blocks can reside on multiprocessor Limit: Multiprocessor resources (register file; shared memory) Grid Device Kernel launched as grid of thread blocks Blocks, grids: Multiple dimensions

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 76 111

slide-130
SLIDE 130

Member of the Helmholtz Association

CUDA Parallelism

CUDA Execution Model

Sofuware Hardware

Thread Scalar Processor Threads executed by scalar processors (CUDA cores) Thread Block Multiprocessor Thread blocks: Executed on multiprocessors (SM) Do not migrate Several concurrent thread blocks can reside on multiprocessor Limit: Multiprocessor resources (register file; shared memory) Grid

. . .

Device Kernel launched as grid of thread blocks Blocks, grids: Multiple dimensions

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 76 111

slide-131
SLIDE 131

Member of the Helmholtz Association

From OpenACC to CUDA

map(||acc,||<<<>>>)

In general: Compiler free to do what it thinks is best Usually gang Mapped to blocks (coarse grain) worker Mapped to threads (fine grain) vector Mapped to threads (fine SIMD/SIMT) seq No parallelism; sequential Exact mapping compiler dependent Performance tips

— Use vector size divisible by 32 — Block size: num_workers × vector_length

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 77 111

slide-132
SLIDE 132

Member of the Helmholtz Association

Declaration of Parallelism

Specify configuration of threads

Three clauses of parallel region (parallel, kernels) for changing distribution/configuration of group of threads Presence of keyword: Distribute using this level Optional size: Control size of parallel entity  OpenACC: gang worker vector

#pragma acc parallel loop gang vector

Also: worker Size: num_gangs(n), num_workers(n), vector_length(n)

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 78 111

slide-133
SLIDE 133

Member of the Helmholtz Association

Understanding Compiler Output II

110, Accelerator kernel generated Generating Tesla code 110, Generating reduction(max:error) 111, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 114, #pragma acc loop seq 114, Complex loop carried dependence of Anew-> prevents parallelization

Compiler reports configuration of parallel entities

— Gang mapped to blockIdx.x — Vector mapped to threadIdx.x — Worker not used

Here: 128 threads per block; as many blocks as needed 128 seems to be default for Tesla/NVIDIA

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 79 111

slide-134
SLIDE 134

Member of the Helmholtz Association

More Parallelism

Unsequentialize inner loop

TASK 6 Add vector clause to inner loop Study result with profiler Task 6: More Parallelism Change to Task6/ directory Work on TODO Compile: make Submit to the batch system: make run Generate profile with make profile_tofile

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 80 111

slide-135
SLIDE 135

Member of the Helmholtz Association

More Parallelism

Compiler Output

$ make pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60 poisson2d.c poisson2d_reference.o -o poisson2d poisson2d.c: main: 104, Generating create(Anew[:ny*nx]) Generating copyin(rhs[:ny*nx]) Generating copy(A[:ny*nx]) 110, Accelerator kernel generated Generating Tesla code 110, Generating reduction(max:error) 111, #pragma acc loop gang /* blockIdx.x */ 114, #pragma acc loop vector(128) /* threadIdx.x */ ...

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 81 111

slide-136
SLIDE 136

Member of the Helmholtz Association

Data Region

Run Result

$ make run bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d Job <4490> is submitted to default queue <normal.i>. <<Waiting for dispatch ...>> <<Starting on juronc11>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 111.7712 s, This: 0.9257 s, speedup: 120.74

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 82 111

slide-137
SLIDE 137

Member of the Helmholtz Association

Data Region

Run Result

$ make run bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d Job <4490> is submitted to default queue <normal.i>. <<Waiting for dispatch ...>> <<Starting on juronc11>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 111.7712 s, This: 0.9257 s, speedup: 120.74

Actually slower! Why?

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 82 111

slide-138
SLIDE 138

Member of the Helmholtz Association

Memory Coalescing

Memory in batch

Coalesced access good

— Threads of warp (group of 32 contiguous threads) access adjacent words — Few transactions, high utilization

Uncoalesced access bad

— Threads of warp access scattered words — Many transactions, low utilization

Best performance: threadIdx.x should access contiguously

0 1 … 31 0 1 … 31

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 83 111

slide-139
SLIDE 139

Member of the Helmholtz Association

Jacobi’s Access Pattern

Coalesced or uncoalesced, that is the question

#pragma acc parallel loop reduction(max:error)

for (int ix = ix_start; ix < ix_end; ix++) {

#pragma acc loop vector

for (int iy = iy_start; iy < iy_end; iy++) { Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1] + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix]));

//...

Fast-running index: ix Slow-running index: iy But vector loop over iy! Consecutive threads access far away memory location!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 84 111

slide-140
SLIDE 140

Member of the Helmholtz Association

Fixing Access Pattern

Loop change

TASK 7 Interchange loop order for Jacobi loops Also: Compare to loop-fixed CPU reference version Task 7: Loop Ordering Change to Task7/ directory Work on TODO Compile: make Submit to the batch system: make run

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 85 111

slide-141
SLIDE 141

Member of the Helmholtz Association

Fixing Access Pattern

Compiler output (unchanged)

$ make pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc60 poisson2d.c poisson2d_reference.o -o poisson2d poisson2d.c: main: 104, Generating create(Anew[:ny*nx]) Generating copyin(rhs[:ny*nx]) Generating copy(A[:ny*nx]) 110, Accelerator kernel generated Generating Tesla code 110, Generating reduction(max:error) 111, #pragma acc loop gang /* blockIdx.x */ 114, #pragma acc loop vector(128) /* threadIdx.x */ ...

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 86 111

slide-142
SLIDE 142

Member of the Helmholtz Association

Fixing Access Pattern

Run Result

$ make run bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d Job <4490> is submitted to default queue <normal.i>. <<Waiting for dispatch ...>> <<Starting on juronc11>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 113.0214 s, This: 0.3284 s, speedup: 344.15

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 87 111

slide-143
SLIDE 143

Member of the Helmholtz Association

Fixing Access Pattern

Run Result

$ make run bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d Job <4490> is submitted to default queue <normal.i>. <<Waiting for dispatch ...>> <<Starting on juronc11>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 113.0214 s, This: 0.3284 s, speedup: 344.15

Again with proper CPU version!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 87 111

slide-144
SLIDE 144

Member of the Helmholtz Association

Fixing Access Pattern

Run Result II

$ make run bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d Job <4490> is submitted to default queue <normal.i>. <<Waiting for dispatch ...>> <<Starting on juronc11>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 6.8080 s, This: 0.2609 s, speedup: 26.10

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 88 111

slide-145
SLIDE 145

Member of the Helmholtz Association

Fixing Access Pattern

Run Result II

$ make run bsub -I -R "rusage[ngpus_shared=20]" ./poisson2d Job <4490> is submitted to default queue <normal.i>. <<Waiting for dispatch ...>> <<Starting on juronc11>> Jacobi relaxation calculation: max 500 iterations on 2048 x 2048 mesh Calculate reference solution and time with serial CPU execution. 0, 0.249999 100, 0.249760 200, 0... Calculate current execution. 0, 0.249999 100, 0.249760 200, 0... 2048x2048: Ref: 6.8080 s, This: 0.2609 s, speedup: 26.10

26 × is great!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 88 111

slide-146
SLIDE 146

Member of the Helmholtz Association

Page-Locked Memory

Pageability

Host memory allocated with malloc() is pageable

— Memory pages of memory can be moved by kernel, e.g. swapped to disk — Additional indirection

NVIDIA GPUs can allocate page-locked memory (pinned memory)

Faster (safety guards are skipped) Interleaving of execution and copy (asynchronous) Directly map into GPU memory Scarce resource; OS performance could degrade

OpenACC: Very easy to use pinned memory

  • ta=tesla:pinned

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 89 111

slide-147
SLIDE 147

Member of the Helmholtz Association

Page-Locked Memory

Pageability

Host memory allocated with malloc() is pageable

— Memory pages of memory can be moved by kernel, e.g. swapped to disk — Additional indirection

NVIDIA GPUs can allocate page-locked memory (pinned memory)

+ Faster (safety guards are skipped) + Interleaving of execution and copy (asynchronous) + Directly map into GPU memory∗ − Scarce resource; OS performance could degrade

OpenACC: Very easy to use pinned memory

  • ta=tesla:pinned

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 89 111

slide-148
SLIDE 148

Member of the Helmholtz Association

Page-Locked Memory

Pageability

Host memory allocated with malloc() is pageable

— Memory pages of memory can be moved by kernel, e.g. swapped to disk — Additional indirection

NVIDIA GPUs can allocate page-locked memory (pinned memory)

+ Faster (safety guards are skipped) + Interleaving of execution and copy (asynchronous) + Directly map into GPU memory∗ − Scarce resource; OS performance could degrade

OpenACC: Very easy to use pinned memory

  • ta=tesla:pinned

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 89 111

slide-149
SLIDE 149

Member of the Helmholtz Association

Page-Locked Memory

Loop change

TASK 7’ Compare performance with and without pinned memory Also test unified memory again Task 7’: Pinned Memory Like in Task 7, but change compilation to include pinned or

managed

Submit to the batch system: make run

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 90 111

slide-150
SLIDE 150

Member of the Helmholtz Association

OpenACC Workflow

Identify available parallelism Parallelize loops with OpenACC Optimize data locality Optimize loop performance

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 91 111

slide-151
SLIDE 151

Member of the Helmholtz Association

Interoperability

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 92 111

slide-152
SLIDE 152

Member of the Helmholtz Association

Interoperability

OpenACC can operate together with

— Applications — Libraries — CUDA

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 93 111

slide-153
SLIDE 153

Member of the Helmholtz Association

The Keyword

OpenACC’s Rosetta Stone

host_data use_device

Background

— GPU and CPU are difgerent devices, have difgerent memory Distinct address spaces

OpenACC hides handling of addresses from user

— For every chunk of accelerated data, two addresses exist — One for CPU data, one for GPU data — OpenACC uses appropriate address in accelerated kernel

But: Automatic handling not working when out of OpenACC (OpenACC will default to host address)

host_data use_device uses the address of the GPU device data

for scope

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 94 111

slide-154
SLIDE 154

Member of the Helmholtz Association

The Keyword

OpenACC’s Rosetta Stone

host_data use_device

Background

— GPU and CPU are difgerent devices, have difgerent memory → Distinct address spaces

OpenACC hides handling of addresses from user

— For every chunk of accelerated data, two addresses exist — One for CPU data, one for GPU data — OpenACC uses appropriate address in accelerated kernel

But: Automatic handling not working when out of OpenACC (OpenACC will default to host address) → host_data use_device uses the address of the GPU device data for scope

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 94 111

slide-155
SLIDE 155

Member of the Helmholtz Association

The host_data Construct

That’s all you need

Usage:

double* foo = new double[N];

// foo on Host #pragma acc data copyin(foo[0:N]) // foo on Device

{ ...

#pragma acc host_data use_device(foo)

some_lfunc(foo);

// Device: OK!

... }

Directive can be used for structured block as well

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 95 111

slide-156
SLIDE 156

Member of the Helmholtz Association

The Inverse: deviceptr

When CUDA is involved

For the inverse case:

— Data has been copied by CUDA or a CUDA-using library — Pointer to data residing on devices is returned → Use this data in OpenACC context

deviceptr clause declares data to be on device

Usage:

float * n; int n = 4223; cudaMalloc((void**)&x,(size_t)n*sizeof(float));

// ... #pragma acc kernels deviceptr(x)

for (int i = 0; i < n; i++) { x[i] = i; }

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 96 111

slide-157
SLIDE 157

Member of the Helmholtz Association

The Inverse: deviceptr

When CUDA is involved

For the inverse case:

— Data has been copied by CUDA or a CUDA-using library — Pointer to data residing on devices is returned → Use this data in OpenACC context

deviceptr clause declares data to be on device

Usage:

float * n; int n = 4223; cudaMalloc((void**)&x,(size_t)n*sizeof(float));

// ... #pragma acc kernels deviceptr(x)

for (int i = 0; i < n; i++) { x[i] = i; }

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 96 111

slide-158
SLIDE 158

Member of the Helmholtz Association

Interoperability

Tasks

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 97 111

slide-159
SLIDE 159

Member of the Helmholtz Association

Task 1

Introduction to BLAS

Use case: Anything linear algebra

BLAS: Basic Linear Algebra Subprograms — Vector-vector, vector-matrix, matrix-matrix operations — Specification of routines — Examples: SAXPY, DGEMV, ZGEMM → http://www.netlib.org/blas/

cuBLAS: NVIDIA’s linear algebra routines with BLAS interface, readily accelerated → http://docs.nvidia.com/cuda/cublas/ Task 1: Use cuBLAS for vector addition, everything else with OpenACC

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 98 111

slide-160
SLIDE 160

Member of the Helmholtz Association

Task 8-1

cuBLAS OpenACC Interaction

cuBLAS routine used:

cublasDaxpy(cublasHandle_t handle, int n, const double *alpha, const double *x, int incx, double *y, int incy)

handle capsules GPU auxiliary data, needs to be created and

destroyed with cublasCreate and cublasDestroy

x and y point to addresses on device!

cuBLAS library needs to be linked with -lcublas

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 99 111

slide-161
SLIDE 161

Member of the Helmholtz Association

Task 8-1

Vector Addition with cuBLAS

Use cuBLAS for vector addition Task 8-1: OpenACC +cuBLAS Change to Task8-1/ directory Work on TODOs in vecAddRed.c

— Use host_data use_device to provide correct pointer — Check cuBLAS documentation for details on cublasDaxpy()

Compile: make Submit to the batch system: make brun TASK 8-1

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 100 111

slide-162
SLIDE 162

Member of the Helmholtz Association

Task 8-2

CUDA Need-to-Know

Use case:

— Working on legacy code — Need the raw power (/flexibility) of CUDA

CUDA need-to-knows:

— Thread → Block → Grid Total number of threads should map to your problem; threads are alway given per block — A kernel is called from every thread on GPU device Number of kernel threads: triple chevron syntax

kernel<<<nBlocks, nThreads>>>(arg1, arg2, ...)

— Kernel: Function with __global__ prefix Aware of its index by global variables, e.g. threadIdx.x → http://docs.nvidia.com/cuda/

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 101 111

slide-163
SLIDE 163

Member of the Helmholtz Association

Task 8-2

Vector Addition with CUDA Kernel

TASK 8-2 CUDA kernel for vector addition, rest OpenACC Marrying CUDA C and OpenACC:

— All direct CUDA interaction wrapped in wrapper file

cudaWrapper.cu, compiled with nvcc to object file (-c)

— vecAddRed.c calls external function from cudaWrapper.cu (extern)

Task 8-2: OpenACC +CUDA Change to Task8-2/ directory Work on TODOs in vecAddRed.c and cublasWrapper.cu

— Use host_data use_device to provide correct pointer — Implement computation in kernel, implement call of kernel

Compile: make Submit to the batch system: make brun

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 102 111

slide-164
SLIDE 164

Member of the Helmholtz Association

Task 8-3

Vector Addition with Thrust

8-3

Thrust

— Template library for CUDA C/C++ (similar to STL) — Ofgers many pre-made algorithms for popular computing tasks — Usually works with C++ iterators, but understands C arrays as well → http://thrust.github.io/

Use Thrust for reduction, everything else of vector addition with OpenACC Task 8-3: OpenACC +CUDA Change to Task8-3/ directory Work on TODOs in vecAddRed.c and thrustWrapper.cu

— Use host_data use_device to provide correct pointer — Implement call to thrust::reduce using c_ptr

Compile: make Submit to the batch system: make brun

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 103 111

slide-165
SLIDE 165

Member of the Helmholtz Association

Task 8-4

Stating the Problem

We want to solve the Poisson equation ΔΦ(x, y) = −ρ(x, y) with periodic boundary conditions in x and y Needed, e.g., for finding electrostatic potential Φ for a given charge distribution ρ Model problem ρ(x, y) = cos(4πx) sin(2πy) (x, y) ∈ [0, 1)2 Analytically known: Φ(x, y) = Φ0 cos(4πx) sin(2πy) Let’s solve the Poisson equation with a Fourier Transform!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 104 111

slide-166
SLIDE 166

Member of the Helmholtz Association

Task 8-4

Introduction to Fourier Transforms

Discrete Fourier Transform and Re-Transform: ˆ fk =

N−1

j=0

fje− 2πik

N j

⇔ fj =

N−1

k=0

ˆ fke

2πij N k

Time for allˆ fk: O(N2) Fast Fourier Transform: Recursively splitting → O(N log(N)) Find derivatives in Fourier space: f′

j = N−1

k=0

ikˆ fke

2πij N k

It’s just multiplying by ik!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 105 111

slide-167
SLIDE 167

Member of the Helmholtz Association

Task 8-4

Plan for FFT Poisson Solution

Start with charge density ρ

1 Fourier-transform ρ

ˆ ρ ← F (ρ)

2 Integrate ρ in Fourier space twice

ˆ φ ← −ˆ ρ/ ( k2

x + k2 y

)

3 Inverse Fourier-transform ˆ

φ φ ← F−1(ˆ φ)

cuFFT

OpenACC

cuFFT

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 106 111

slide-168
SLIDE 168

Member of the Helmholtz Association

Task 8-4

Plan for FFT Poisson Solution

Start with charge density ρ

1 Fourier-transform ρ

ˆ ρ ← F (ρ)

2 Integrate ρ in Fourier space twice

ˆ φ ← −ˆ ρ/ ( k2

x + k2 y

)

3 Inverse Fourier-transform ˆ

φ φ ← F−1(ˆ φ)

cuFFT

OpenACC

cuFFT

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 106 111

slide-169
SLIDE 169

Member of the Helmholtz Association

Task 8-4

cuFFT

cuFFT: NVIDIA’s (Fast) Fourier Transform library — 1D, 2D, 3D transforms; complex and real data types — Asynchronous execution — Modeled afuer FFTW library (API) — Part of CUDA Toolkit → https://developer.nvidia.com/cufft

cufftDoubleComplex *src, *tgt;

// Device data!

cufftHandle plan;

// Setup 2d complex-complex trafo w/ dimensions (Nx, Ny)

cufftCreatePlan(plan, Nx, Ny, CUFFT_Z2Z); cufftExecZ2Z(plan, src, tgt, CUFFT_FORWARD); // FFT cufftExecZ2Z(plan, tgt, tgt, CUFFT_INVERSE); // iFFT

// Inplace trafo ^----^

cufftDestroy(plan);

// Clean-up

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 107 111

slide-170
SLIDE 170

Member of the Helmholtz Association

Task 8-4

Synchronizing cuFFT

CUDA Streams enable interleaving of computational tasks

cuFFT uses streams for asynchronous execution cuFFT runs in default CUDA stream;

OpenACC not → trouble ⇒ Force cuFFT on OpenACC stream

#include <openacc.h> // Obtain the OpenACC default stream id

cudaStream_t accStream = (cudaStream_t) acc_get_cuda_stream(acc_async_sync) ;

// Execute all cufft calls on this stream

cufftSetStream(accStream);

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 108 111

slide-171
SLIDE 171

Member of the Helmholtz Association

Task 8-4

OpenACC and cuFFT

Use case: Fourier transforms Use cuFFT and OpenACC to solve Poisson’s Equation Task 8-4: OpenACC +cuFFT Change to Task8-4/ directory Work on TODOs in poisson.c

solveRSpace Force cuFFT on correct stream; implement

data handling with host_data use_device

solveKSpace Implement data handling and parallelism

Compile: make Submit to the batch system: make brun TASK 8-4

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 109 111

slide-172
SLIDE 172

Member of the Helmholtz Association

Conclusions

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 110 111

slide-173
SLIDE 173

Member of the Helmholtz Association

Conclusions

OpenACC directives and clauses

#pragma acc parallel loop copyin(A[0:N]) reduction(max:err) vector

Start easy, optimize from there PGI / NVIDIA Visual Profiler help to find bottlenecks OpenACC is interoperable to other GPU programming models Don’t forget the CPU version!

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 111 111

slide-174
SLIDE 174

Member of the Helmholtz Association

Conclusions

OpenACC directives and clauses

#pragma acc parallel loop copyin(A[0:N]) reduction(max:err) vector

Start easy, optimize from there PGI / NVIDIA Visual Profiler help to find bottlenecks OpenACC is interoperable to other GPU programming models Don’t forget the CPU version!

Thank you for your attention!

a . h e r t e n @ f z

  • j

u e l i c h . d e

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 111 111

slide-175
SLIDE 175

Member of the Helmholtz Association

Appendix List of Tasks Glossary References

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 1 8

slide-176
SLIDE 176

Member of the Helmholtz Association

List of Tasks

Task 0⋆: Setup Task 0: Getting Started Task 1: Analyze Application Task 2: A First Parallel Loop Task 3: More Parallel Loops Task 4: Data Copies Task 5: Data Region Task 6: More Parallelism Task 7: Loop Ordering Task 7’: Pinned Memory Task 8-1: OpenACC +cuBLAS Task 8-2: OpenACC +CUDA Task 8-3: OpenACC +CUDA Task 8-4: OpenACC +cuFFT

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 2 8

slide-177
SLIDE 177

Member of the Helmholtz Association

Glossary I

API A programmatic interface to sofuware by well-defined

  • functions. Short for application programming
  • interface. 79

CUDA Computing platform for GPUs from NVIDIA. Provides,

among others, CUDA C/C++. 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 33, 46, 79, 93, 94, 95, 96, 131, 152, 156, 157, 162, 163, 164, 169, 170, 176, 177

GCC The GNU Compiler Collection, the collection of open

source compilers, among others for C and Fortran. 45, 49

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 3 8

slide-178
SLIDE 178

Member of the Helmholtz Association

Glossary II

NVIDIA US technology company creating GPUs. 39, 79, 92, 97,

98, 105, 106, 146, 147, 148, 159, 169, 173, 174, 177

OpenACC Directive-based programming, primarily for many-core

  • machines. 2, 33, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49,

50, 51, 52, 53, 62, 69, 70, 72, 74, 76, 79, 82, 87, 97, 98, 99, 100, 108, 117, 124, 127, 131, 132, 146, 147, 148, 150, 152, 153, 154, 156, 157, 159, 160, 161, 163, 164, 167, 168, 170, 171, 173, 174, 176

OpenCL The Open Computing Language. Framework for writing

code for heterogeneous architectures (CPU, GPU, DSP,

FPGA). The alternative to CUDA. 33

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 4 8

slide-179
SLIDE 179

Member of the Helmholtz Association

Glossary III

OpenMP Directive-based programming, primarily for

multi-threaded machines. 2, 33, 40, 41, 42, 50, 84, 85, 86

Pascal GPU architecture from NVIDIA (announced 2016). 93,

94, 95, 96

Thrust A parallel algorithms library for (among others) GPUs.

See https://thrust.github.io/. 33

CPU Central Processing Unit. 4, 5, 6, 7, 8, 9, 10, 11, 12, 45,

50, 79, 93, 94, 95, 96, 140, 143, 153, 154, 173, 174, 177

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 5 8

slide-180
SLIDE 180

Member of the Helmholtz Association

Glossary IV

GPU Graphics Processing Unit. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,

23, 24, 25, 26, 33, 45, 48, 50, 70, 76, 79, 92, 93, 94, 95, 96, 97, 98, 105, 115, 116, 118, 146, 147, 148, 153, 154, 160, 162, 173, 174, 177

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 6 8

slide-181
SLIDE 181

Member of the Helmholtz Association

References I

[3] Gene M. Amdahl. “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities”. In: Proceedings

  • f the April 18-20, 1967, Spring Joint Computer Conference. AFIPS

’67 (Spring). Atlantic City, New Jersey: ACM, 1967, pp. 483–485. DOI: 10.1145/1465482.1465560. URL:

http://doi.acm.org/10.1145/1465482.1465560.

[4] John L. Gustafson. “Reevaluating Amdahl’s Law”. In: Commun. ACM 31.5 (May 1988), pp. 532–533. ISSN: 0001-0782. DOI:

10.1145/42411.42415. URL: http://doi.acm.org/10.1145/42411.42415.

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 7 8

slide-182
SLIDE 182

Member of the Helmholtz Association

References: Images, Graphics

[1] Mark Lee. Picture: kawasaki ninja. URL:

https://www.flickr.com/photos/pochacco20/39030210/

(pages 4, 5). [2] Shearings Holidays. Picture: Shearings coach 636. URL:

https://www.flickr.com/photos/shearings/13583388025/

(pages 4, 5).

Andreas Herten | OpenACC Tutorial | 31 August 2017 # 8 8