Programming Heterogeneous Systems F. Bodin June 2013 Uppsala - - PowerPoint PPT Presentation

programming heterogeneous systems
SMART_READER_LITE
LIVE PREVIEW

Programming Heterogeneous Systems F. Bodin June 2013 Uppsala - - PowerPoint PPT Presentation

Programming Heterogeneous Systems F. Bodin June 2013 Uppsala Introduction HPC and embedded software going for dramatic changes to adapt to massive parallelism o Huge market issue o Many codes and users not ready directives based


slide-1
SLIDE 1

Programming Heterogeneous Systems

  • F. Bodin

June 2013 Uppsala

slide-2
SLIDE 2

Introduction

  • HPC and embedded software going for dramatic changes to adapt

to massive parallelism

  • Huge market issue
  • Many codes and users not ready  directives based approaches
  • Key economical competitive topic
  • Performance and energy consumption intimately coupled
  • Looking for code execution time and energy consumption minimization
  • Specialized solutions based on accelerators and co-processors
  • Exascale driving the next generation of technologies*
  • Embedded systems
  • HPC
  • Big data

*see ETP4HPC Strategic Research Agenda

2 Uppsala 5/06/13

slide-3
SLIDE 3

Overview of the Presentation

  • PART I
  • Using Accelerators
  • PART II
  • An Overview of OpenACC Directives

Uppsala 3 5/06/13

slide-4
SLIDE 4

PART I Using Accelerators

slide-5
SLIDE 5

PART I Overview

  • Accelerator / Co-processor Technology
  • One OpenCL to Rule Them All?
  • Auto-Tuning Overview
  • CAPS Auto-tuning Approach
  • Making OpenMP Codes Heterogeneous

Uppsala 5 5/06/13

slide-6
SLIDE 6

Accelerator/Coprocessor Architectures

  • Many architectures
  • GPU based systems: Nvidia Kepler, AMD APU, ARM Mali, …
  • CPU core based systems: Intel Xeon Phi, Kalray MPPA, …
  • SIMT based architecture
  • Performance from vector accesses and plenty of threads
  • Cache based architecture
  • Performance from caching and vector instructions
  • Different address spaces
  • Distributed or shared (APU and embedded systems)

Uppsala 6 5/06/13

slide-7
SLIDE 7
  • Heterogeneity is
  • Different parallel models
  • Different ISAs
  • Different compilers
  • Different memory systems
  • Different libraries
  • Performance and code migration very dependant on

hardware idiosyncrasies

  • Hardware landscape still very chaotic

Uppsala 7

Heterogeneous Architectures

5/06/13

slide-8
SLIDE 8

Programming Heterogeneous Model

  • Native programming languages
  • CUDA / OpenCL
  • OpenCL available almost everywhere
  • Directive based API
  • OpenACC, OpenHMPP, PGI Acc, …
  • Intersection of accelerators capabilities
  • OpenMP accelerator extension in two flavors
  • GPU execution model oriented
  • OpenMP execution model oriented

Uppsala 8 5/06/13

slide-9
SLIDE 9

codes need to move in this space and new HWs to come

Code Writing Constraints

  • A code must be written for a set of hardware configurations
  • 6 CPU cores + Intel Xeon Phi
  • 24 CPU cores + AMD GPU / Nvidia GPU / …
  • 12 cores + 2 GPUs
  • AMD APU

5/06/13 Uppsala 9

X86 / ARM multi-cores Intel MIC/KALRAY MPPA NVIDA/AMD/ARM GPUs Fat cores - OO Light cores SIMT cores

slide-10
SLIDE 10

Compilers and Heterogeneous Hardware

  • Compilers are heterogeneous themselves
  • Not one technology fits all
  • Want to mix the best compilers to address heterogeneity

Uppsala 10

CPU compilers

  • Intel compilers
  • IBM compilers
  • ABSoft
  • Pathscale
  • PGI
  • Gcc
  • LLVM
  • Open64

Accelerator compilers

  • Nvidia Cuda compiler
  • Intel OpenCL
  • AMD OpenCL
  • ARM OpenCL
  • Kalray compilers

x86 ARM MIPS PowerPC … x86 PTX HSA Kalray MPPA Isa …

5/06/13

slide-11
SLIDE 11

Limits of Compilers

  • Excellent at transforming codes, poor at understanding

semantic and making decisions

  • Lack many data anyway
  • Code execution more sensitive to optimization on heterogeneous

hardware

  • Experts invent strategies, not compilers
  • Look at "3D Finite Difference Computation on GPUs using CUDA"

from Paulius Micikevicius, NVIDIA

  • Known code transformations but specific strategy
  • Need to provide extra semantic and optimization strategies
  • Specific to each target system and application

Uppsala 11 5/06/13

slide-12
SLIDE 12

One OpenCL to Rule Them All?

Uppsala 12

slide-13
SLIDE 13

One OpenCL to Rule Them All?*

  • HydroC mini-apps
  • Many kernels
  • Accelerator

friendly

  • OpenCL thread

Work Group tuning

  • Generated from

OpenACC

  • Generated code

as efficient as native one

Uppsala 13

Efficiency Loss of the code variants. Lower the better. Value 0,00% indicates that the variant reaches the best performance.

*http://www.caps-entreprise.com/wp-content/uploads/2012/08/One-OpenCL-to-rule-them-all.pdf

5/06/13

slide-14
SLIDE 14

Auto-Tuning Techniques

Uppsala 14

slide-15
SLIDE 15

Auto-Tuning Overview

  • Two main issues
  • Discovery of optimizing code transformations
  • Mainly an offline technique
  • Adaptation to execution context
  • Online technique
  • Need to create an optimization space to explore
  • Auto-tuning capabilities intrinsically limited by coding APIs
  • Code generation must have a lot of freedom to deal with heterogeneous systems
  • Auto-tuning has to be integrated into parallel programming
  • Separation of code generation/optimization infrastructure and exploration

infrastructure is important

  • Many different ways to explore the optimization space (e.g. serial versus distributed)
  • Not a compiler infrastructure issue but a system issue
  • Many different metrics (e.g. time, energy, multi-objectives, …)
  • The optimization space has to be focused to limit the runtime search
  • High level information difficult to prove from source code and sensitive to coding style
  • e.g. computing a convolution
  • Lack of contextual information
  • e.g. data transformations

Uppsala 15 5/06/13

slide-16
SLIDE 16

Auto-Tuning Approach for Heterogeneous HW

  • Directive-based approach is pertinent
  • But directives need to be "high-level" but not too

abstract

  • Keep CPU code as simple as possible
  • Some issues are local
  • e.g kernel optimizations
  • Some issues are global
  • e.g. data movements, libraries
  • Infrastructure needs to be compiler independent
  • Exploration engine can exist in many

configurations

  • Parallel exploration of the optimization space
  • Sequential exploration
  • Many strategies (e.g. random, ML, genetic)

5/06/13 Uppsala 16

Parallel HW independent code e.g. C, Fortran Parallel dep. code e.g. CUDA, OpenCL code generation to get closer to HW code high level information cannot be reconstructed

slide-17
SLIDE 17
  • Rely on code variants
  • Some created

dynamically via dynamic parameters such as OpenACC #gang, #worker, #vector

  • Some static obtained via

program transformations as the one provided in OpenHMPP

  • Source-to-source

technology suitable for heterogeneous systems

Uppsala 17

Auto-tuning Codes

select variant codelet variant 1 Execution feedback codelet variant 2 codelet variant 3 codelet variant … HMPP compiler dynamic

5/06/13

slide-18
SLIDE 18

Example of Exploring Runtime Parameters

  • Auto-tuning implementation of a Blur filter in OpenACC
  • Explore dynamic parameters (e.g. #gangs, #workers)

Uppsala 18

size_t gangs[] = { 8, 16, 32, 64, 128, 128, 8, 16, 32, 64, 128, 256 }; size_t workers[] = { 16, 16, 16, 16, 16, 16, 24, 24, 24, 24, 24, 24 }; … while (nber_of_iterations < max_iterations) { … variant = variantSelectorState("kernel.c:21",
 (sizeof(gangs)/sizeof(size_t))-1); blur(images[(currentImage + 1) % 2], image_caps, width, height, 
 blockSize, gangs[variant], workers[variant]); … } #pragma acc parallel, copyin(dst_caps[0:height*width]), 
 copyout(src_caps[0:height*width]), num_gangs(gangs), 
 num_workers(workers), vector_length(32) { #pragma acc loop, gang for (tileY = 0; tileY < tileCountY; tileY++) { for (tileX = 0; tileX < tileCountX; tileX++) { …

Parameterized parallel regions parameter space to explore set auto-tuning driver on

5/06/13

slide-19
SLIDE 19

Uppsala 19

DNADist Auto-tuning (bio info)

#call kernel time

exploration phase steady state

Data can be collected over multiple executions

2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Kernel Computation Time (in sec). Lower is better

0,2 0,4 0,6 0,8 1 1,2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Kernel Computation Time (in sec). Lower is better

Kepler config 10 = 256 G x 128 W CARMA config. 8 = 14 G x 16 W 5/06/13

slide-20
SLIDE 20

Uppsala 20

DNADist Auto-tuning

AMD Trinity APU AMD 7970 GPU Intel Xeon Phi Nvidia Fermi

best config. 8 = 14 G x 8 W best config. 16 = 64 G x 8 W best config. 16 = 64 G x 8 W best config 10 = 256 G x 128 W

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

5/06/13

slide-21
SLIDE 21

CAPS Source-to-Source Technology

  • Standard languages are stable APIs as

targets

  • e.g. OpenCL, CUDA, C + pthreads
  • Exploit native target machine code generation
  • Exploit constructor code optimization knowhow
  • Use application usual host CPU compiler
  • Many (large) codes have compiler

dependencies

  • Losing performance on CPU not an option
  • Hardware independent runtime
  • Want to avoid link issue
  • Fallback mode to CPU code
  • One version of the binary code per host system
  • But
  • Source target language may be inconvenient

(e.g. Fortran)

  • Generated code may mess up with the native

compiler

Uppsala 21

C++ Frontend C Frontend Fortran Frontend Executable

(mybin.exe) Instrumentation module CPU compiler (gcc, ifort, …)

HWA Code

(Dyn. library) OpenCL/Cuda Generation Native compilers

Extraction module

Fun #2 Fun #3 Fun#1

Host code kernels

CAPS Runtime

5/06/13

slide-22
SLIDE 22
  • PoCC (Polyhedral Compiler Collection)
  • http://www.cse.ohio-state.edu/~pouchet/software/pocc/
  • Rose source-to-source translators
  • http://rosecompiler.org/
  • Cetus, source-to-Source compiler infrastructure
  • http://cetus.ecn.purdue.edu
  • EDG, Edison Design Group technology
  • http://www.edg.com
  • Clang
  • http://clang.llvm.org
  • Insieme Compiler
  • http://www.insieme-compiler.org/

Uppsala 22

Other Source-to-Source Technologies

5/06/13

slide-23
SLIDE 23

CAPS Auto-tuning Approach

slide-24
SLIDE 24

An Embedded DSL Oriented Approach to Auto-Tuning

  • Source-to-Source approach
  • Exploit native compilers
  • Code partitioning to help offline approach
  • CodeletFinder
  • Scripting to implement DSL approaches
  • Generate domain / code specific search / optimization space
  • Static selections of variants
  • Runtime code parameters instantiation
  • Variant parameters fixed at runtime
  • Low level API for 'auto-tuning drivers"
  • Separate objective functions from optimization space generation
  • An engineering issue
  • How to embed code/domain specific strategies so it is ready to use to

programmers

  • Still dealing with legacy code
  • Integration in the compiling process is a key feature
  • Focus on node level issues

Uppsala 24 5/06/13

slide-25
SLIDE 25

Auto-tuning Flow Overview

Uppsala 25

HMPP Compiler Autotunable executable code CAPS profiling, tuning interface auto-tuning driver

collect profiling data explore the variants space

Source code CodeletFinder CT0 CT0 CT2 Performance Tools Optimizing Strategy Optimizing Scripts

5/06/13

slide-26
SLIDE 26

Code Partitioning for Auto-Tuning

  • Tuning and analyzing performance of large/complex applications

is usually a challenge

  • Execution time with real data sets is usually too long to be compatible

with the trial/experiment cycle

  • Many performance or tuning tools cannot be used at large scale
  • Compute intensive parts of the code usually represent a small

portion of the total application

  • Extracts these to focus on them
  • Allows to use many analysis and optimizing tools
  • Faster experiments cycle
  • Similar works: Code Isolator (Y.-J. Lee and M. W. Hall) and Rose

Outliner (C. Liao, D. J. Quinlan, R. Vuduc, and T. Panas)

Uppsala 26 5/06/13

slide-27
SLIDE 27

CodeletFinder

  • Decomposing applications

in hotspots

  • Each hotspot can be

efficiently analyzed separately

  • Outlined hotspots
  • Mix of static and dynamic

analysis

  • Code and data

27 Uppsala 5/06/13

slide-28
SLIDE 28

CodeletFinder Process Overview

Project Capture Hotspot Finder Codelet Builder Micro Bencher

  • Captures build

process

  • Capture execution

parameters

  • Replays the build
  • n demand
  • Builds the codelets

based on identified hotspots (code outliner)

  • Creates standalone

micro-benchs

  • Patterns are given

to build the codelets

  • Finds hotspots in the

application using execution profiles

  • Statically extracts

potential hotspots

  • Captures data for

the micro-benches

  • Runs the micro-

benches

28 Uppsala 5/06/13

slide-29
SLIDE 29

Scripting to Generate the Search Space

  • Most tuning strategies are code/domain specifics
  • In regards to the code structure and runtime properties
  • Many codes live long and allow to amortize code specific approaches
  • Many different high-level approaches can be embedded
  • Stencil code generator (e.g. Patus)
  • Polyhedral model based approach  PoCC
  • Libraries
  • Data structures transformations

Uppsala 29 5/06/13

slide-30
SLIDE 30

New CAPS Compiler Features

  • Code

transformation s scripts (in Lua) can be added as a pre- compilation phase

  • Scripts can

read and modify the source code AST

5/06/13 Uppsala 30

C++ Frontend C Frontend Fortran Frontend

Instrumentation module CPU compiler (gcc, ifort, …) OpenCL/Cuda Generation Native compilers

Extraction module

Fun #2 Fun #3 Fun#1

Host code kernels

Scripting Engine

Application/Domain specific scripts

slide-31
SLIDE 31
  • Directives convey programmer knowledge
  • The code provides low level information
  • e.g. loop index, variables names, …
  • Scripts hide low level code transformation details
  • Many loop transformations can be implemented using

hmppcg directives

Uppsala 31

Tuning Script Implementation

!$capstune scriptName scriptInput code region !$capstune end scriptName … script to be activated Expressions providing high level information to the scripts

5/06/13

slide-32
SLIDE 32

Uppsala 32

Simple Example-1

… !$capstune stencil … !$acc kernel !$acc loop independent do i=1,10 !$acc loop independent do j=1,10 a(i,j) = … b(i,j) … end do end do !$acc end kernel !$capstune end stencil … Specify the script to generate an optimized stencil code using various method

  • multiple variants
  • external tools
  • using a library

5/06/13

slide-33
SLIDE 33

Uppsala 33

Simple Example-2

TYPE foo REAL :: w(10,10) REAL :: x(10,10) REAL :: y(10,10) REAL :: z(10,10) END type foo … !$capstune scalarize state_x => state%x , state_z => state%z !$acc parallel num_gangs(10) num_workers(10) copyout(state_x) copyin(state_z) !$acc loop gang do i=1,10 !$acc loop worker do j=1,10 state%x(i,j) = state%z(i,j) + i+j/1000.0 end do end do !$acc end parallel !$capstune end scalarize

Transform a data structure for an accelerator:

  • Take slides of a derived type
  • Decision cannot be usually

made on local code analysis

5/06/13

slide-34
SLIDE 34

Making OpenMP Codes Heterogeneous

slide-35
SLIDE 35

Code Generation Process Overview

  • Converts

OpenMP to the use of GPU automatically

  • Currently focusing
  • n AMD APUs
  • Incremental

process to make the OpenMP code GPU friendly

5/06/13 Uppsala 35

slide-36
SLIDE 36

Data Uses Analysis

  • Necessary to allocate data on the accelerator and compute

basic data transfers overheads

  • Keep analysis overhead low
  • Analysis based on an abstract execution of the OpenMP

loop nest sequence

5/06/13 Uppsala 36

slide-37
SLIDE 37

Preliminary Example of Experiments Display

5/06/13 Uppsala 37

slide-38
SLIDE 38

PART I Conclusion

  • OpenMP, a good start to migrate codes
  • Data use analysis is a key feature
  • Source-to-source technology well adapted to heterogeneity
  • Avoid "one compiler fits all" approach
  • Auto-tuning techniques helps to simplify code tuning and

deployment

  • The DSL approach helps to guide the auto-tuning process

Uppsala 38 5/06/13

slide-39
SLIDE 39

PART II OpenACC

Directives for Accelerators

slide-40
SLIDE 40

Credits

  • http://www.openacc.org/
  • V1.0: November 2011 Specification
  • OpenACC, Directives for Accelerators, Nvidia Slideware
  • CAPS Compilers-3.x OpenACC Reference Manual , CAPS

entreprise

Uppsala 40 5/06/13

slide-41
SLIDE 41

Agenda

  • OpenACC Overview and Compilers
  • Lab Session 1: Using CAPS Compilers
  • Programming Model
  • Lab Session 2: Offloading Computations
  • Managing Data
  • Lab Session 3: Optimizing Data Transfers
  • Specifying Parallelization
  • Lab Session 4: Optimizing Compute Kernels
  • Asynchronism
  • Lab Session 5: Performing Asynchronous Computations
  • Runtime API
  • OpenACC 2.0 Draft Specification

Uppsala 41 5/06/13

slide-42
SLIDE 42

OpenACC Overview and Compilers

slide-43
SLIDE 43

Directive-based Programming (1)

  • Three ways of programming GPGPU applications:

Uppsala 43

Libraries

Ready-to-use Acceleration

Directives

Quickly Accelerate Existing Applications

Programming Languages

Maximum Performance

5/06/13

slide-44
SLIDE 44

Directive-based Programming (2)

Uppsala 44 5/06/13

slide-45
SLIDE 45

Advantages of Directive-based Programming

  • Simple and fast development of accelerated applications
  • Non-intrusive
  • Helps to keep a unique version of code
  • To preserve code assets
  • To reduce maintenance cost
  • To be portable on several accelerators
  • Incremental approach
  • Enables "portable" performance

Uppsala 45 5/06/13

slide-46
SLIDE 46

OpenACC Initiative

  • A CAPS, CRAY, Nvidia and PGI initiative
  • Open Standard
  • A directive-based approach for

programming heterogeneous many-core hardware for C and FORTRAN applications

  • http://www.openacc-standard.com

Uppsala 46 5/06/13

slide-47
SLIDE 47

OpenACC Compilers (1)

CAPS Compilers:

  • Source-to-source

compilers

  • Support Intel Xeon Phi,

NVIDIA GPUs, AMD GPUs and APUs PGI Accelerator

  • Extension of x86 PGI

compiler

  • Support Intel Xeon Phi,

NVIDIA GPUs, AMD GPUs and APUs

Uppsala 47

Cray Compiler:

  • Provided with Cray systems only

5/06/13

slide-48
SLIDE 48

CAPS Compilers (2)

Are source-to-source compilers, composed of 3 parts:

  • The directives (OpenACC or OpenHMPP)
  • Define parts of code to be accelerated
  • Indicate resource allocation and communication
  • Ensure portability
  • The toolchain
  • Helps building manycore applications
  • Includes compilers and target code generators
  • Insulates hardware specific computations
  • Uses hardware vendor SDK
  • The runtime
  • Helps to adapt to platform configuration
  • Manages hardware resource availability

Uppsala 48 5/06/13

slide-49
SLIDE 49

CAPS Compilers (3)

  • Take the original application as input and generate another

application source code as output

  • Automatically turn the OpenACC source code into a accelerator-

specific source code (CUDA, OpenCL)

  • Compile the entire hybrid application
  • Just prefix the original compilation line with capsmc to

produce a hybrid application

Uppsala 49

$ capsmc gcc myprogram.c $ capsmc gfortran myprogram.f90

5/06/13

slide-50
SLIDE 50

CAPS Compilers (4)

  • CAPS Compilers drives

all compilation passes

  • Host application

compilation

  • Calls traditional CPU

compilers

  • CAPS Runtime is linked

to the host part of the application

  • Device code

production

  • According to the

specified target

  • A dynamic library is

built

Uppsala 50

Fun #3

C++ Frontend C Frontend Fortran Frontend

CUDA Code Generation

Executable

(mybin.exe) Instrumen- tation module

CPU compiler (gcc, ifort, …) CUDA compilers HWA Code

(Dynamic library)

OpenCL Generatio n OpenCL compilers

Extraction module

Fun #2

Host code codelets

CAPS Runtime Fun #1

5/06/13

slide-51
SLIDE 51

CAPS Compilers Options

  • Usage:
  • To display the compilation process
  • To specify accelerator-specific code

Uppsala 51

$ capsmc –d -c gcc myprogram.c $ capsmc –-openacc-target CUDA gcc myprogram.c #(default) $ capsmc –-openacc-target OPENCL gcc myprogram.c #(AMD and Phi)

$ capsmc [CAPSMC_FLAGS] <host_compiler> [HOST_COMPILER_FLAGS] <source_files>

5/06/13

slide-52
SLIDE 52

Lab Session 1: Using CAPS Compilers

Uppsala 52

slide-53
SLIDE 53

Lab 1: Using CAPS Compilers

  • Compile and execute a simple “Hello world!” application
  • Use the –d and –c flags to display the compilation process
  • Use ldd on the output executable to print library

dependencies

5/06/13 53 Uppsala

slide-54
SLIDE 54

Programming Model

54

slide-55
SLIDE 55

Programming Model

  • Express data and computations to be executed on an

accelerator

  • Using marked code regions
  • Main OpenACC constructs
  • Parallel and kernel regions
  • Parallel loops
  • Data regions
  • Runtime API

Uppsala 55

Data/stream/vector parallelism to be exploited by HWA

e.g. CUDA / OpenCL CPU and HWA linked with a PCIx bus

5/06/13

slide-56
SLIDE 56

Execution Model

  • Among a bulk of computations executed by the CPU, some

regions can be offloaded to hardware accelerators

  • Parallel regions
  • Kernels regions
  • Host is responsible for:
  • Allocating memory space on accelerator
  • Initiating data transfers
  • Launching computations
  • Waiting for completion
  • Deallocating memory space
  • Accelerators execute parallel regions:
  • Use work-sharing directives
  • Specify level of parallelization

Uppsala 56 5/06/13

slide-57
SLIDE 57

OpenACC Execution Model

  • Host-controlled execution
  • Based on three parallelism levels
  • Gangs – coarse grain
  • Workers – fine grain
  • Vectors – finest grain

Uppsala 57

Device Gang Worker Vector s Gang Worker Vector s

5/06/13

slide-58
SLIDE 58

Gangs, Workers, Vectors

  • In CAPS Compilers, gangs, workers and vectors correspond

to the following in a CUDA grid

  • Beware: this implementation is compiler-dependent

Uppsala 58

gridDim.y = 1 gridDim.x = number of gangs blockDim.y = number of workers blockDim.x = number of vectors

5/06/13

slide-59
SLIDE 59

Directive Syntax

  • C
  • Fortran

Uppsala 59

!$acc directive-name [clause [, clause] …] code to offload !$acc end directive-name #pragma acc directive-name [clause [, clause] …] { code to offload }

5/06/13

slide-60
SLIDE 60

Parallel Construct

  • Starts parallel execution on the accelerator
  • Creates gangs and workers
  • The number of gangs and workers remains constant for the

parallel region

  • One worker in each gang begins executing the code in the region

Uppsala 60

#pragma acc parallel […] { … for(i=0; i < n; i++) { for(j=0; j < n; j++) { … } } … } Code executed on the hardware accelerator

5/06/13

slide-61
SLIDE 61

Kernels Construct

  • Defines a region of code to be compiled into a sequence of

accelerator kernels

  • Typically, each loop nest will be a distinct kernel
  • The number of gangs and workers can be different for each kernel

Uppsala 61

#pragma acc kernels […] { for(i=0; i < n; i++) { … } … for(j=0; j < n; j++) { … } } $!acc kernels […] DO i=1,n … END DO … DO j=1,n … END DO $!acc end kernels 1st Kernel 2nd Kernel

5/06/13

slide-62
SLIDE 62

Lab Session 2: Offloading Computations

slide-63
SLIDE 63

Lab 2: Offloading Computations

  • Offload two SAXPY operations on the accelerator device:

Y = Alpha . X + Y

– X, Y are vectors – Alpha is a scalar

  • Use parallel and kernels construct
  • Pay attention to the compilers notifications
  • Use the logger to understand the behavior of the accelerator
  • Use CUDA profiling to display CUDA grid properties

Uppsala 63

$ export HMPPRT_LOG_LEVEL=info

5/06/13

slide-64
SLIDE 64

Managing Data

64

slide-65
SLIDE 65

What is the problem using discrete accelerators?

  • PCIe transfers have huge latencies
  • In kernels and parallel regions, data are implicitly managed
  • Data are automatically transferred to and from the device
  • Implies possible useless communications
  • Avoiding transfers leads to a better performance
  • OpenACC offers a solution to control transfers

5/06/13 65 Uppsala

slide-66
SLIDE 66

Device Memory Reuse

  • In this example:
  • A and B are allocated

and transferred for the first kernels region

  • A and C are allocated

and transferred for the second kernels region

  • How to reuse A

between the two kernels regions?

  • And save transfer and

allocation time

Uppsala 66

float A[n]; #pragma acc kernels { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … #pragma acc kernels { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } }

5/06/13

slide-67
SLIDE 67

Memory Allocations

  • Avoid data reallocation using the create clause
  • It declares variables, arrays or subarrays to be allocated in the device

memory

  • No data specified in this clause will be copied between host and

device

  • The scope of such a clause corresponds to a data region
  • Data regions are used to define such scopes (as is, they have no

effect)

  • They define scalars, arrays and subarrays to be allocated on the

device memory for the duration of the region

  • Kernels and Parallel regions implicitly define data regions

Uppsala 67 5/06/13

slide-68
SLIDE 68

Data Presence

  • How to tell the compiler that data has already been

allocated?

  • The present clause declares data that are already present on

the device

  • Thanks to data region that contains this region of code
  • CAPS Runtime will find and use the data on device

Uppsala 68 5/06/13

slide-69
SLIDE 69

Data Construct: Create and Present Clause

Uppsala 69

float A[n]; #pragma acc data create(A) { #pragma acc kernels present(A) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … #pragma acc kernels present(A) { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } }

Allocation of A of size n on the device Deallocation of A on the device Reuse of A already allocated on the device Reuse of A already allocated on the device

5/06/13

slide-70
SLIDE 70

Data Storage: Mirroring

  • How is the data stored in a data region?
  • A data construct defines a section of code where data are mirrored between host and

device

  • Mirroring duplicates a CPU memory block into the HWA memory
  • The mirror identifier is a CPU memory block address
  • Only one mirror per CPU block
  • Users ensure consistency of copies via directives

Uppsala 70

Host Memory Master copy ……… ……… ……… ……… ……… ……… ……. HWA Memory CAPS RT Descriptor ……… ……… ……… ……… ……… ……… ……. Mirror copy

5/06/13

slide-71
SLIDE 71

Arrays and Subarrays (1)

  • In C and C++, specified with start and length
  • Allocation of an array a of size n
  • Allocation of an subarray of a of size n/2
  • ie: elements a[2], a[3], …, a[n/2-1 + 2]
  • Static arrays can be allocated automatically
  • Length of dynamically allocated arrays must be explicitly specified

Uppsala 71

#pragma acc data create a[0:n] OR #pragma acc data create a[:n] #pragma acc data create a[2:n/2]

5/06/13

slide-72
SLIDE 72

Arrays and Subarrays (2)

  • In Fortran, specified with a list of range specifications
  • Allocation of an array a of size n*m
  • Allocation of a subarray of a of size 3*1
  • ie: elements a(1,5), a(2,5), a(3,5)
  • In any language, any array or subarray must be a

contiguous block of memory

Uppsala 72

!$acc data create a(0:n,0:m) !$acc data create a(1:3,5:5)

5/06/13

slide-73
SLIDE 73

Arrays and Subarrays Example

Uppsala 73

#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … #pragma acc kernels present(A[:n]) { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } } !$acc data create(A(1:n)) !$acc kernels present(A(1:n)) do i=1,n A(i) = B(n – i) end do !$acc end kernels … init(C) … !$acc kernels present(A(1:n)) do i=1,n C(i) = A(i) * alpha + C(i) end do !$acc end kernels !$acc end data

5/06/13

slide-74
SLIDE 74

Redundant Transfers

  • In this example:
  • A is allocated for the data

section

  • No data transfer of A between

host and device

  • B is allocated and transferred

for the first kernels region

  • Input transfer
  • Output transfer
  • C is allocated and transferred

for the second kernels region

  • Input transfer
  • Output transfer
  • How to avoid useless data

transfers for B and C?

Uppsala 74

#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … #pragma acc kernels present(A[:n]) { for(i=0; i < n; i++) { C[i] = A[i] * alpha; } } }

5/06/13

slide-75
SLIDE 75

Input Transfers: Copyin Clause

  • Declares data that need only

to be copied from the host to the device when entering the data section

  • Performs input transfers only
  • It defines scalars, arrays and

subarrays to be allocated on the device memory for the duration of the data region

Uppsala 75

#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) \ copyin(B[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … #pragma acc kernels present(A[:n]) { for(i=0; i < n; i++) { C[i] = A[i] * alpha; } } }

5/06/13

slide-76
SLIDE 76

Output Transfers: Copyout Clause

  • Declares data that need only

to be copied from the device to the host when exiting data section

  • Performs output transfers only
  • It defines scalars, arrays and

subarrays to be allocated on the device memory for the duration of the data region

Uppsala 76

#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) \ copyin(B[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … #pragma acc kernels present(A[:n]) \ copyout(C[:n]) { for(i=0; i < n; i++) { C[i] = A[i] * alpha; } } }

5/06/13

slide-77
SLIDE 77

Input/Output Transfers: Copy Clause

  • If we change the example, how to

express that input and output transfers of C are required?

  • Use copy clause to:
  • Declare data that need to be copied

from the host to the device when entering the data section

  • Assign values on the device that

need to be copied back to the host when exiting the data section

  • Allocate scalars, arrays and

subarrays on the device memory for the duration of the data region

  • It corresponds to the default

behavior in our example

Uppsala 77

#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) \ copyin(B[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … #pragma acc kernels present(A[:n]) \ copy(C[:n]) { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } }

5/06/13

slide-78
SLIDE 78

Transfer Example: Summary

Uppsala 78

#pragma acc data create(A[:n]) { #pragma acc kernels present(A[:n]) \ copyin(B[:n]) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … #pragma acc kernels present(A[:n]) \ copy(C[:n]) { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } } Allocation of A of size n on the device Deallocation of A on the device Transfer of C from device to host and deallocation of C on the device Reuse of A already allocated on the device Allocation of B of size n on the device and transfer of data of B from host to device Deallocation of B on the device Reuse of A already allocated on the device Allocation of C of size n on the device and transfer of data of C from host to device

5/06/13

slide-79
SLIDE 79

Alternative Behaviors

  • In this example:
  • A is allocated for the data

region

  • The first call to subroutine

f1 reuses the data of A already allocated

  • What happens for the

second call to f1?

  • A is specified as present

but it has been released at the end of the data section

  • It leads to an error when

executed

Uppsala 79

program main … !$acc data create(X(1:n)) call f1( n, X, Y ) … !$acc end data … call f1( n, X, Z ) … contains subroutine f1( n, A, B ) … !$acc kernels present(A(1:n)) \ copyin(B(1:n)) do i=1,n A(i) = B(n – i) end do !$acc end kernels end subroutine f1 … end program main

5/06/13

slide-80
SLIDE 80

Present_or_create Clause

  • Combines two behaviors
  • Declares data that may be present
  • If data is already present, use value in the device memory
  • If not, allocate data on device when entering region and deallocate

when exiting

  • May be shortened to pcreate

Uppsala 80 5/06/13

slide-81
SLIDE 81

Present_or_copyin/copyout Clauses

  • If data is already present, use value in the device memory
  • If not:
  • Both present_or_copyin/present_or_copyout allocate memory on

device at region entry

  • present_or_copyin copies the value from the host at region entry
  • present_or_copyout copies the value from the device to the host at

region exit

  • Both present_or_copyin/present_or_copyout deallocate memory at

region exit

  • May be shortened to pcopyin and pcopyout

Uppsala 81 5/06/13

slide-82
SLIDE 82

Present_or_copy Clause

  • If data is already present, use value in the device memory
  • If not:
  • Allocates data on device and copies the value from the host at region

entry

  • Copies the value from the device to the host and deallocate memory

at region exit

  • May be shortened to pcopy

Uppsala 82 5/06/13

slide-83
SLIDE 83

Present_or_* Clauses Example

Uppsala 83 program main … !$acc data create(A(1:n)) call f1( n, A, B ) … !$acc end data … call f1( n, A, C ) … contains subroutine f1( n, A, B ) … !$acc kernels pcopyout(A(1:n)) \ copyin(B(1:n)) do i=1,n A(i) = B(n – i) end do !$acc end kernels end subroutine f1 … end program main

Allocation of A of size n on the device Reuse of A already allocated on the device Allocation of B of size n on the device for the duration of the subroutine and input transfer

  • f B

Deallocation of A on the device Allocation of A and B of size n on the device for the duration of the subroutine Input transfer of B and output transfer of A

Present_or_* clauses are generally safer

5/06/13

slide-84
SLIDE 84

Default Behavior

  • CAPS Compilers is able to detect the variables required on

the device for the kernels and parallel constructs.

  • According to the specification, depending on the type of the

variables, they follow the following policies

  • Tables: present_or_copy behavior
  • Scalar
  • if not live in or live out variable: private behavior
  • copy behavior otherwise

Uppsala 84 5/06/13

slide-85
SLIDE 85

Constructs and Directives

  • OpenACC defines two ways of managing accelerator

allocations and transfers

  • With data constructs followed by allocation or transfer clauses
  • Or standalone directives for allocations or transfers
  • Data constructs are declarative
  • They define properties for a code regions and variables
  • Imperative directives are standalone statements

Uppsala 85 5/06/13

slide-86
SLIDE 86

Declare Directive

  • In Fortran: used in the declaration section of a subroutine
  • In C/C++: follow a variable declaration
  • Specifies variables or arrays to be allocated on the device memory

for the duration of the function, subroutine or program

  • Specifies the kind of transfer to realize (create, copy, copyin, etc)

Uppsala 86

float A[n]; #pragma acc data create(A) { #pragma acc kernels present(A) { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … } float A[n]; #pragma acc declare create(A) #pragma acc kernels present(A) { for(i=0; i < n; i++) { A[i] = B[n – i]; } }

5/06/13

slide-87
SLIDE 87

Update Directive

  • Used within explicit or implicit data region
  • Updates all or part of host memory arrays with values from

the device when used with host clause

  • Updates all or part of device memory arrays with values

from the host when used with device clause

Uppsala 87

!$acc kernels copyout(A(1:n)) \ copyin (B(1:n)) do i=1,n A(i) = B(n – i) end do !$acc end kernels !$acc data create( A(1:n), \ B(1:n) ) !$acc update device (B(1:n)) !$acc kernels do i=1,n A(i) = B(n – i) end do !$acc end kernels !$acc update host (A(1:n)) !$acc end kernels

5/06/13

slide-88
SLIDE 88

Lab session 3: Data Management

slide-89
SLIDE 89

Lab 3: Data Management

  • Offload two SAXPY operations (cf. Lab 2)
  • Where arrays are allocated dynamically
  • Specify data size on kernels and parallel regions and appropriate

transfers

  • Avoid deallocating and reallocating the data on the accelerator by

defining a data section

  • Ensure the data displayed between the two compute regions are

correct by updating the host mirror

  • Notice the performance evolution and understand why thanks to

the logger

5/06/13 89 Uppsala

slide-90
SLIDE 90

Specifying Parallelization

90

slide-91
SLIDE 91

Parallel and Kernels Constructs Default Behavior

  • By default, CAPS Compilers will create 192 gangs and 256

workers containing 1 vector each for parallel and kernels regions

  • The resulting CUDA grid size will be 192 thread blocks
  • Each thread block containing 256*1 CUDA threads
  • CAPS Compilers will detect data-independent loops and will

distribute iterations among gangs and workers

  • How to modify the number of gangs, workers or vectors?

Uppsala 91

 Loop ‘i’ was shared among gangs(192) and workers(256)

5/06/13

slide-92
SLIDE 92

Gangs, Workers, Vectors in Parallel Constructs

  • In parallel constructs, the

number of gangs, workers and vectors is the same for the entire section

  • The clauses:
  • num_gangs
  • num_workers
  • vector_length
  • Enable to specify the

number of gangs, workers and vectors in the corresponding parallel section

Uppsala 92

#pragma acc parallel, num_gangs(128) \ num_workers(256) { … for(i=0; i < n; i++) { for(j=0; j < m; j++) { … } } … }

… … … … 256 128

5/06/13

slide-93
SLIDE 93

Loop Constructs

  • A Loop directive applies to a loop that immediately follow the

directive

  • The parallelism to use is described by one of the following

clause:

  • Gang for coarse-grain parallelism
  • Worker for middle-grain parallelism
  • Vector for fine-grain parallelism

Uppsala 93 5/06/13

slide-94
SLIDE 94

Gangs (1)

  • Gang clause:
  • The iterations of the

following loop are executed in parallel

  • Iterations are distributed

among the gangs available

  • In a parallel construct, no

argument is allowed

Uppsala 94

#pragma acc parallel, num_gangs(128) \ num_workers(192) { … #pragma acc loop gang for(i=0; i < n; i++) { for(j=0; j < m; j++) { … } } … }

… … … 192 128 i= … i= i= 1 i= 2

5/06/13

slide-95
SLIDE 95

Gangs (2)

Uppsala 95

#pragma parallel num_gang(2) { #pragma acc loop gang for(i = 0; i < n; i ++) { A[i] = B[i] * B[i] * 3.14; } } if(i = 0; i < n/2; i ++) { A[i] = B[i] * B[i] * 3.14; } if(i = n/2; i < n; i ++) { A[i] = B[i] * B[i] * 3.14; }

5/06/13

slide-96
SLIDE 96

Workers

  • Worker clause:
  • The iterations of the

following loop are executed in parallel

  • Iterations are distributed

among the multiple workers withing a single gang

  • Loop iterations must be data

independent, unless it performs a reduction

  • peration
  • In a parallel construct, no

argument is allowed

Uppsala 96

#pragma acc parallel, num_gangs(128) \ num_workers(192) { … #pragma acc loop gang for(i=0; i < n; i++) { #pragma acc loop worker for(j=0; j < n; j++) { … } } … }

… … … 192 128 i= … i= i= 1 i= 2

j=0 j=1 j=2

5/06/13

slide-97
SLIDE 97

Vector

  • Vector clause
  • The iterations of the

following loop are executed in SIMD mode

  • Iterations are distributed

among the multiple workers withing a single gang

  • In a parallel construct,

no argument is allowed

Uppsala 97

#pragma acc parallel, num_gangs(128) \ num_workers(192) { … #pragma acc loop gang for(i=0; i < n; i++) { #pragma acc loop worker for(j=0; j < m; j++) { #pragma acc loop vector for(k=0; k < l; k++) { … } } } … }

… 192 128 i=

j=0 j=1 j=2

… i= … … i= … …

k=0 k=1 k=2

5/06/13

slide-98
SLIDE 98

Gang, Worker, Vector in Kernels Constructs

  • The parallelism

description is the same as in parallel sections

  • However, these clauses

accept an argument to specify the number of gangs, workers or vectors to use

  • Every loop can have a

different number of gangs, workers or vectors in the same kernels region

Uppsala 98 #pragma acc kernels { … #pragma acc loop gang(128) for(i=0; i < n; i++) { … } … #pragma acc loop gang(64) for(j=0; j < m; j++) { … } }

… 64 … i= … i= … i= 2 … … i= … i= … i= 2 128

5/06/13

slide-99
SLIDE 99

Data Independency

  • In kernels sections, the clause independent specifies that iterations of the

loop are data-independent

  • The user does not have to think about gangs, workers or vector parameters
  • Allows the compiler to generate code to execute the iterations in parallel

with no synchronization

Uppsala 99

Programming error

A[0] = 0; #pragma acc loop independent for(i=1; i<n; i++) { A[i] = A[i]-1; } A(1) = 0 $!acc loop independent DO i=2,n A(i) = A(i-1) END DO

5/06/13

slide-100
SLIDE 100

Sequential Execution

  • It is possible to

specify sequential loops using the seq clause

  • Useful to increase the

work per thread for example

Uppsala 100

!$acc loop independent DO i=0,n !$acc loop seq DO j=1,4 A(j)… ENDDO ENDDO

5/06/13

slide-101
SLIDE 101

Loop Collapsing

  • Collapse clause specifies how many tightly nested loops are

associated with the loop construct

  • Iterations of associated loop are scheduled according to the

rest of the clause

Uppsala 101

#pragma acc loop collapse (2) for(i=0; i<n; i++) { for(j=0; j<m; j++) { A[i][j]= … } } #pragma acc loop for(k=0; k<n*m; k++) { int i = k%m; int j = k/n; A[i][j]= … }

5/06/13

slide-102
SLIDE 102

Privatization

  • The clause private declares a copy of each specified item for

each iteration of the associated loop

Uppsala 102

int w; #pragma acc loop independent, private (w) for(i = 0; i < n; i++) { w = i*i; b[i] = b[i] + w*a[i]; } …

5/06/13

slide-103
SLIDE 103

Reduction Operation

  • Reduction clause performs a reduction operation
  • Creates a private copy of the variable specified for each iteration of the

associated loop

  • Values for each gang are combined using the reduction operator and

stored in the original variable

Uppsala 103

#pragma acc loop worker, reduction(+: sum) for(i=0; i < n; i++) { sum += foo(i, tab[i]); }

foo(n-2, tab[n-2] ) foo(n-1, tab[n-1] ) foo(0, tab[0] ) foo(1, tab[1] ) … + + ... ... sum + + +

Worker #0 Worker #1 Worker #n-2 Worker #n-1 Worker #0 Worker #n/2-1

...

Worker #0 5/06/13

slide-104
SLIDE 104

Reduction Operators

Uppsala 104

C and C++ Fortran Operator Initialization Value Operator Initialization Value + + * 1 * 1 max least max least min largest min largest & ~0 iand all bits on | ior && 1 ieor || .and. .true. .or. .false. .eqv. .true. .neqv. .false.

5/06/13

slide-105
SLIDE 105

Lab Session 4: Compute Kernels

105

slide-106
SLIDE 106

Lab 4: Compute Kernels

  • Offload two SGEMM operations on the accelerator device:

C = alpha A . B + beta C – A, B an C are matrices – Alpha, beta are scalars

  • Use parallel and kernels constructs
  • Add loop directives and notice the performance changes
  • Use CUDA profiling to display CUDA grid properties

5/06/13 106 Uppsala

slide-107
SLIDE 107

Asynchronism

107

slide-108
SLIDE 108

Asynchronism

  • By default, the code on the

accelerator is synchronous

  • The host waits for

completion of the parallel or kernels region

  • The async clause enables to

use the device while the host process continues with the code following the region

  • Can be used on parallel and

kernels regions and update directives

Uppsala 108

CPU HWA 1 2 3 4 5 CPU HWA 1 2 3 4 5

5/06/13

slide-109
SLIDE 109

Wait Directive

  • Causes the program to wait for an asynchronous activity
  • Parallel, kernels regions or update directives
  • An identifier can be added to the async clause and wait directive:
  • Host thread will wait for the asynchronous activities with the same ID
  • Without any identifier, the host process waits for all asynchronous

activities

Uppsala 109

#pragma acc kernels, async { … } #pragma acc kernels, async { … } #pragma acc wait $!acc kernels, async 1 … $!acc end kernels … $!acc kernels, async 2 … $!acc end kernels … $!acc wait 1

5/06/13

slide-110
SLIDE 110

Execute OpenACC Computations on the Host

  • OpenACC sections defines the behavior of the accelerator
  • What happens if there is no accelerator?
  • What if the OpenACC code should also be executed on the host?
  • The if clause enables to generate two copies of the

OpenACC code:

  • One to be executed on the host
  • One to be executed on the accelerator

Uppsala 110 5/06/13

slide-111
SLIDE 111

If Clause

  • Available on parallel, kernels or data constructs and update

directive

  • When clause evaluation corresponds to:
  • Zero in C or C++ or .false. in Fortran, the host copy is executed
  • Nonzero in C or C++ or .true. in Fortran, the accelerator copy is

executed

Uppsala 111

#pragma acc kernels if(cond) { for(i=0; i < n; i++) { … } … } $!acc kernels if(cond) DO i=1,n … END DO … $!acc end kernels

5/06/13

slide-112
SLIDE 112

Lab Session 5: Performing Asynchronous Computations

112

slide-113
SLIDE 113

Lab 5: Performing Asynchronous Computations

  • Offload an SGEMM operations on the accelerator device:

C = alpha A . B + beta C – A, B an C are matrices – Alpha, beta are scalars

  • Use kernels constructs, if and async clauses to:
  • Launch the computations of a large part of the matrix on the device
  • While computing a smaller part on the host at the same time

5/06/13 113 Uppsala

slide-114
SLIDE 114

Runtime API

114

slide-115
SLIDE 115

Runtime API

  • May limit portability of the code
  • Conditional compilation using _OPENACC preprocessor variable is

available

  • Enables to:
  • Initialize the OpenACC runtime
  • Retrieve environment information

5/06/13 115 Uppsala

slide-116
SLIDE 116

Runtime Library Definition

  • For C:
  • Header file: openacc.h
  • For Fortran:
  • Interface declaration in: openacc_lib.h in a Fortran module called
  • penacc
  • acc_device_t: type of accelerator device
  • acc_device_none (OpenACC 1.0)
  • acc_device_default (OpenACC 1.0)
  • acc_device_host (OpenACC 1.0)
  • acc_device_not_host (OpenACC 1.0)
  • acc_device_cuda (CAPS Compilers)
  • acc_device_opencl (CAPS Compilers)

Uppsala 116 5/06/13

slide-117
SLIDE 117

Runtime API

  • Initialize the runtime for the given type

void acc_init ( acc_device_t ) (C) Subroutine acc_init ( devicetype ) (Fortran)

  • Disconnect the program from the accelerator device

Void acc_shutdown ( acc_device_t ) (C) Subroutine acc_shutdown ( devicetype ) (Fortran)

Uppsala 117 5/06/13

slide-118
SLIDE 118

Runtime API

  • Return the number of accelerator devices of the given type

attached to the host

int acc_get_num_device (acc_device_t) (C) integer function acc_get_num_device (devicetype) (Fortran)

  • Tell the runtime which type of device to use

int acc_set_device_type (acc_device_t) (C) subroutine acc_set_device_type (devicetype) (Fortran)

  • Tell the program what type of device will be used

acc_device_type acc_get_device_type (void) (C) function acc_get_device_type () (Fortran)

Uppsala 118 5/06/13

slide-119
SLIDE 119

“Fallback” Example

Uppsala 119

int dev; Dev = acc_get_num_device(acc_device_cuda); #pragma acc data copy(A[0:N]) if (dev) { #pragma acc kernels if (dev) ... #pragma acc kernels if (dev) for (int i = 0+t*N/2; i < (1+t)*N/2; ++i) { A[i] = A[i] ...; } ... }

Check number of CUDA devices available on the system

  • If 0 is return, no CUDA device is available

If no device is available, the host code is executed

5/06/13

slide-120
SLIDE 120

Runtime API

  • Tell the runtime which device to use

void acc_set_device_num (int, acc_device_t) (C) subroutine acc_set_device_num ( devicenum, devicetype) (Fortran)

  • Return the device number of the specified device type that

will be used

int acc_get_device_num (acc_device_t) (C) Integer function acc_get_device_num (devicetype) (Fortran)

Uppsala 120 5/06/13

slide-121
SLIDE 121

Multidevice Example

Uppsala 121

#pragma omp parallel for for (int t = ; t < 2; ++t) { acc_set_device_num(t, acc_device_default); #pragma acc kernels copy(A[0+t*N/2:(1+t)*N/2]) { #pragma acc loop independent for (int i = 0+t*N/2; i < (1+t)*N/2; ++i) { A[i] = A[i] ...; } ... } acc_shutdown(acc_device_default) }

Two CPU threads are created with OpenMP:

  • thread #0 will manage device #0
  • thread #1 will manage device #1

Data set is split in two: each set will be processed by one device

5/06/13

slide-122
SLIDE 122

Runtime API: Allocations

  • void* acc_malloc ( size_t ) (C)
  • Allocates memory on accelerator device
  • Pointers assigned to this function may be reused
  • void* acc_free ( void* ) (C)
  • Deallocates memory on accelerator device
  • Beware: the device memory allocated with the runtime API

is no longer mirrored

Uppsala 122 5/06/13

slide-123
SLIDE 123

Device Pointers

Uppsala 123

float *a = (float *)acc_malloc(sizeof(float)*size); float *b = (float *)acc_malloc(sizeof(float)*size); float *c = (float *)malloc(sizeof(float)*size); #pragma acc kernels deviceptr(a, b) copyout(c[0:size]) { // a and b initialisation ... #pragma acc loop independent for (i = 0; i < size; ++i) { c[i] += a[i] * b[i]; } } acc_free(a); acc_free(b); free(c);

Arrays a and b are

  • nly present on the
  • device. They are

not mirrored on the host.

5/06/13

slide-124
SLIDE 124

OpenACC 2.0 Draft Specification

Uppsala 124

slide-125
SLIDE 125

OpenACC 2.0 Draft Specification

  • Final specification should be released in April 2013
  • Adds new features
  • Nested parallelism: parallel or kernels construct can contain other

parallel or kernels constructs

  • Integration of external functions inside parallel or kernels constructs

thanks to routine directive

  • Clarifies some behaviors
  • Extends OpenACC runtime API

5/06/13 125 Uppsala

slide-126
SLIDE 126

Device_type Clause

  • Can be added to parallel, kernels, loop constructs, update

and routine directives

  • Specifies that the directive applies to one or many kinds of

devices

  • acc_device_cuda
  • acc_device_opencl
  • Without this clause, the directive applies to all device types

Uppsala 126 5/06/13

slide-127
SLIDE 127

Enter Data / Exit Data

  • May replace data constructs

Uppsala 127

#pragma acc data copy(A) { … } #pragma acc enter data copy(A) … #pragma acc exit data

5/06/13

slide-128
SLIDE 128

Default(none) clause

  • Optional clause for:
  • Parallel constructs
  • Kernels constructs
  • Tells the compilers not to implicitly determine data attribute

for any variable

  • Data attributes should appear explicitly in a data clause or in a data

construct

Uppsala 128 5/06/13

slide-129
SLIDE 129

New Loop Clauses

  • Auto clause specifies that the compiler should select

whether to apply gang, worker or vector parallelism to the following loop

  • Tile clause specifies that each loop in the loop nest should

be split into two loops:

Uppsala 129

#pragma acc loop tile(32,5) for(i=0; i < n; i++) { for(j=0; j < n; j++) { … } }

#pragma acc loop for(i_1=0; i_1 < n; i_1=i_1+5) { for(i_2=0; i_2 < 5; i_2++) { for(j_1=0; j_1 < n; j_1=j_1+32) { for(j_2=0; j_2 < 32; j_2++) { … } } } }

5/06/13

slide-130
SLIDE 130

Conclusion

130

slide-131
SLIDE 131

Directives Summary

Constructs Directives Clauses Parallel Kernels Data Loop Declare Update If x x x x Async x x x Private x x Firstprivate x Reduction x x Create/Present Copy/Pcopy Copyin/Pcopyin Copyout/Pcopyout Deviceptr x x x x Collapse x Gang/Worker/Vector x Num_gangs / Num_workers / Vector_length x Seq x Independent x Host/Device x Uppsala 131 5/06/13

slide-132
SLIDE 132

PART II Conclusion

  • Beware of compiler-dependent behaviors
  • Fast development of high-level heterogeneous applications
  • For C and FORTRAN code
  • Explicit the calls to a hardware accelerator in your code
  • Whatever the target
  • CAPS Compilers supports:
  • Nvidia Tesla GPUs
  • AMD GPUs and APUs
  • X86 Intel Xeon Phi

Uppsala 132 5/06/13

slide-133
SLIDE 133

Accelerator Programming model

Directive-based programming

Parallel Computing

OpenHMPP OpenACC

GPGPU

Many-Core programming Parallelization

HPC

OpenCL

Code speedup

NVIDIA CUDA High Performance Computing

CAPS Compilers CAPS Workbench

Portability

Performance

Visit CAPS Website: www.caps-entreprise.com

5/06/13 133 Uppsala