OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One - - PowerPoint PPT Presentation

openacc 2 0 and beyond
SMART_READER_LITE
LIVE PREVIEW

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One - - PowerPoint PPT Presentation

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives Manage #pragma acc data copyin(x[0:n],y[0:n]) copyout(z[0:n]) Incremental Data { Movement ... Single source #pragma acc parallel {


slide-1
SLIDE 1

OpenACC 2.0 and Beyond

PGI Accelerator Compilers and Tools

slide-2
SLIDE 2

One Slide Intro to OpenACC Directives

Manage Data Movement Initiate Parallel Execution Optimize Loop Mappings #pragma acc data copyin(x[0:n],y[0:n]) copyout(z[0:n]) { ... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { z[i] = x[i] + y[i]; ... } } ... }

CPU, GPU, MIC Performance portable Interoperable Single source Incremental

slide-3
SLIDE 3

OpenACC 2.0

acc routine acc enter / exit data acc atomic acc wait async acc parallel wait() acc loop tile default(none) device_type(...) new API routines many clarifications

Highlights

slide-4
SLIDE 4

OpenACC 2.0

#pragma acc routine vector float dotprod( float* a, float* b, int n ){ float sum = 0.0f; #pragma acc loop vector reduction(+:sum) for( int i = 0; i < n; ++ i sum += a[i]*b[i]; return sum; }

acc routine

slide-5
SLIDE 5

OpenACC 2.0

template<typename T>class v{ T* _data; size_t _size; ... move_to_device(){ #pragma acc enter data copyin(this, \ _data[0:_size]) } update_host(){ #pragma acc update self(_data[0:_size]) }...

acc enter data and acc exit data

slide-6
SLIDE 6

OpenACC 2.0

#pragma acc parallel loop for( i = 0; i < n; ++i ){ x = index[i]; #pragma acc atomic update hist[x]++; }

acc atomic

slide-7
SLIDE 7

OpenACC 2.0

#pragma acc parallel loop async(1) for(...){...} #pragma acc parallel loop async(2) for(...){...} #pragma acc wait(1) async(2) #pragma acc parallel loop async(2) wait(1) for(...){...}

acc wait async

slide-8
SLIDE 8

PGI 2015 Additions

template<typename T> class myvect{ T* _data; size_t _size; public: // ... void dev_create(){ #pragma acc enter data copyin(this) #pragma acc enter data copyin(_data[0:_size]) } void host_update(){ #pragma acc update self(_data[0:_size]) }

C++ class data member in OpenACC data clauses

slide-9
SLIDE 9

PGI 2015 Additions

pgc++ -ta=tesla:managed malloc, calloc, free, new, delete, allocatable replaced with managed allocate/free limitations

Managed Memory Support (beta feature)

slide-10
SLIDE 10

PGI 2015 OpenACC Performance – NIM (NOAA)

All times measured on a K20x not including data transfers from host memory to device memory

Microseconds

5000 10000 15000 20000 25000 F2C-ACC PGI 2014 PGI 2015

VDMINTV

5000 10000 15000 20000 25000 30000 35000 F2C-ACC PGI 2014 PGI 2015

VDMINTS

500 1000 1500 2000 2500 3000 3500 4000 4500 F2C-ACC PGI 2014 PGI 2015

FLUX

PGI 2015 OpenACC Performance Enhancements:  !$ACC CACHE directive  Scalar replacement optimizations  Variable length VECTOR support  Short loop optimizations

slide-11
SLIDE 11

OpenACC 2.5 (in design)

#pragma acc data present_or_copy(x[0:n]) copy(b[0:n]) {....}

acc data copy(x) == present_or_copy(x)

slide-12
SLIDE 12

OpenACC 2.5 (in design)

module m real, allocatable :: a(:,:) !$acc declare create(a) end module subroutine init(n) use m allocate(a(n,n)) ...

acc declare(allocatable)

slide-13
SLIDE 13

OpenACC 2.5 (in design)

#pragma acc parallel loop default(present) for( i = 0; i < n; ++i ) a[i] = fexpf(b[i]) * cosf(c[i]);

default(present)

slide-14
SLIDE 14

OpenACC 3.0 (in design)

template<typename T>class reactor{ class magnet* m; class laser* l; class coolant* c; class steampipe* s; }; ... class reactor R; #pragma acc enter data copyin(R)

Deep Copy – Data Structure Management

slide-15
SLIDE 15

Future of OpenACC

Descriptive Performance Portable Data Management Parallelism Management PGI Commitment

On Future Supercomputers

High Capacity Memory

High Bandwidth Memory

High Performance

Shared Cache

$ $ $ $ $ $ $ $ $ $ $ $

Highly Parallel

Shared Cache

$ $ $ $ $ $ $ $