OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and - - PowerPoint PPT Presentation

openacc programming model
SMART_READER_LITE
LIVE PREVIEW

OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and - - PowerPoint PPT Presentation

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI GPU ARCHITECTURE Threads Register Files Shared L1 Read-Only Memory Cache Data Cache L2 Cache Texture Constant GPU


slide-1
SLIDE 1

Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL

slide-2
SLIDE 2

2

GPU ARCHITECTURE

Threads Shared Memory L1 Cache Read-Only Data Cache Register Files GPU Global Memory Texture Memory Constant Memory L2 Cache

slide-3
SLIDE 3

3

USING SHARED MEMORY WITH CUDA

Creating Shared Memory:

Static Shared Memory __shared__ int s[64]; Dynamic Shared Memory extern __shared__ int s[];

Handling Data Race: __syncthreads();

slide-4
SLIDE 4

4

PROS AND CONS OF CUDA APPROACH

Pros:

Better control over hardware

Cons:

Familiar with CUDA and GPU Redesign the algorithm Thread Synchronization Bank Conflicts

slide-5
SLIDE 5

5

OPENACC: A DIRECTIVE-BASED APPROACH

Rich Set of Data Directives Two Offload Region Constructs: parallel and kernels Three Levels of Parallelism: gang, worker and vector

Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience

GPU GPU CPU

Program myscience ... serial code ... do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo ... End Program myscience OpenA enACC Comp mpiler er Direct ectives ves

slide-6
SLIDE 6

6

C/C++ #pragma acc cache (a[lower1: length1] [lower2: length2]) Fortran !$acc cache (a(lower1:upper1, lower2: upper2)) Examples: #pragma acc cache (a[i-1: 3] [j]) // i and j as loop index !$acc cache (a(J, :)) ! cache the entire dimension !$acc cache (a) ! cache the entire array

CACHE DIRECTIVE CONSTRUCT

slide-7
SLIDE 7

7

CASE STUDIES

Partial Array Cached Entire Array Dimension Cached Entire Array Cached

slide-8
SLIDE 8

8

PARTIAL ARRAY CACHED: 1D CACHE

!$acc loop gang vector DO i=4, M C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread0 Global Memory I Register Files for a thread1 I

slide-9
SLIDE 9

9

PARTIAL ARRAY CACHED: 1D CACHE

!$acc loop gang vector DO i=4, M C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread0 Global Memory I Register Files for a thread1 I Nine Loads

slide-10
SLIDE 10

10

PARTIAL ARRAY CACHED: 1D CACHE

!$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Global Memory I

t0 t1

First Load

slide-11
SLIDE 11

11

PARTIAL ARRAY CACHED: 1D CACHE

!$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Global Memory I Second Load

slide-12
SLIDE 12

12

PARTIAL ARRAY CACHED: 1D CACHE

!$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread Global Memory I Shared Memory

slide-13
SLIDE 13

13

PARTIAL ARRAY CACHED: 1D CACHE

!$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread Global Memory I Shared Memory

slide-14
SLIDE 14

14

PARTIAL ARRAY CACHED: 1D CACHE

!$acc loop gang vector DO i=1, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0

0.5 1 1.5 2 2.5 P100 K80

Speedup

1D 9-Point Stencil with Cache Directive M=128*1024*1024

slide-15
SLIDE 15

15

PARTIAL ARRAY CACHED: 1D VS 2D

!$acc loop gang DO j=1,N !$acc loop vector DO i=1, M !$acc cache (A(i-4:i+4, j-4:j+4)) C(i, j) = (A(i-4, j) + A(i-3, j));+ A(i-2, j) + A(i-1, j) )+ A(i, j) + A(i+1, j) + A(i+2, j) + A(i+3, j) + A(i+4, j) + A(i, j-4) + A(i, j-3));+ A(i, j-2) + A(i, j-1) + A(i, j+1) + A(i, j+2) + A(i, j+3) + A(i, j+4)) * coeff !$acc loop gang DO j=1,N !$acc loop vector DO i=1, M !$acc cache (A(i-4:i+4, j)) C(i, j) = (A(i-4, j) + A(i-3, j));+ A(i-2, j) + A(i-1, j) )+ A(i, j) + A(i+1, j) + A(i+2, j) + A(i+3, j) + A(i+4, j) + A(i, j-4) + A(i, j-3));+ A(i, j-2) + A(i, j-1) + A(i, j+1) + A(i, j+2) + A(i, j+3) + A(i, j+4)) * coeff

0.00 0.50 1.00 1.50 2.00 2.50 P100 K80

Speedup

2D Stencil Cache Performance (N=16*1024, M=16*1024)

1D cache 2D Cache

slide-16
SLIDE 16

16

PARTIAL ARRAY CACHED: UNCOALESCED

!$acc loop gang DO j=1,N !$acc loop vector DO i=1, M !$acc cache (B(j, i-1:i+1)) C(i, j) = (A(i-1, j) )+ A(i, j) + A(i+1, j) + B(j, i+1) + B(j, i) + B(j, i+1)) * coeff

0.00 0.50 1.00 1.50 2.00 2.50 3.00

Speedup

Speedup of applying cache to uncoalesced data

P100 K80

N=M=8192

slide-17
SLIDE 17

17

ENTIRE ARRAY DIMENSIONS CACHED

ORNL CAAR ACME

!$acc parallel loop gang collapse(3) do ie = 1 , nelemd do q = 1 , qsize do ks = 1 , nlev, kchunk !$acc cache(s(:,:,ks:ks+kchunk-1,q,ie)) !$acc loop vector collapse(3) do k = 1 , kchunk do j = 1 , np do i = 1 , np do l = 1 , np dsdx00 = dsdx00 + deriv_dvv(l,i)*s(l,j,ks+k-1,q,ie) …

slide-18
SLIDE 18

18

ENTIRE ARRAY CACHED

Nonhydrostatic Icosahedral Model: NIM

!$acc parallel acc loop gang private(fu0, sumu, …) do ipn=IPS,IPE !$acc cache(fu0, sumu, …) !$acc loop vector do k=1,NZ fu0(k) = 0.0 … enddo !$acc loop vector do k=1,NZ fu0(k) = fu0(k) + sumu(k) …. end do

slide-19
SLIDE 19

19

VARIABLE-LENGTH ARRAY

real :: a(NX) … !$acc loop gang private(a) DO j=1,N !$acc cache (a) !$acc loop vector DO i=1, M … pgfortran –acc –ta=tesla:safecache a.f90 -Minfo

slide-20
SLIDE 20

20

PERFORMANCE DATA

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 MPAS ACME FORCE PSTADV VDMINTV DIAG FLUX

Speedup

Cache Directive Performance Improvement

P100 K80

Kernels from Real-World Apps

slide-21
SLIDE 21

21

DISCUSSION

Recommendation often given: If there is data reuse within the thread-block, then use shared memory to cache such data and then access latency is reduced.

Better Performance Thread Occupancy Performance Factors: Hardware Platforms Memory Access Latency Others Recommendation

slide-22
SLIDE 22

22

CASE STUDY: ORNL DIRAC

Array S (double) Is(int) Id(int) Size 1599*8 1599*4 1599*4

10 20 30 40 50 60 70 80 nocache cache-s cache-id-is cache-all

Percentage(%)/ms

Occupany vs Performance on P100

P100-occupany P100-perf 20 40 60 80 100 120 140 nocache cache-s cache-id-is cache-all

Percentage(%)/ms

Occupany vs Performance on K80

K80-occupany K80-perf

slide-23
SLIDE 23

23

CONCLUSION

Summary: Cache directive does improve the performance in real world applications Pros: Help reduce uncoalesced memory access

Combining with gang-level private, avoid data fetch from global memory

Cons:

No performance improvement guarantee, if the shared memory is overly used