GPU Computing with OpenACC Directives GPUs Reaching Broader Set of - - PowerPoint PPT Presentation

gpu computing with
SMART_READER_LITE
LIVE PREVIEW

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of - - PowerPoint PPT Presentation

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of Developers 1,000,000s CAE CFD Finance Rendering Data Analytics Universities Supercomputing Centers Life Sciences 100,000s Defense Oil & Gas Weather Climate


slide-1
SLIDE 1

GPU Computing with OpenACC Directives

slide-2
SLIDE 2

1,000,000’s Early Adopters

Time

Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics Life Sciences Defense Weather Climate Plasma Physics

GPUs Reaching Broader Set of Developers

100,000’s

2004 Present

slide-3
SLIDE 3

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in” Acceleration

Programming Languages OpenACC Directives

Maximum Flexibility Easily Accelerate Applications

slide-4
SLIDE 4

OpenACC Directives

Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience

CPU GPU

Your original Fortran or C code

Simple Compiler hints Compiler Parallelizes code Works on many-core GPUs & multicore CPUs

OpenACC Compiler Hint

slide-5
SLIDE 5

Familiar to OpenMP Programmers

main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }

CPU OpenMP

main() { double pi = 0.0; long i; #pragma acc kernels for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }

CPU GPU OpenACC

slide-6
SLIDE 6

Easy: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU

OpenACC

The Standard for GPU Directives

slide-7
SLIDE 7

Ope penA nACC CC Me Membe mbers s and nd Sup uppor porter ers

slide-8
SLIDE 8

Directives: Easy & Powerful

Real-Time Object Detection

Global Manufacturer of Navigation Systems

Valuation of Stock Portfolios using Monte Carlo

Global Technology Consulting Company

Interaction of Solvents and Biomolecules

University of Texas at San Antonio

Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications.

  • - Developer at the Global Manufacturer of Navigation Systems

5x in 40 Hours 2x in 4 Hours 5x in 8 Hours

slide-9
SLIDE 9

Focus on Exposing Parallelism

With Directives, tuning work focuses on exposing parallelism, which makes codes inherently better

Example: Application tuning work using directives for new Titan system at ORNL

S3D Research more efficient combustion with next- generation fuels CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios

  • Tuning top 3 kernels (90% of runtime)
  • 3 to 6x faster on CPU+GPU vs. CPU+CPU
  • But also improved all-CPU version by 50%
  • Tuning top key kernel (50% of runtime)
  • 6.5x faster on CPU+GPU vs. CPU+CPU
  • Improved performance of CPU version by 100%
slide-10
SLIDE 10

subrouti subroutine ne saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $! $!acc acc kernels do do i=1,n =1,n y( y(i) = a*x(i)+y( )+y(i) enddo enddo $! $!acc acc end kernels end subr end subroutine

  • utine saxpy

saxpy ... ... $ Perfor $ Perform SAXP m SAXPY on 1M Y on 1M elemen elements ts call call saxpy(2**20, 2.0, x_d x_d, , y_d y_d) ... ... void sax void saxpy(int py(int n, n, float a, float *x, float *restrict y) { #pragma #pragma acc acc kerne kernels ls for for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... ... // Perfo // Perform SAX rm SAXPY on 1M PY on 1M eleme elements nts saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ...

A Very Simple Exercise: SAXPY

SAXPY in C SAXPY in Fortran

slide-11
SLIDE 11

Directive Syntax

Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block

slide-12
SLIDE 12

kernel els: Your first OpenACC Directive

Each loop executed as a separate kernel on the GPU.

!$acc kernels do i=1,n a(i) = 0.0 b(i) = 1.0 c(i) = 2.0 end do do i=1,n a(i) = b(i) + c(i) end do !$acc end kernels

kernel 1 kernel 2

Kernel:

A parallel function that runs on the GPU

slide-13
SLIDE 13

Kernels Construct

Fortran

!$ !$acc acc kernels rnels [clause …] stru tructure ctured b d bloc lock !$ !$acc acc end ker d kernel nels

Clauses

if if( ( cond

  • ndition

ition ) as async nc( ( expres expressio ion )

Also, any data clause (more later) C

#prag ragma ma acc acc kern ernels els [clause …] { st structu ructured red bl block }

  • ck }
slide-14
SLIDE 14

C tip: the restric rict keyword

Declaration of intent given by the programmer to the compiler

Applied to a pointer, e.g. float * float *restrict restrict ptr ptr Meaning: “for the lifetime of ptr tr, only it or a value directly derived from it (such as ptr tr + 1) will be used to access the object to which it points”*

Limits the effects of pointer aliasing OpenACC compilers often require restric trict to determine independence

Otherwise the compiler can’t parallelize loops that access ptr tr Note: if programmer violates the declaration, behavior is undefined

http://en.wikipedia.org/wiki/Restrict

slide-15
SLIDE 15

Complete SAXPY example code

Trivial first example

Apply a loop directive Learn compiler commands

#in incl clude de < <st stdli lib. b.h> voi

  • id sax

axpy py(in int n, n, fl floa

  • at a,

a, fl floa

  • at *x

*x, float *restrict y) y) { #pr prag agma a ac acc ker erne nels ls for

  • r (int

nt i = = 0; 0; i < n n; + ++i) y[ y[i] = = a a * * x[ x[i] + + y[ y[i]; ]; } int nt ma main in(int nt ar argc gc, , ch char ** **ar argv) { int nt N N = = 1< 1<<20 20; ; // // 1 1 mi mill llio ion f flo loat ats if if (ar argc gc > > 1) 1) N N = = at atoi

  • i(ar

argv gv[1 [1]) ]); flo loat *x *x = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); flo loat *y *y = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); for

  • r (i

(int nt i i = = 0; 0; i i < < N; N; + ++i +i) { { x[ x[i] = = 2 2.0f 0f; y[ y[i] = 1.0f; } sax axpy(N, N, 3 3.0f 0f, , x, x, y y); ret eturn rn 0; 0; }

*restrict: “I promise y does not alias x”

slide-16
SLIDE 16

Compile and run

C:

pgcc pgcc –acc acc -ta= ta=nvidia nvidia -Minfo Minfo=accel accel –o

  • saxpy_acc

saxpy_acc saxpy.c saxpy.c

Fortran:

pgf90 pgf90 –acc acc -ta= ta=nvidia nvidia -Minfo Minfo=accel accel –o

  • saxpy_acc

saxpy_acc saxpy.f90 saxpy.f90

Compiler output:

pgc gcc -ac acc -Mi Minf nfo=ac acce cel -ta ta=nv nvid idia ia -o

  • sax

axpy py_a _acc sa saxp xpy. y.c sax axpy py: 8, 8, G Gen enera rati ting ng co copy pyin in(x (x[:n :n-1]) ]) Gen enera rati ting ng c copy py(y (y[: [:n-1]) ]) Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 1 1.0 .0 b bina nary ry Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 2 2.0 .0 b bina nary ry 9, 9, L Loo

  • op i

is s pa para ralle leli liza zable le Acc ccele lera rato tor r ker erne nel l gen ener erat ated ed 9, 9, #p #pra ragm gma a acc cc loo

  • op w

wor

  • rke

ker, r, ve vect ctor

  • r(25

256) 6) / /* * blo lock ckId Idx.x .x thr hrea eadId Idx. x.x */ */ CC CC 1 1.0 .0 : : 4 4 re regi giste ters rs; ; 52 52 sh shar ared ed, 4 4 c con

  • nst

stant nt, , 0 0 loc

  • cal

al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy

slide-17
SLIDE 17

Example: Jacobi Iteration

Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points.

Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛂𝟑𝒈(𝒚, 𝒛) = 𝟏

A(i,j) A(i+1,j) A(i-1,j) A(i,j-1) A(i,j+1)

𝐵𝑙+1 𝑗, 𝑘 = 𝐵𝑙(𝑗 − 1, 𝑘) + 𝐵𝑙 𝑗 + 1, 𝑘 + 𝐵𝑙 𝑗, 𝑘 − 1 + 𝐵𝑙 𝑗, 𝑘 + 1 4

slide-18
SLIDE 18

Jacobi Iteration C Code

while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; for for( ( int nt j j = 1 1; ; j j < n < n-1; 1; j++) j++) { { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j [j-1][ ][i] + + A A[j+1] [j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][i] = Anew[j][i]; } } iter iter++; }

Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays

slide-19
SLIDE 19

Jacobi Iteration Fortran Code

do do while while ( err > tol tol .and. iter < < iter_max ) err=0 err=0._fp_kind do do j=1,m do do i=1,n =1,n Anew( Anew(i,j i,j) = .25 ) = .25_fp fp_k _kind ind * * (A (A(i+1, (i+1, j j ) + A(i ) + A(i-1, , j j ) ) + & + & A( A(i , j , j-1) + A( ) + A(i , j+1)) err = = max(err, Anew(i,j) ) - A( A(i,j)) )) end end do do end do end do do do j=1,m-2 do do i=1,n =1,n-2 A( A(i,j i,j) = Anew(i,j) en end do d do end do iter iter = = iter iter +1 +1 end do end do

Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays

slide-20
SLIDE 20

OpenMP C Code

while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; #pragma #pragma omp

  • mp paral

paralle lel l for

  • r s

shared hared(m (m, , n, , An Anew, A ew, A) for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j [j-1][ ][i] + + A A[j+1] [j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } #pragma #pragma omp

  • mp parallel for shared(m, n, Anew, A)

for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][i] = Anew[j][i]; } } iter iter++; }

Parallelize loop across CPU threads Parallelize loop across CPU threads

slide-21
SLIDE 21

OpenMP Fortran Code

do do while while ( err > tol tol .and. iter < < iter_max ) err=0 err=0._fp_kind !$ !$omp

  • mp parallel do shared(m,n,Anew,A) reduction(max:err)

do do j=1,m do do i=1,n Anew( Anew(i,j i,j) = .25 ) = .25_fp fp_k _kind ind * * (A (A(i+1, (i+1, j j ) + A(i ) + A(i-1, , j j ) ) + & + & A( A(i , j , j-1) + A( ) + A(i , j+1)) err = = max(err, Anew(i,j) ) - A( A(i,j)) )) end end do do end do end do !$ !$omp

  • mp parallel do shared(m,n,Anew,A)

do do j=1,m-2 do do i=1,n =1,n-2 A( A(i,j i,j) = Anew(i,j) en end do d do end do iter iter = = iter iter +1 +1 end do end do

Parallelize loop across CPU threads Parallelize loop across CPU threads

slide-22
SLIDE 22

Exercises: General Instructions (compiling)

Exercises are in “exercises/openacc” directory in your home directory

Solutions are in “exercise_solutions/openacc” directory

module load pgi/13.5 To compile, use one of the provided makefiles

C: > make > make Fortran: > make > make –f Makefile_f90 f Makefile_f90

Remember these compiler flags: –acc acc -ta= ta=nvidia nvidia -Minfo Minfo=acce accel

slide-23
SLIDE 23

Exercises: General Instructions (running)

To run, use qsub qsub with one of the provided job files

> > qsub qsub runit.acc runit.acc > > qstat qstat –u <username> u <username> # prints # prints qsub qsub status status Output is placed in slur lurm.* when finished. OpenACC job file looks something like this

#! #!/bin bin/bas bash ./ ./lap lapla lace2 ce2d_a d_acc

The OpenMP version specifies number of cores to use

#! #!/bin bin/bas bash ex expor port t OMP OMP_NU _NUM_TH _THRE READS ADS=6 =6 ./ ./lap lapla lace2 ce2d_o d_omp Edit this to control the number

  • f cores to use
slide-24
SLIDE 24

Exercise 1: Jacobi Kernels

Task: use acc acc kerne rnels ls to parallelize the Jacobi loop nests Edit laplace2D.c or laplace2D.f90 (your choice)

In the 001 01-laplace2D laplace2D-kernels kernels directory Add directives where it helps Figure out the proper compilation command (similar to SAXPY example)

Compile both with and without OpenACC parallelization Optionally compile with OpenMP (original code has OpenMP directives)

Run OpenACC version with laplace_acc, OpenMP with laplace_omp

Q: can you get a speedup with just kernels directives?

Versus 1 CPU core? Versus 6 CPU cores?

slide-25
SLIDE 25

Exercise 1 Solution: OpenACC C

while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; #pragma #pragma acc acc kerne kernels ls for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j [j-1][ ][i] + + A A[j+1] [j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][i] = Anew[j][i]; } } iter iter++; }

Execute GPU kernel for loop nest Execute GPU kernel for loop nest

slide-26
SLIDE 26

Exercise 1 Solution: OpenACC Fortran

do do while while ( err > tol tol .and. iter < < iter_max ) err=0 err=0._fp_kind !$ !$acc acc kernels do do j=1,m do do i=1,n Anew( Anew(i,j i,j) = .25 ) = .25_fp fp_k _kind ind * * (A (A(i+1, (i+1, j j ) + A(i ) + A(i-1, , j j ) ) + & + & A( A(i , j , j-1) + A( ) + A(i , j+1)) err = = max(err, Anew(i,j) ) - A( A(i,j)) )) end end do do end end do do !$ !$acc acc end kernels !$ !$acc acc kernels do do j=1,m-2 do do i=1,n =1,n-2 A( A(i,j i,j) ) = A Ane new( w(i,j i,j) end do end do !$ !$acc acc end kernels iter iter = = iter iter +1 +1 end do end do

Generate GPU kernel for loop nest Generate GPU kernel for loop nest

slide-27
SLIDE 27

Exercise 1 Solution: C Makefile

CC = = pgcc cc CC CCFLA LAGS S = ACCFL CFLAGS AGS = = -acc cc -ta= ta=nvid vidia ia -Minfo nfo=acce ccel OMPFL PFLAGS AGS = = -fast ast -mp mp -Minfo nfo BIN = N = l laplac aplace2d e2d_om _omp lap p laplac lace2d e2d_acc _acc all: l: $(B $(BIN) IN) lapla place2 ce2d_acc d_acc: l : lapl aplace2d ace2d.c .c $(CC) (CC) $( $(CCF CFLAGS) AGS) $( $(ACC CCFLAGS LAGS) -o $ $@ $< $< lapla place2 ce2d_omp d_omp: l : lapl aplace2d ace2d.c .c $(CC) $(CC) $( $(CCF CCFLAGS) LAGS) $( $(OMP OMPFLAGS FLAGS) ) -o $

  • $@ $<

@ $< clean ean: $(RM) $(RM) $( $(BIN BIN)

slide-28
SLIDE 28

Exercise 1 Solution: Fortran Makefile

F90 = = pgf9 gf90 CC CCFLA LAGS S = ACCFL CFLAGS AGS = = -acc cc -ta= ta=nvid vidia ia -Minfo nfo=acce ccel OMPFL PFLAGS AGS = = -fast ast -mp mp -Minfo nfo BIN = N = laplac aplace2d e2d_f9 _f90_omp 0_omp la lapla place2d_ ce2d_f90 f90_ac _acc all: l: $(B $(BIN) IN) lapla place2 ce2d_f90 d_f90_ac _acc: : lapla laplace2 ce2d.f d.f90 90 $(F90 (F90) $(CC $(CCFLAGS LAGS) $ $(AC ACCFLAG FLAGS) ) -o

  • $@ $<

@ $< lapla place2 ce2d_f90 d_f90_om _omp: : lapla laplace2 ce2d.f d.f90 90 $(F90 $(F90) ) $(CC (CCFLAGS FLAGS) $ ) $(OM (OMPFLAG PFLAGS) S) -o $@ $< $@ $< clean ean: $(RM) $(RM) $( $(BIN BIN)

slide-29
SLIDE 29

Exercise 1: Compiler output (C)

pgcc -tp nehalem -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c main: 56, Generating present_or_copyout(Anew[1:4094][1:4094]) Generating present_or_copyin(A[0:][0:]) Generating NVIDIA code Generating compute capability 1.3 binary Generating compute capability 2.0 binary Generating compute capability 3.0 binary 57, Loop is parallelizable 59, Loop is parallelizable Accelerator kernel generated 57, #pragma acc loop gang /* blockIdx.y */ Cached references to size [3x(x+2)] block of 'A' 59, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 63, Max reduction generated for error 68, Generating present_or_copyin(Anew[1:4094][1:4094]) Generating present_or_copyout(A[1:4094][1:4094]) Generating NVIDIA code Generating compute capability 1.3 binary Generating compute capability 2.0 binary Generating compute capability 3.0 binary 69, Loop is parallelizable 71, Loop is parallelizable Accelerator kernel generated 69, #pragma acc loop gang /* blockIdx.y */ 71, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

slide-30
SLIDE 30

Exercise 1: Performance

Execution (4096x4096) Time (s) Speedup CPU 1 OpenMP thread 95.1

  • CPU 2 OpenMP threads

50.8 1.9x CPU 4 OpenMP threads 30.5 3.1x CPU 6 OpenMP threads 24.7 3.9x OpenACC GPU 150 0.16x FAIL

Speedup vs. 6 CPU cores Speedup vs. 1 CPU core

CPU: Intel E5-2670 8 Cores @ 2.60 GHz GPU: NVIDIA Tesla K20Xm

slide-31
SLIDE 31

What went wrong?

Add PGI_ I_AC ACC_TI C_TIME ME=1 =1 to your environment

time(us): 86,029,004 56 56: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 56 56: : dat ata a co copy pyin n re reac ached ed 1 100 000 0 tim imes es de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =20,5 ,538 38,0 ,099 9 ma max= x=20 20,99 996 6 mi min=2 =20, 0,21 211 1 avg vg=2 =20, 0,538 38 59 59: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc

  • ck:

k: [ [128 28] device time(us): total=3,078,568 max=3,185 min=3,072 avg=3,078 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=3 =3,09 091, 1,32 327 m max ax=3 =3,8 ,810 0 mi min= n=3,0 ,081 81 a avg vg=3, 3,09 091 59 59: : red educ ucti tion

  • n ke

kern rnel el la laun unch ched ed 10 1000 00 t time mes gri rid: d: [ [1] 1] b blo lock ck: [ [25 256] 6] de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =262, 2,16 167 7 max ax=8 =848 48 m min= n=25 255 5 avg vg=2 =262 62 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=2 =279, 9,25 250 0 max ax=8 =860 60 m min= n=26 266 6 avg vg=2 =279 79 68 68: : dat ata a co copy pyout ut r rea eache hed d 10 1000 00 ti time mes de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =20,0 ,048 48,3 ,356 6 ma max= x=20 20,21 215 5 mi min=2 =20, 0,03 037 7 avg vg=2 =20, 0,048 48 68 68: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 68 68: : dat ata a co copy pyin n re reac ached ed 1 100 000 0 tim imes es de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =20,5 ,536 36,9 ,979 9 ma max= x=21 21,11 117 7 mi min=2 =20, 0,21 219 9 avg vg=2 =20, 0,536 36 71 71: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc

  • ck:

k: [ [128 28] device time(us): total=1,515,375 max=1,526 min=1,503 avg=1,515 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=1 =1,90 902, 2,54 546 m max ax=2 =2,8 ,883 3 mi min= n=1,5 ,526 26 a avg vg=1, 1,90 902 77 77: : dat ata a co copy pyout ut r rea eache hed d 10 1000 00 ti time mes de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =20,0 ,049 49,4 ,460 0 ma max= x=20 20,36 360 0 mi min=2 =20, 0,03 038 8 avg vg=2 =20, 0,049 49

Huge Data Transfer Bottleneck!

Computation: 5 seconds Data movement: 80 seconds

slide-32
SLIDE 32

Basic Concepts

PCI Bus

Transfer data Offload computation

For efficiency, decouple data movement and compute off-load

GPU GPU Memory CPU CPU Memory

slide-33
SLIDE 33

Excessive Data Transfers

while while ( err ( error

  • r >

> tol tol && && iter iter < < it iter er_max _max ) { ) { error=0.0; ... ... } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m-1; 1; i++) ++) { Anew[j][i] = ] = 0.25 * (A[j][i+1] + A[j][i-1] + 1] + A[j-1][ 1][i] + A[j+1][i]); ]); error error = = ma max( x(er error, ror, ab abs(A (Ane new[ w[j][ j][i] ] - A[j][ A[j][i]); ]); } }

A, Anew resident on host A, Anew resident on host A, Anew resident on accelerator A, Anew resident on accelerator

These copies happen every iteration of the

  • uter while loop!*

Copy Copy

*Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration!

slide-34
SLIDE 34

DATA MANAGEMENT

slide-35
SLIDE 35

Data Construct

Fortran

!$acc data [clause …] structured block !$acc end data

General Clauses

if( condition ) async( expression )

C

#pragma acc data [clause …] { structured block }

Manage data movement. Data regions may be nested.

slide-36
SLIDE 36

Data Clauses

copy ( list )

Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.

copyin ( list )

Allocates memory on GPU and copies data from host to GPU when entering region.

copyout ( list ) Allocates memory on GPU and copies data to the

host when exiting region.

create ( list )

Allocates memory on GPU but does not copy.

present ( list ) Data is already present on GPU from another

containing data region. and present_or_copy[in|out], present_or_create, deviceptr.

slide-37
SLIDE 37

Array Shaping

Compiler sometimes cannot determine size of arrays

Must specify explicitly using data clauses and array “shape”

C

#prag ragma ma acc acc data ata copyin pyin(a[0 a[0:si :size ze-1]), ), cop copyout yout(b[s b[s/4: /4:3*s/4 3*s/4]) ])

Fortran

!$ !$pragma pragma acc acc data data copyin copyin(a(1:size)), (a(1:size)), copyout copyout(b(s/4:3*s/4)) (b(s/4:3*s/4))

Note: data clauses can be used on data ta, kerne nels ls or para rall llel el

slide-38
SLIDE 38

Update Construct

Fortran

!$acc update [clause …]

Clauses

host( list ) device( list )

C

#pragma acc update [clause …] if( expression ) async( expression )

Used to update existing data after it has changed in its corresponding copy (e.g. update device copy after host copy changes) Move data from GPU to host, or host to GPU. Data movement can be conditional, and asynchronous.

slide-39
SLIDE 39

Exercise 2: Jacobi Data Directives

Task: use acc acc data ta to minimize transfers in the Jacobi example Start from given laplace2D.c or laplace2D.f90 (your choice)

In the 002 02-laplace2d laplace2d-data data directory Add directives where it helps (hint: [do] while loop)

Q: What speedup can you get with data + kernels directives?

Versus 6 CPU cores? OMP_NUM_THREADS=6 ./laplace2d_omp

slide-40
SLIDE 40

Exercise 2 Solution: OpenACC C

#pragma #pragma acc acc data copy(A), create(Anew) while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[ Anew[j][i] ][i] = = 0. 0.25 5 * * (A[j] (A[j][i [i+1 +1] + + A A[j][i [j][i-1] + ] + A[j-1][ ][i] + A[j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][i] = Anew[j][i]; } } iter iter++; }

Copy A in at beginning of loop, out at end. Allocate Anew on accelerator

slide-41
SLIDE 41

Exercise 2 Solution: OpenACC Fortran

!$ !$acc acc data copy(A), create(Anew) do while do while ( ( er err > > tol tol .an and.

  • d. ite

ter < < iter_ iter_ma max ) err=0. err=0._fp_kind !$ !$acc acc kernels do do j=1,m do do i=1,n Anew(i,j i,j) = .25 ) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A( A(i , j , j-1) + A( 1) + A(i , j+1)) err = max(err, Anew(i,j) ) - A( A(i,j)) )) en end do d do end do end do !$ !$acc acc end kernels ... ... iter iter = = iter +1 +1 end do end do !$ !$acc acc end data data

Copy A in at beginning of loop, out at end. Allocate Anew on accelerator

slide-42
SLIDE 42

Exercise 2: Performance

Execution Time (s) Speedup CPU 6 OpenMP threads 24.7

  • OpenACC GPU K20Xm

150 0.16x OpenACC GPU K20Xm-opt 16.5 1.49x

Speedups vs. 6 CPU cores

slide-43
SLIDE 43

What went right?

Add PGI_ I_AC ACC_TI C_TIME ME=1 =1 to your environment

time(us): 4,837,846 28 28: : da data ta re regi gion

  • n r

reac ache hed d 1 t tim ime 50 50: : dat ata a co copy pyin re reac ached ed 1 1 t tim ime de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =20,2 ,239 39 m max= x=20 20,2 ,239 39 mi min= n=20 20,23 239 9 av avg=2 =20, 0,23 239 82 82: : dat ata a co copy pyout ut rea eache hed d 1 1 ti time de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =20,0 ,063 63 m max= x=20 20,0 ,063 63 mi min= n=20 20,06 063 3 av avg=2 =20, 0,06 063 56: compute region reached 1000 times 59 59: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc

  • ck:

k: [ [128 28] de devi vice ce t time me(u (us) s): t tot

  • tal

al=3 =3,06 062, 2,98 986 m max ax=3 =3,1 ,180 0 mi min= n=3,0 ,060 60 avg vg=3, 3,06 062 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=3 =3,07 073, 3,48 489 m max ax=3 =3,5 ,520 0 mi min= n=3,0 ,070 70 avg vg=3, 3,07 073 59 59: : red educ ucti tion

  • n ke

kern rnel el la laun unch ched ed 10 1000 00 t time mes gri rid: d: [ [1] 1] b blo lock ck: [ [25 256] 6] de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =256, 6,58 582 2 max ax=3 =309 09 m min= n=25 255 5 avg vg=2 =256 56 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=2 =267, 7,78 788 8 max ax=6 =657 57 m min= n=26 264 4 avg vg=2 =267 67 68 68: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 71 71: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc

  • ck:

k: [ [128 28] de devi vice ce t time me(u (us) s): t tot

  • tal

al=1 =1,47 477, 7,97 976 m max ax=1 =1,5 ,528 8 mi min= n=1,4 ,473 73 avg vg=1, 1,47 477 elapsed time(us): total=1,489,101 max=1,735 min=1,484 avg=1,489

Transfer Bottleneck Eliminated!

Computation: 5 seconds Data movement: negligible

slide-44
SLIDE 44

Further speedups

OpenACC gives us more detailed control over parallelization

Via gang, worker, and vector clauses

By understanding more about OpenACC execution model and GPU hardware organization, we can get higher speedups on this code By understanding bottlenecks in the code via profiling, we can reorganize the code for higher performance Will tackle these in later exercises

slide-45
SLIDE 45

Finding Parallelism in your code

(Nested) for loops are best for parallelization Large loop counts needed to offset GPU/memcpy overhead Iterations of loops must be independent of each other

To help compiler: rest estrict keyword (C), inde ndependent clause

Compiler must be able to figure out sizes of data regions

Can use directives to explicitly control sizes

Pointer arithmetic should be avoided if possible

Use subscripted arrays, rather than pointer-indexed arrays.

Function calls within accelerated region must be inlineable.

slide-46
SLIDE 46

Tips and Tricks

(PGI) Use time option to learn where time is being spent

  • ta=nvidia,time

Eliminate pointer arithmetic Inline function calls in directives regions

(PGI): -inline or –inline,levels(<N>)

Use contiguous memory for multi-dimensional arrays Use data regions to avoid excessive memory transfers Conditional compilation with _OPENACC macro

slide-47
SLIDE 47

OpenACC Learning Resources

OpenACC info, specification, FAQ, samples, and more

http://openacc.org

PGI OpenACC resources

http://www.pgroup.com/resources/accel.htm