GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS - - PowerPoint PPT Presentation

gpu computing with openacc 3 ways to accelerate
SMART_READER_LITE
LIVE PREVIEW

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS - - PowerPoint PPT Presentation

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum Drop - in Applications Flexibility Acceleration 2 OPENACC DIRECTIVES CPU GPU


slide-1
SLIDE 1

GPU COMPUTING WITH OPENACC

slide-2
SLIDE 2

2

3 WAYS TO ACCELERATE APPLICATIONS

Applications

Libraries

“Drop-in” Acceleration

Programming Languages OpenACC Directives

Maximum Flexibility Easily Accelerate Applications

slide-3
SLIDE 3

3

OPENACC DIRECTIVES

Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience

CPU GPU

Your original Fortran or C code

Simple Compiler hints Compiler Parallelizes code Works on many-core GPUs & multicore CPUs

OpenACC Compiler Hint

slide-4
SLIDE 4

4

FAMILIAR TO OPENMP PROGRAMMERS

main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }

CPU OpenMP

main() { double pi = 0.0; long i; #pragma acc kernels for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }

CPU GPU OpenACC

slide-5
SLIDE 5

Op OpenA enACC CC Me Member mbers s and nd Sup uppor porter ers

slide-6
SLIDE 6

6

DIRECTIVES: EASY & POWERFUL

Real-Time Object Detection

Global Manufacturer of Navigation Systems

Valuation of Stock Portfolios using Monte Carlo

Global Technology Consulting Company

Interaction of Solvents and Biomolecules

University of Texas at San Antonio

Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications.

  • - Developer at the Global Manufacturer of Navigation Systems

5x in 40 Hours 2x in 4 Hours 5x in 8 Hours

slide-7
SLIDE 7

7

FOCUS ON EXPOSING PARALLELISM

With Directives, tuning work focuses on exposing parallelism, which makes codes inherently better

Example: Application tuning work using directives for new Titan system at ORNL

S3D Research more efficient combustion with next- generation fuels CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios

  • Tuning top 3 kernels (90% of runtime)
  • 3 to 6x faster on CPU+GPU vs. CPU+CPU
  • But also improved all-CPU version by 50%
  • Tuning top key kernel (50% of runtime)
  • 6.5x faster on CPU+GPU vs. CPU+CPU
  • Improved performance of CPU version by 100%
slide-8
SLIDE 8

8

subrouti subroutine ne sa saxpy py(n (n, , a, x, a, x, y y) real :: x(:), y(:), a integer :: n, i $! $!acc acc kernels do do i=1,n y( y(i) = a*x(i)+y( )+y(i) enddo enddo $! $!acc acc end kernels end subr end subroutine

  • utine saxpy

saxpy ... ... $ Perfor $ Perform SAXP m SAXPY on 1M Y on 1M elemen elements ts call call sa saxpy py(2 (2**20, **20, 2 2.0 .0, x_d x_d, , y_d y_d) ... ... void sax void saxpy(int py(int n, n, float a, fl float at * *x, x, float *restrict y) { #pragma #pragma acc acc kernels for for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... ... // Perfo // Perform SAX rm SAXPY on 1M PY on 1M eleme elements nts saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ...

A VERY SIMPLE EXERCISE: SAXPY

SAXPY in C SAXPY in Fortran

slide-9
SLIDE 9

9

DIRECTIVE SYNTAX

Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block

slide-10
SLIDE 10

10

KERNELS: YOUR FIRST OPENACC DIRECTIVE

Each loop executed as a separate kernel on the GPU.

!$acc kernels do i=1,n a(i) = 0.0 b(i) = 1.0 c(i) = 2.0 end do do i=1,n a(i) = b(i) + c(i) end do !$acc end kernels

kernel 1 kernel 2

Kernel:

A parallel function that runs on the GPU

slide-11
SLIDE 11

11

KERNELS CONSTRUCT

Fortran

!$acc kernels [clause …] structured block !$acc end kernels

Clauses

if( condition ) async( expression )

Also, any data clause (more later) C

#pragma acc kernels [clause …] { structured block }

slide-12
SLIDE 12

12

COMPLETE SAXPY EXAMPLE CODE

Trivial first example

Apply a loop directive Learn compiler commands

#in incl clude de < <st stdli lib. b.h> voi

  • id sax

axpy py(in int n, n, fl floa

  • at a,

a, fl floa

  • at *x

*x, fl floa

  • at *r

*res estri rict ct y) y) { #pr prag agma a ac acc ker erne nels ls for

  • r (int

nt i = = 0; 0; i < n n; + ++i) y[ y[i] = a * x[i] + y[i]; ]; } int nt ma main in(int nt ar argc gc, , ch char ** **ar argv) { int nt N N = = 1< 1<<20 20; ; // // 1 1 mi mill llio ion f flo loat ats if if (ar argc gc > > 1) 1) N N = = at atoi

  • i(ar

argv gv[1 [1]) ]); flo loat *x *x = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); flo loat *y *y = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); for

  • r (i

(int nt i i = = 0; 0; i i < < N; N; + ++i +i) { { x[ x[i] = = 2 2.0f 0f; y[ y[i] = = 1 1.0f 0f; } sax axpy(N, N, 3 3.0f 0f, , x, x, y y); ret eturn rn 0; 0; }

*restrict: “I promise y does not alias

x”

slide-13
SLIDE 13

13

COMPILE AND RUN

C: pgcc –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.c Fortran: pgf90 –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.f90 Compiler output:

pgc gcc -ac acc -Mi Minf nfo=ac acce cel -ta ta=nv nvid idia ia -o

  • sax

axpy py_a _acc sa saxp xpy. y.c sax axpy py: 8, 8, G Gen enera rati ting ng co copy pyin in(x (x[:n :n-1]) ]) Gen enera rati ting ng c copy py(y (y[: [:n-1]) ]) Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 1 1.0 .0 b bina nary ry Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 2 2.0 .0 b bina nary ry 9, Loop is parallelizable Acc ccele lera rato tor r ker erne nel l gen ener erat ated ed 9, 9, #p #pra ragm gma a acc cc loo

  • op w

wor

  • rke

ker, r, ve vect ctor

  • r(25

256) 6) / /* * blo lock ckId Idx.x .x thr hrea eadId Idx. x.x */ */ CC CC 1 1.0 .0 : : 4 4 re regi giste ters rs; ; 52 52 sh shar ared ed, 4 4 c con

  • nst

stant nt, , 0 0 loc

  • cal

al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy CC CC 2 2.0 .0 : : 8 8 re regi giste ters rs; ; 4 4 sha hare red, d, 64 64 c con

  • nst

stant nt, , 0 0 loc

  • cal

al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy

slide-14
SLIDE 14

14

EXAMPLE: JACOBI ITERATION

Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points.

Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛂𝟑𝒈(𝒚, 𝒛) = 𝟏

A(i,j) A(i+1,j) A(i-1,j) A(i,j-1) A(i,j+1)

𝐵𝑙+1 𝑗, 𝑘 = 𝐵𝑙(𝑗 − 1, 𝑘) + 𝐵𝑙 𝑗 + 1, 𝑘 + 𝐵𝑙 𝑗, 𝑘 − 1 + 𝐵𝑙 𝑗, 𝑘 + 1 4

slide-15
SLIDE 15

15

JACOBI ITERATION C CODE

while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j-1][ ][i] + A[j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][ A[j][i] = A ] = Ane new[ w[j][ ][i]; ]; } } iter iter++; }

Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays

slide-16
SLIDE 16

16

JACOBI ITERATION FORTRAN CODE

do do while ( err > tol tol .and. iter < < iter_max ) err=0 err=0._ ._fp_kind do do j=1,m do do i=1,n =1,n Anew( Anew(i,j i,j) = .25 ) = .25_fp fp_k _kind ind * * (A (A(i+1, (i+1, j j ) + A(i ) + A(i-1, , j j ) ) + & + & A( A(i , j , j-1) + A( ) + A(i , j+1)) err = = max(err (err, Anew(i,j) ) - A( A(i,j)) )) end end do do end end do do do do j=1,m-2 do do i=1,n =1,n-2 A( A(i,j i,j) = Anew(i,j) end do en end d do do iter iter = = iter iter +1 +1 end do end do

Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays

slide-17
SLIDE 17

17

EXERCISES

Exercises are in “exercises/openacc” directory

Solutions in “exercise_solutions/openacc” directory

module load pgi/14.6 To compile, use one of the provided makefiles

C: > make Fortran: > make –f Makefile_f90

Remember these flags

  • acc –ta=nvidia –Minfo=accel

General instructions (compiling)

slide-18
SLIDE 18

18

EXERCISES

To run, use sbatch with one of the provided job files

> sbatch runit.acc > qstat –u <username> # prints qsub status Output is placed in slurm.* when finished.

General instructions (running)

slide-19
SLIDE 19

19

EXERCISE 1

Task: use acc kernels to parallelize the Jacobi loop nests Edit laplace2D.c or laplace2D.f90 (your choice)

In the 001-laplace2D-kernels directory Add directives where it helps Figure out the proper compilation flags to use Optionally: Run OpenACC version with laplace_acc

Q: can you get a speedup with just kernels directives?

Versus 1 CPU core? Versus 6 CPU cores?

Jacobi kernels

slide-20
SLIDE 20

20

EXERCISE 1 SOLUTION: OPENACC C

while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; #pragma #pragma acc acc kernels for for( ( int nt j j = 1 1; ; j j < n < n-1; 1; j++) j++) { { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j-1][ ][i] + A[j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][ A[j][i] = A ] = Ane new[ w[j][ ][i]; ]; } } iter iter++; }

Execute GPU kernel for loop nest Execute GPU kernel for loop nest

slide-21
SLIDE 21

21

EXERCISE 1 SOLUTION: OPENACC FORTRAN

do while ( error > tol tol .and. iter iter < < iter_max ) ) err=0._fp_kind $! $!acc acc kernels do do j= j=1,m ,m do do i=1,n =1,n Anew(i,j i,j) ) = = 0.25 * ( * (A(i+1,j) + + A(i A(i-1,j) + & A(i,j-1) 1) + + A(i,j+1)) err = = max(err, abs(Anew(i,j ,j) ) – A( A(i,j i,j); ); enddo enddo enddo !$ !$acc acc end kernels !$ !$acc acc kernels do do j=1, m-2 do do i=1,n =1,n-2 A( A(i,j i,j) ) = = Anew( Anew(i,j) enddo enddo enddo !$ !$acc acc end kernels iter iter = iter+1 enddo enddo

Execute GPU kernel for loop nest Execute GPU kernel for loop nest

slide-22
SLIDE 22

22

EXERCISE 1: COMPILER OUTPUT (C)

pgcc -tp sandybridge-64 -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c main: 56, Generating present_or_copyout(Anew[1:4094][1:4094]) Generating present_or_copyin(A[:][:]) Generating Tesla code 57, Loop is parallelizable 59, Loop is parallelizable Accelerator kernel generated 57, #pragma acc loop gang /* blockIdx.y */ 59, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 63, Max reduction generated for error 68, Generating present_or_copyin(Anew[1:4094][1:4094]) Generating present_or_copyout(A[1:4094][1:4094]) Generating Tesla code 69, Loop is parallelizable 71, Loop is parallelizable Accelerator kernel generated 69, #pragma acc loop gang /* blockIdx.y */ 71, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

slide-23
SLIDE 23

23

EXERCISE 1: PERFORMANCE

Execution (4096x4096) Time (s) Speedup CPU 1 OpenMP thread 108.7

  • CPU 2 OpenMP threads

58.6 1.85x CPU 4 OpenMP threads 31.9 3.41x CPU 8 OpenMP threads 20.3 5.35x OpenACC GPU 176.1 0.12x FAIL

Speedup vs. 8 CPU cores Speedup vs. 1 CPU core

CPU: Intel E5-2680 v2 10 Cores @ 2.80 GHz GPU: NVIDIA Tesla K20m

slide-24
SLIDE 24

24

WHAT WENT WRONG?

PGI_ACC_TIME=1

ti time me(us us): ): 1 101, 1,10 104, 4,17 174 56 56: : da data ta re regi gion

  • n r

reac ache hed d 100 000 0 ti time mes 56 56: : dat ata a co copy pyin re reac ached ed 8 800 000 0 tim imes es de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =22,0 ,030 30,0 ,081 1 ma max= x=2, 2,858 58 m min in=2, 2,74 746 6 avg avg=2 =2,7 ,753 53 68 68: : dat ata a co copy pyout ut rea eache hed d 80 8000 00 ti time mes de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =23,0 ,018 18,7 ,701 1 ma max= x=6, 6,552 52 m min in=2, 2,85 855 5 avg avg=2 =2,8 ,877 77 56 56: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 59 59: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc

  • ck:

k: [ [128 28] de devi vice ce t time me(u (us) s): t tot

  • tal

al=6 =6,45 456, 6,51 517 m max ax=6 =6,5 ,516 6 mi min= n=6,4 ,447 47 avg vg=6, 6,45 456 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=6 =6,47 471, 1,11 110 m max ax=7 =7,0 ,066 6 mi min= n=6,4 ,460 60 avg vg=6, 6,47 471 59 59: : red educ ucti tion

  • n ke

kern rnel el la laun unch ched ed 10 1000 00 t time mes grid: [1] block: [256] de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =270, 0,28 280 0 max ax=2 =276 76 m min= n=26 268 8 avg vg=2 =270 70 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=2 =283, 3,76 763 3 max ax=3 =353 53 m min= n=28 282 2 avg vg=2 =283 83 68 68: : da data ta re regi gion

  • n r

reac ache hed d 100 000 0 ti time mes 68 68: : dat ata a co copy pyin re reac ached ed 8 800 000 0 tim imes es de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =23,2 ,271 71,7 ,701 1 ma max= x=2, 2,946 46 m min in=2, 2,89 891 1 avg avg=2 =2,9 ,908 08 77 77: : dat ata a co copy pyout ut rea eache hed d 80 8000 00 ti time mes de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =23,0 ,016 16,0 ,095 5 ma max= x=2, 2,993 93 m min in=2, 2,85 854 4 avg avg=2 =2,8 ,877 77 68 68: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 71 71: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc

  • ck:

k: [ [128 28] de devi vice ce t time me(u (us) s): t tot

  • tal

al=3 =3,04 040, 0,79 799 m max ax=3 =3,0 ,050 0 mi min= n=3,0 ,037 37 avg vg=3, 3,04 040 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=3 =3,05 056, 6,31 315 m max ax=3 =3,1 ,131 1 mi min= n=3,0 ,052 52 avg vg=3, 3,05 056

Huge Data Transfer Bottleneck!

Computation: 10 seconds Data movement: 90 seconds

slide-25
SLIDE 25

25

BASIC CONCEPTS

PCI Bus

Transfer data Offload computation

For efficiency, decouple data movement and compute off-load

GPU GPU Memory CPU CPU Memory

slide-26
SLIDE 26

26

EXCESSIVE DATA TRANSFERS

while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; ... ... } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m m-1; 1; i++) ++) { { Anew[j][i] = ] = 0.25 * (A[j][i+1] + A[j][i-1] + 1] + A[j-1][ 1][i] + A[j+1][i]); ]); error = max(error, abs(Anew[j][i] ] - A[j][ A[j][i]); } }

A, Anew resident on host A, Anew resident on host A, Anew resident on accelerator A, Anew resident on accelerator

These copies happen every iteration of the

  • uter while loop!*

Copy Copy

*Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration!

slide-27
SLIDE 27

27

Data Management

slide-28
SLIDE 28

28

DATA CONSTRUCT

Fortran

!$acc data [clause …] structured block !$acc end data

General Clauses

if( condition ) async( expression )

C

#pragma acc data [clause …] { structured block }

Manage data movement. Data regions may be nested.

slide-29
SLIDE 29

29

DATA CLAUSES

copy ( list )

Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.

copyin ( list )

Allocates memory on GPU and copies data from host to GPU when entering region.

copyout ( list ) Allocates memory on GPU and copies data to the host

when exiting region.

create ( list )

Allocates memory on GPU but does not copy.

present ( list ) Data is already present on GPU from another

containing data region. and present_or_copy[in|out], present_or_create, deviceptr.

slide-30
SLIDE 30

30

ARRAY SHAPING

Compiler sometimes cannot determine size of arrays

Must specify explicitly using data clauses and array “shape”

C

#pragma acc data copyin(a[0:size-1]), copyout(b[s/4:3*s/4])

Fortran

!$pragma acc data copyin(a(1:size)), copyout(b(s/4:3*s/4))

Note: data clauses can be used on data, kernels or parallel

slide-31
SLIDE 31

31

EXERCISE 2: JACOBI DATA DIRECTIVES

Task: use acc data to minimize transfers in the Jacobi example Start from given laplace2D.c or laplace2D.f90 (your choice)

In the 002-laplace2d-data directory Add directives where it helps (hint: [do] while loop)

Q: What speedup can you get with data + kernels directives?

Versus 6 CPU cores? OMP_NUM_THREADS=6 ./laplace2d_omp

slide-32
SLIDE 32

Exercise 2 Solution: OpenACC C

#pragma #pragma acc acc data copy(A), create(Anew) while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; #pragma #pragma acc acc kerne kernels ls for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j-1][ ][i] + A[j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; 1; i++ ) { ++ ) { A[j][i] = Anew[j][i]; } } iter iter++; }

Copy A in at beginning of loop, out at end. Allocate Anew on accelerator

slide-33
SLIDE 33

Exercise 2 Solution: OpenACC Fortran

!$ !$acc acc data copy(A), create(Anew) do while do while ( err > tol tol .and. iter < < iter_max ) err=0. err=0._fp_kind !$ !$acc acc kernels do do j=1,m do do i=1,n Anew(i,j i,j) = .25 ) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A( A(i , j , j-1) + A( 1) + A(i , , j+1)) j+1)) err = max(err, Anew(i,j) ) - A( A(i,j)) )) end do end do end do !$ !$acc acc end kernels ... ... iter iter = = iter +1 +1 end do end do !$ !$acc acc end data data

Copy A in at beginning of loop, out at end. Allocate Anew on accelerator

slide-34
SLIDE 34

34

EXERCISE 2: PERFORMANCE

Execution (4096x4096) Time (s) Speedup CPU 8 OpenMP thread 20.3

  • OpenACC K20m

176.1 0.12x OpenACC K20m-opt 8.34 2.43x

Speedup vs. 8 CPU cores

slide-35
SLIDE 35

35

WHAT WENT RIGHT?

ti time me(us us): ): 9 9,90 901, 1,98 981 50 50: : da data ta re regi gion

  • n r

reac ache hed d 1 t tim ime 50 50: : dat ata a co copy pyin re reac ached ed 8 8 t tim imes de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =22,0 ,039 39 m max= x=2, 2,76 763 3 min in=2 =2,7 ,751 1 av avg=2, 2,754 54 82 82: : dat ata a co copy pyout ut rea eache hed d 9 9 ti times es de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =21,8 ,851 51 m max= x=2, 2,73 734 4 min in=1 =14 4 avg vg=2 =2,4 ,427 27 56 56: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 59 59: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc

  • ck:

k: [ [128 28] device time(us): total=6,437,470 max=6,491 min=6,429 avg=6,437 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=6 =6,45 452, 2,03 030 m max ax=7 =7,1 ,162 2 mi min= n=6,4 ,442 42 avg vg=6, 6,45 452 59 59: : red educ ucti tion

  • n ke

kern rnel el la laun unch ched ed 10 1000 00 t time mes gri rid: d: [ [1] 1] b blo lock ck: [ [25 256] 6] de devi vice ce t time me(u (us) s): t tot

  • tal

al=2 =269, 9,57 570 0 max ax=3 =324 24 m min= n=26 268 8 avg vg=2 =269 69 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=2 =283, 3,75 752 2 max ax=1 =1,0 ,057 57 mi min= n=28 281 avg vg=2 =283 83 68 68: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 71 71: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc

  • ck:

k: [ [128 28] de devi vice ce t time me(u (us) s): t tot

  • tal

al=3 =3,15 151, 1,05 051 m max ax=3 =3,2 ,206 6 mi min= n=3,1 ,147 47 avg vg=3, 3,15 151 ela laps psed ed t time me(u (us) s): t tot

  • tal

al=3 =3,16 166, 6,37 372 m max ax=3 =3,9 ,924 4 mi min= n=3,1 ,160 60 avg vg=3, 3,16 166

Transfer Bottleneck Eliminated!

Computation: 10 seconds Data movement: negligible

slide-36
SLIDE 36

36

FURTHER SPEEDUPS

OpenACC gives us more detailed control over parallelization

Via gang, worker, and vector clauses

By understanding more about OpenACC execution model and GPU hardware organization, we can get higher speedups on this code By understanding bottlenecks in the code via profiling, we can reorganize the code for higher performance Will tackle these in later exercises

slide-37
SLIDE 37

37

FINDING PARALLELISM IN YOUR CODE

(Nested) for loops are best for parallelization Large loop counts needed to offset GPU/memcpy overhead Iterations of loops must be independent of each other

To help compiler: restrict keyword (C), independent clause

Compiler must be able to figure out sizes of data regions

Can use directives to explicitly control sizes

Pointer arithmetic should be avoided if possible

Use subscripted arrays, rather than pointer-indexed arrays.

Function calls within accelerated region must be inlineable.

slide-38
SLIDE 38

38

TIPS AND TRICKS

(PGI) Use time option to learn where time is being spent

  • ta=nvidia,time

Eliminate pointer arithmetic Inline function calls in directives regions

(PGI): -inline or –inline,levels(<N>)

Use contiguous memory for multi-dimensional arrays Use data regions to avoid excessive memory transfers Conditional compilation with _OPENACC macro