GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS - - PowerPoint PPT Presentation
GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS - - PowerPoint PPT Presentation
GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum Drop - in Applications Flexibility Acceleration 2 OPENACC DIRECTIVES CPU GPU
2
3 WAYS TO ACCELERATE APPLICATIONS
Applications
Libraries
“Drop-in” Acceleration
Programming Languages OpenACC Directives
Maximum Flexibility Easily Accelerate Applications
3
OPENACC DIRECTIVES
Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience
CPU GPU
Your original Fortran or C code
Simple Compiler hints Compiler Parallelizes code Works on many-core GPUs & multicore CPUs
OpenACC Compiler Hint
4
FAMILIAR TO OPENMP PROGRAMMERS
main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }
CPU OpenMP
main() { double pi = 0.0; long i; #pragma acc kernels for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }
CPU GPU OpenACC
Op OpenA enACC CC Me Member mbers s and nd Sup uppor porter ers
6
DIRECTIVES: EASY & POWERFUL
Real-Time Object Detection
Global Manufacturer of Navigation Systems
Valuation of Stock Portfolios using Monte Carlo
Global Technology Consulting Company
Interaction of Solvents and Biomolecules
University of Texas at San Antonio
Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications.
”
- - Developer at the Global Manufacturer of Navigation Systems
“
5x in 40 Hours 2x in 4 Hours 5x in 8 Hours
7
FOCUS ON EXPOSING PARALLELISM
With Directives, tuning work focuses on exposing parallelism, which makes codes inherently better
Example: Application tuning work using directives for new Titan system at ORNL
S3D Research more efficient combustion with next- generation fuels CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios
- Tuning top 3 kernels (90% of runtime)
- 3 to 6x faster on CPU+GPU vs. CPU+CPU
- But also improved all-CPU version by 50%
- Tuning top key kernel (50% of runtime)
- 6.5x faster on CPU+GPU vs. CPU+CPU
- Improved performance of CPU version by 100%
8
subrouti subroutine ne sa saxpy py(n (n, , a, x, a, x, y y) real :: x(:), y(:), a integer :: n, i $! $!acc acc kernels do do i=1,n y( y(i) = a*x(i)+y( )+y(i) enddo enddo $! $!acc acc end kernels end subr end subroutine
- utine saxpy
saxpy ... ... $ Perfor $ Perform SAXP m SAXPY on 1M Y on 1M elemen elements ts call call sa saxpy py(2 (2**20, **20, 2 2.0 .0, x_d x_d, , y_d y_d) ... ... void sax void saxpy(int py(int n, n, float a, fl float at * *x, x, float *restrict y) { #pragma #pragma acc acc kernels for for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... ... // Perfo // Perform SAX rm SAXPY on 1M PY on 1M eleme elements nts saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ...
A VERY SIMPLE EXERCISE: SAXPY
SAXPY in C SAXPY in Fortran
9
DIRECTIVE SYNTAX
Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block
10
KERNELS: YOUR FIRST OPENACC DIRECTIVE
Each loop executed as a separate kernel on the GPU.
!$acc kernels do i=1,n a(i) = 0.0 b(i) = 1.0 c(i) = 2.0 end do do i=1,n a(i) = b(i) + c(i) end do !$acc end kernels
kernel 1 kernel 2
Kernel:
A parallel function that runs on the GPU
11
KERNELS CONSTRUCT
Fortran
!$acc kernels [clause …] structured block !$acc end kernels
Clauses
if( condition ) async( expression )
Also, any data clause (more later) C
#pragma acc kernels [clause …] { structured block }
12
COMPLETE SAXPY EXAMPLE CODE
Trivial first example
Apply a loop directive Learn compiler commands
#in incl clude de < <st stdli lib. b.h> voi
- id sax
axpy py(in int n, n, fl floa
- at a,
a, fl floa
- at *x
*x, fl floa
- at *r
*res estri rict ct y) y) { #pr prag agma a ac acc ker erne nels ls for
- r (int
nt i = = 0; 0; i < n n; + ++i) y[ y[i] = a * x[i] + y[i]; ]; } int nt ma main in(int nt ar argc gc, , ch char ** **ar argv) { int nt N N = = 1< 1<<20 20; ; // // 1 1 mi mill llio ion f flo loat ats if if (ar argc gc > > 1) 1) N N = = at atoi
- i(ar
argv gv[1 [1]) ]); flo loat *x *x = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); flo loat *y *y = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); for
- r (i
(int nt i i = = 0; 0; i i < < N; N; + ++i +i) { { x[ x[i] = = 2 2.0f 0f; y[ y[i] = = 1 1.0f 0f; } sax axpy(N, N, 3 3.0f 0f, , x, x, y y); ret eturn rn 0; 0; }
*restrict: “I promise y does not alias
x”
13
COMPILE AND RUN
C: pgcc –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.c Fortran: pgf90 –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.f90 Compiler output:
pgc gcc -ac acc -Mi Minf nfo=ac acce cel -ta ta=nv nvid idia ia -o
- sax
axpy py_a _acc sa saxp xpy. y.c sax axpy py: 8, 8, G Gen enera rati ting ng co copy pyin in(x (x[:n :n-1]) ]) Gen enera rati ting ng c copy py(y (y[: [:n-1]) ]) Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 1 1.0 .0 b bina nary ry Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 2 2.0 .0 b bina nary ry 9, Loop is parallelizable Acc ccele lera rato tor r ker erne nel l gen ener erat ated ed 9, 9, #p #pra ragm gma a acc cc loo
- op w
wor
- rke
ker, r, ve vect ctor
- r(25
256) 6) / /* * blo lock ckId Idx.x .x thr hrea eadId Idx. x.x */ */ CC CC 1 1.0 .0 : : 4 4 re regi giste ters rs; ; 52 52 sh shar ared ed, 4 4 c con
- nst
stant nt, , 0 0 loc
- cal
al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy CC CC 2 2.0 .0 : : 8 8 re regi giste ters rs; ; 4 4 sha hare red, d, 64 64 c con
- nst
stant nt, , 0 0 loc
- cal
al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy
14
EXAMPLE: JACOBI ITERATION
Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points.
Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛂𝟑𝒈(𝒚, 𝒛) = 𝟏
A(i,j) A(i+1,j) A(i-1,j) A(i,j-1) A(i,j+1)
𝐵𝑙+1 𝑗, 𝑘 = 𝐵𝑙(𝑗 − 1, 𝑘) + 𝐵𝑙 𝑗 + 1, 𝑘 + 𝐵𝑙 𝑗, 𝑘 − 1 + 𝐵𝑙 𝑗, 𝑘 + 1 4
15
JACOBI ITERATION C CODE
while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j-1][ ][i] + A[j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][ A[j][i] = A ] = Ane new[ w[j][ ][i]; ]; } } iter iter++; }
Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays
16
JACOBI ITERATION FORTRAN CODE
do do while ( err > tol tol .and. iter < < iter_max ) err=0 err=0._ ._fp_kind do do j=1,m do do i=1,n =1,n Anew( Anew(i,j i,j) = .25 ) = .25_fp fp_k _kind ind * * (A (A(i+1, (i+1, j j ) + A(i ) + A(i-1, , j j ) ) + & + & A( A(i , j , j-1) + A( ) + A(i , j+1)) err = = max(err (err, Anew(i,j) ) - A( A(i,j)) )) end end do do end end do do do do j=1,m-2 do do i=1,n =1,n-2 A( A(i,j i,j) = Anew(i,j) end do en end d do do iter iter = = iter iter +1 +1 end do end do
Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays
17
EXERCISES
Exercises are in “exercises/openacc” directory
Solutions in “exercise_solutions/openacc” directory
module load pgi/14.6 To compile, use one of the provided makefiles
C: > make Fortran: > make –f Makefile_f90
Remember these flags
- acc –ta=nvidia –Minfo=accel
General instructions (compiling)
18
EXERCISES
To run, use sbatch with one of the provided job files
> sbatch runit.acc > qstat –u <username> # prints qsub status Output is placed in slurm.* when finished.
General instructions (running)
19
EXERCISE 1
Task: use acc kernels to parallelize the Jacobi loop nests Edit laplace2D.c or laplace2D.f90 (your choice)
In the 001-laplace2D-kernels directory Add directives where it helps Figure out the proper compilation flags to use Optionally: Run OpenACC version with laplace_acc
Q: can you get a speedup with just kernels directives?
Versus 1 CPU core? Versus 6 CPU cores?
Jacobi kernels
20
EXERCISE 1 SOLUTION: OPENACC C
while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; #pragma #pragma acc acc kernels for for( ( int nt j j = 1 1; ; j j < n < n-1; 1; j++) j++) { { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j-1][ ][i] + A[j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][ A[j][i] = A ] = Ane new[ w[j][ ][i]; ]; } } iter iter++; }
Execute GPU kernel for loop nest Execute GPU kernel for loop nest
21
EXERCISE 1 SOLUTION: OPENACC FORTRAN
do while ( error > tol tol .and. iter iter < < iter_max ) ) err=0._fp_kind $! $!acc acc kernels do do j= j=1,m ,m do do i=1,n =1,n Anew(i,j i,j) ) = = 0.25 * ( * (A(i+1,j) + + A(i A(i-1,j) + & A(i,j-1) 1) + + A(i,j+1)) err = = max(err, abs(Anew(i,j ,j) ) – A( A(i,j i,j); ); enddo enddo enddo !$ !$acc acc end kernels !$ !$acc acc kernels do do j=1, m-2 do do i=1,n =1,n-2 A( A(i,j i,j) ) = = Anew( Anew(i,j) enddo enddo enddo !$ !$acc acc end kernels iter iter = iter+1 enddo enddo
Execute GPU kernel for loop nest Execute GPU kernel for loop nest
22
EXERCISE 1: COMPILER OUTPUT (C)
pgcc -tp sandybridge-64 -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c main: 56, Generating present_or_copyout(Anew[1:4094][1:4094]) Generating present_or_copyin(A[:][:]) Generating Tesla code 57, Loop is parallelizable 59, Loop is parallelizable Accelerator kernel generated 57, #pragma acc loop gang /* blockIdx.y */ 59, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 63, Max reduction generated for error 68, Generating present_or_copyin(Anew[1:4094][1:4094]) Generating present_or_copyout(A[1:4094][1:4094]) Generating Tesla code 69, Loop is parallelizable 71, Loop is parallelizable Accelerator kernel generated 69, #pragma acc loop gang /* blockIdx.y */ 71, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
23
EXERCISE 1: PERFORMANCE
Execution (4096x4096) Time (s) Speedup CPU 1 OpenMP thread 108.7
- CPU 2 OpenMP threads
58.6 1.85x CPU 4 OpenMP threads 31.9 3.41x CPU 8 OpenMP threads 20.3 5.35x OpenACC GPU 176.1 0.12x FAIL
Speedup vs. 8 CPU cores Speedup vs. 1 CPU core
CPU: Intel E5-2680 v2 10 Cores @ 2.80 GHz GPU: NVIDIA Tesla K20m
24
WHAT WENT WRONG?
PGI_ACC_TIME=1
ti time me(us us): ): 1 101, 1,10 104, 4,17 174 56 56: : da data ta re regi gion
- n r
reac ache hed d 100 000 0 ti time mes 56 56: : dat ata a co copy pyin re reac ached ed 8 800 000 0 tim imes es de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =22,0 ,030 30,0 ,081 1 ma max= x=2, 2,858 58 m min in=2, 2,74 746 6 avg avg=2 =2,7 ,753 53 68 68: : dat ata a co copy pyout ut rea eache hed d 80 8000 00 ti time mes de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =23,0 ,018 18,7 ,701 1 ma max= x=6, 6,552 52 m min in=2, 2,85 855 5 avg avg=2 =2,8 ,877 77 56 56: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 59 59: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc
- ck:
k: [ [128 28] de devi vice ce t time me(u (us) s): t tot
- tal
al=6 =6,45 456, 6,51 517 m max ax=6 =6,5 ,516 6 mi min= n=6,4 ,447 47 avg vg=6, 6,45 456 ela laps psed ed t time me(u (us) s): t tot
- tal
al=6 =6,47 471, 1,11 110 m max ax=7 =7,0 ,066 6 mi min= n=6,4 ,460 60 avg vg=6, 6,47 471 59 59: : red educ ucti tion
- n ke
kern rnel el la laun unch ched ed 10 1000 00 t time mes grid: [1] block: [256] de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =270, 0,28 280 0 max ax=2 =276 76 m min= n=26 268 8 avg vg=2 =270 70 ela laps psed ed t time me(u (us) s): t tot
- tal
al=2 =283, 3,76 763 3 max ax=3 =353 53 m min= n=28 282 2 avg vg=2 =283 83 68 68: : da data ta re regi gion
- n r
reac ache hed d 100 000 0 ti time mes 68 68: : dat ata a co copy pyin re reac ached ed 8 800 000 0 tim imes es de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =23,2 ,271 71,7 ,701 1 ma max= x=2, 2,946 46 m min in=2, 2,89 891 1 avg avg=2 =2,9 ,908 08 77 77: : dat ata a co copy pyout ut rea eache hed d 80 8000 00 ti time mes de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =23,0 ,016 16,0 ,095 5 ma max= x=2, 2,993 93 m min in=2, 2,85 854 4 avg avg=2 =2,8 ,877 77 68 68: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 71 71: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc
- ck:
k: [ [128 28] de devi vice ce t time me(u (us) s): t tot
- tal
al=3 =3,04 040, 0,79 799 m max ax=3 =3,0 ,050 0 mi min= n=3,0 ,037 37 avg vg=3, 3,04 040 ela laps psed ed t time me(u (us) s): t tot
- tal
al=3 =3,05 056, 6,31 315 m max ax=3 =3,1 ,131 1 mi min= n=3,0 ,052 52 avg vg=3, 3,05 056
Huge Data Transfer Bottleneck!
Computation: 10 seconds Data movement: 90 seconds
25
BASIC CONCEPTS
PCI Bus
Transfer data Offload computation
For efficiency, decouple data movement and compute off-load
GPU GPU Memory CPU CPU Memory
26
EXCESSIVE DATA TRANSFERS
while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; ... ... } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m m-1; 1; i++) ++) { { Anew[j][i] = ] = 0.25 * (A[j][i+1] + A[j][i-1] + 1] + A[j-1][ 1][i] + A[j+1][i]); ]); error = max(error, abs(Anew[j][i] ] - A[j][ A[j][i]); } }
A, Anew resident on host A, Anew resident on host A, Anew resident on accelerator A, Anew resident on accelerator
These copies happen every iteration of the
- uter while loop!*
Copy Copy
*Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration!
27
Data Management
28
DATA CONSTRUCT
Fortran
!$acc data [clause …] structured block !$acc end data
General Clauses
if( condition ) async( expression )
C
#pragma acc data [clause …] { structured block }
Manage data movement. Data regions may be nested.
29
DATA CLAUSES
copy ( list )
Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.
copyin ( list )
Allocates memory on GPU and copies data from host to GPU when entering region.
copyout ( list ) Allocates memory on GPU and copies data to the host
when exiting region.
create ( list )
Allocates memory on GPU but does not copy.
present ( list ) Data is already present on GPU from another
containing data region. and present_or_copy[in|out], present_or_create, deviceptr.
30
ARRAY SHAPING
Compiler sometimes cannot determine size of arrays
Must specify explicitly using data clauses and array “shape”
C
#pragma acc data copyin(a[0:size-1]), copyout(b[s/4:3*s/4])
Fortran
!$pragma acc data copyin(a(1:size)), copyout(b(s/4:3*s/4))
Note: data clauses can be used on data, kernels or parallel
31
EXERCISE 2: JACOBI DATA DIRECTIVES
Task: use acc data to minimize transfers in the Jacobi example Start from given laplace2D.c or laplace2D.f90 (your choice)
In the 002-laplace2d-data directory Add directives where it helps (hint: [do] while loop)
Q: What speedup can you get with data + kernels directives?
Versus 6 CPU cores? OMP_NUM_THREADS=6 ./laplace2d_omp
Exercise 2 Solution: OpenACC C
#pragma #pragma acc acc data copy(A), create(Anew) while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; #pragma #pragma acc acc kerne kernels ls for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j-1][ ][i] + A[j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; 1; i++ ) { ++ ) { A[j][i] = Anew[j][i]; } } iter iter++; }
Copy A in at beginning of loop, out at end. Allocate Anew on accelerator
Exercise 2 Solution: OpenACC Fortran
!$ !$acc acc data copy(A), create(Anew) do while do while ( err > tol tol .and. iter < < iter_max ) err=0. err=0._fp_kind !$ !$acc acc kernels do do j=1,m do do i=1,n Anew(i,j i,j) = .25 ) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A( A(i , j , j-1) + A( 1) + A(i , , j+1)) j+1)) err = max(err, Anew(i,j) ) - A( A(i,j)) )) end do end do end do !$ !$acc acc end kernels ... ... iter iter = = iter +1 +1 end do end do !$ !$acc acc end data data
Copy A in at beginning of loop, out at end. Allocate Anew on accelerator
34
EXERCISE 2: PERFORMANCE
Execution (4096x4096) Time (s) Speedup CPU 8 OpenMP thread 20.3
- OpenACC K20m
176.1 0.12x OpenACC K20m-opt 8.34 2.43x
Speedup vs. 8 CPU cores
35
WHAT WENT RIGHT?
ti time me(us us): ): 9 9,90 901, 1,98 981 50 50: : da data ta re regi gion
- n r
reac ache hed d 1 t tim ime 50 50: : dat ata a co copy pyin re reac ached ed 8 8 t tim imes de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =22,0 ,039 39 m max= x=2, 2,76 763 3 min in=2 =2,7 ,751 1 av avg=2, 2,754 54 82 82: : dat ata a co copy pyout ut rea eache hed d 9 9 ti times es de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =21,8 ,851 51 m max= x=2, 2,73 734 4 min in=1 =14 4 avg vg=2 =2,4 ,427 27 56 56: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 59 59: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc
- ck:
k: [ [128 28] device time(us): total=6,437,470 max=6,491 min=6,429 avg=6,437 ela laps psed ed t time me(u (us) s): t tot
- tal
al=6 =6,45 452, 2,03 030 m max ax=7 =7,1 ,162 2 mi min= n=6,4 ,442 42 avg vg=6, 6,45 452 59 59: : red educ ucti tion
- n ke
kern rnel el la laun unch ched ed 10 1000 00 t time mes gri rid: d: [ [1] 1] b blo lock ck: [ [25 256] 6] de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =269, 9,57 570 0 max ax=3 =324 24 m min= n=26 268 8 avg vg=2 =269 69 ela laps psed ed t time me(u (us) s): t tot
- tal
al=2 =283, 3,75 752 2 max ax=1 =1,0 ,057 57 mi min= n=28 281 avg vg=2 =283 83 68 68: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 71 71: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc
- ck:
k: [ [128 28] de devi vice ce t time me(u (us) s): t tot
- tal
al=3 =3,15 151, 1,05 051 m max ax=3 =3,2 ,206 6 mi min= n=3,1 ,147 47 avg vg=3, 3,15 151 ela laps psed ed t time me(u (us) s): t tot
- tal
al=3 =3,16 166, 6,37 372 m max ax=3 =3,9 ,924 4 mi min= n=3,1 ,160 60 avg vg=3, 3,16 166
Transfer Bottleneck Eliminated!
Computation: 10 seconds Data movement: negligible
36
FURTHER SPEEDUPS
OpenACC gives us more detailed control over parallelization
Via gang, worker, and vector clauses
By understanding more about OpenACC execution model and GPU hardware organization, we can get higher speedups on this code By understanding bottlenecks in the code via profiling, we can reorganize the code for higher performance Will tackle these in later exercises
37
FINDING PARALLELISM IN YOUR CODE
(Nested) for loops are best for parallelization Large loop counts needed to offset GPU/memcpy overhead Iterations of loops must be independent of each other
To help compiler: restrict keyword (C), independent clause
Compiler must be able to figure out sizes of data regions
Can use directives to explicitly control sizes
Pointer arithmetic should be avoided if possible
Use subscripted arrays, rather than pointer-indexed arrays.
Function calls within accelerated region must be inlineable.
38
TIPS AND TRICKS
(PGI) Use time option to learn where time is being spent
- ta=nvidia,time
Eliminate pointer arithmetic Inline function calls in directives regions
(PGI): -inline or –inline,levels(<N>)