GPU Computing with OpenACC Directives GPUs Reaching Broader Set of - - PowerPoint PPT Presentation
GPU Computing with OpenACC Directives GPUs Reaching Broader Set of - - PowerPoint PPT Presentation
GPU Computing with OpenACC Directives GPUs Reaching Broader Set of Developers 1,000,000s CAE CFD Finance Rendering Data Analytics Universities Supercomputing Centers Life Sciences 100,000s Defense Oil & Gas Weather Climate
1,000,000’s Early Adopters
Time
Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics Life Sciences Defense Weather Climate Plasma Physics
GPUs Reaching Broader Set of Developers
100,000’s
2004 Present
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in” Acceleration
Programming Languages OpenACC Directives
Maximum Flexibility Easily Accelerate Applications
OpenACC Directives
Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience
CPU GPU
Your original Fortran or C code
Simple Compiler hints Compiler Parallelizes code Works on many-core GPUs & multicore CPUs
OpenACC Compiler Hint
Familiar to OpenMP Programmers
main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }
CPU OpenMP
main() { double pi = 0.0; long i; #pragma acc kernels for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }
CPU GPU OpenACC
Easy: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU
OpenACC
The Standard for GPU Directives
Ope penA nACC CC Me Membe mbers s and nd Sup uppor porter ers
Directives: Easy & Powerful
Real-Time Object Detection
Global Manufacturer of Navigation Systems
Valuation of Stock Portfolios using Monte Carlo
Global Technology Consulting Company
Interaction of Solvents and Biomolecules
University of Texas at San Antonio
Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications.
”
- - Developer at the Global Manufacturer of Navigation Systems
“
5x in 40 Hours 2x in 4 Hours 5x in 8 Hours
Focus on Exposing Parallelism
With Directives, tuning work focuses on exposing parallelism, which makes codes inherently better
Example: Application tuning work using directives for new Titan system at ORNL
S3D Research more efficient combustion with next- generation fuels CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios
- Tuning top 3 kernels (90% of runtime)
- 3 to 6x faster on CPU+GPU vs. CPU+CPU
- But also improved all-CPU version by 50%
- Tuning top key kernel (50% of runtime)
- 6.5x faster on CPU+GPU vs. CPU+CPU
- Improved performance of CPU version by 100%
subrouti subroutine ne saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $! $!acc acc kernels do do i=1,n =1,n y( y(i) = a*x(i)+y( )+y(i) enddo enddo $! $!acc acc end kernels end subr end subroutine
- utine saxpy
saxpy ... ... $ Perfor $ Perform SAXP m SAXPY on 1M Y on 1M elemen elements ts call call saxpy(2**20, 2.0, x_d x_d, , y_d y_d) ... ... void sax void saxpy(int py(int n, n, float a, float *x, float *restrict y) { #pragma #pragma acc acc kerne kernels ls for for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... ... // Perfo // Perform SAX rm SAXPY on 1M PY on 1M eleme elements nts saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ...
A Very Simple Exercise: SAXPY
SAXPY in C SAXPY in Fortran
Directive Syntax
Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block
kernel els: Your first OpenACC Directive
Each loop executed as a separate kernel on the GPU.
!$acc kernels do i=1,n a(i) = 0.0 b(i) = 1.0 c(i) = 2.0 end do do i=1,n a(i) = b(i) + c(i) end do !$acc end kernels
kernel 1 kernel 2
Kernel:
A parallel function that runs on the GPU
Kernels Construct
Fortran
!$ !$acc acc kernels rnels [clause …] stru tructure ctured b d bloc lock !$ !$acc acc end ker d kernel nels
Clauses
if if( ( cond
- ndition
ition ) as async nc( ( expres expressio ion )
Also, any data clause (more later) C
#prag ragma ma acc acc kern ernels els [clause …] { st structu ructured red bl block }
- ck }
C tip: the restric rict keyword
Declaration of intent given by the programmer to the compiler
Applied to a pointer, e.g. float * float *restrict restrict ptr ptr Meaning: “for the lifetime of ptr tr, only it or a value directly derived from it (such as ptr tr + 1) will be used to access the object to which it points”*
Limits the effects of pointer aliasing OpenACC compilers often require restric trict to determine independence
Otherwise the compiler can’t parallelize loops that access ptr tr Note: if programmer violates the declaration, behavior is undefined
http://en.wikipedia.org/wiki/Restrict
Complete SAXPY example code
Trivial first example
Apply a loop directive Learn compiler commands
#in incl clude de < <st stdli lib. b.h> voi
- id sax
axpy py(in int n, n, fl floa
- at a,
a, fl floa
- at *x
*x, float *restrict y) y) { #pr prag agma a ac acc ker erne nels ls for
- r (int
nt i = = 0; 0; i < n n; + ++i) y[ y[i] = = a a * * x[ x[i] + + y[ y[i]; ]; } int nt ma main in(int nt ar argc gc, , ch char ** **ar argv) { int nt N N = = 1< 1<<20 20; ; // // 1 1 mi mill llio ion f flo loat ats if if (ar argc gc > > 1) 1) N N = = at atoi
- i(ar
argv gv[1 [1]) ]); flo loat *x *x = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); flo loat *y *y = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); for
- r (i
(int nt i i = = 0; 0; i i < < N; N; + ++i +i) { { x[ x[i] = = 2 2.0f 0f; y[ y[i] = 1.0f; } sax axpy(N, N, 3 3.0f 0f, , x, x, y y); ret eturn rn 0; 0; }
*restrict: “I promise y does not alias x”
Compile and run
C:
pgcc pgcc –acc acc -ta= ta=nvidia nvidia -Minfo Minfo=accel accel –o
- saxpy_acc
saxpy_acc saxpy.c saxpy.c
Fortran:
pgf90 pgf90 –acc acc -ta= ta=nvidia nvidia -Minfo Minfo=accel accel –o
- saxpy_acc
saxpy_acc saxpy.f90 saxpy.f90
Compiler output:
pgc gcc -ac acc -Mi Minf nfo=ac acce cel -ta ta=nv nvid idia ia -o
- sax
axpy py_a _acc sa saxp xpy. y.c sax axpy py: 8, 8, G Gen enera rati ting ng co copy pyin in(x (x[:n :n-1]) ]) Gen enera rati ting ng c copy py(y (y[: [:n-1]) ]) Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 1 1.0 .0 b bina nary ry Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 2 2.0 .0 b bina nary ry 9, 9, L Loo
- op i
is s pa para ralle leli liza zable le Acc ccele lera rato tor r ker erne nel l gen ener erat ated ed 9, 9, #p #pra ragm gma a acc cc loo
- op w
wor
- rke
ker, r, ve vect ctor
- r(25
256) 6) / /* * blo lock ckId Idx.x .x thr hrea eadId Idx. x.x */ */ CC CC 1 1.0 .0 : : 4 4 re regi giste ters rs; ; 52 52 sh shar ared ed, 4 4 c con
- nst
stant nt, , 0 0 loc
- cal
al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy
Example: Jacobi Iteration
Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points.
Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛂𝟑𝒈(𝒚, 𝒛) = 𝟏
A(i,j) A(i+1,j) A(i-1,j) A(i,j-1) A(i,j+1)
𝐵𝑙+1 𝑗, 𝑘 = 𝐵𝑙(𝑗 − 1, 𝑘) + 𝐵𝑙 𝑗 + 1, 𝑘 + 𝐵𝑙 𝑗, 𝑘 − 1 + 𝐵𝑙 𝑗, 𝑘 + 1 4
Jacobi Iteration C Code
while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; for for( ( int nt j j = 1 1; ; j j < n < n-1; 1; j++) j++) { { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j [j-1][ ][i] + + A A[j+1] [j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][i] = Anew[j][i]; } } iter iter++; }
Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays
Jacobi Iteration Fortran Code
do do while while ( err > tol tol .and. iter < < iter_max ) err=0 err=0._fp_kind do do j=1,m do do i=1,n =1,n Anew( Anew(i,j i,j) = .25 ) = .25_fp fp_k _kind ind * * (A (A(i+1, (i+1, j j ) + A(i ) + A(i-1, , j j ) ) + & + & A( A(i , j , j-1) + A( ) + A(i , j+1)) err = = max(err, Anew(i,j) ) - A( A(i,j)) )) end end do do end do end do do do j=1,m-2 do do i=1,n =1,n-2 A( A(i,j i,j) = Anew(i,j) en end do d do end do iter iter = = iter iter +1 +1 end do end do
Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays
OpenMP C Code
while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; #pragma #pragma omp
- mp paral
paralle lel l for
- r s
shared hared(m (m, , n, , An Anew, A ew, A) for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j [j-1][ ][i] + + A A[j+1] [j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } #pragma #pragma omp
- mp parallel for shared(m, n, Anew, A)
for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][i] = Anew[j][i]; } } iter iter++; }
Parallelize loop across CPU threads Parallelize loop across CPU threads
OpenMP Fortran Code
do do while while ( err > tol tol .and. iter < < iter_max ) err=0 err=0._fp_kind !$ !$omp
- mp parallel do shared(m,n,Anew,A) reduction(max:err)
do do j=1,m do do i=1,n Anew( Anew(i,j i,j) = .25 ) = .25_fp fp_k _kind ind * * (A (A(i+1, (i+1, j j ) + A(i ) + A(i-1, , j j ) ) + & + & A( A(i , j , j-1) + A( ) + A(i , j+1)) err = = max(err, Anew(i,j) ) - A( A(i,j)) )) end end do do end do end do !$ !$omp
- mp parallel do shared(m,n,Anew,A)
do do j=1,m-2 do do i=1,n =1,n-2 A( A(i,j i,j) = Anew(i,j) en end do d do end do iter iter = = iter iter +1 +1 end do end do
Parallelize loop across CPU threads Parallelize loop across CPU threads
Exercises: General Instructions (compiling)
Exercises are in “exercises/openacc” directory in your home directory
Solutions are in “exercise_solutions/openacc” directory
module load pgi/13.5 To compile, use one of the provided makefiles
C: > make > make Fortran: > make > make –f Makefile_f90 f Makefile_f90
Remember these compiler flags: –acc acc -ta= ta=nvidia nvidia -Minfo Minfo=acce accel
Exercises: General Instructions (running)
To run, use qsub qsub with one of the provided job files
> > qsub qsub runit.acc runit.acc > > qstat qstat –u <username> u <username> # prints # prints qsub qsub status status Output is placed in slur lurm.* when finished. OpenACC job file looks something like this
#! #!/bin bin/bas bash ./ ./lap lapla lace2 ce2d_a d_acc
The OpenMP version specifies number of cores to use
#! #!/bin bin/bas bash ex expor port t OMP OMP_NU _NUM_TH _THRE READS ADS=6 =6 ./ ./lap lapla lace2 ce2d_o d_omp Edit this to control the number
- f cores to use
Exercise 1: Jacobi Kernels
Task: use acc acc kerne rnels ls to parallelize the Jacobi loop nests Edit laplace2D.c or laplace2D.f90 (your choice)
In the 001 01-laplace2D laplace2D-kernels kernels directory Add directives where it helps Figure out the proper compilation command (similar to SAXPY example)
Compile both with and without OpenACC parallelization Optionally compile with OpenMP (original code has OpenMP directives)
Run OpenACC version with laplace_acc, OpenMP with laplace_omp
Q: can you get a speedup with just kernels directives?
Versus 1 CPU core? Versus 6 CPU cores?
Exercise 1 Solution: OpenACC C
while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; #pragma #pragma acc acc kerne kernels ls for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j [j-1][ ][i] + + A A[j+1] [j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][i] = Anew[j][i]; } } iter iter++; }
Execute GPU kernel for loop nest Execute GPU kernel for loop nest
Exercise 1 Solution: OpenACC Fortran
do do while while ( err > tol tol .and. iter < < iter_max ) err=0 err=0._fp_kind !$ !$acc acc kernels do do j=1,m do do i=1,n Anew( Anew(i,j i,j) = .25 ) = .25_fp fp_k _kind ind * * (A (A(i+1, (i+1, j j ) + A(i ) + A(i-1, , j j ) ) + & + & A( A(i , j , j-1) + A( ) + A(i , j+1)) err = = max(err, Anew(i,j) ) - A( A(i,j)) )) end end do do end end do do !$ !$acc acc end kernels !$ !$acc acc kernels do do j=1,m-2 do do i=1,n =1,n-2 A( A(i,j i,j) ) = A Ane new( w(i,j i,j) end do end do !$ !$acc acc end kernels iter iter = = iter iter +1 +1 end do end do
Generate GPU kernel for loop nest Generate GPU kernel for loop nest
Exercise 1 Solution: C Makefile
CC = = pgcc cc CC CCFLA LAGS S = ACCFL CFLAGS AGS = = -acc cc -ta= ta=nvid vidia ia -Minfo nfo=acce ccel OMPFL PFLAGS AGS = = -fast ast -mp mp -Minfo nfo BIN = N = l laplac aplace2d e2d_om _omp lap p laplac lace2d e2d_acc _acc all: l: $(B $(BIN) IN) lapla place2 ce2d_acc d_acc: l : lapl aplace2d ace2d.c .c $(CC) (CC) $( $(CCF CFLAGS) AGS) $( $(ACC CCFLAGS LAGS) -o $ $@ $< $< lapla place2 ce2d_omp d_omp: l : lapl aplace2d ace2d.c .c $(CC) $(CC) $( $(CCF CCFLAGS) LAGS) $( $(OMP OMPFLAGS FLAGS) ) -o $
- $@ $<
@ $< clean ean: $(RM) $(RM) $( $(BIN BIN)
Exercise 1 Solution: Fortran Makefile
F90 = = pgf9 gf90 CC CCFLA LAGS S = ACCFL CFLAGS AGS = = -acc cc -ta= ta=nvid vidia ia -Minfo nfo=acce ccel OMPFL PFLAGS AGS = = -fast ast -mp mp -Minfo nfo BIN = N = laplac aplace2d e2d_f9 _f90_omp 0_omp la lapla place2d_ ce2d_f90 f90_ac _acc all: l: $(B $(BIN) IN) lapla place2 ce2d_f90 d_f90_ac _acc: : lapla laplace2 ce2d.f d.f90 90 $(F90 (F90) $(CC $(CCFLAGS LAGS) $ $(AC ACCFLAG FLAGS) ) -o
- $@ $<
@ $< lapla place2 ce2d_f90 d_f90_om _omp: : lapla laplace2 ce2d.f d.f90 90 $(F90 $(F90) ) $(CC (CCFLAGS FLAGS) $ ) $(OM (OMPFLAG PFLAGS) S) -o $@ $< $@ $< clean ean: $(RM) $(RM) $( $(BIN BIN)
Exercise 1: Compiler output (C)
pgcc -tp nehalem -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c main: 56, Generating present_or_copyout(Anew[1:4094][1:4094]) Generating present_or_copyin(A[0:][0:]) Generating NVIDIA code Generating compute capability 1.3 binary Generating compute capability 2.0 binary Generating compute capability 3.0 binary 57, Loop is parallelizable 59, Loop is parallelizable Accelerator kernel generated 57, #pragma acc loop gang /* blockIdx.y */ Cached references to size [3x(x+2)] block of 'A' 59, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 63, Max reduction generated for error 68, Generating present_or_copyin(Anew[1:4094][1:4094]) Generating present_or_copyout(A[1:4094][1:4094]) Generating NVIDIA code Generating compute capability 1.3 binary Generating compute capability 2.0 binary Generating compute capability 3.0 binary 69, Loop is parallelizable 71, Loop is parallelizable Accelerator kernel generated 69, #pragma acc loop gang /* blockIdx.y */ 71, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Exercise 1: Performance
Execution (4096x4096) Time (s) Speedup CPU 1 OpenMP thread 95.1
- CPU 2 OpenMP threads
50.8 1.9x CPU 4 OpenMP threads 30.5 3.1x CPU 6 OpenMP threads 24.7 3.9x OpenACC GPU 150 0.16x FAIL
Speedup vs. 6 CPU cores Speedup vs. 1 CPU core
CPU: Intel E5-2670 8 Cores @ 2.60 GHz GPU: NVIDIA Tesla K20Xm
What went wrong?
Add PGI_ I_AC ACC_TI C_TIME ME=1 =1 to your environment
time(us): 86,029,004 56 56: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 56 56: : dat ata a co copy pyin n re reac ached ed 1 100 000 0 tim imes es de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =20,5 ,538 38,0 ,099 9 ma max= x=20 20,99 996 6 mi min=2 =20, 0,21 211 1 avg vg=2 =20, 0,538 38 59 59: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc
- ck:
k: [ [128 28] device time(us): total=3,078,568 max=3,185 min=3,072 avg=3,078 ela laps psed ed t time me(u (us) s): t tot
- tal
al=3 =3,09 091, 1,32 327 m max ax=3 =3,8 ,810 0 mi min= n=3,0 ,081 81 a avg vg=3, 3,09 091 59 59: : red educ ucti tion
- n ke
kern rnel el la laun unch ched ed 10 1000 00 t time mes gri rid: d: [ [1] 1] b blo lock ck: [ [25 256] 6] de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =262, 2,16 167 7 max ax=8 =848 48 m min= n=25 255 5 avg vg=2 =262 62 ela laps psed ed t time me(u (us) s): t tot
- tal
al=2 =279, 9,25 250 0 max ax=8 =860 60 m min= n=26 266 6 avg vg=2 =279 79 68 68: : dat ata a co copy pyout ut r rea eache hed d 10 1000 00 ti time mes de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =20,0 ,048 48,3 ,356 6 ma max= x=20 20,21 215 5 mi min=2 =20, 0,03 037 7 avg vg=2 =20, 0,048 48 68 68: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 68 68: : dat ata a co copy pyin n re reac ached ed 1 100 000 0 tim imes es de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =20,5 ,536 36,9 ,979 9 ma max= x=21 21,11 117 7 mi min=2 =20, 0,21 219 9 avg vg=2 =20, 0,536 36 71 71: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc
- ck:
k: [ [128 28] device time(us): total=1,515,375 max=1,526 min=1,503 avg=1,515 ela laps psed ed t time me(u (us) s): t tot
- tal
al=1 =1,90 902, 2,54 546 m max ax=2 =2,8 ,883 3 mi min= n=1,5 ,526 26 a avg vg=1, 1,90 902 77 77: : dat ata a co copy pyout ut r rea eache hed d 10 1000 00 ti time mes de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =20,0 ,049 49,4 ,460 0 ma max= x=20 20,36 360 0 mi min=2 =20, 0,03 038 8 avg vg=2 =20, 0,049 49
Huge Data Transfer Bottleneck!
Computation: 5 seconds Data movement: 80 seconds
Basic Concepts
PCI Bus
Transfer data Offload computation
For efficiency, decouple data movement and compute off-load
GPU GPU Memory CPU CPU Memory
Excessive Data Transfers
while while ( err ( error
- r >
> tol tol && && iter iter < < it iter er_max _max ) { ) { error=0.0; ... ... } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m-1; 1; i++) ++) { Anew[j][i] = ] = 0.25 * (A[j][i+1] + A[j][i-1] + 1] + A[j-1][ 1][i] + A[j+1][i]); ]); error error = = ma max( x(er error, ror, ab abs(A (Ane new[ w[j][ j][i] ] - A[j][ A[j][i]); ]); } }
A, Anew resident on host A, Anew resident on host A, Anew resident on accelerator A, Anew resident on accelerator
These copies happen every iteration of the
- uter while loop!*
Copy Copy
*Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration!
DATA MANAGEMENT
Data Construct
Fortran
!$acc data [clause …] structured block !$acc end data
General Clauses
if( condition ) async( expression )
C
#pragma acc data [clause …] { structured block }
Manage data movement. Data regions may be nested.
Data Clauses
copy ( list )
Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.
copyin ( list )
Allocates memory on GPU and copies data from host to GPU when entering region.
copyout ( list ) Allocates memory on GPU and copies data to the
host when exiting region.
create ( list )
Allocates memory on GPU but does not copy.
present ( list ) Data is already present on GPU from another
containing data region. and present_or_copy[in|out], present_or_create, deviceptr.
Array Shaping
Compiler sometimes cannot determine size of arrays
Must specify explicitly using data clauses and array “shape”
C
#prag ragma ma acc acc data ata copyin pyin(a[0 a[0:si :size ze-1]), ), cop copyout yout(b[s b[s/4: /4:3*s/4 3*s/4]) ])
Fortran
!$ !$pragma pragma acc acc data data copyin copyin(a(1:size)), (a(1:size)), copyout copyout(b(s/4:3*s/4)) (b(s/4:3*s/4))
Note: data clauses can be used on data ta, kerne nels ls or para rall llel el
Update Construct
Fortran
!$acc update [clause …]
Clauses
host( list ) device( list )
C
#pragma acc update [clause …] if( expression ) async( expression )
Used to update existing data after it has changed in its corresponding copy (e.g. update device copy after host copy changes) Move data from GPU to host, or host to GPU. Data movement can be conditional, and asynchronous.
Exercise 2: Jacobi Data Directives
Task: use acc acc data ta to minimize transfers in the Jacobi example Start from given laplace2D.c or laplace2D.f90 (your choice)
In the 002 02-laplace2d laplace2d-data data directory Add directives where it helps (hint: [do] while loop)
Q: What speedup can you get with data + kernels directives?
Versus 6 CPU cores? OMP_NUM_THREADS=6 ./laplace2d_omp
Exercise 2 Solution: OpenACC C
#pragma #pragma acc acc data copy(A), create(Anew) while while ( error > tol tol && iter iter < < iter_max ) { ) { error=0.0; #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { Anew[ Anew[j][i] ][i] = = 0. 0.25 5 * * (A[j] (A[j][i [i+1 +1] + + A A[j][i [j][i-1] + ] + A[j-1][ ][i] + A[j+1][i]); ]); error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); } } #pragma #pragma acc acc kernels for for( ( int j = 1; j < n-1; j++) { for for( ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { A[j][i] = Anew[j][i]; } } iter iter++; }
Copy A in at beginning of loop, out at end. Allocate Anew on accelerator
Exercise 2 Solution: OpenACC Fortran
!$ !$acc acc data copy(A), create(Anew) do while do while ( ( er err > > tol tol .an and.
- d. ite
ter < < iter_ iter_ma max ) err=0. err=0._fp_kind !$ !$acc acc kernels do do j=1,m do do i=1,n Anew(i,j i,j) = .25 ) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A( A(i , j , j-1) + A( 1) + A(i , j+1)) err = max(err, Anew(i,j) ) - A( A(i,j)) )) en end do d do end do end do !$ !$acc acc end kernels ... ... iter iter = = iter +1 +1 end do end do !$ !$acc acc end data data
Copy A in at beginning of loop, out at end. Allocate Anew on accelerator
Exercise 2: Performance
Execution Time (s) Speedup CPU 6 OpenMP threads 24.7
- OpenACC GPU K20Xm
150 0.16x OpenACC GPU K20Xm-opt 16.5 1.49x
Speedups vs. 6 CPU cores
What went right?
Add PGI_ I_AC ACC_TI C_TIME ME=1 =1 to your environment
time(us): 4,837,846 28 28: : da data ta re regi gion
- n r
reac ache hed d 1 t tim ime 50 50: : dat ata a co copy pyin re reac ached ed 1 1 t tim ime de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =20,2 ,239 39 m max= x=20 20,2 ,239 39 mi min= n=20 20,23 239 9 av avg=2 =20, 0,23 239 82 82: : dat ata a co copy pyout ut rea eache hed d 1 1 ti time de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =20,0 ,063 63 m max= x=20 20,0 ,063 63 mi min= n=20 20,06 063 3 av avg=2 =20, 0,06 063 56: compute region reached 1000 times 59 59: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc
- ck:
k: [ [128 28] de devi vice ce t time me(u (us) s): t tot
- tal
al=3 =3,06 062, 2,98 986 m max ax=3 =3,1 ,180 0 mi min= n=3,0 ,060 60 avg vg=3, 3,06 062 ela laps psed ed t time me(u (us) s): t tot
- tal
al=3 =3,07 073, 3,48 489 m max ax=3 =3,5 ,520 0 mi min= n=3,0 ,070 70 avg vg=3, 3,07 073 59 59: : red educ ucti tion
- n ke
kern rnel el la laun unch ched ed 10 1000 00 t time mes gri rid: d: [ [1] 1] b blo lock ck: [ [25 256] 6] de devi vice ce t time me(u (us) s): t tot
- tal
al=2 =256, 6,58 582 2 max ax=3 =309 09 m min= n=25 255 5 avg vg=2 =256 56 ela laps psed ed t time me(u (us) s): t tot
- tal
al=2 =267, 7,78 788 8 max ax=6 =657 57 m min= n=26 264 4 avg vg=2 =267 67 68 68: : co comp mpute te r reg egio ion r rea each ched d 10 1000 00 t time mes 71 71: : ker erne nel l la launc nche hed d 100 000 0 ti time mes gri rid: d: [ [32 32x40 4094 94] ] bl bloc
- ck:
k: [ [128 28] de devi vice ce t time me(u (us) s): t tot
- tal
al=1 =1,47 477, 7,97 976 m max ax=1 =1,5 ,528 8 mi min= n=1,4 ,473 73 avg vg=1, 1,47 477 elapsed time(us): total=1,489,101 max=1,735 min=1,484 avg=1,489
Transfer Bottleneck Eliminated!
Computation: 5 seconds Data movement: negligible
Further speedups
OpenACC gives us more detailed control over parallelization
Via gang, worker, and vector clauses
By understanding more about OpenACC execution model and GPU hardware organization, we can get higher speedups on this code By understanding bottlenecks in the code via profiling, we can reorganize the code for higher performance Will tackle these in later exercises
Finding Parallelism in your code
(Nested) for loops are best for parallelization Large loop counts needed to offset GPU/memcpy overhead Iterations of loops must be independent of each other
To help compiler: rest estrict keyword (C), inde ndependent clause
Compiler must be able to figure out sizes of data regions
Can use directives to explicitly control sizes
Pointer arithmetic should be avoided if possible
Use subscripted arrays, rather than pointer-indexed arrays.
Function calls within accelerated region must be inlineable.
Tips and Tricks
(PGI) Use time option to learn where time is being spent
- ta=nvidia,time
Eliminate pointer arithmetic Inline function calls in directives regions
(PGI): -inline or –inline,levels(<N>)
Use contiguous memory for multi-dimensional arrays Use data regions to avoid excessive memory transfers Conditional compilation with _OPENACC macro
OpenACC Learning Resources
OpenACC info, specification, FAQ, samples, and more
http://openacc.org
PGI OpenACC resources
http://www.pgroup.com/resources/accel.htm