advanced openacc
play

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER - PowerPoint PPT Presentation

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review Optimizing OpenACC Loops Routines Update Directive Asynchronous Programming Multi-GPU Programming OpenACC Interoperability Atomic Directive


  1. ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

  2. AGENDA OpenACC Review Optimizing OpenACC Loops Routines Update Directive Asynchronous Programming Multi-GPU Programming OpenACC Interoperability Atomic Directive Misc. Advice & Techniques Next Steps

  3. OPENACC REVIEW

  4. WHAT ARE COMPILER DIRECTIVES? Programmer inserts compiler hints. int main() { int main() { Execution Begins on the CPU. do_serial_stuff() do_serial_stuff() Data and Execution moves to the GPU. Compiler Generates GPU Code #pragma acc parallel loop for(int i=0; i < BIGN; i++) for(int i=0; i < BIGN; i++) { { …compute intensive work …compute intensive work } } do_more_serial_stuff(); do_more_serial_stuff(); Data and Execution returns to the CPU. } }

  5. OPENACC: THE STANDARD FOR GPU DIRECTIVES Simple: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Portable: GPU Directives represent parallelism at a high level, allowing portability to a wide range of architectures with the same code.

  6. Identify Available Parallelism Optimize Parallelize Loop Loops with Performance OpenACC Optimize Data Locality

  7. JACOBI ITERATION: C CODE while ( err > tol && iter < iter_max ) { Iterate until converged err=0.0; Iterate across matrix for( int j = 1; j < n-1; j++) { elements for(int i = 1; i < m-1; i++) { Calculate new value from Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + neighbors A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); Compute max error for } convergence } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Swap input/output arrays A[j][i] = Anew[j][i]; } } iter++; } 7

  8. JACOBI: FINAL CODE #pragma acc data copy(A) create(Anew) Optimized Data Locality while ( err > tol && iter < iter_max ) { err=0.0; #pragma acc parallel loop reduction(max:err) Parallelized Loop for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); } } Parallelized Loop #pragma acc parallel loop for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; }

  9. Speed-Up (Higher is Better) 30.00X 27.30X 25.00X 20.00X Socket/Socket: 6.24X 15.00X Intel Xeon E5-2698 v3 @ 2.30GHz (Haswell) 10.00X vs. NVIDIA Tesla K40 5.00X 4.38X 0.82X 1.00X 0.00X SINGLE THREAD 8 THREADS OPENACC (STEP 1) OPENACC (STEP 2)

  10. Identify Available Parallelism Optimize Parallelize Loop Loops with Performance OpenACC Optimize Data Locality

  11. SPARSE MATRIX/VECTOR PRODUCT Performs Mat/Vec product 99 do i=1,a%num_rows 100 tmpsum = 0.0d0 of sparse matrix 101 row_start = arow_offsets(i) Matrices are stored in a 102 row_end = arow_offsets(i+1)-1 row-compressed format 103 do j=row_start,row_end Parallelism per-row will 104 acol = acols(j) 105 acoef = acoefs(j) vary, but is generally not 106 xcoef = x(acol) very large 107 tmpsum = tmpsum + acoef*xcoef 108 enddo 109 y(i) = tmpsum 110 enddo

  12. PARALLELIZED SPMV Data already on device 106 !$acc parallel loop present(arow_offsets,acols,acoefs) & 107 !$acc& private(row_start,row_end,acol,acoef,xcoef) & Compiler has vectorized 108 !$acc& reduction(+:tmpsum) the loop at 113 and 109 do i=1,a%num_rows selected a vector length 110 tmpsum = 0.0d0 111 row_start = arow_offsets(i) of 256 112 row_end = arow_offsets(i+1)-1 Total application speed- 113 do j=row_start,row_end up (including other 114 acol = acols(j) 115 acoef = acoefs(j) accelerated routines): 116 xcoef = x(acol) 1.08X 117 tmpsum = tmpsum + acoef*xcoef 118 enddo 119 y(i) = tmpsum 120 enddo

  13. OPENACC: 3 LEVELS OF PARALLELISM • Vector threads work in Vector lockstep (SIMD/SIMT Workers parallelism) • Workers compute a vector Gang • Gangs have 1 or more workers and share resources Vector (such as cache, the Workers streaming multiprocessor, etc.) Gang • Multiple gangs work independently of each other

  14. OPENACC GANG, WORKER, VECTOR CLAUSES gang, worker, and vector can be added to a loop clause A parallel region can only specify one of each gang, worker, vector Control the size using the following clauses on the parallel region num_gangs(n), num_workers(n), vector_length(n) #pragma acc kernels loop gang #pragma acc parallel vector_length(128) for (int i = 0; i < n; ++i) #pragma acc loop gang #pragma acc loop vector(128) for (int i = 0; i < n; ++i) for (int j = 0; j < n; ++j) #pragma acc loop vector ... for (int j = 0; j < n; ++j) ...

  15. OPTIMIZED SPMV VECTOR LENGTH 106 !$acc parallel loop present(arow_offsets,acols,acoefs) & 3.50X 107 !$acc& private(row_start,row_end,acol,acoef,xcoef) & 3.00X 108 !$acc& vector_length(32) 109 do i=1,a%num_rows 2.50X 110 tmpsum = 0.0d0 111 row_start = arow_offsets(i) Speed-up 2.00X 112 row_end = arow_offsets(i+1)-1 113 !$acc loop vector reduction(+:tmpsum) 1.50X 114 do j=row_start,row_end 115 acol = acols(j) 1.00X 116 acoef = acoefs(j) 117 xcoef = x(acol) 0.50X 118 tmpsum = tmpsum + acoef*xcoef 0.00X 119 enddo 1024 512 256 128 64 32 120 y(i) = tmpsum OpenACC Vector Length for SPMV 121 enddo

  16. PERFORMANCE LIMITER: OCCUPANCY We need more threads!

  17. INCREASED PARALLELISM WITH WORKERS 106 !$acc parallel loop present(arow_offsets,acols,acoefs) & 2.00X 6X to Original 107 !$acc& private(row_start,row_end,acol,acoef,xcoef) & 1.80X 108 !$acc& gang worker vector_length(32) num_workers(32) 1.60X 109 do i=1,a%num_rows 110 tmpsum = 0.0d0 1.40X 111 row_start = arow_offsets(i) 1.20X 112 row_end = arow_offsets(i+1)-1 Speed-up 113 !$acc loop vector reduction(+:tmpsum) 1.00X 114 do j=row_start,row_end 0.80X 115 acol = acols(j) 116 acoef = acoefs(j) 0.60X 117 xcoef = x(acol) 0.40X 118 tmpsum = tmpsum + acoef*xcoef 119 enddo 0.20X 120 y(i) = tmpsum 0.00X 121 enddo 2 4 8 16 32 Number of Workers

  18. PERFORMANCE LIMITER: COMPUTE Now we’re compute bound

  19. PERFORMANCE LIMITER: PARALLELISM Really, we’re limited by parallelism per-row.

  20. SPEED-UP STEP BY STEP 7.00X Parallelize Optimize Data Optimize Loops Identify Locality Parallelism 6.00X 5.00X 4.00X Speed-up 3.00X 2.00X 1.00X 0.00X 0 1 2 3 4 5 6

  21. OPENACC COLLAPSE CLAUSE collapse(n): Transform the following n tightly nested loops into one, flattened loop. • Useful when individual loops lack sufficient parallelism or more than 3 loops are nested (gang/worker/vector) #pragma acc parallel #pragma acc parallel #pragma acc loop collapse(2) #pragma acc loop for(int i=0; i<N; i++) for(int ij=0; ij<N*N; ij++) for(int j=0; j<N; j++) ... ... Loops must be tightly nested

  22. NEW CASE STUDY: MANDELBROT SET Application generates the image to the right. Each pixel in the image can be independently calculated. Skills Used: Parallel Loop Data Region Update Directive Asynchronous Pipelining

  23. MANDELBROT CODE // Calculate value for a pixel unsigned char mandelbrot(int Px, int Py) { double x0=xmin+Px*dx; double y0=ymin+Py*dy; The mandelbrot() function calculates double x=0.0; double y=0.0; the color for each pixel. for(int i=0;x*x+y*y<4.0 && i<MAX_ITERS;i++) { double xtemp=x*x-y*y+x0; y=2*x*y+y0; x=xtemp; } return (double)MAX_COLOR*i/MAX_ITERS; } // Used in main() Within main() there is a doubly-nested for(int y=0;y<HEIGHT;y++) { for(int x=0;x<WIDTH;x++) { loop that calculates each pixel image[y*WIDTH+x]=mandelbrot(x,y); independently. } }

  24. ROUTINES

  25. OPENACC ROUTINE DIRECTIVE Specifies that the compiler should generate a device copy of the function/subroutine and what type of parallelism the routine contains. Clauses: gang/worker/vector/seq Specifies the level of parallelism contained in the routine. bind Specifies an optional name for the routine, also supplied at call-site no_host The routine will only be used on the device device_type Specialize this routine for a particular device type. 25

  26. MANDELBROT: ROUTINE DIRECTIVE At function source: // mandelbrot.h #pragma acc routine seq Function needs to be built for unsigned char mandelbrot(int Px, int Py); the GPU. It will be called by each thread // Used in main() (sequentially) #pragma acc parallel loop At call the compiler needs to know: for(int y=0;y<HEIGHT;y++) { Function will be available on for(int x=0;x<WIDTH;x++) { the GPU image[y*WIDTH+x]=mandelbrot(x,y); It is a sequential routine } }

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend