openacc programming model
play

OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and - PowerPoint PPT Presentation

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI GPU ARCHITECTURE Threads Register Files Shared L1 Read-Only Memory Cache Data Cache L2 Cache Texture Constant GPU


  1. CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI

  2. GPU ARCHITECTURE Threads Register Files Shared L1 Read-Only Memory Cache Data Cache L2 Cache Texture Constant GPU Global Memory Memory Memory 2

  3. USING SHARED MEMORY WITH CUDA Creating Shared Memory: Static Shared Memory __shared__ int s[64]; Dynamic Shared Memory extern __shared__ int s[]; Handling Data Race: __syncthreads(); 3

  4. PROS AND CONS OF CUDA APPROACH Pros: Better control over hardware Cons: Familiar with CUDA and GPU Redesign the algorithm Thread Synchronization Bank Conflicts 4

  5. OPENACC: A DIRECTIVE-BASED APPROACH Rich Set of Data Directives Two Offload Region Constructs: parallel and kernels GPU GPU Three Levels of Parallelism: gang, worker and vector CPU Program myscience Program myscience ... serial code ... ... serial code ... !$acc kernels do k = 1,n1 do k = 1,n1 do i = 1,n2 do i = 1,n2 ... parallel code ... ... parallel code ... enddo enddo enddo enddo ... !$acc end kernels End Program myscience ... OpenA enACC End Program myscience Comp mpiler er Direct ectives ves 5

  6. CACHE DIRECTIVE CONSTRUCT C/C++ #pragma acc cache (a[lower1: length1] [lower2: length2]) Fortran !$acc cache (a(lower1:upper1, lower2: upper2)) Examples: #pragma acc cache (a[i-1: 3] [j]) // i and j as loop index !$acc cache (a(J, :)) ! cache the entire dimension !$acc cache (a) ! cache the entire array 6

  7. Partial Array Cached Entire Array Dimension Cached CASE STUDIES Entire Array Cached 7

  8. PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread0 Register Files for a thread1 Global Memory I 8 I

  9. PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread0 Register Files for a thread1 Nine Loads Global Memory I 9 I

  10. PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 t0 t1 First Load Global Memory 10 I

  11. PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Second Load Global Memory 11 I

  12. PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread Shared Memory Global Memory 12 I

  13. PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread Shared Memory Global Memory 13 I

  14. PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=1, M 1D 9-Point Stencil with Cache Directive M=128*1024*1024 !$acc cache (A(i-4:i+4)) 2.5 C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + 2 A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Speedup 1.5 1 0.5 0 P100 K80 14

  15. PARTIAL ARRAY CACHED: 1D VS 2D !$acc loop gang !$acc loop gang DO j=1,N DO j=1,N !$acc loop vector !$acc loop vector DO i=1, M DO i=1, M !$acc cache (A(i-4:i+4, j)) !$acc cache (A(i-4:i+4, j-4:j+4)) C(i, j) = (A(i-4, j) + A(i-3, j));+ A(i-2, j) + A(i-1, j) )+ A(i, j) C(i, j) = (A(i-4, j) + A(i-3, j));+ A(i-2, j) + A(i-1, j) )+ A(i, j) + A(i+1, j) + A(i+2, j) + A(i+3, j) + A(i+4, j) + A(i+1, j) + A(i+2, j) + A(i+3, j) + A(i+4, j) + A(i, j-4) + A(i, j-3));+ A(i, j-2) + A(i, j-1) + A(i, j-4) + A(i, j-3));+ A(i, j-2) + A(i, j-1) + A(i, j+1) + A(i, j+2) + A(i, j+3) + A(i, j+4)) * coeff + A(i, j+1) + A(i, j+2) + A(i, j+3) + A(i, j+4)) * coeff 2D Stencil Cache Performance (N=16*1024, M=16*1024) 2.50 2.00 Speedup 1.50 1.00 0.50 0.00 P100 K80 15 1D cache 2D Cache

  16. PARTIAL ARRAY CACHED: UNCOALESCED N=M=8192 Speedup of applying cache to uncoalesced data !$acc loop gang 3.00 DO j=1,N 2.50 !$acc loop vector 2.00 Speedup DO i=1, M 1.50 !$acc cache (B(j, i-1:i+1)) 1.00 0.50 C(i, j) = (A(i-1, j) )+ A(i, j) + A(i+1, j) + 0.00 B(j, i+1) + B(j, i) + B(j, i+1)) * coeff P100 K80 16

  17. ENTIRE ARRAY DIMENSIONS CACHED ORNL CAAR ACME !$acc parallel loop gang collapse(3) do ie = 1 , nelemd do q = 1 , qsize do ks = 1 , nlev, kchunk !$acc cache(s(:,:,ks:ks+kchunk-1,q,ie)) !$acc loop vector collapse(3) do k = 1 , kchunk do j = 1 , np do i = 1 , np do l = 1 , np dsdx00 = dsdx00 + deriv_dvv(l,i)*s(l,j,ks+k-1,q,ie) 17 …

  18. ENTIRE ARRAY CACHED Nonhydrostatic Icosahedral Model: NIM !$acc parallel acc loop gang private(fu0, sumu , …) do ipn=IPS,IPE !$acc cache(fu0, sumu , …) !$acc loop vector do k=1,NZ fu0(k) = 0.0 … enddo !$acc loop vector do k=1,NZ fu0(k) = fu0(k) + sumu(k) …. 18 end do

  19. VARIABLE-LENGTH ARRAY real :: a(NX) … pgfortran – acc – ta=tesla:safecache a.f90 -Minfo !$acc loop gang private(a) DO j=1,N !$acc cache (a) !$acc loop vector DO i=1, M … 19

  20. PERFORMANCE DATA Kernels from Real-World Apps Cache Directive Performance Improvement 1.80 1.60 1.40 1.20 Speedup 1.00 0.80 0.60 0.40 0.20 0.00 MPAS ACME FORCE PSTADV VDMINTV DIAG FLUX P100 K80 20

  21. DISCUSSION Recommendation often given: If there is data reuse within the thread-block, then use shared memory to cache such data and then access latency is reduced. ≠ Recommendation Better Performance Performance Factors: Thread Occupancy Memory Access Latency Hardware Platforms Others 21

  22. CASE STUDY: ORNL DIRAC Array S (double) Is(int) Id(int) Size 1599*8 1599*4 1599*4 Occupany vs Performance on K80 Occupany vs Performance on P100 140 80 70 120 Percentage(%)/ms Percentage(%)/ms 60 100 50 80 40 60 30 40 20 20 10 0 0 nocache cache-s cache-id-is cache-all nocache cache-s cache-id-is cache-all K80-occupany K80-perf P100-occupany P100-perf 22

  23. CONCLUSION Summary: Cache directive does improve the performance in real world applications Pros: Help reduce uncoalesced memory access Combining with gang-level private, avoid data fetch from global memor y Cons: No performance improvement guarantee, if the shared memory is overly used 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend