OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and - PowerPoint PPT Presentation

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI

GPU ARCHITECTURE Threads Register Files Shared L1 Read-Only Memory Cache Data Cache L2 Cache Texture Constant GPU Global Memory Memory Memory 2

USING SHARED MEMORY WITH CUDA Creating Shared Memory: Static Shared Memory __shared__ int s[64]; Dynamic Shared Memory extern __shared__ int s[]; Handling Data Race: __syncthreads(); 3

PROS AND CONS OF CUDA APPROACH Pros: Better control over hardware Cons: Familiar with CUDA and GPU Redesign the algorithm Thread Synchronization Bank Conflicts 4

OPENACC: A DIRECTIVE-BASED APPROACH Rich Set of Data Directives Two Offload Region Constructs: parallel and kernels GPU GPU Three Levels of Parallelism: gang, worker and vector CPU Program myscience Program myscience ... serial code ... ... serial code ... !$acc kernels do k = 1,n1 do k = 1,n1 do i = 1,n2 do i = 1,n2 ... parallel code ... ... parallel code ... enddo enddo enddo enddo ... !$acc end kernels End Program myscience ... OpenA enACC End Program myscience Comp mpiler er Direct ectives ves 5

CACHE DIRECTIVE CONSTRUCT C/C++ #pragma acc cache (a[lower1: length1] [lower2: length2]) Fortran !$acc cache (a(lower1:upper1, lower2: upper2)) Examples: #pragma acc cache (a[i-1: 3] [j]) // i and j as loop index !$acc cache (a(J, :)) ! cache the entire dimension !$acc cache (a) ! cache the entire array 6

Partial Array Cached Entire Array Dimension Cached CASE STUDIES Entire Array Cached 7

PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread0 Register Files for a thread1 Global Memory I 8 I

PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread0 Register Files for a thread1 Nine Loads Global Memory I 9 I

PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 t0 t1 First Load Global Memory 10 I

PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Second Load Global Memory 11 I

PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread Shared Memory Global Memory 12 I

PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=4, M !$acc cache (A(i-4:i+4)) C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Register Files for a thread Shared Memory Global Memory 13 I

PARTIAL ARRAY CACHED: 1D CACHE !$acc loop gang vector DO i=1, M 1D 9-Point Stencil with Cache Directive M=128*1024*1024 !$acc cache (A(i-4:i+4)) 2.5 C(i) = (A(i-4) + A(i-3))+ A(i-2) + A(i-1) + 2 A(i) + A(i+1) + A(i+2) + A(i+3) + A(i+4))/9.0 Speedup 1.5 1 0.5 0 P100 K80 14

PARTIAL ARRAY CACHED: 1D VS 2D !$acc loop gang !$acc loop gang DO j=1,N DO j=1,N !$acc loop vector !$acc loop vector DO i=1, M DO i=1, M !$acc cache (A(i-4:i+4, j)) !$acc cache (A(i-4:i+4, j-4:j+4)) C(i, j) = (A(i-4, j) + A(i-3, j));+ A(i-2, j) + A(i-1, j) )+ A(i, j) C(i, j) = (A(i-4, j) + A(i-3, j));+ A(i-2, j) + A(i-1, j) )+ A(i, j) + A(i+1, j) + A(i+2, j) + A(i+3, j) + A(i+4, j) + A(i+1, j) + A(i+2, j) + A(i+3, j) + A(i+4, j) + A(i, j-4) + A(i, j-3));+ A(i, j-2) + A(i, j-1) + A(i, j-4) + A(i, j-3));+ A(i, j-2) + A(i, j-1) + A(i, j+1) + A(i, j+2) + A(i, j+3) + A(i, j+4)) * coeff + A(i, j+1) + A(i, j+2) + A(i, j+3) + A(i, j+4)) * coeff 2D Stencil Cache Performance (N=16*1024, M=16*1024) 2.50 2.00 Speedup 1.50 1.00 0.50 0.00 P100 K80 15 1D cache 2D Cache

PARTIAL ARRAY CACHED: UNCOALESCED N=M=8192 Speedup of applying cache to uncoalesced data !$acc loop gang 3.00 DO j=1,N 2.50 !$acc loop vector 2.00 Speedup DO i=1, M 1.50 !$acc cache (B(j, i-1:i+1)) 1.00 0.50 C(i, j) = (A(i-1, j) )+ A(i, j) + A(i+1, j) + 0.00 B(j, i+1) + B(j, i) + B(j, i+1)) * coeff P100 K80 16

ENTIRE ARRAY DIMENSIONS CACHED ORNL CAAR ACME !$acc parallel loop gang collapse(3) do ie = 1 , nelemd do q = 1 , qsize do ks = 1 , nlev, kchunk !$acc cache(s(:,:,ks:ks+kchunk-1,q,ie)) !$acc loop vector collapse(3) do k = 1 , kchunk do j = 1 , np do i = 1 , np do l = 1 , np dsdx00 = dsdx00 + deriv_dvv(l,i)*s(l,j,ks+k-1,q,ie) 17 …

ENTIRE ARRAY CACHED Nonhydrostatic Icosahedral Model: NIM !$acc parallel acc loop gang private(fu0, sumu , …) do ipn=IPS,IPE !$acc cache(fu0, sumu , …) !$acc loop vector do k=1,NZ fu0(k) = 0.0 … enddo !$acc loop vector do k=1,NZ fu0(k) = fu0(k) + sumu(k) …. 18 end do

VARIABLE-LENGTH ARRAY real :: a(NX) … pgfortran – acc – ta=tesla:safecache a.f90 -Minfo !$acc loop gang private(a) DO j=1,N !$acc cache (a) !$acc loop vector DO i=1, M … 19

PERFORMANCE DATA Kernels from Real-World Apps Cache Directive Performance Improvement 1.80 1.60 1.40 1.20 Speedup 1.00 0.80 0.60 0.40 0.20 0.00 MPAS ACME FORCE PSTADV VDMINTV DIAG FLUX P100 K80 20

DISCUSSION Recommendation often given: If there is data reuse within the thread-block, then use shared memory to cache such data and then access latency is reduced. ≠ Recommendation Better Performance Performance Factors: Thread Occupancy Memory Access Latency Hardware Platforms Others 21

CASE STUDY: ORNL DIRAC Array S (double) Is(int) Id(int) Size 1599*8 1599*4 1599*4 Occupany vs Performance on K80 Occupany vs Performance on P100 140 80 70 120 Percentage(%)/ms Percentage(%)/ms 60 100 50 80 40 60 30 40 20 20 10 0 0 nocache cache-s cache-id-is cache-all nocache cache-s cache-id-is cache-all K80-occupany K80-perf P100-occupany P100-perf 22

CONCLUSION Summary: Cache directive does improve the performance in real world applications Pros: Help reduce uncoalesced memory access Combining with gang-level private, avoid data fetch from global memor y Cons: No performance improvement guarantee, if the shared memory is overly used 23

OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and - PowerPoint PPT Presentation

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI GPU ARCHITECTURE Threads Register Files Shared L1 Read-Only Memory Cache Data Cache L2 Cache Texture Constant GPU

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

Programming Heterogeneous Many-cores Using Directives HMPP - OpenAcc F. Bodin, CAPS CTO

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU programming with OpenACC October 22 23, 2020 CSC IT Center for Science Ltd., Espoo

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Exploit GPU-Specific Features Still at a High Level Seyong Lee and Jeffrey S. Vetter Future

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

Fortran Programmers Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com Outline GPU

Microsoft Office Upload Center Cache Files in Forensic Investigations Rick van Gorp, Kotaiba

Protecting Applications Against TOCTTOU Races by User-Space Caching of File Metadata Mathias

The Bunker Cache for Spatio-Value Approximation Joshua San Miguel Jorge Albericio Natalie

Investor Update Q4 FY 2017-18 May 2018 Den Networks Ltd. Comparable financial figures are for

A System- -on on- -a a- -Chip Lock Chip Lock A System Cache with Task Preemption Cache

An Introduction to An Introduction to Geocaching Geocaching Chris Kracik Chris Kracik aka

Cache la Poudre River National Heritage Area Supported and Managed by the Poudre Heritage

DNS Session 2: DNS cache operation and DNS debugging Joe Abley AfNOG 2006 workshop How caching

OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and - PowerPoint PPT Presentation

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI GPU ARCHITECTURE Threads Register Files Shared L1 Read-Only Memory Cache Data Cache L2 Cache Texture Constant GPU

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

Programming Heterogeneous Many-cores Using Directives HMPP - OpenAcc F. Bodin, CAPS CTO

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU programming with OpenACC October 22 23, 2020 CSC IT Center for Science Ltd., Espoo

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Exploit GPU-Specific Features Still at a High Level Seyong Lee and Jeffrey S. Vetter Future

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher &amp; Activity Lead

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

Fortran Programmers Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com Outline GPU

Microsoft Office Upload Center Cache Files in Forensic Investigations Rick van Gorp, Kotaiba

Protecting Applications Against TOCTTOU Races by User-Space Caching of File Metadata Mathias

The Bunker Cache for Spatio-Value Approximation Joshua San Miguel Jorge Albericio Natalie

Investor Update Q4 FY 2017-18 May 2018 Den Networks Ltd. Comparable financial figures are for

A System- -on on- -a a- -Chip Lock Chip Lock A System Cache with Task Preemption Cache

An Introduction to An Introduction to Geocaching Geocaching Chris Kracik Chris Kracik aka

Cache la Poudre River National Heritage Area Supported and Managed by the Poudre Heritage

DNS Session 2: DNS cache operation and DNS debugging Joe Abley AfNOG 2006 workshop How caching

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead