Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA

OpenMP Background Step by Step Case Study Parallelize on CPU Offload to GPU AGENDA Team Up Increase Parallelism Improve Scheduling Additional Experiments Conclusions 2

Motivation Multiple compilers are in development to support OpenMP offloading to NVIDIA GPUs. Articles and blog posts are being written by early adopters trying OpenMP on NVIDIA GPUs, most of them have gotten it wrong. If you want to try OpenMP offloading to NVIDIA GPUs, we want you to know what to expect and how to get reasonable performance. 4/1/2016 3

A Brief History of OpenMP 1996 - Architecture Review Board (ARB) formed by several vendors implementing their own directives for Shared Memory Parallelism (SMP). 1997 - 1.0 was released for C/C++ and Fortran with support for parallelizing loops across threads. 2000, 2002 – Version 2.0 of Fortran, C/C++ specifications released. 2005 – Version 2.5 released, combining both specs into one. 2008 – Version 3.0 released, added support for tasking 2011 – Version 3.1 release, improved support for tasking 2013 – Version 4.0 released, added support for offloading (and more) 2015 – Version 4.5 released, improved support for offloading targets (and more) 4/1/2016 4

OpenMP In Clang Multi-vendor effort to implement OpenMP in Clang (including offloading) Current status – interesting How to get it – https://www.ibm.com/developerworks/community/blogs/8e0d7b52- b996-424b-bb33-345205594e0d?lang=en 4/1/2016 5

OpenMP In Clang How to get it, our way Step one – make sure you have: gcc, cmake, python and cuda installed and updated Step two – Look at http://llvm.org/docs/GettingStarted.html https://www.ibm.com/developerworks/community/blogs/8e0d7b52-b996- 424b-bb33-345205594e0d?lang=en Step three – git clone https://github.com/clang-ykt/llvm_trunk.git cd llvm_trunk/tools git clone https://github.com/clang-ykt/clang_trunk.git clang cd ../projects git clone https://github.com/clang-ykt/openmp.git 4/1/2016 6

OpenMP In Clang How to build it cd .. mkdir build cd build cmake -DCMAKE_BUILD_TYPE=DEBUG|RELEASE|MinSizeRel \ -DLLVM_TARGETS_TO_BUILD =“X86;NVPTX” \ - DCMAKE_INSTALL_PREFIX=“<where you want it>" \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_ENABLE_BACKTRACES=ON \ -DLLVM_ENABLE_WERROR=OFF \ -DBUILD_SHARED_LIBS=OFF \ -DLLVM_ENABLE_RTTI=ON \ -DCMAKE_C_COMPILER =“GCC you want used" \ -DCMAKE_CXX_COMPILER =“G++ you want used" \ -G "Unix Makefiles" \ !there are other options, I like this one ../llvm_trunk make [-j#] make install 4/1/2016 7

OpenMP In Clang How to use it export LIBOMP_LIB=<llvm-install-lib> export OMPTARGET_LIBS=$LIBOMP_LIB export LIBRARY_PATH=$OMPTARGET_LIBS export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OMPTARGET_LIBS export PATH=$PATH:<llvm_install-bin> clang -O3 -fopenmp=libomp -omptargets=nvptx64sm_35-nvidia-linux … 4/1/2016 8

Case Study: Jacobi Iteration 9

Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛂 𝟑 𝒈(𝒚, 𝒛) = 𝟏 A(i,j+1) A(i-1,j) A(i+1,j) A(i,j) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1,𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 A(i,j-1) 4 10

Jacobi Iteration Convergence Loop while ( err > tol && iter < iter_max ) { err=0.0; Calculate Next for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); } } Exchange Values for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } 11

Parallelize on the CPU 12

OpenMP Worksharing PARALLEL Directive OMP PARALLEL Spawns a team of threads Thread Team Execution continues redundantly on all threads of the team. All threads join at the end and the master thread continues execution. Master Thread 13

OpenMP Worksharing FOR/DO (Loop) Directive OMP PARALLEL Divides ( “ workshares ” ) the Thread Team iterations of the next loop across the threads in the team OMP FOR How the iterations are divided is determined by a schedule . 14

CPU-Parallelism while ( error > tol && iter < iter_max ) { error = 0.0; Create a team of threads and workshare this loop #pragma omp parallel for reduction(max:error) across those threads. for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } Create a team of threads and workshare this loop #pragma omp parallel for for( int j = 1; j < n-1; j++) { across those threads. for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter++ % 100 == 0) printf("%5d, %0.6f\n", iter, error); } 16

CPU-Parallelism while ( error > tol && iter < iter_max ) { error = 0.0; Create a team of threads #pragma omp parallel { #pragma omp for reduction(max:error) Workshare this loop for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); Prevent threads from } } executing the second #pragma omp barrier loop nest until the first #pragma omp for for( int j = 1; j < n-1; j++) { completes for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } if(iter++ % 100 == 0) printf("%5d, %0.6f\n", iter, error); } 17

CPU-Parallelism while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { #pragma omp simd for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); Some compilers want a error = fmax( error, fabs(Anew[j][i] - A[j][i])); } SIMD directive to simdize } on CPUS. #pragma omp parallel for for( int j = 1; j < n-1; j++) { #pragma omp simd for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter++ % 100 == 0) printf("%5d, %0.6f\n", iter, error); } 18

CPU Scaling (Smaller is Better) 80 1.00X 70 60 50 1.70X 40 Execution Time (seconds) 30 2.94X 3.40X 3.52X 20 10 0 1 2 4 8 16 OpenMP Threads Intel Xeon E5-2690 v2 @ 3.00GHz 19

Targeting the GPU 20

OpenMP Offloading TARGET Directive Offloads execution and associated data from the CPU to the GPU The target device owns the data, accesses by the CPU during the execution of the • target region are forbidden. • Data used within the region may be implicitly or explicitly mapped to the device. All of OpenMP is allowed within target regions, but only a subset will run well on • GPUs. 21

Target the GPU while ( error > tol && iter < iter_max ) { Moves this region of error = 0.0; #pragma omp target code to the GPU and { implicitly maps data. #pragma omp parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp parallel for for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } if(iter++ % 100 == 0) printf("%5d, %0.6f\n", iter, error); } 22

Target the GPU while ( error > tol && iter < iter_max ) { Moves this region of error = 0.0; #pragma omp target map(alloc:Anew[:n+2][:m+2]) map(tofrom:A[:n+2][:m+2]) code to the GPU and { explicitly maps data. #pragma omp parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp parallel for for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } if(iter++ % 100 == 0) printf("%5d, %0.6f\n", iter, error); } 23

Execution Time (Smaller is Better) 893 0.12X 140 120 1.00X 100 80 Execution Time (seconds) 60 40 5.12X 20 0 Original CPU Threaded GPU Threaded NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @ 3.00GHz 24

GPU Architecture Basics GPUs are composed of 1 or more independent parts, known as Streaming Multiprocessors ( “SMs” ) Threads are organized into threadblocks . Threads within the same theadblock run on an SM and can synchronize. Threads in different threadblocks (even if they’re on the same SM) cannot synchronize. 25

Teaming Up 26

OpenMP Teams TEAMS Directive To better utilize the GPU resources, OMP TEAMS use many thread teams via the TEAMS directive. • Spawns 1 or more thread teams with the same number of threads • Execution continues on the master threads of each team (redundantly) No synchronization between • teams 27

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP Background Step by Step Case Study Parallelize on CPU Offload to GPU AGENDA Team Up Increase Parallelism

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP on GPUs, First Experiences and Best Practices Jeff Larkin, GTC2018 S8344, March 2018 What

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Improving Privacy Protection in the area of Behavioural Targeting Privacy & Innovation, Hong

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

An overview of the difference between inflation targeting, NGDP targeting, and a Taylor Rule;

CONCUR: BOOKING WITH THE TRAVEL TEAM - Stephen Russell Business Development Manager

Workshop : swissbib for the short distance runner Gnter Hipler, system architect, project

Wines Chardonnay, Santa Lucia Highlands Chardonnay, Estate, Santa Lucia Highlands Pinot Noir,

ALIGN SEO EFFORTS WITH TODAYS SEARCH Learn How to Perform Proper Keyword Research, Align Them

Submodel Pattern Extraction for Simulink Models James R. Cordy Queens University NECSIS

Pinot Noir: The Savage Yet Seductive Grape The Beauty and the Beast Dr. Karl Kaiser

Display and Antibody Screening Jason Graves September 20, 2011 AAS User Group Meeting My

Everything you always wanted to know about SSDs But were afraid to ask. Remember the SST?? The

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP Background Step by Step Case Study Parallelize on CPU Offload to GPU AGENDA Team Up Increase Parallelism

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP on GPUs, First Experiences and Best Practices Jeff Larkin, GTC2018 S8344, March 2018 What

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Improving Privacy Protection in the area of Behavioural Targeting Privacy &amp; Innovation, Hong

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

An overview of the difference between inflation targeting, NGDP targeting, and a Taylor Rule;

CONCUR: BOOKING WITH THE TRAVEL TEAM - Stephen Russell Business Development Manager

Workshop : swissbib for the short distance runner Gnter Hipler, system architect, project

Wines Chardonnay, Santa Lucia Highlands Chardonnay, Estate, Santa Lucia Highlands Pinot Noir,

ALIGN SEO EFFORTS WITH TODAYS SEARCH Learn How to Perform Proper Keyword Research, Align Them

Submodel Pattern Extraction for Simulink Models James R. Cordy Queens University NECSIS

Pinot Noir: The Savage Yet Seductive Grape The Beauty and the Beast Dr. Karl Kaiser

Display and Antibody Screening Jason Graves September 20, 2011 AAS User Group Meeting My

Everything you always wanted to know about SSDs But were afraid to ask. Remember the SST?? The

Improving Privacy Protection in the area of Behavioural Targeting Privacy & Innovation, Hong