OpenACC, CUDA, and OmpSs Pau Farr Antonio J. Pea Munich, Oct. 12 - PowerPoint PPT Presentation

www.bsc.es Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farré Antonio J. Peña Munich, Oct. 12 2017

PROLOGUE

Barcelona Supercomputing Center Marenostrum 4 • 13.7 PetaFlop/s • General Purpose Computing  3400 nodes of Xeon, 11 PF/s • Emerging Technologies  Power 9 + Pascal  1.5 PF/s  Knights Landing and Knights Hill  0.5 PF/s  64bit ARMv8  0.5 PF/s 3

Mission of BSC Scientific Departments COMPUTER SCIENCES EARTH SCIENCES To influence the way machines To develop and implement global are built, programmed and and regional state-of-the-art used: programming models, models for short-term air quality performance tools, Big Data, forecast and long-term climate computer architecture, energy efficiency applications CASE LIFE SCIENCES To develop scientific and To understand living engineering software to organisms by means of efficiently exploit super- theoretical and computing capabilities computational methods (biomedical, geophysics, (molecular modeling, genomics, proteomics) atmospheric, energy, social and economic simulations) 4

BSC Training on European Level - PATC PRACE Advanced Training Centers The PRACE, designated 6 Advanced Training Centers: • Barcelona Supercomputing Center (Spain) • CINECA Consorzio Interuniversitario (Italy), • CSC - IT Center for Science Ltd (Finland), • EPCC at the University of Edinburgh (UK), • Gauss Centre for Supercomputing (Germany) and Maison de la Simulation (France). Mission of PATCs Carry out and coordinate training and education activities that foster the efficient usage of the infrastructure available through PRACE. 5

BSC & The Global IT Industry 2016 IBM-BSC Deep Learning Center NVIDIA GPU BSC-Microsoft Research Centre Center of Excellence Intel-BSC Exascale Lab 6

Projects with the Energy Industry Repsol-BSC Research Center Iberdrola Renovables Research into advanced technologies for the exploration of hydrocarbons, subterranean and subsea reserve modelling and fluid flows 7

BSC/UPC NVIDIA GPU Center of Excellence NVIDIA Award to BSC/UPC (since 2011) R&D around GPU Computing (currently ~10 core collaborators) – Architecture, Programming Models, Libraries, Applications, Porting Education, Training, Dissemination (free registration) – PUMPS Summer School – advanced CUDA mainly – PRACE Adv. Training Center courses on Introduction to CUDA & OpenACC – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships, advising, hiring antonio.pena@bsc.es 8

Introductions Pau Farre, Jr. Engineer – GCoE Core Team – GPU porting and optimization specialist – Did most of the hard work for this lab pau.farre@bsc.es Antonio J. Peña, Sr. Researcher – Manager of the GCoE – Juan de la Cierva Fellow – Prospective Marie Curie Fellow – Activity Leader “Accelerators and Communications for HPC” – The one to blame if anything goes wrong antonio.pena@bsc.es 9

Introduction: Programming Models for GPU Computing CUDA (Compute Unified Device Architecture) – Runtime & Driver APIs (high-level / low-level) – Specific for NVIDIA GPUs: best performance & control OpenACC (Open Accelerators) – Open Standard – Higher-level, pragma-based – Aiming at portability – heterogeneous hardware – For NVIDIA GPUs, implemented on top of CUDA OpenCL(Open Computing Language) – Open Standard – Low-level – similar to CUDA Driver API – Multi-target, portable* (Intentionally leaving out weird stuff like CG, OpenGL , …) 10

Motivation: Coding Productivity & Performance Coding Prod. / Perf. Don’t get me wrong: CUDA CUDA delivers awesome coding productivity w.r.t., e.g., OpenGL, but I only want to use 3 (easy) OpenACC colors here. Please interpret colors as relative to each other. OpenACC + CUDA OpenACC may well deliver more than the performance you OmpSs + CUDA *need*. However, we have the OmpSs + OpenACC lowest control on performance – High-level, task-based, pragma-based, developed @ BSC w.r.t. the discussed alternatives. – Target accelerators combined with CUDA or (recently) OpenACC 11

HANDS-ON

LAB CONNECTION INSTRUCTIONS - Part 1 Go to nvlabs.qwiklab.com Sign in or create an account Check for Access Codes (each day): - Click My Account - Click Credits & Subscriptions If no Access Codes, ask for paper one from TA. Please tear in half once used An Access Code is needed to start the lab WIFI SSID: GTC_Hands_On Password: HandsOnGpu

LAB CONNECTION INSTRUCTIONS - Part 2 1 1. Click Qwiklabs in upper-left 2 2. Select GTC2017 Class 3. Find lab and click on it 4. Click on Select 5. Click Start Lab 4 WIFI SSID: GTC_Hands_On Password: 3 HandsOnGpu

Steps to Parallelize with OpenACC 1. Identify Parallelism ○ Using a CPU profiling tool (example: nvprof – cpu-profiling on ) 2. Express Parallelism ○ Declare parallel regions with directives 3. Express Data Locality ○ Help OpenACC figure out how to manage data 4. Optimize ○ Using nvprof & Nvidia visual profiler 15

FWI – A Full Wave Inversion Oil & Gas (mini-)application ● Analyzes physical properties of the subsoil from seismic measures ● Elastic wave propagator + linearly elastic stress-strain relationships – Six different stress components – Finite differences (FD) method with a Fully Staggered Grid (FSG) Base code developed by the BSC Repsol Team 16

FWI Parallelization – OpenACC/CUDA #6: Results • Our optimized CUDA Kernels have better performance than the OpenACC FWI Speedups Baseline: OpenMP 19,32 20,00 18,00 16,00 12,72 14,00 12,17 11,76 11,69 Speedup 12,00 9,29 10,00 7,15 7,07 7,16 8,00 6,46 6,00 3,82 3,02 3,06 3,11 4,00 0,60 1,74 1,00 0,51 2,00 0,92 0,00 Xeon Platinium 8160 (23c) Tesla K40 (Kepler) Titan X (Maxwell) Tesla P100 (Pascal) 17

OmpSs + CUDA / OpenACC

OmpSs Main Program Sequential control flow – Defines a single address space – Executes sequential code that • Can spawn/instantiate tasks that will be executed sometime in the future • Can stall/wait for tasks Tasks annotated with directionality clauses – in, out, inout – Used • To build dependences among tasks • For main to wait for data to be produced – Basis for memory management functionalities (replication, locality, movement , … • Copy clauses Sequential equivalence (~) 19

OmpSs: A Sequential Program … void Cholesky( float *A[NT][NT] ) { int i, j, k; TS NT for (k=0; k<NT; k++) { TS NT TS spotrf (A[k*NT+k]) ; for (i=k+1; i<NT; i++) { TS strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { sgemm( A[k][i], A[k][j], A[j][i]); } ssyrk (A[k][i], A[i][i]); } } 20

OmpSs : … with Directionality Annotations … void Cholesky( float *A[NT][NT] ) { int i, j, k; TS NT for (k=0; k<NT; k++) { TS #pragma omp task inout (A[k][k]) NT TS spotrf (A[k][k]) ; for (i=k+1; i<NT; i++) { TS #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); } #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); } } 21

OmpSs : … that Happens to Execute in Parallel void Cholesky( float *A[NT][NT] ) { int i, j, k; TS NT for (k=0; k<NT; k++) { TS #pragma omp task inout (A[k][k]) NT TS spotrf (A[k][k]) ; for (i=k+1; i<NT; i++) { TS #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); } #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); } } Decouple how we write/think (sequential) from how it is executed 22

OmpSs + CUDA – Example: AXPY Algorithm 1 2 3 Port kernel to CUDA Annotate device (cuda) Complete device (smp) 3 #include <kernel.h> main.c #pragma omp target device(smp) copy_deps kernel.h #pragma omp task in([n]x) inout([n]y) int main(int argc, char *argv[]) void saxpy(int n, float a, float* x, float* y); { kernel.c float a=5, x[N], y[N]; void saxpy(int n, float a, float *X, float *Y) { // Initialize values for (int i=0; i<n; ++i) for (int i=0; i<N; ++i) Y[i] = X[i] * a + Y[i]; x[i] = y[i] = i; } // Compute saxpy algorithm (1 task) saxpy(N, a, x, y); kernel.cuh 2 #pragma omp target device(cuda) copy_deps ndrange(1,n,128) #pragma omp taskwait #pragma omp task in([n]x) inout([n]y) __global__ void saxpy(int n, float a, float* x, float* y); //Check results for (int i=0; i<N; ++i){ kernel.cu 1 if (y[i]!=a*i+i) perror("Error\n") __global__ void saxpy(int n, float a, float* x, float* y) } { int i = blockIdx.x * blockDim.x + threadIdx.x; message("Results are correct\n"); if(i < n) y[i] = a * x[i] + y[i]; } } 23

OpenACC, CUDA, and OmpSs Pau Farr Antonio J. Pea Munich, Oct. 12 - PowerPoint PPT Presentation

www.bsc.es Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farr Antonio J. Pea Munich, Oct. 12 2017 PROLOGUE Barcelona Supercomputing Center Marenostrum 4 13.7 PetaFlop/s General Purpose Computing 3400

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

Priority 5 / Housing Scrutiny panel presentation 9 th July 2015 Priority 5 Create homes

Intelligent Process Automation with Jetson-TX 2 Murali Kaundinya Imagine if we could engage

2 3 Highlights Cencosud 1Q13 for 4 Operational & Financial Highlights 1Q13 was a

Geneva Collision Convention of 1960 Relevance and Need for Update? Dr. Olaf Hartenstein ARNECKE

Improvement of software development process by moving from Rational Unified Process towards Agile

Object-Oriented Software Design and Software Processes Hans Vangheluwe Modelling, Simulation and

SAN DIEGO UNIFIED SCHOOL DISTRICT 1 2 0 17-18 B U D GET U P D ATE B O A R D O F E D U C A T

CBOs Role in the Congressional Budget Process Budget Formulation and Execution Line of

OpenACC, CUDA, and OmpSs Pau Farr Antonio J. Pea Munich, Oct. 12 - PowerPoint PPT Presentation

www.bsc.es Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farr Antonio J. Pea Munich, Oct. 12 2017 PROLOGUE Barcelona Supercomputing Center Marenostrum 4 13.7 PetaFlop/s General Purpose Computing 3400

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher &amp; Activity Lead

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

Priority 5 / Housing Scrutiny panel presentation 9 th July 2015 Priority 5 Create homes

Intelligent Process Automation with Jetson-TX 2 Murali Kaundinya Imagine if we could engage

2 3 Highlights Cencosud 1Q13 for 4 Operational &amp; Financial Highlights 1Q13 was a

Geneva Collision Convention of 1960 Relevance and Need for Update? Dr. Olaf Hartenstein ARNECKE

Improvement of software development process by moving from Rational Unified Process towards Agile

Object-Oriented Software Design and Software Processes Hans Vangheluwe Modelling, Simulation and

SAN DIEGO UNIFIED SCHOOL DISTRICT 1 2 0 17-18 B U D GET U P D ATE B O A R D O F E D U C A T

CBOs Role in the Congressional Budget Process Budget Formulation and Execution Line of

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead

2 3 Highlights Cencosud 1Q13 for 4 Operational & Financial Highlights 1Q13 was a