MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. - PowerPoint PPT Presentation

www.bsc.es MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Peña Sr. Researcher & Activity Lead Manager, BSC/UPC NVIDIA GCoE San Jose 2018

Introduction: Programming Models for GPU Computing CUDA (Compute Unified Device Architecture) – Runtime & Driver APIs (high-level / low-level) – Specific for NVIDIA GPUs: best performance & control OpenACC (Open Accelerators) – Open Standard – Higher-level, pragma-based – Aiming at portability – heterogeneous hardware – For NVIDIA GPUs, implemented on top of CUDA OpenCL (Open Computing Language) – Open Standard – Low-level – similar to CUDA Driver API – Multi-target, portable* (Intentionally leaving out weird stuff like CG, OpenGL , …) 2

Motivation: Coding Productivity & Performance Coding Prod. / Perf. Don’t get me wrong: CUDA CUDA delivers awesome coding productivity w.r.t., e.g., OpenGL, but I only want to use 3 (easy) OpenACC colors here. Please interpret colors as relative to each other OpenACC + CUDA OpenACC may well deliver more OmpSs + CUDA than the performance you *need*. However, we have the OmpSs + OpenACC lowest control on performance OmpSs + OpenACC + CUDA w.r.t. the discussed alternatives 3

EPEEC, an EU H2020 Project European joint E ffort toward a Highly P roductive Programming E nvironment for Heterogeneous E xascale C omputing FETHPC, 3 years, ~4M € , Starting October 2018 – Subtopic: “High productivity programming environments for exascale” 10 Partners; Coordinator: BSC (I’m the Technical Manager) High-level Objectives: – Develop & deploy a production-ready parallel programming environment – Advance and integrate existing state-of-the-art European technology – High coding productivity, high performance, energy awareness

Proposed Methodology for Application Developers Directive No No Automatic Code Satisfactory Profile Optimisation Annotation Performance? Possible? Yes Yes Update code patterns Tune/Insert Code Low-Level Directives Manually Accelerator Kernels Yes No No Satisfactory Satisfactory Code Patterns? Performance? Yes Start Deploy 5

OmpSs + CUDA / OpenACC

OmpSs Main Program Sequential control flow – Defines a single address space – Executes sequential code that • Can spawn/instantiate tasks that will be executed sometime in the future • Can stall/wait for tasks Tasks annotated with directionality clauses – in, out, inout – Used • To build dependences among tasks • For main to wait for data to be produced – Basis for memory management functionalities (replication, locality, movement , …) • Copy clauses 7

OmpSs: A Sequential Program … void Cholesky( float *A[NT][NT] ) { TS int i, j, k; NT for (k=0; k<NT; k++) { TS NT TS spotrf (A[k*NT+k]) ; for (i=k+1; i<NT; i++) { TS strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { sgemm( A[k][i], A[k][j], A[j][i]); } ssyrk (A[k][i], A[i][i]); } } 8

OmpSs : … with Directionality Annotations … void Cholesky( float *A[NT][NT] ) { TS int i, j, k; NT for (k=0; k<NT; k++) { TS #pragma omp task inout (A[k][k]) NT TS spotrf (A[k][k]) ; for (i=k+1; i<NT; i++) { TS #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); } #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); } } 9

OmpSs : … that Happens to Execute in Parallel void Cholesky( float *A[NT][NT] ) { TS int i, j, k; NT for (k=0; k<NT; k++) { TS #pragma omp task inout (A[k][k]) NT TS spotrf (A[k][k]) ; for (i=k+1; i<NT; i++) { TS #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); } #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); } } Decouple how we write/think (sequential) from how it is executed 10

Memory Consistency (Getting Consistent Copies) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } T1 needs a valid copy Memory Transfers void main(int argc, char *argv[]) { of array A in the device ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A B B Also it allocates array B in C the device (no copy needed), T1 No need to copy and invalidates other B’s #pragma omp taskwait // can access any of A,B,C ... 11

Memory Consistency (Reusing Data in Place) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } T2 can reuse arrays A and B, Memory Transfers void main(int argc, char *argv[]) { due they have been used by ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 previous task (T1) MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A B B Additionally it also C invalidates others A’s T2 #pragma omp taskwait // can access any of A,B,C ... 12

Memory Consistency (on Demand Copy Data Back) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } Memory Transfers void main(int argc, char *argv[]) { ... DEVICE HOST T3 needs to copy back scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 to the host array B scale_task_host (C, B, 2.00, 1024); //T3 A A B B C Does not invalidate the B existing copy in the device #pragma omp taskwait // can access any of A,B,C T3 ... 13

Memory Consistency (Centralized Memory Consistency) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } Memory Transfers void main(int argc, char *argv[]) { ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A T1 B B T2 C B T3 TW A #pragma omp taskwait Taskwait requires full memory // can access any of A,B,C consistency in the host ... 14

Memory Consistency (Avoid taskwait Consistency) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) noflush #pragma omp task out([N] b) in([N] c) T4 void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } Memory Transfers void main(int argc, char *argv[]) { ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A #pragma omp taskwait noflush T1 B B // does not flush data dev -> host Taskwait is waiting for task T2 C finalization, but does not B T3 scale_task_cuda (B, C, 3.00, 1024); //T4 copy memory back to the nf #pragma omp taskwait host (neither invalidate it) // can access any of A,B,C ... 15

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. - PowerPoint PPT Presentation

www.bsc.es MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead Manager, BSC/UPC NVIDIA GCoE San Jose 2018 Introduction: Programming Models for GPU Computing CUDA (Compute Unified Device Architecture)

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

OpenACC, CUDA, and OmpSs Pau Farr Antonio J. Pea Munich, Oct. 12 2017 PROLOGUE Barcelona

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Meal Planning Made Easy Meal Planning Made Easy Healthy Utah Meal Planning Made Easy

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

Question format -- open Dos and donts of occupation Because occupations are

A Human-Centric Approach to Program Understanding Ph.D. Dissertation Proposal Raymond P.L. Buse

Automatic Code Generation for Library Method Inclusion in Domain Specific Languages

Architecting a data platform to support analytic workflows for scientific data Sun Maria Lehmann,

TTCN-3 Code Generation

Aligning Zero-Touch Nonfunctional Testing in DevOps Implementation Presented by: Subash Newton HCl

OC Organic Computing Middleware for Ubiquitous Environments Status report and outlook Theo

Im plem entation of GIS using Cloud paradigm 1 GIS514 Term _ 122 Pa p er p resenta tion B Y :