MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. - - PowerPoint PPT Presentation

multigpu made easy by ompss cuda openacc
SMART_READER_LITE
LIVE PREVIEW

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. - - PowerPoint PPT Presentation

www.bsc.es MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead Manager, BSC/UPC NVIDIA GCoE San Jose 2018 Introduction: Programming Models for GPU Computing CUDA (Compute Unified Device Architecture)


slide-1
SLIDE 1

San Jose 2018

MultiGPU Made Easy by OmpSs + CUDA/OpenACC

Antonio J. Peña

  • Sr. Researcher & Activity Lead

Manager, BSC/UPC NVIDIA GCoE www.bsc.es

slide-2
SLIDE 2

CUDA (Compute Unified Device Architecture)

– Runtime & Driver APIs (high-level / low-level) – Specific for NVIDIA GPUs: best performance & control

OpenACC (Open Accelerators)

– Open Standard – Higher-level, pragma-based – Aiming at portability – heterogeneous hardware – For NVIDIA GPUs, implemented on top of CUDA

OpenCL (Open Computing Language)

– Open Standard – Low-level – similar to CUDA Driver API – Multi-target, portable*

(Intentionally leaving out weird stuff like CG, OpenGL, …)

Introduction: Programming Models for GPU Computing

2

slide-3
SLIDE 3

Motivation: Coding Productivity & Performance

3

CUDA OpenACC OpenACC + CUDA OmpSs + CUDA OmpSs + OpenACC OmpSs + OpenACC + CUDA

Coding Prod. / Perf. Don’t get me wrong: CUDA delivers awesome coding productivity w.r.t., e.g., OpenGL, but I only want to use 3 (easy) colors here. Please interpret colors as relative to each other OpenACC may well deliver more than the performance you *need*. However, we have the lowest control on performance w.r.t. the discussed alternatives

slide-4
SLIDE 4

European joint Effort toward a Highly Productive Programming Environment for Heterogeneous Exascale Computing FETHPC, 3 years, ~4M€, Starting October 2018

– Subtopic: “High productivity programming environments for exascale”

10 Partners; Coordinator: BSC (I’m the Technical Manager) High-level Objectives:

– Develop & deploy a production-ready parallel programming environment – Advance and integrate existing state-of-the-art European technology – High coding productivity, high performance, energy awareness

EPEEC, an EU H2020 Project

slide-5
SLIDE 5

Proposed Methodology for Application Developers

5

Automatic Code Annotation Satisfactory Performance? Profile Code Low-Level Accelerator Kernels Start Deploy Yes No Satisfactory Performance? Yes No Tune/Insert Directives Manually Directive Optimisation Possible? No Yes Update code patterns Satisfactory Code Patterns? No Yes

slide-6
SLIDE 6

OmpSs + CUDA / OpenACC

slide-7
SLIDE 7

Sequential control flow

– Defines a single address space – Executes sequential code that

  • Can spawn/instantiate tasks that will be executed sometime in the future
  • Can stall/wait for tasks

Tasks annotated with directionality clauses

– in, out, inout – Used

  • To build dependences among tasks
  • For main to wait for data to be produced

– Basis for memory management functionalities (replication, locality, movement, …)

  • Copy clauses

OmpSs Main Program

7

slide-8
SLIDE 8

OmpSs: A Sequential Program…

8

void Cholesky( float *A[NT][NT] ) { int i, j, k; for (k=0; k<NT; k++) { spotrf (A[k*NT+k]) ; for (i=k+1; i<NT; i++) { strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { sgemm( A[k][i], A[k][j], A[j][i]); } ssyrk (A[k][i], A[i][i]); } }

TS TS NT NT TS TS

slide-9
SLIDE 9

OmpSs: … with Directionality Annotations …

9

void Cholesky( float *A[NT][NT] ) { int i, j, k; for (k=0; k<NT; k++) { #pragma omp task inout (A[k][k]) spotrf (A[k][k]) ; for (i=k+1; i<NT; i++) { #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); } #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); } }

TS TS NT NT TS TS

slide-10
SLIDE 10

OmpSs: … that Happens to Execute in Parallel

10

void Cholesky( float *A[NT][NT] ) { int i, j, k; for (k=0; k<NT; k++) { #pragma omp task inout (A[k][k]) spotrf (A[k][k]) ; for (i=k+1; i<NT; i++) { #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); } #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); } }

TS TS NT NT TS TS

Decouple how we write/think (sequential) from how it is executed

slide-11
SLIDE 11

11

Memory Consistency (Getting Consistent Copies)

  • Relaxed-consistency “shared-memory” model (OpenMP-like)

#pragma omp target device (cuda) #pragma omp task out([N] b) in([N] c) void scale_task_cuda(double *b, double *c, double a, int N) { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host(double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } void main(int argc, char *argv[]) { ... scale_task_cuda (B, A, 10.0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 #pragma omp taskwait // can access any of A,B,C ...

Memory Transfers

HOST MEMORY DEVICE MEMORY

A A B T1 No need to copy

Task Dependency Graph

T1 needs a valid copy

  • f array A in the device

B Also it allocates array B in the device (no copy needed), and invalidates other B’s C T1 T2 T3

slide-12
SLIDE 12

12

Memory Consistency (Reusing Data in Place)

  • Relaxed-consistency “shared-memory” model (OpenMP-like)

#pragma omp target device (cuda) #pragma omp task out([N] b) in([N] c) void scale_task_cuda(double *b, double *c, double a, int N) { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host(double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } void main(int argc, char *argv[]) { ... scale_task_cuda (B, A, 10.0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 #pragma omp taskwait // can access any of A,B,C ...

Memory Transfers

HOST MEMORY DEVICE MEMORY

A A B T2

Task Dependency Graph

T2 can reuse arrays A and B, due they have been used by previous task (T1) B C Additionally it also invalidates others A’s T1 T2 T3

slide-13
SLIDE 13

13

Memory Consistency (on Demand Copy Data Back)

  • Relaxed-consistency “shared-memory” model (OpenMP-like)

#pragma omp target device (cuda) #pragma omp task out([N] b) in([N] c) void scale_task_cuda(double *b, double *c, double a, int N) { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host(double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } void main(int argc, char *argv[]) { ... scale_task_cuda (B, A, 10.0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 #pragma omp taskwait // can access any of A,B,C ...

Memory Transfers

HOST MEMORY DEVICE MEMORY

A A B T3

Task Dependency Graph

T3 needs to copy back to the host array B B C Does not invalidate the existing copy in the device T1 T2 T3 B

slide-14
SLIDE 14

14

Memory Consistency (Centralized Memory Consistency)

  • Relaxed-consistency “shared-memory” model (OpenMP-like)

#pragma omp target device (cuda) #pragma omp task out([N] b) in([N] c) void scale_task_cuda(double *b, double *c, double a, int N) { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host(double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } void main(int argc, char *argv[]) { ... scale_task_cuda (B, A, 10.0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 #pragma omp taskwait // can access any of A,B,C ...

Memory Transfers

HOST MEMORY DEVICE MEMORY

A A B

Task Dependency Graph

B C Taskwait requires full memory consistency in the host A T1 T2 T3

T1 T2 T3

B

TW

slide-15
SLIDE 15

15

Memory Consistency (Avoid taskwait Consistency)

  • Relaxed-consistency “shared-memory” model (OpenMP-like)

#pragma omp target device (cuda) #pragma omp task out([N] b) in([N] c) void scale_task_cuda(double *b, double *c, double a, int N) { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host(double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } void main(int argc, char *argv[]) { ... scale_task_cuda (B, A, 10.0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (B, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of A,B,C ...

T1 T2 T3 T4

Task Dependency Graph

Taskwait is waiting for task finalization, but does not copy memory back to the host (neither invalidate it)

noflush Memory Transfers

HOST MEMORY DEVICE MEMORY

A A B C

nf T1 T2 T3

B B

slide-16
SLIDE 16

16

Memory Consistency (Avoid taskwait Consistency)

  • Relaxed-consistency “shared-memory” model (OpenMP-like)

#pragma omp target device (cuda) #pragma omp task out([N] b) in([N] c) void scale_task_cuda(double *b, double *c, double a, int N) { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host(double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } void main(int argc, char *argv[]) { ... scale_task_cuda (B, A, 10.0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (B, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of A,B,C ...

T1 T2 T3 T4

Task Dependency Graph

Before executing T4 it will need a consistent copy of C and it will also invalidate all previous versions of B

noflush Memory Transfers

HOST MEMORY DEVICE MEMORY

A A B C

nf T1 T2 T3

B B C B

T4

slide-17
SLIDE 17

17

Memory Consistency (Avoid taskwait Consistency)

  • Relaxed-consistency “shared-memory” model (OpenMP-like)

#pragma omp target device (cuda) #pragma omp task out([N] b) in([N] c) void scale_task_cuda(double *b, double *c, double a, int N) { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host(double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } void main(int argc, char *argv[]) { ... scale_task_cuda (B, A, 10.0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (B, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of A,B,C ...

T1 T2 T3 T4

Task Dependency Graph

Taskwait waits for tasks finalization, it will invalidate all data versions and force memory consistency

noflush Memory Transfers

HOST MEMORY DEVICE MEMORY

A A B C

nf T1 T2 T3

B B C B

T4

A B

TW

slide-18
SLIDE 18

OmpSs + CUDA – Example: AXPY Algorithm

18 #include <kernel.h> int main(int argc, char *argv[]) { float a=5, x[N], y[N]; // Initialize values for (int i=0; i<N; ++i) x[i] = y[i] = i; // Compute saxpy algorithm (1 task) saxpy(N, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<N; ++i){ if (y[i]!=a*i+i) perror("Error\n") } message("Results are correct\n"); } void saxpy(int n, float a, float *X, float *Y) { for (int i=0; i<n; ++i) Y[i] = X[i] * a + Y[i]; } #pragma omp target device(smp) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy(int n, float a, float* x, float* y);

kernel.c kernel.h main.c

1 Port kernel to CUDA 2 Annotate device (cuda)

#pragma omp target device(cuda) copy_deps ndrange(1,n,128) #pragma omp task in([n]x) inout([n]y) __global__ void saxpy(int n, float a, float* x, float* y);

kernel.cuh

2

__global__ void saxpy(int n, float a, float* x, float* y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if(i < n) y[i] = a * x[i] + y[i]; }

kernel.cu

1

3 Complete device (smp)

3

Difficult for non-experienced programmers! So easy!

slide-19
SLIDE 19

What if we could use OpenACC directives with OmpSs? OpenACC is way easier than CUDA

– Instead of porting & optimizing many CUDA tasks… – …port every GPU-accelerated task using using OpenACC… – …and only use CUDA where the OpenACC compiler doesn’t provide the required efficiency

OmpSs + OpenACC: Motivation

19

slide-20
SLIDE 20

OmpSs + OpenACC – Example: SAXPY Algorithm

#include <kernel.h> int main(int argc, char *argv[]) { float a=5, x[N], y[N]; // Initialize values for (int i=0; i<N; ++i) x[i] = y[i] = i; // Compute saxpy algorithm (1 task) saxpy(N, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<N; ++i){ if (y[i]!=a*i+i) perror("Error\n") } message("Results are correct\n"); }

void saxpy(int n, float a, float *x, float *y) { for (int i=0; i<n; ++i) y[i] = x[i] * a + y[i]; }

#pragma omp target device(smp) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy (int n, float a, float* x, float* y);

kernel.c kernel.h main.c

1 Port kernel to CUDA 2 Annotate device (openacc) 3 Complete device (smp)

3

So easy! So easy!

20 #pragma omp target device(openacc) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy (int n, float a, float* x, float* y);

kernel.h

void saxpy(int n, float a, float *x, float *y) { #pragma acc kernels for (int i=0; i<n; ++i) y[i] = x[i] * a + y[i]; }

kernel.c

2

So easy!

slide-21
SLIDE 21

Analyzes physical properties of the subsoil from seismic measures Elastic wave propagator + linearly elastic stress-strain relationships

– Six different stress components – Finite differences (FD) method with a Fully Staggered Grid (FSG)

FWI – A Full Wave Inversion Oil & Gas Miniapplication

21

Base code developed by the BSC Repsol Team

slide-22
SLIDE 22

FWI Parallelization – OmpSs/OpenACC - Results

22

1,00 1,00 3,27 5,68 0,97 5,75 5,84 7,18 6,25 12,15 13,44 1,13 13,29 13,46 17,47 12,37 14,96 15,95 1,29 16,52 16,57 19,08 18,18 0,00 5,00 10,00 15,00 20,00 25,00 Speedup

FWI Speedups Baseline: OpenMP

i7-5930K (6c) Tesla K40 (Kepler) Titan X (Maxwell) Titan X (Pascal)

slide-23
SLIDE 23

L8116 “Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs” – Thu. 10:00-12:00 S8328 “One More Step Towards the Simulation of the Human Brain on NVIDIA GPUs (HBP) – Thu. 4:00-4:25pm Join Upcoming EPEEC’s Users’ Group STARS Open Postdoctoral Fellowships PUMPS+AI Summer School – Barcelona, July

– Featuring Wen-mei Hwu && David Kirk – Advanced CUDA + Brand-new “+AI” format!

antonio.pena@bsc.es

Some Announcements

http://pumps.bsc.es

slide-24
SLIDE 24

Guray Ozen – First OmpSs+OpenACC prototype Accelerators and Communications for HPC Team – my team

– Core: Pau Farré, Marc Jordà, Kyunghun Kim, Mohammad Owais – Collaborators: Pedro Valero, Aimar Rodríguez, Jan Ciesko

OmpSs Team – Awesome programming moldel and runtime

– Xavier Martorell, Vicenç Beltran, Xavier Teruel, Sergi Mateo, JM Perez, …

BSC Repsol Team – Providing original FWI implementation

– Maurizio Hanzich, Samuel Rodríguez, …

PUMPS Summer School

– Wen-mei Hwu & TAs: Simón García de Gonzalo, Abdul Dakkak, Carl Pearson, Mert Hiyadetoglu – David Kirk, Juan Gómez-Luna

Acknowledgements

24 Pau Farré

  • Jr. Research Engineer
slide-25
SLIDE 25

Thank you!

For further information please contact antonio.pena@bsc.es

www.bsc.es