multigpu made easy by ompss cuda openacc
play

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. - PowerPoint PPT Presentation

www.bsc.es MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead Manager, BSC/UPC NVIDIA GCoE San Jose 2018 Introduction: Programming Models for GPU Computing CUDA (Compute Unified Device Architecture)


  1. www.bsc.es MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Peña Sr. Researcher & Activity Lead Manager, BSC/UPC NVIDIA GCoE San Jose 2018

  2. Introduction: Programming Models for GPU Computing CUDA (Compute Unified Device Architecture) – Runtime & Driver APIs (high-level / low-level) – Specific for NVIDIA GPUs: best performance & control OpenACC (Open Accelerators) – Open Standard – Higher-level, pragma-based – Aiming at portability – heterogeneous hardware – For NVIDIA GPUs, implemented on top of CUDA OpenCL (Open Computing Language) – Open Standard – Low-level – similar to CUDA Driver API – Multi-target, portable* (Intentionally leaving out weird stuff like CG, OpenGL , …) 2

  3. Motivation: Coding Productivity & Performance Coding Prod. / Perf. Don’t get me wrong: CUDA CUDA delivers awesome coding productivity w.r.t., e.g., OpenGL, but I only want to use 3 (easy) OpenACC colors here. Please interpret colors as relative to each other OpenACC + CUDA OpenACC may well deliver more OmpSs + CUDA than the performance you *need*. However, we have the OmpSs + OpenACC lowest control on performance OmpSs + OpenACC + CUDA w.r.t. the discussed alternatives 3

  4. EPEEC, an EU H2020 Project European joint E ffort toward a Highly P roductive Programming E nvironment for Heterogeneous E xascale C omputing FETHPC, 3 years, ~4M € , Starting October 2018 – Subtopic: “High productivity programming environments for exascale” 10 Partners; Coordinator: BSC (I’m the Technical Manager) High-level Objectives: – Develop & deploy a production-ready parallel programming environment – Advance and integrate existing state-of-the-art European technology – High coding productivity, high performance, energy awareness

  5. Proposed Methodology for Application Developers Directive No No Automatic Code Satisfactory Profile Optimisation Annotation Performance? Possible? Yes Yes Update code patterns Tune/Insert Code Low-Level Directives Manually Accelerator Kernels Yes No No Satisfactory Satisfactory Code Patterns? Performance? Yes Start Deploy 5

  6. OmpSs + CUDA / OpenACC

  7. OmpSs Main Program Sequential control flow – Defines a single address space – Executes sequential code that • Can spawn/instantiate tasks that will be executed sometime in the future • Can stall/wait for tasks Tasks annotated with directionality clauses – in, out, inout – Used • To build dependences among tasks • For main to wait for data to be produced – Basis for memory management functionalities (replication, locality, movement , …) • Copy clauses 7

  8. OmpSs: A Sequential Program … void Cholesky( float *A[NT][NT] ) { TS int i, j, k; NT for (k=0; k<NT; k++) { TS NT TS spotrf (A[k*NT+k]) ; for (i=k+1; i<NT; i++) { TS strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { sgemm( A[k][i], A[k][j], A[j][i]); } ssyrk (A[k][i], A[i][i]); } } 8

  9. OmpSs : … with Directionality Annotations … void Cholesky( float *A[NT][NT] ) { TS int i, j, k; NT for (k=0; k<NT; k++) { TS #pragma omp task inout (A[k][k]) NT TS spotrf (A[k][k]) ; for (i=k+1; i<NT; i++) { TS #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); } #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); } } 9

  10. OmpSs : … that Happens to Execute in Parallel void Cholesky( float *A[NT][NT] ) { TS int i, j, k; NT for (k=0; k<NT; k++) { TS #pragma omp task inout (A[k][k]) NT TS spotrf (A[k][k]) ; for (i=k+1; i<NT; i++) { TS #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); } #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); } } Decouple how we write/think (sequential) from how it is executed 10

  11. Memory Consistency (Getting Consistent Copies) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } T1 needs a valid copy Memory Transfers void main(int argc, char *argv[]) { of array A in the device ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A B B Also it allocates array B in C the device (no copy needed), T1 No need to copy and invalidates other B’s #pragma omp taskwait // can access any of A,B,C ... 11

  12. Memory Consistency (Reusing Data in Place) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } T2 can reuse arrays A and B, Memory Transfers void main(int argc, char *argv[]) { due they have been used by ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 previous task (T1) MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A B B Additionally it also C invalidates others A’s T2 #pragma omp taskwait // can access any of A,B,C ... 12

  13. Memory Consistency (on Demand Copy Data Back) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } Memory Transfers void main(int argc, char *argv[]) { ... DEVICE HOST T3 needs to copy back scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 to the host array B scale_task_host (C, B, 2.00, 1024); //T3 A A B B C Does not invalidate the B existing copy in the device #pragma omp taskwait // can access any of A,B,C T3 ... 13

  14. Memory Consistency (Centralized Memory Consistency) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } Memory Transfers void main(int argc, char *argv[]) { ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A T1 B B T2 C B T3 TW A #pragma omp taskwait Taskwait requires full memory // can access any of A,B,C consistency in the host ... 14

  15. Memory Consistency (Avoid taskwait Consistency) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) noflush #pragma omp task out([N] b) in([N] c) T4 void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } Memory Transfers void main(int argc, char *argv[]) { ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A #pragma omp taskwait noflush T1 B B // does not flush data dev -> host Taskwait is waiting for task T2 C finalization, but does not B T3 scale_task_cuda (B, C, 3.00, 1024); //T4 copy memory back to the nf #pragma omp taskwait host (neither invalidate it) // can access any of A,B,C ... 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend