The OmpSs programming model and its runtime support Jess Labarta - PowerPoint PPT Presentation

www.bsc.es The OmpSs programming model and its runtime support Jesús Labarta BSC 13 th Charm++ Workshop Urbana-Champaign. May, 8 th 2015 1

2 VISION

Look around … We are in the middle of a “revolution” 3

Living in the programming revolution The power wall made us go multicore and the ISA interface to leak  our world is shaking Application logic + Platform specificites Applications Applications Address spaces (hierarchy, transfer), ISA / API control flows,… … complexity !!!! 4

The programming revolution An age changing revolution – From the latency age … • Specify what to compute, where and when • Performance dominated by latency in a broad sense – Memory, communication, pipeline depths, fork-join, … – I need something … I need it now!!! – …to the throughput age • Ability to instantiate “lots” of work and avoiding stalling for specific requests – I need this and this and that … and as long as it keeps coming I am ok – (Broader interpretation than just GPU computing !!) • Performance dominated by overall availability/balance of resources 5

From the latency age to the throughput age It will require a programming effort !!! – Must make the transition as easy/smooth as possible – Must make it as long lived as possible Need – Simple mechanisms at the programming model level to express potential concurrency, letting exploitation responsibility to the runtime • Dynamic task based, asynchrony, look-ahead, malleability, … – A change in programmers mentality/attitude • Top down programming methodology • Think global, of potentials rather than how-to’s • Specify local, real needs and outcomes of the functionality being written 6

Vision in the programming revolution Need to decouple again Application logic Arch. independent Applications PM: High-level, clean, abstract interface General purpose Task based Single address space Power to the runtime “Reuse” architectural ideas ISA / API under new constraints 7

Vision in the programming revolution Fast prototyping Applications Special purpose DSL2 DSL3 DSL1 Must be easy to PM: High-level, clean, abstract interface develop/maintain General purpose Power to the runtime Task based Single address space ISA / API “Reuse” architectural ideas under new constraints 8 8 8

9 WHAT WE DO?

BSC technologies Programming model – The StarSs concept (*Superscalar) : • sequential programming + directionality annotations  Out of order execution – The OmpSs implementation  OpenMP Standard Performance tools – Trace visualization and analysis: • extreme flexibility and detail – Performance analytics 10 10

11 PROGRAMMING MODELS

The StarSs family of programming models Key concept – Sequential task based program on single address/name space + directionality annotations – Happens to execute parallel: Automatic run time computation of dependencies between tasks Differentiation of StarSs – Dependences: Tasks instantiated but not ready. Order IS defined • Lookahead – Avoid stalling the main control flow when a computation depending on previous tasks is reached – Possibility to “see” the future searching for further potential concurrency • Dependences built from data access specification – Locality aware • Without defining new concepts – Homogenizing heterogeneity • Device specific tasks but homogeneous program logic 12 12

The StarSs “Granularities” StarSs COMPSs OmpSs PyCOMPSs @ SMP @ GPU @ Cluster Average task Granularity : 100 microseconds – 10 milliseconds 1second - 1 day Address space to compute dependences : Memory Files, Objects (SCM) Language binding : C, C++, FORTRAN Java, Python Parallel Ensemble, workflow 13 13

OmpSs in one slide Minimalist set of concepts … – … ”extending” OpenMP – … relaxing StarSs funtional model #pragma omp task [ in (array_spec...)] [ out (...)] [ inout (...)] \ [ concurrent (…)] [commutative(...)] [ priority(P) ] [ label(...) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)] [reduction(identfier : list)] {code block or function} #pragma omp taskwait [ on (...) ][noflush] #pragma omp for [ shared(...)][private(...)][firstprivate(...)][schedule_clause] {for_loop} #pragma omp target device ({ smp | opencl | cuda }) \ [ implements ( function_name )] \ [ copy_deps | no_copy_deps ] [ copy_in ( array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] } \ [ndrange (dim, …)] [shmem(...) ] 14 14

Pragmas TS Inlined NT TS NT TS TS void Cholesky(int NT, float *A[NT][NT] ) { for (int k=0; k<NT; k++) { #pragma omp task inout ([TS][TS](A[k][k])) spotrf (A[k][k], TS) ; for (int i=k+1; i<NT; i++) { #pragma omp task in([TS][TS](A[k][k])) inout ([TS][TS](A[k][i])) strsm (A[k][k], A[k][i], TS); } for (int i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in([TS][TS](A[k][i]), [TS][TS](A[k][j])) \ inout ([TS][TS](*A[j][i])) sgemm( A[k][i], A[k][j], A[j][i], TS); } #pragma omp task in ([TS][TS](A[k][i])) inout([TS][TS](A[i][i])) ssyrk (A[k][i], A[i][i], TS); } } } 15

Pragmas TS …or outlined NT TS NT TS TS #pragma omp task inout ([TS][TS]A) void spotrf (float *A, int TS); #pragma omp task input ([TS][TS]T) inout ([TS][TS]B) void strsm (float *T, float *B, int TS); #pragma omp task input ([TS][TS]A,[TS][TS]B) inout ([TS][TS]C) void sgemm (float *A, float *B, float *C, int TS); #pragma omp task input ([TS][TS]A) inout ([TS][TS]C) void ssyrk (float *A, float *C, int TS); void Cholesky(int NT, float *A[NT][NT] ) { for (int k=0; k<NT; k++) { spotrf (A[k][k], TS) ; for (int i=k+1; i<NT; i++) strsm (A[k][k], A[k][i], TS); for (int i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) sgemm( A[k][i], A[k][j], A[j][i], TS); ssyrk (A[k][i], A[i][i], TS); } } } 16

Incomplete directionalities specification: sentinels void Cholesky(int NT, float *A[NT][NT] ) { for (int k=0; k<NT; k++) { #pragma omp task inout (A[k][k]) spotrf (A[k][k], TS) ; for (int i=k+1; i<NT; i++) { #pragma omp task in((A[k][k])) inout (A[k][i]) strsm (A[k][k], A[k][i], TS); } for (int i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in(A[k][i],A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i], TS); } #pragma omp task in (A[k][i]) inout(A[i][i]) #pragma omp task inout (*A) ssyrk (A[k][i], A[i][i], TS); void spotrf (float *A, int TS); } #pragma omp task input (*T) inout (*B) } void strsm (float *T, float *B, int TS); } #pragma omp task input (*A,*B) inout (*C) void sgemm (float *A, float *B, float *C, int TS); #pragma omp task input (*A) inout (*C) TS void ssyrk (float *A, float *C, int TS); NT TS void Cholesky(int NT, float *A[NT][NT] ) { NT for (int k=0; k<NT; k++) { TS spotrf (A[k][k], TS) ; TS for (int i=k+1; i<NT; i++) TS strsm (A[k][k], A[k][i], TS); NT for (int i=k+1; i<NT; i++) { TS for (j=k+1; j<i; j++) NT sgemm( A[k][i], A[k][j], A[j][i], TS); TS ssyrk (A[k][i], A[i][i], TS); } TS } } 17 17

Homogenizing Heterogeneity ISA heterogeneity Single address space program … executes in several non coherent address spaces – Copy clauses: • ensure sequentially consistent copy accessible in address space where task is going to be executed • Requires precise specification of data accessed (e.g. array sections) – Runtime offloads data and computation #pragma omp target device ({ smp | opencl | cuda }) \ [ implements ( function_name )] \ [ copy_deps | no_copy_deps ] [ copy_in ( array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] } \ [ndrange (dim, …)] [shmem(...) ] #pragma omp taskwait [ on (...) ][noflush] 18 18

CUDA tasks @ OmpSs Compiler splits code and sends codelet to nvcc Data transfers to/from device are performed by runtime Constrains for “codelet” – Can not access copied data  . Pointers translated when activating “codelet” task. – Can access firstprivate data void Calc_forces_cuda( int npart, Particle *particles, Particle *result, float dtime) { const int bs = npart/8; int first, last, nblocks; for ( int i = 0; i < npart; i += bs ) { first = i; last = (i+bs-1 > npart) ? npart : i+bs-1; nblocks = (last - first + MAX_THREADS ) / MAX_THREADS; #pragma omp target device(cuda) copy_deps #pragma omp task in( particles[0:npart-1] ) out( result[first:(first+bs)-1]) { calculate_forces <<< nblocks, MAX_THREADS >>> (dtime, particles, npart, &result[first], first, last); } } } 19 19

MACC (Mercurium ACcelerator Compiler) “OpenMP 4.0 accelerator directives” compiler – Generates OmpSs code + CUDA kernels (for Intel & Power8 + GPUs) – Propose clauses that improve kernel performance Extended semantics – Change in mentality … minor details make a difference Type of device Ensure availability Specific device DO transfer G. Ozen et al, “On the roles of the programmer, the compiler and the runtime system when facing accelerators in OpenMP 4.0” IWOMP 2014 20 20

The OmpSs programming model and its runtime support Jess Labarta - PowerPoint PPT Presentation

www.bsc.es The OmpSs programming model and its runtime support Jess Labarta BSC 13 th Charm++ Workshop Urbana-Champaign. May, 8 th 2015 1 2 VISION Look around We are in the middle of a revolution 3 Living in the programming

OmpSs - programming model for heterogenous and distributed platforms Rosa M Badia Uppsala, 3

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

The OmpSs Programming Model Jesus Labarta Director Computer Sciences Research Dept. BSC

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead

Integrating OmpSs@FPGA within Eclipse Presentation for EclipseCon 2019 Ruben Cano-Daz and

OpenACC, CUDA, and OmpSs Pau Farr Antonio J. Pea Munich, Oct. 12 2017 PROLOGUE Barcelona

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

Runtime systems Runtime systems Functional program are very high-level: its not obvious how to

Runtime System COMP 524: Programming Languages Based in part on slides and notes by J. Erickson,

Runtime Considerations Were moving towards actually producing target code. This means we need

Solid Type System vs Runtime Checks and Unit Tests Vladimir Pavkin Plan Fail Fast concept

Horizon Runtime Efficient Event Scheduling in Runtime Efficient Event Scheduling in

The The SeETL RunTime RunTime SeETL Utilities Presentation Utilities Presentation

TenantGuard: Scalable Runtime Verification of Cloud-Wide VM-Level Network Isolation Han Song

Characteristics of Adapti tive Runtime Systems in HPC Laxmikant (Sanjay) Kale

WoT Runtime, Scripting, Bindings Zoltan Kis, Intel WoT Runtime WoT RT Script 1 Things Things

Toward a Model Architecture for Model Composition Techniques Murillo Scholl, Toacy Oliveira

Trace Modules and Rigidity Haydee Lindo Williams College @CGMRT, November 2017 Haydee Lindo

Arc Length 11/18/2011 Suppose you want to know what the length of a curve y = f ( x ) is from the

Nowhere-zero 3-flows in arc-transitive graphs on nilpotent groups Sanming Zhou Department of

T HE PERMUTATION MODEL G ( n , 2 d ) 1 , . . . , d iid uniform permutations. Superimpose. T

Department of

Properties of Free Multiplicative Convolution Hong Chang Ji Korea Advanced Institute of Science

Spectral functions of subordinated Brownian motion M.A. Fahrenwaldt 12 1 Institut fr

The OmpSs programming model and its runtime support Jess Labarta - PowerPoint PPT Presentation

www.bsc.es The OmpSs programming model and its runtime support Jess Labarta BSC 13 th Charm++ Workshop Urbana-Champaign. May, 8 th 2015 1 2 VISION Look around We are in the middle of a revolution 3 Living in the programming

OmpSs - programming model for heterogenous and distributed platforms Rosa M Badia Uppsala, 3

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

The OmpSs Programming Model Jesus Labarta Director Computer Sciences Research Dept. BSC

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher &amp; Activity Lead

Integrating OmpSs@FPGA within Eclipse Presentation for EclipseCon 2019 Ruben Cano-Daz and

OpenACC, CUDA, and OmpSs Pau Farr Antonio J. Pea Munich, Oct. 12 2017 PROLOGUE Barcelona

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

Runtime systems Runtime systems Functional program are very high-level: its not obvious how to

Runtime System COMP 524: Programming Languages Based in part on slides and notes by J. Erickson,

Runtime Considerations Were moving towards actually producing target code. This means we need

Solid Type System vs Runtime Checks and Unit Tests Vladimir Pavkin Plan Fail Fast concept

Horizon Runtime Efficient Event Scheduling in Runtime Efficient Event Scheduling in

The The SeETL RunTime RunTime SeETL Utilities Presentation Utilities Presentation

TenantGuard: Scalable Runtime Verification of Cloud-Wide VM-Level Network Isolation Han Song

Characteristics of Adapti tive Runtime Systems in HPC Laxmikant (Sanjay) Kale

WoT Runtime, Scripting, Bindings Zoltan Kis, Intel WoT Runtime WoT RT Script 1 Things Things

Toward a Model Architecture for Model Composition Techniques Murillo Scholl, Toacy Oliveira

Trace Modules and Rigidity Haydee Lindo Williams College @CGMRT, November 2017 Haydee Lindo

Arc Length 11/18/2011 Suppose you want to know what the length of a curve y = f ( x ) is from the

Nowhere-zero 3-flows in arc-transitive graphs on nilpotent groups Sanming Zhou Department of

T HE PERMUTATION MODEL G ( n , 2 d ) 1 , . . . , d iid uniform permutations. Superimpose. T

Department of

Properties of Free Multiplicative Convolution Hong Chang Ji Korea Advanced Institute of Science

Spectral functions of subordinated Brownian motion M.A. Fahrenwaldt 12 1 Institut fr

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead