outthink limits Performance Analysis and Optimizations for - PowerPoint PPT Presentation

outthink limits Performance Analysis and Optimizations for Lambda-based Applications in OpenMP 4.5 Compiler and Application Teams at IBM Various People at LLNL David Truby, Carlo Bertolli, Kevin O’Brien, Kathryn O’Brien david.truby@ibm.com, {cbertol,caomhin,kmob}@us.ibm.com IBM T. J. Watson Research Center

Scope of Work Compiler optimization perspective on (>=) C++11 frameworks • Lambda-based frameworks make performance portability possible: no other compiler-free known solution • State-of-Art: plotting performance differences when using C++11 features and OpenMP with various compilers • Unclear what compilers actually do § On host and device! In this presentation • Using special branch of Clang: https://github.com/clang-ykt § ..and Lightweight OpenMP Library • Experiments on LULESH v2.0 and RAJA • Reporting performance then go figure out why - looking at generated code • Porting LULESH presents various alternatives § Experiment on many different loops to get a full-application view IBM Systems | 2

OpenMP and Lambdas on Host struct anon { double *a, *b, *c; Capture by copy } Captures are retrieved template <typename LOOP_BODY> before the loop and re- int main() { inline void forall_omp (int begin, int end, used within it struct anon args; LOOP_BODY loop_body) { args.a = a; #pragma omp parallel for proc_bind(spread) args.b = b; for (int ii = 0 ; ii < end ; ++ii ) { args.c = c; loop_body( ii ); fork_call(outlined_region, .., args ) } } } int main() { double *a, *b, *c; void outlined_region(.., struct anon args ) { // init a, b, and c double *a, *b, *c; a = args.a; forall_omp (0, n, [ = ] (int i) { b = args.b; a [i] += b [i] + c [i]; c = args.c; } ); for (int i = 0 ; i < n ; i++) {..} } } Capture all variables undefined in region by copy IBM Systems | 3

OpenMP and Lambdas on Host struct anon { double **a, **b, **c; Capture by reference } template <typename LOOP_BODY> int main() { inline void forall_omp (int begin, int end, double **a, **b, **c; LOOP_BODY loop_body) { struct anon args; #pragma omp parallel for proc_bind(spread) args.a = a; for (int ii = 0 ; ii < end ; ++ii ) { args.b = b; loop_body( ii ); args.c = c; } fork_call(outlined_region, .., args ) } } void outlined_region(.., struct anon args ) { double **a, **b, **c; int main() { for (int i = 0 ; i < n ; i++) { double *a, *b, *c; a = args.a; Captures are now // init a, b, and c b = args.b; retrieved from within loop body c = args.c; forall_omp(0, n, [&] (int i) { a_val = load a[i]; a [i] += b [i] + c [i]; b_val = load b[i]; } ); c_val = load c[i]; } // … } Capture all variables undefined in region by reference } IBM Systems | 4

OpenMP and Lambdas on Device What the compiler does for you: 1. Implicit map(tofrom) of lambda struct (can be optimized to map(to) template <typename LOOP_BODY> 2. Instruct the runtime to translate pointers in inline void forall_omp (int begin, int end, struct anon from host to device LOOP_BODY loop_body) { struct anon { #pragma omp target teams distribute \ double *a, *b, *c; parallel for } for (int ii = 0 ; ii < end ; ++ii ) loop_body( ii ); int main() { } double *a, *b, *c; struct anon args; int main() { args.a = a; double *a, *b, *c; args.b = b; // init a, b, and c args.c = c; #pragma omp target enter data map(to: a[:n], b[:n], c[:n]) tgt_target_teams(outlined_region, .., args ) forall_omp (0, n, [ = ] ( int i) { } a[i] += b[i] + c[i]; } ); #pragma omp target exit data map(from: a[:n]) \ a, b, and c will be translated by map(release:b[:n], c[:n]) runtime } IBM Systems | 5

Very Simple Tests – Vector Add Compare #parallel for with and without lambda, different captures Clang [=] SMT=8 Clang %diff [&]/[=] Lambda %diff with [=] 100.00 Clang does not Time (msec) 10.00 Plain %diff with [=] vectorize lambda body 1.00 140 with [&] capture 0.10 120 0.01 100 0.00 % difference 10^4 10^5 10^6 10^7 10^8 80 Problem Size Lambda Plain 60 40 Clang [&] SMT=8 20 100.00 Time (msec) 10.00 0 1.00 10^4 10^5 10^6 10^7 10^8 -20 0.10 Problem Size 0.01 0.00 10^4 10^5 10^6 10^7 10^8 remark: loop not vectorized: cannot identify array bounds Problem Size Lambda Plain IBM Systems | 6

Very Simple Test – Vector Add with Target 5.6% Lambda Plain Difference only at smaller sizes, up to 7.00 one order of magnitude 6.00 Disappears with large iteration Execution Time (msec) 5.00 space size 4.00 3.00 282.4% Code generated is identical, except 2.00 lambda version has to retrieve 1.00 pointers from struct 0.00 10^4 10^5 10^6 10^7 10^8 Problem Size IBM Systems | 7

LULESH 2.0 – Performance Analysis Partial study of LULESH 2.0 using Raja • Using RAJA OpenMP 4.5. backend plus our special compiler branch • Four version of code: § Host: plain OpenMP parallel for, RAJA with domain, RAJA with direct array access § Device: plain OpenMP target region, RAJA with array capturing Experiments • On Power8 S822LC ("Minsky") server, including Pascal GPU (Tesla P100-SXM2-16GB) • Options and env: -O3, -fopenmp-implicit-declare-target, -ffp-contract=fast, explicitly pinning threads to cores Kernel Description Instructions CalcLagrangeElements Elements, small kernel with few operations 4 fadd, 6 fsub, 2 fdiv CalcSoundSpeedForElems Variable iteration space, small kernel with 1 fadd, 4 fmul, 1 fdiv, 1 sqrt switch CalcMonotonicQGradientsForElems Elements, large kernel without control flow 118 fadd, 27 fsub, 64 fmul, 4 fdiv, 2 sqrt CalcMonotonicQRegionForElems Variable iteration space, large kernel with 10 fadd, 7 fsub, 35 fmul, 4 fdiv, switch IBM Systems | 8

LULESH – OpenMP Target Implementation We modified LULESH to access domain arrays from within capture expression RAJA::forall<target_exec_policy>(0, numElem, RAJA::forall<elem_exec_policy>(0, numElem, [=, dxx=&domain.dxx(0), [=] (int k) { dyy=&domain.dyy(0), // calc strain rate and apply as constraint dzz=&domain.dzz(0), // (only done in FB element) vdov_v=&domain.vdov(0)] (int k) { Real_t vdov = domain.dxx(k) + domain.dyy(k) // calc strain rate and apply as constraint + domain.dzz(k) ; // (only done in FB element) Real_t vdovthird = vdov/Real_t(3.0) ; Real_t vdov = dxx[k] + dyy[k] + dzz[k]; // make the rate of deformation tensor deviatoric Real_t vdovthird = vdov/Real_t(3.0) ; domain.vdov(k ) = vdov ; // make the rate of deformation tensor deviatoric domain.dxx(k) -= vdovthird ; vdov_v[k] = vdov ; domain.dyy(k) -= vdovthird ; dxx[k] -= vdovthird ; domain.dzz(k) -= vdovthird ; dyy[k] -= vdovthird ; } dzz[k] -= vdovthird ; ); } ); IBM Systems | 9

Host Performance - Impact of Lambdas K1 = CalcLagrangeElements K2 = CalcMonotonicQGradientsForElem %difference between RAJA version with domain object and plain version K3.1,3.2 = CalcMonotonicQRegionForElem K4.1,4.2 = CalcSoundSpeedForElem %difference between RAJA version using arrays and plain version %diff domain %diff array Size=12 SMT=1 Size=12 SMT=8 %diff domain %diff array Size=12 SMT=4 Size=12 SMT=2 %diff domain %diff array %diff domain %diff array 70 Percentage Difference 70 70 Percentage Difference 70 Percentage Difference Percentage Difference 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 K1 K2 K3.2 K3.1 K4.1 K4.2 K1 K2 K3.2 K3.1 K4.1 K4.2 K1 K2 K3.2 K3.1 K4.1 K4.2 -10 K1 K2 K3.2 K3.1 K4.1 K4.2 %diff domain %diff array Size=30 SMT=1 %diff domain %diff array Size30= SMT=2 %diff domain %diff array Size=30 SMT=8 %diff domain Size=30 SMT=4 70 70 70 70 60 60 Percentage Difference 60 60 Percentage Difference Percentage Difference 50 Percentage Difference 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 -10 0 K1 K2 K3.2 K3.1 K4.1 K4.2 -10 K1 K2 K3.2 K3.1 K4.1 K4.2 K1 K2 K3.2 K3.1 K4.1 K4.2 K1 K2 K3.2 K3.1 K4.1 K4.2 -10 IBM Systems | 10

Host Performance - Impact of Lambdas Size=60 SMT=4 Size=60 SMT=1 Size=60 SMT=8 %diff domain %diff array %diff domain %diff array Size=60 SMT=2 %diff domain %diff array %diff domain %diff array 120 100 120 120 Percentage Difference 100 100 100 Percentage Difference Percentage Difference Percentage Difference 80 80 80 80 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 K1 K2 K3.1 K3.2 K4.2 K4.1 -20 -20 0 -20 K1 K2 K3.1 K3.2 K4.2 K4.1 K1 K2 K3.1 K3.2 K4.2 K4.1 K1 K2 K3.1 K3.2 K4.2 K4.1 Size=100 SMT=4 %diff domain %diff array Size=100 SMT=2 %diff domain %diff array Size=100 SMT=8 Size=100 SMT=1 %diff domain %diff array %diff domain %diff array 70 60 50 100 60 50 Percentage Difference 40 80 50 40 Percentage Difference 30 Percentage Difference Percentage Difference 60 40 20 30 30 40 10 20 0 20 20 10 -10 10 0 0 -20 0 -20 -30 -10 -10 -40 -40 -20 -20 K1 K2 K3.1 K3.2 K4.1 K4.2 K1 K2 K3.1 K3.2 K4.1 K4.2 K1 K2 K3.1 K3.2 K4.1 K4.2 K1 K2 K3.1 K3.2 K4.1 K4.2 IBM Systems | 11

outthink limits Performance Analysis and Optimizations for - PowerPoint PPT Presentation

outthink limits Performance Analysis and Optimizations for Lambda-based Applications in OpenMP 4.5 Compiler and Application Teams at IBM Various People at LLNL David Truby, Carlo Bertolli, Kevin OBrien, Kathryn OBrien

City Limits Lions Clubs City Limits Lions Clubs City Limits Lions Clubs City Limits Lions

Different Types of Limits Besides ordinary, two-sided limits, there are one-sided limits (left-

MAT 166 Calculus for Bus/Soc Chapter 3 Notes Limits The Deriviative David J. Gisch Limits

Limits (the size of the pie) allocation limits minimum reliability flow of supply Limits

Medical Programs Overview Table 1. Caption Medical SNAP TANF Programs Income Limits Income

Scope & Limits of Scope & Limits of Scope & Limits of Legal Authority Legal

Modeling Limits Jaroslav Neetil Patrice Ossona de Mendez Charles University CAMS, CNRS/EHESS

DB server limits (process/sessions) DB server limits (process/sessions) Carlos Fernando Gamboa,

d Limits at infinity and infinite limits i E 2 Lectures a l l u d b Dr. Abdulla Eid A

Limits of sub semigroups of C and Siegel enrichments Ismael Bachy 22 novembre 2010 Limits of

Calculus without Limits: The difficulty of limits the Theory The difficulty of defining R

Determination of the Outer Continental Shelf Limits and the Determination of the Outer Continental

Local Limits Crash Course Gorman Lau, P.E. CWEA 2016 P3S Conference February 29, 2016

Calculus without Limits: The difficulty of limits the Theory The difficulty of defining R

Admissible Covers and Rescaling Limits Xavier Buff Universit de Toulouse after Matthieu Arfeux

Limits on the Power of Indistinguishability Obfuscation Gilad Asharov Gil Segev Limits on the

Petascale Delivered Whats Past is Prologue IBMs pNext ; The Next Era of Computing

IBM Model 701 (Early 1950's) CS 140 Lecture Notes: Introduction Slide 1 IBM 7094 (Early 1960's)

12 Symbolic MT 1: The IBM Models and EM Algorithm Up until now, we have seen Section 3 discuss n

UnitedHealth Group Transforming our Business through Technology and AI UnitedHealth Group at a

|() | () is one of the most important outstanding problems

IBM POWER9 Bhopesh Bassi, Ivan Chen, Wes Darvin What is POWER9 IBMs POWER processor line

Building and Running a Solr-as-a-Service SHAI ERERA IBM Who Am I? Working at IBM Social

Computer Vision and Machine Learning for ICARUS Physics Reconstruction Francois Drielsma ,

outthink limits Performance Analysis and Optimizations for - PowerPoint PPT Presentation

outthink limits Performance Analysis and Optimizations for Lambda-based Applications in OpenMP 4.5 Compiler and Application Teams at IBM Various People at LLNL David Truby, Carlo Bertolli, Kevin OBrien, Kathryn OBrien

City Limits Lions Clubs City Limits Lions Clubs City Limits Lions Clubs City Limits Lions

Different Types of Limits Besides ordinary, two-sided limits, there are one-sided limits (left-

MAT 166 Calculus for Bus/Soc Chapter 3 Notes Limits The Deriviative David J. Gisch Limits

Limits (the size of the pie) allocation limits minimum reliability flow of supply Limits

Medical Programs Overview Table 1. Caption Medical SNAP TANF Programs Income Limits Income

Scope &amp; Limits of Scope &amp; Limits of Scope &amp; Limits of Legal Authority Legal

Modeling Limits Jaroslav Neetil Patrice Ossona de Mendez Charles University CAMS, CNRS/EHESS

DB server limits (process/sessions) DB server limits (process/sessions) Carlos Fernando Gamboa,

d Limits at infinity and infinite limits i E 2 Lectures a l l u d b Dr. Abdulla Eid A

Limits of sub semigroups of C and Siegel enrichments Ismael Bachy 22 novembre 2010 Limits of

Calculus without Limits: The difficulty of limits the Theory The difficulty of defining R

Determination of the Outer Continental Shelf Limits and the Determination of the Outer Continental

Local Limits Crash Course Gorman Lau, P.E. CWEA 2016 P3S Conference February 29, 2016

Calculus without Limits: The difficulty of limits the Theory The difficulty of defining R

Admissible Covers and Rescaling Limits Xavier Buff Universit de Toulouse after Matthieu Arfeux

Limits on the Power of Indistinguishability Obfuscation Gilad Asharov Gil Segev Limits on the

Petascale Delivered Whats Past is Prologue IBMs pNext ; The Next Era of Computing

IBM Model 701 (Early 1950's) CS 140 Lecture Notes: Introduction Slide 1 IBM 7094 (Early 1960's)

12 Symbolic MT 1: The IBM Models and EM Algorithm Up until now, we have seen Section 3 discuss n

UnitedHealth Group Transforming our Business through Technology and AI UnitedHealth Group at a

|() | () is one of the most important outstanding problems

IBM POWER9 Bhopesh Bassi, Ivan Chen, Wes Darvin What is POWER9 IBMs POWER processor line

Building and Running a Solr-as-a-Service SHAI ERERA IBM Who Am I? Working at IBM Social

Computer Vision and Machine Learning for ICARUS Physics Reconstruction Francois Drielsma ,

Scope & Limits of Scope & Limits of Scope & Limits of Legal Authority Legal