outthink limits
play

outthink limits Performance Analysis and Optimizations for - PowerPoint PPT Presentation

outthink limits Performance Analysis and Optimizations for Lambda-based Applications in OpenMP 4.5 Compiler and Application Teams at IBM Various People at LLNL David Truby, Carlo Bertolli, Kevin OBrien, Kathryn OBrien


  1. outthink limits Performance Analysis and Optimizations for Lambda-based Applications in OpenMP 4.5 Compiler and Application Teams at IBM Various People at LLNL David Truby, Carlo Bertolli, Kevin O’Brien, Kathryn O’Brien david.truby@ibm.com, {cbertol,caomhin,kmob}@us.ibm.com IBM T. J. Watson Research Center

  2. Scope of Work Compiler optimization perspective on (>=) C++11 frameworks • Lambda-based frameworks make performance portability possible: no other compiler-free known solution • State-of-Art: plotting performance differences when using C++11 features and OpenMP with various compilers • Unclear what compilers actually do § On host and device! In this presentation • Using special branch of Clang: https://github.com/clang-ykt § ..and Lightweight OpenMP Library • Experiments on LULESH v2.0 and RAJA • Reporting performance then go figure out why - looking at generated code • Porting LULESH presents various alternatives § Experiment on many different loops to get a full-application view IBM Systems | 2

  3. OpenMP and Lambdas on Host struct anon { double *a, *b, *c; Capture by copy } Captures are retrieved template <typename LOOP_BODY> before the loop and re- int main() { inline void forall_omp (int begin, int end, used within it struct anon args; LOOP_BODY loop_body) { args.a = a; #pragma omp parallel for proc_bind(spread) args.b = b; for (int ii = 0 ; ii < end ; ++ii ) { args.c = c; loop_body( ii ); fork_call(outlined_region, .., args ) } } } int main() { double *a, *b, *c; void outlined_region(.., struct anon args ) { // init a, b, and c double *a, *b, *c; a = args.a; forall_omp (0, n, [ = ] (int i) { b = args.b; a [i] += b [i] + c [i]; c = args.c; } ); for (int i = 0 ; i < n ; i++) {..} } } Capture all variables undefined in region by copy IBM Systems | 3

  4. OpenMP and Lambdas on Host struct anon { double **a, **b, **c; Capture by reference } template <typename LOOP_BODY> int main() { inline void forall_omp (int begin, int end, double **a, **b, **c; LOOP_BODY loop_body) { struct anon args; #pragma omp parallel for proc_bind(spread) args.a = a; for (int ii = 0 ; ii < end ; ++ii ) { args.b = b; loop_body( ii ); args.c = c; } fork_call(outlined_region, .., args ) } } void outlined_region(.., struct anon args ) { double **a, **b, **c; int main() { for (int i = 0 ; i < n ; i++) { double *a, *b, *c; a = args.a; Captures are now // init a, b, and c b = args.b; retrieved from within loop body c = args.c; forall_omp(0, n, [&] (int i) { a_val = load a[i]; a [i] += b [i] + c [i]; b_val = load b[i]; } ); c_val = load c[i]; } // … } Capture all variables undefined in region by reference } IBM Systems | 4

  5. OpenMP and Lambdas on Device What the compiler does for you: 1. Implicit map(tofrom) of lambda struct (can be optimized to map(to) template <typename LOOP_BODY> 2. Instruct the runtime to translate pointers in inline void forall_omp (int begin, int end, struct anon from host to device LOOP_BODY loop_body) { struct anon { #pragma omp target teams distribute \ double *a, *b, *c; parallel for } for (int ii = 0 ; ii < end ; ++ii ) loop_body( ii ); int main() { } double *a, *b, *c; struct anon args; int main() { args.a = a; double *a, *b, *c; args.b = b; // init a, b, and c args.c = c; #pragma omp target enter data map(to: a[:n], b[:n], c[:n]) tgt_target_teams(outlined_region, .., args ) forall_omp (0, n, [ = ] ( int i) { } a[i] += b[i] + c[i]; } ); #pragma omp target exit data map(from: a[:n]) \ a, b, and c will be translated by map(release:b[:n], c[:n]) runtime } IBM Systems | 5

  6. Very Simple Tests – Vector Add Compare #parallel for with and without lambda, different captures Clang [=] SMT=8 Clang %diff [&]/[=] Lambda %diff with [=] 100.00 Clang does not Time (msec) 10.00 Plain %diff with [=] vectorize lambda body 1.00 140 with [&] capture 0.10 120 0.01 100 0.00 % difference 10^4 10^5 10^6 10^7 10^8 80 Problem Size Lambda Plain 60 40 Clang [&] SMT=8 20 100.00 Time (msec) 10.00 0 1.00 10^4 10^5 10^6 10^7 10^8 -20 0.10 Problem Size 0.01 0.00 10^4 10^5 10^6 10^7 10^8 remark: loop not vectorized: cannot identify array bounds Problem Size Lambda Plain IBM Systems | 6

  7. Very Simple Test – Vector Add with Target 5.6% Lambda Plain Difference only at smaller sizes, up to 7.00 one order of magnitude 6.00 Disappears with large iteration Execution Time (msec) 5.00 space size 4.00 3.00 282.4% Code generated is identical, except 2.00 lambda version has to retrieve 1.00 pointers from struct 0.00 10^4 10^5 10^6 10^7 10^8 Problem Size IBM Systems | 7

  8. LULESH 2.0 – Performance Analysis Partial study of LULESH 2.0 using Raja • Using RAJA OpenMP 4.5. backend plus our special compiler branch • Four version of code: § Host: plain OpenMP parallel for, RAJA with domain, RAJA with direct array access § Device: plain OpenMP target region, RAJA with array capturing Experiments • On Power8 S822LC ("Minsky") server, including Pascal GPU (Tesla P100-SXM2-16GB) • Options and env: -O3, -fopenmp-implicit-declare-target, -ffp-contract=fast, explicitly pinning threads to cores Kernel Description Instructions CalcLagrangeElements Elements, small kernel with few operations 4 fadd, 6 fsub, 2 fdiv CalcSoundSpeedForElems Variable iteration space, small kernel with 1 fadd, 4 fmul, 1 fdiv, 1 sqrt switch CalcMonotonicQGradientsForElems Elements, large kernel without control flow 118 fadd, 27 fsub, 64 fmul, 4 fdiv, 2 sqrt CalcMonotonicQRegionForElems Variable iteration space, large kernel with 10 fadd, 7 fsub, 35 fmul, 4 fdiv, switch IBM Systems | 8

  9. LULESH – OpenMP Target Implementation We modified LULESH to access domain arrays from within capture expression RAJA::forall<target_exec_policy>(0, numElem, RAJA::forall<elem_exec_policy>(0, numElem, [=, dxx=&domain.dxx(0), [=] (int k) { dyy=&domain.dyy(0), // calc strain rate and apply as constraint dzz=&domain.dzz(0), // (only done in FB element) vdov_v=&domain.vdov(0)] (int k) { Real_t vdov = domain.dxx(k) + domain.dyy(k) // calc strain rate and apply as constraint + domain.dzz(k) ; // (only done in FB element) Real_t vdovthird = vdov/Real_t(3.0) ; Real_t vdov = dxx[k] + dyy[k] + dzz[k]; // make the rate of deformation tensor deviatoric Real_t vdovthird = vdov/Real_t(3.0) ; domain.vdov(k ) = vdov ; // make the rate of deformation tensor deviatoric domain.dxx(k) -= vdovthird ; vdov_v[k] = vdov ; domain.dyy(k) -= vdovthird ; dxx[k] -= vdovthird ; domain.dzz(k) -= vdovthird ; dyy[k] -= vdovthird ; } dzz[k] -= vdovthird ; ); } ); IBM Systems | 9

  10. Host Performance - Impact of Lambdas K1 = CalcLagrangeElements K2 = CalcMonotonicQGradientsForElem %difference between RAJA version with domain object and plain version K3.1,3.2 = CalcMonotonicQRegionForElem K4.1,4.2 = CalcSoundSpeedForElem %difference between RAJA version using arrays and plain version %diff domain %diff array Size=12 SMT=1 Size=12 SMT=8 %diff domain %diff array Size=12 SMT=4 Size=12 SMT=2 %diff domain %diff array %diff domain %diff array 70 Percentage Difference 70 70 Percentage Difference 70 Percentage Difference Percentage Difference 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 K1 K2 K3.2 K3.1 K4.1 K4.2 K1 K2 K3.2 K3.1 K4.1 K4.2 K1 K2 K3.2 K3.1 K4.1 K4.2 -10 K1 K2 K3.2 K3.1 K4.1 K4.2 %diff domain %diff array Size=30 SMT=1 %diff domain %diff array Size30= SMT=2 %diff domain %diff array Size=30 SMT=8 %diff domain Size=30 SMT=4 70 70 70 70 60 60 Percentage Difference 60 60 Percentage Difference Percentage Difference 50 Percentage Difference 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 -10 0 K1 K2 K3.2 K3.1 K4.1 K4.2 -10 K1 K2 K3.2 K3.1 K4.1 K4.2 K1 K2 K3.2 K3.1 K4.1 K4.2 K1 K2 K3.2 K3.1 K4.1 K4.2 -10 IBM Systems | 10

  11. Host Performance - Impact of Lambdas Size=60 SMT=4 Size=60 SMT=1 Size=60 SMT=8 %diff domain %diff array %diff domain %diff array Size=60 SMT=2 %diff domain %diff array %diff domain %diff array 120 100 120 120 Percentage Difference 100 100 100 Percentage Difference Percentage Difference Percentage Difference 80 80 80 80 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 K1 K2 K3.1 K3.2 K4.2 K4.1 -20 -20 0 -20 K1 K2 K3.1 K3.2 K4.2 K4.1 K1 K2 K3.1 K3.2 K4.2 K4.1 K1 K2 K3.1 K3.2 K4.2 K4.1 Size=100 SMT=4 %diff domain %diff array Size=100 SMT=2 %diff domain %diff array Size=100 SMT=8 Size=100 SMT=1 %diff domain %diff array %diff domain %diff array 70 60 50 100 60 50 Percentage Difference 40 80 50 40 Percentage Difference 30 Percentage Difference Percentage Difference 60 40 20 30 30 40 10 20 0 20 20 10 -10 10 0 0 -20 0 -20 -30 -10 -10 -40 -40 -20 -20 K1 K2 K3.1 K3.2 K4.1 K4.2 K1 K2 K3.1 K3.2 K4.1 K4.2 K1 K2 K3.1 K3.2 K4.1 K4.2 K1 K2 K3.1 K3.2 K4.1 K4.2 IBM Systems | 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend