BOAST Performance Portability Using Meta-Programming and - PowerPoint PPT Presentation

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography BOAST Performance Portability Using Meta-Programming and Auto-Tuning Frédéric Desprez 1 , Brice Videau 1 , 3 , Kevin Pouget 1 , Luigi Genovese 2 , Thierry Deutsch 2 , Dimitri Komatitsch 3 , Jean-François Méhaut 1 1 INRIA/LIG - CORSE, 2 CEA - L_Sim, 3 CNRS Workshop CCDSC October 6, 2016 BOAST 1 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography Scientific Application Portability Limited Portability Huge codes (more than 100 000 lines), Written in FORTRAN or C++ Collaborative efforts Use many different programming paradigms (OpenMP, OpenCL, CUDA, ...) But Based on Computing Kernels Kernels Should Be Written Well defined parts of a program In a portable manner Compute intensive In a way that raises developer productivity Prime target for optimization To present good performance BOAST 2 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography HPC Architecture Evolution Very Rapid and Diverse, Top500: Sunway processor (TaihuLight) Tomorrow? Intel processor + Xeon Phi (Tianhe-2) AMD processor + nVidia GPU (Titan) ARM + DSP? IBM BlueGene/Q (Sequoia) Intel Atom + FPGA? Fujitsu SPARC64 (K Computer) Quantum computing? Intel processor + nVidia GPU (Tianhe-1) AMD processor (Jaguar) How to write kernels that could adapt to those architectures? (well maybe not quantum computing...) BOAST 3 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography Related Work Ad hoc autotuners (usually for libraries): Atlas [6] (C macro processing) SPIRAL [4] (DSL) ... Generic frameworks using annotation systems: POET [7] (external annotation file) Orio [3] (source annotation) BEAST [1] (Python preprocessor based, embedded DSL for optimization space definition/pruning) Generic frameworks using embedded DSL: Halide [5] (C++, not very generic, 2D stencil targeted) Heterogeneous Programming Library [2] (C++) BOAST 4 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography Classical Tuning of Computing Kernels Development Compilation Source Binary Code Developer Optimization Performance Analysis Performance data Kernel optimization workflow Usually performed by a knowledgeable developer BOAST 5 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography Classical Tuning of Computing Kernels Development Compilation Source Binary Code Gcc Mercurium OpenCL Optimization Performance Analysis Performance data Compilers perform optimizations Architecture specific or generic optimizations BOAST 5 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography Classical Tuning of Computing Kernels Development Compilation Source Binary Code Optimization Performance MAQAO Analysis HW Counters Proprietary Tools Performance data Performance data hint at source transformations Architecture specific or generic hints BOAST 5 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography Classical Tuning of Computing Kernels Development Compilation Source Binary Code Optimization Performance Developer Analysis Performance data Multiplication of kernel versions and/or loss of versions Difficulty to benchmark versions against each-other BOAST 5 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography BOAST Workflow Development Compilation Source Binary Code Transformation Performance Analysis Generative Optimization Performance Source Code data Developer Meta-programming of optimizations in BOAST High level object oriented language BOAST 6 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography BOAST Workflow Development Compilation Source Binary Code Transformation Performance BOAST Analysis Generative Optimization Performance Source Code data Generate combination of optimizations C, OpenCL, FORTRAN and CUDA are supported BOAST 6 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography BOAST Workflow Development Compilation Source Binary Code Gcc Mercurium OpenCL Transformation Performance MAQAO Analysis HW Counters Proprietary Tools Generative Optimization Performance Source Code data Compilation and analysis are automated Selection of best version can also be automated BOAST 6 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography BOAST Architecture Application kernel Optimization space Binary analysis tool (SPECFEM3D, prunner: ASK, like MAQAO BigDFT, ...) Collective Mind 1 5 Kernel written in Performance Binary 4 BOAST DSL measurements kernel 2 Select input Select target Select data language optimizations Best performing version BOAST BOAST code generation runtime gcc, opencl Select performance Select compiler metrics and options 3 C Fortran OpenCL CUDA C with vector kernel kernel kernel kernel intrinsics kernel BOAST 7 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography Example: Laplace Kernel from ARM 1 void laplace(const int width , 2 const int height , 3 const unsigned char src[height ][ width ][3] , unsigned char dst[height ][ width ][3]){ 4 for (int j = 1; j < height -1; j++) { 5 for (int i = 1; i < width -1; i++) { 6 for (int c = 0; c < 3; c++) { 7 int tmp = -src[j -1][i -1][c] - src[j -1][i][c] - src[j -1][i+1][c]\ 8 - src[j ][i -1][c] + 9* src[j ][i][c] - src[j ][i+1][c]\ 9 10 - src[j+1][i -1][c] - src[j+1][i][c] - src[j+1][i+1][c]; 11 dst[j][i][c] = (tmp < 0 ? 0 : (tmp > 255 ? 255 : tmp )); 12 } 13 } 14 } 15 } C reference implementation Many opportunities for improvement ARM GPU Mali 604 within the Montblanc project BOAST 8 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography Example: Laplace in OpenCL kernel laplace(const int width , 1 const int height , 2 global const uchar *src , 3 4 global uchar *dst ){ 5 int i = get_global_id (0); 6 int j = get_global_id (1); 7 for (int c = 0; c < 3; c++) { 8 int tmp = -src [3* width *(j -1) + 3*(i -1) + c]\ 9 - src [3* width *(j -1) + 3*(i ) + c]\ 10 - src [3* width *(j -1) + 3*(i+1) + c]\ 11 - src [3* width *(j ) + 3*(i -1) + c]\ 12 + 9* src [3* width *(j ) + 3*(i ) + c]\ 13 - src [3* width *(j ) + 3*(i+1) + c]\ 14 - src [3* width *(j+1) + 3*(i -1) + c]\ 15 - src [3* width *(j+1) + 3*(i ) + c]\ 16 - src [3* width *(j+1) + 3*(i+1) + c]; 17 dst [3* width*j + 3*i + c] = clamp(tmp , 0, 255); 18 } 19 } OpenCL reference implementation Outer loops mapped to threads 1 pixel per thread BOAST 9 / 21

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography Example: Vectorizing 1 kernel laplace(const int width , 2 const int height , global const uchar *src , 3 4 global uchar *dst){ 5 int i = get_global_id (0); 6 int j = get_global_id (1); 7 uchar16 v11_ = vload16( 0, src + 3* width *(j-1) + 3*5*i - 3 ); 8 uchar16 v12_ = vload16( 0, src + 3* width *(j-1) + 3*5*i ); 9 uchar16 v13_ = vload16( 0, src + 3* width *(j-1) + 3*5*i + 3 ); uchar16 v21_ = vload16( 0, src + 3* width *(j ) + 3*5*i - 3 ); 10 11 uchar16 v22_ = vload16( 0, src + 3* width *(j ) + 3*5*i ); uchar16 v23_ = vload16( 0, src + 3* width *(j ) + 3*5*i + 3 ); 12 13 uchar16 v31_ = vload16( 0, src + 3* width *(j+1) + 3*5*i - 3 ); 14 uchar16 v32_ = vload16( 0, src + 3* width *(j+1) + 3*5*i ); 15 uchar16 v33_ = vload16( 0, src + 3* width *(j+1) + 3*5*i + 3 ); 16 int16 v11 = convert_int16 (v11_ ); 17 int16 v12 = convert_int16 (v12_ ); int16 v13 = convert_int16 (v13_ ); 18 19 int16 v21 = convert_int16 (v21_ ); 20 int16 v22 = convert_int16 (v22_ ); int16 v23 = convert_int16 (v23_ ); 21 22 int16 v31 = convert_int16 (v31_ ); 23 int16 v32 = convert_int16 (v32_ ); 24 int16 v33 = convert_int16 (v33_ ); 25 int16 res = v22 * (int )9 - v11 - v12 - v13 - v21 - v23 - v31 - v32 - v33; res = clamp(res , (int16 )0, (int16 )255); 26 27 uchar16 res_ = convert_uchar16 (res ); 28 vstore8(res_.s01234567 , 0, dst + 3* width*j + 3*5*i); 29 vstore8(res_.s89ab , 0, dst + 3* width*j + 3*5*i + 8); vstore8(res_.scd , 0, dst + 3* width*j + 3*5*i + 12); 30 31 dst [3* width*j + 3*5*i + 14] = res_.se; } 32 Vectorized OpenCL implementation 5 pixels instead of one (15 components) BOAST 10 / 21

BOAST Performance Portability Using Meta-Programming and - PowerPoint PPT Presentation

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography BOAST Performance Portability Using Meta-Programming and Auto-Tuning Frdric Desprez 1 , Brice Videau 1 , 3 , Kevin Pouget 1 , Luigi Genovese 2 ,

PARADOX THE UPSIDE DOWN TRUTH OF FAITH Week 2 Become Weak to be Strong? 2 Corinthians 11

Agenda I. Types of non-dilutive Funding II. Mandates of Funding Programs III. When to Access

Wordly Wise afford boast Goal: Students will read with accuracy and apply chord exceptional

Love is patient and kind; love does not envy or boast; it is not arrogant or rude. It does not

SPIRITUAL DISCERNMENT FINDING GODS WILL TOGETHER I Corinthians 1:18-31 God has made foolish

2 Corinthians 12:6-10 NIV 6 Even if I should choose to boast, I would not be a fool, because I

The Tortoise and The Hare There once was a speedy hare who bragged about how fast he could run.

heart of the Motor Valley, where it operates latest-generation automotive aluminium foundries. We

I a am L LAUREN ENTIU VA VASILE 3, Unirii Blvd. |4 th district |Bucharest |Romania M: +40 723

State Board of Education Public Hearing December 3, 2008 1 Introduction Introduction We have

If I was just a carpenter, a lowly wage earner, nothing spectacular, would I be overlooked for I

He hath a heart as sound as a bell, and his tongue is a clapper; for what the heart thinks

I CORINTHIANS 13 OVER CHURCH PROBLEMS IV. LOVE DOES NOT BOAST Perperos To Brag; Set

Romans 4:1-12 Bibles Are Available On The Back Table Please Silence Your Phones Download This

Gods New You Bringing glory to the God that created you 1 Corinthians 1:31 Let the

them show it by their good life, by deeds done in the humility that comes from wisdom. But if you

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/20/13 1 TPP performance

Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale

Capacity Planning of Supercomputers Simulating MPI Applications at Scale Tom Cornebize Under the

ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON HETEROGENEOUS MANY CORE

Data- Intensive