Automatic Performance Tuning and Machine Learning Markus Pschel - PowerPoint PPT Presentation

Carnegie Mellon Automatic Performance Tuning and Machine Learning Markus Püschel Computer Science, ETH Zürich with: Frédéric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon PhD and Postdoc openings: High performance computing  Compilers  Theory  Programming languages/Generative programming  M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Why Autotuning? Matrix-Matrix Multiplication (MMM) on quadcore Intel platform Performance [Gflop/s] 50 45 40 Best implementation 35 30 25 160x 20 15 10 5 Triple loop 0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 matrix size  Same (mathematical) operation count (2n 3 )  Compiler underperforms by 160x M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Same for All Critical Compute Functions WiFi Receiver (Physical layer) on one Intel Core Throughput [Mbit/s] vs. Data rate [Mbit/s] 35x 30x M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Solution: Autotuning Definition: Search over alternative implementations or parameters to find the fastest. Definition: Automating performance optimization with tools that complement/aid the compiler or programmer. However: Search is an important tool. But expensive. Solution: Machine learning M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Organization  Autotuning examples  An example use of machine learning M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon time of time of time of implementation installation use platform known problem parameters known M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon PhiPac/ATLAS: MMM Generator Whaley, Bilmes, Demmel, Dongarra , … c a b = * Blocking improves locality c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B) for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i++) for (j1 = j; j1 < j+B; j++) for (k1 = k; k1 < k+B; k++) c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1]; } M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon PhiPac/ATLAS: MMM Generator Compile Mflop/s Execute Measure L1Size NB MU,NU,KU Detect MiniMMM ATLAS ATLAS MMM NR xFetch Hardware Source Search Engine Code Generator MulAdd MulAdd Parameters L * Latency source: Pingali, Yotov, Cornell U. M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon ATLAS MMM generator time of time of time of implementation installation use platform known problem parameters known M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon FFTW: Discrete Fourier Transform (DFT) Frigo, Johnson Installation n = 1024 configure/make radix 16 16 64 base case radix 8 Usage Twiddles Search for fastest 8 8 d = dft(n) base case base case computation strategy d(x,y) M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon FFTW: Codelet Generator Frigo n DFT codelet generator dft_n (*x, *y, …) fixed size DFT function straightline code M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon FFTW codelet ATLAS FFTW adaptive generator MMM generator library time of time of time of implementation installation use platform known problem parameters known M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon OSKI: Sparse Matrix-Vector Multiplication Vuduc, Im, Yelick, Demmel = *  Blocking for registers:  Improves locality (reuse of input vector)  But creates overhead (zeros in block) = * M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon OSKI: Sparse Matrix-Vector Multiplication Gain by blocking (dense MVM) Overhead by blocking = * 16/9 = 1.77 1.4 1.4/1.77 = 0.79 (no gain) M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon OSKI OSKI sparse MVM sparse MVM FFTW codelet ATLAS FFTW adaptive generator MMM generator library time of time of time of implementation installation use platform known problem parameters known M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Spiral: Linear Transforms & More Algorithm knowledge Platform description Spiral Optimized implementation regenerated for every new platform M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Program Generation in Spiral (Sketched) Transform Optimization at all user specified abstraction levels Algorithm rules Fast algorithm parallelization in SPL vectorization many choices loop Σ -SPL optimizations constant folding Optimized implementation scheduling …… + search M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Machine learning Machine learning Spiral: transforms Spiral: transforms general input size general input size Spiral: transforms fixed input size OSKI OSKI sparse MVM sparse MVM FFTW codelet ATLAS FFTW adaptive generator MMM generator library time of time of time of implementation installation use platform known problem parameters known M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Organization  Autotuning examples  An example use of machine learning M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Online tuning Offline tuning (time of use) (time of installation) Installation Installation configure/make configure/make for a few n: search learn decision trees Use Use Twiddles Search for fastest Twiddles d = dft(n) d = dft(n) computation strategy d(x,y) d(x,y) Goal M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Integration with Spiral-Generated Libraries Voronenko 2008 Online tunable Spiral + some platform information library M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Organization  Autotuning examples  An example use of machine learning  Anatomy of an adaptive discrete Fourier transform library  Decision tree generation using C4.5  Results M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Discrete/Fast Fourier Transform  Discrete Fourier transform (DFT):  Cooley/Tukey fast Fourier transform (FFT):  Dataflow (right to left): 16 = 4 x 4 M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Adaptive Scalar Implementation (FFTW 2.x) void dft( int n, cpx *y, cpx *x) { if (use_dft_base_case(n)) Choices used for dft_bc(n, y, x); adaptation else { int k = choose_dft_radix(n); for ( int i=0; i < k; ++i) dft_strided(m, k, t + m*i, x + m*i); for ( int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t + i); } } void dft_strided( int n, int istr, cpx *y, cpx *x) { ... } void dft_scaled( int n, int str, cpx *d, cpx *y, cpx *x) { ... } M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Decision Graph of Library void dft( int n, cpx *y, cpx *x) { if (use_dft_base_case(n)) Choices used for dft_bc(n, y, x); adaptation else { int k = choose_dft_radix(n); for ( int i=0; i < k; ++i) dft_strided(m, k, t + m*i, x + m*i); for ( int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t + i); } } void dft_strided( int n, int istr, cpx *y, cpx *x) { ... } void dft_scaled( int n, int str, cpx *d, cpx *y, cpx *x) { ... } M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Spiral-Generated Libraries Standard Scalar Spiral Vectorized Threading Buffering 20 mutually recursive functions  10 different choices (occurring recursively)  Choices are heterogeneous (radix, threading, buffering, …)  M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Our Work Upon installation, generate decision trees for each choice Example: M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Statistical Classification: C4.5 Features (events) C4.5 Entropy of Features P(play|windy=false) = 6/8 H(windy=false) = 0.81 P(don’t play|windy=false) = 2/8 H(windy) = 0.89 P(play|windy=true) = 1/2 H(windy=true) = 1.0 H(outlook) = 0.69 P(don’t play|windy=false) = 1/2 H(humidity) = … M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Application to Libraries  Features = arguments of functions (except variable pointers) dft( int n, cpx *y, cpx *x) dft_strided( int n, int istr, cpx *y, cpx *x) dft_scaled( int n, int str, cpx *d, cpx *y, cpx *x)  At installation time:  Run search for a few input sizes n  Yields training set: features and associated decisions (several for each size)  Generate decision trees using C4.5 and insert into library M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Issues  Correctness of generated decision trees  Issue: learning sizes n in {12, 18, 24, 48}, may find radix 6  Solution: correction pass through decision tree  Prime factor structure n = 2 i 3 j = 2, 3, 4, 6, 9, 12, 16, 18, 24, 27, 32, … i Compute i, j and add to features j M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Experimental Setup  3GHz Intel Xeon 5160 (2 Core 2 Duos = 4 cores)  Linux 64-bit, icc 10.1  Libraries:  IPP 5.3  FFTW 3.2 alpha 2  Spiral-generated library M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon Learning works as expected M arkus Püschel, ETH Zurich, 201 1

Carnegie Mellon “All” Sizes  All sizes n ≤ 2 18 , with prime factors ≤ 19 M arkus Püschel, ETH Zurich, 201 1

Automatic Performance Tuning and Machine Learning Markus Pschel - PowerPoint PPT Presentation

Carnegie Mellon Automatic Performance Tuning and Machine Learning Markus Pschel Computer Science, ETH Zrich with: Frdric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon M arkus Pschel, ETH Zurich, 201 1

Machine learning with mlr Dr. Shirin Elsinghorst Data Scientist DataCamp Hyperparameter Tuning

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

Automatic Database Management System Tuning Through Large-scale Machine Learning Dana Van Aken

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

A Hands-On Introduction to Automatic Machine Learning Lars Kotthofg University of Wyoming

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine learning with H2O Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning in R

CAPES:Unsupervised Storage Performance Tuning Using Neural Network-Based Deep Reinforcement

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

TUNING Russia: Development of master programmes in engineering education using the Tuning

Probability & Statistics Intro / Review NEU 560 Jonathan Pillow Lecture 6, part II 1

Linear Transform Libraries Yevgen Voronenko, Frdric de Mesmay, and Markus Pschel Carnegie

Tsing nghua hua University versity Introduction Deep learning has widely used in lots of

Computatio ion Reuse in in DNNs by Exploiting Input Sim imilarity Marc Riera , Jose Maria

Malware Defense II TDDD17 Information Security, Second Course Alireza Mohammadinodooshan

Holographic models for QCD Elias Kiritsis University of Crete ( APC, Paris ) 1- Collaborators

Learning agents Performance standard Critic Sensors Learning from Observations feedback

Methods for handling uncertainty Default or nonmonotonic logic: Assume my car does not have a

Automatic Performance Tuning and Machine Learning Markus Pschel - PowerPoint PPT Presentation

Carnegie Mellon Automatic Performance Tuning and Machine Learning Markus Pschel Computer Science, ETH Zrich with: Frdric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon M arkus Pschel, ETH Zurich, 201 1

Machine learning with mlr Dr. Shirin Elsinghorst Data Scientist DataCamp Hyperparameter Tuning

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

Automatic Database Management System Tuning Through Large-scale Machine Learning Dana Van Aken

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

A Hands-On Introduction to Automatic Machine Learning Lars Kotthofg University of Wyoming

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine learning with H2O Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning in R

CAPES:Unsupervised Storage Performance Tuning Using Neural Network-Based Deep Reinforcement

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

TUNING Russia: Development of master programmes in engineering education using the Tuning

Probability &amp; Statistics Intro / Review NEU 560 Jonathan Pillow Lecture 6, part II 1

Linear Transform Libraries Yevgen Voronenko, Frdric de Mesmay, and Markus Pschel Carnegie

Tsing nghua hua University versity Introduction Deep learning has widely used in lots of

Computatio ion Reuse in in DNNs by Exploiting Input Sim imilarity Marc Riera , Jose Maria

Malware Defense II TDDD17 Information Security, Second Course Alireza Mohammadinodooshan

Holographic models for QCD Elias Kiritsis University of Crete ( APC, Paris ) 1- Collaborators

Learning agents Performance standard Critic Sensors Learning from Observations feedback

Methods for handling uncertainty Default or nonmonotonic logic: Assume my car does not have a

Probability & Statistics Intro / Review NEU 560 Jonathan Pillow Lecture 6, part II 1