Operator Language: A Program Generation Framework for Fast Kernels - PowerPoint PPT Presentation

Carnegie Mellon Operator Language: A Program Generation Framework for Fast Kernels Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel

Carnegie Mellon The Problem: Example MMM Matrix-Matrix Multiplication (MMM) on 2xCore2Duo 3 GHz (double precision) Performance [Gflop/s] 50 45 40 Best code (K. Goto) 35 30 25 160x 20 15 10 5 Triple loop 0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 matrix size Similar plots can be shown for all numerical kernels in linear algebra,  signal processing, coding, crypto, … What’s going on? Hardware is becoming increasingly complex. 

Carnegie Mellon Automatic Performance Tuning  Current vicious circle: Whenever a new platform comes out, the same functionality needs to be rewritten and reoptimized  Automatic Performance Tuning  BLAS: ATLAS, PHiPAC  Linear algebra: Sparsity/OSKI, Flame  Sorting  Fourier transform: FFTW  Linear transforms (and beyond): Spiral  … others How to build an extensible system? For more problem classes? For yet un-invented platforms? Proceedings of the IEEE special issue, Feb. 2005

Carnegie Mellon What is Spiral? Traditionally Spiral Approach Spiral Comparable High performance library High performance library performance optimized for given platform optimized for given platform

Carnegie Mellon Idea: Common Abstraction and Rewriting Model: common abstraction = spaces of matching formulas = domain-specific language abstraction abstraction ν p defines rewriting search μ pick algorithm architecture space space Architectural parameter: Kernel: optimization Vector length, problem size, #processors, … algorithm choice

Carnegie Mellon Some Kernels as OL Formulas. Linear Transforms Viterbi Decoding convolutional 11 10 01 01 10 10 11 00 Viterbi 010001 11 10 00 01 10 01 11 00 010001 encoder decoder £ Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR) matched preprocessing interpolation 2D iFFT = £ filtering

Carnegie Mellon How Spiral Works Problem specification (transform) Spiral: Complete automation of the controls implementation and Algorithm Generation optimization task Algorithm Optimization algorithm Basic ideas: controls Declarative representation Search Implementation of algorithms Code Optimization C code Rewriting systems to generate and optimize Compilation performance algorithms at a high level Compiler Optimizations of abstraction Spiral Fast executable Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

Carnegie Mellon Organization  Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

Carnegie Mellon Operators Definition  Operator: Multiple complex vectors ! multiple complex vectors  Higher-dimensional data is linearized  Operators are potentially nonlinear Example: Matrix-matrix-multiplication (MMM) A C B

Carnegie Mellon Operator Language

Carnegie Mellon OL Tensor Product: Repetitive Structure Kronecker product (structured matrices) OL Tensor product (structured operators) Definition (extension to non-linear)

Carnegie Mellon Translating OL Formulas Into Programs

Carnegie Mellon Example: Matrix Multiplication (MMM) Breakdown rules: capture various forms of blocking

Carnegie Mellon Example: SAR Computation as OL Rules Grid Compute Range Azimuth 2D FFT Interpolation Interpolation

Carnegie Mellon Modeling Multicore: Base Cases  Hardware abstraction: shared cache with cache lines  Tensor product: embarrassingly parallel operator A Processor 0 A Processor 1 A Processor 2 A Processor 3 x y  Permutation: problematic; may produce false sharing x y

Carnegie Mellon Parallelization: OL Rewriting Rules  Tags encode hardware constraints  Rules are algorithm-independent  Rules encode program transformations

Carnegie Mellon The Joint Rule Set: MMM  Algorithm rules: breakdown rules  Hardware constraints: base cases  Program transformations: manipulation rules Combined rule set spans search space for empirical optimization

Carnegie Mellon Parallelization Through Rewriting: MMM Load-balanced No false sharing

Carnegie Mellon Same Approach for Different Paradigms Threading: Vectorization: GPUs: Verilog for FPGAs:

Carnegie Mellon Matrix Multiplication Library MKL 10.0 MKL 10.0 GotoBLAS 1.26 Spiral-generated library Spiral-generated library GotoBLAS 1.26 Rank-k Update , single precision, k=4 Rank-k Update , double precision, k=4 performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz 18 9 Spiral-generated library 16 8 Spiral-generated library 14 7 12 6 10 5 MKL 10.0 MKL 10.0 8 4 6 3 4 2 2 1 Input size Input size 0 0 2 4 8 16 32 64 128 256 512 2 4 8 16 32 64 128 256 512

Carnegie Mellon Result: Spiral-Generated PFA SAR on Core2 Quad SAR Image Formation on Intel platforms performance [Gflop/s] 50 3.0 GHz Core 2 (65nm) 44 43 3.0 GHz Core 2 (45nm) 40 2.66 GHz Core i7 newer 3.0 GHz Core i7 (Virtual) platforms 30 20 10 0 100 Megapixels 16 Megapixels Algorithm by J. Rudin (best paper award, HPEC 2007): 30 Gflop/s on Cell  Each implementation: vectorized, threaded, cache tuned, ~13 MB of code 

Carnegie Mellon Summary  Platforms are powerful yet complicated optimization will stay a hard problem Image: Intel  OL: unified mathematical framework captures platforms and algorithms  Spiral: program generation and autotuning architecture kernel can provide full automation M (») A(µ)  Performance of supported kernels is competitive with expert tuning

Operator Language: A Program Generation Framework for Fast Kernels - PowerPoint PPT Presentation

Carnegie Mellon Operator Language: A Program Generation Framework for Fast Kernels Franz Franchetti, Frdric de Mesmay, Daniel McFarlin, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA

Shuffle algebra perspective on operator valued probability theory 30 mars 2020 1/25 Operator

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

OPERATOR LICENSING PROGRAM Commission Meeting January 13, 2017 1 Agenda Initial operator

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Generation Andrea Zugarini SAILab December 5th, 2019 LabMeeting, December 5th

FAST DISTRIBUTED RSA KEY GENERATION FOR FAST DISTRIBUTED RSA KEY GENERATION FOR semi-honest

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

Existentially closed C algebras, operator systems, and operator spaces Isaac Goldbring

Triple Variational Principles for Operator Functions Matthias Langer University of Strathclyde,

Module 14: Analyzing Queries Overview Queries That Use the AND Operator the OR

Operator overloading (1 of 9) Operator overloading (2 of 9) Let us assume a class to handle

Outline Outline RainForest A Framework for Fast A Framework for Fast RainForest 1.

An XML Markup Language An XML Markup Language Framework for Lexical Databases Framework for

FRESCO Foundational Research on Service Composition Modelling Support for Service Compounds

CSE 341 Programming Languages Dynamic Dispatch vs. Closures OOP vs. Functional Decomposition

A complete schema denition language for the Text Encoding Initiative Lou Burnard and Sebastian

Coali&on Ba*le Management Language (C-BML) and C2SIM

Link-Time Static Analysis for Efficient Separate Compilation of Object-Oriented Languages Jean

Grammar Implementation with Lexicalized Tree Adjoining Grammars and Frame Semantics Grammar

XHTML vs. HTML 2 History of the World in Just 5 Slides, Part 1 ARPANET Implemented in

CMPT 165 CMPT 165 INTRODUCTION TO THE INTERNET INTRODUCTION TO THE INTERNET AND THE WORLD WIDE

Operator Language: A Program Generation Framework for Fast Kernels - PowerPoint PPT Presentation

Carnegie Mellon Operator Language: A Program Generation Framework for Fast Kernels Franz Franchetti, Frdric de Mesmay, Daniel McFarlin, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA

Shuffle algebra perspective on operator valued probability theory 30 mars 2020 1/25 Operator

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

OPERATOR LICENSING PROGRAM Commission Meeting January 13, 2017 1 Agenda Initial operator

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Generation Andrea Zugarini SAILab December 5th, 2019 LabMeeting, December 5th

FAST DISTRIBUTED RSA KEY GENERATION FOR FAST DISTRIBUTED RSA KEY GENERATION FOR semi-honest

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

Existentially closed C algebras, operator systems, and operator spaces Isaac Goldbring

Triple Variational Principles for Operator Functions Matthias Langer University of Strathclyde,

Module 14: Analyzing Queries Overview Queries That Use the AND Operator the OR

Operator overloading (1 of 9) Operator overloading (2 of 9) Let us assume a class to handle

Outline Outline RainForest A Framework for Fast A Framework for Fast RainForest 1.

An XML Markup Language An XML Markup Language Framework for Lexical Databases Framework for

FRESCO Foundational Research on Service Composition Modelling Support for Service Compounds

CSE 341 Programming Languages Dynamic Dispatch vs. Closures OOP vs. Functional Decomposition

A complete schema denition language for the Text Encoding Initiative Lou Burnard and Sebastian

Coali&amp;on Ba*le Management Language (C-BML) and C2SIM

Link-Time Static Analysis for Efficient Separate Compilation of Object-Oriented Languages Jean

Grammar Implementation with Lexicalized Tree Adjoining Grammars and Frame Semantics Grammar

XHTML vs. HTML 2 History of the World in Just 5 Slides, Part 1 ARPANET Implemented in

CMPT 165 CMPT 165 INTRODUCTION TO THE INTERNET INTRODUCTION TO THE INTERNET AND THE WORLD WIDE

Coali&on Ba*le Management Language (C-BML) and C2SIM