operator language a program generation framework for fast
play

Operator Language: A Program Generation Framework for Fast Kernels - PowerPoint PPT Presentation

Carnegie Mellon Operator Language: A Program Generation Framework for Fast Kernels Franz Franchetti, Frdric de Mesmay, Daniel McFarlin, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA


  1. Carnegie Mellon Operator Language: A Program Generation Framework for Fast Kernels Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel

  2. Carnegie Mellon The Problem: Example MMM Matrix-Matrix Multiplication (MMM) on 2xCore2Duo 3 GHz (double precision) Performance [Gflop/s] 50 45 40 Best code (K. Goto) 35 30 25 160x 20 15 10 5 Triple loop 0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 matrix size Similar plots can be shown for all numerical kernels in linear algebra,  signal processing, coding, crypto, … What’s going on? Hardware is becoming increasingly complex. 

  3. Carnegie Mellon Automatic Performance Tuning  Current vicious circle: Whenever a new platform comes out, the same functionality needs to be rewritten and reoptimized  Automatic Performance Tuning  BLAS: ATLAS, PHiPAC  Linear algebra: Sparsity/OSKI, Flame  Sorting  Fourier transform: FFTW  Linear transforms (and beyond): Spiral  … others How to build an extensible system? For more problem classes? For yet un-invented platforms? Proceedings of the IEEE special issue, Feb. 2005

  4. Carnegie Mellon What is Spiral? Traditionally Spiral Approach Spiral Comparable High performance library High performance library performance optimized for given platform optimized for given platform

  5. Carnegie Mellon Idea: Common Abstraction and Rewriting Model: common abstraction = spaces of matching formulas = domain-specific language abstraction abstraction ν p defines rewriting search μ pick algorithm architecture space space Architectural parameter: Kernel: optimization Vector length, problem size, #processors, … algorithm choice

  6. Carnegie Mellon Some Kernels as OL Formulas. Linear Transforms Viterbi Decoding convolutional 11 10 01 01 10 10 11 00 Viterbi 010001 11 10 00 01 10 01 11 00 010001 encoder decoder £ Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR) matched preprocessing interpolation 2D iFFT = £ filtering

  7. Carnegie Mellon How Spiral Works Problem specification (transform) Spiral: Complete automation of the controls implementation and Algorithm Generation optimization task Algorithm Optimization algorithm Basic ideas: controls Declarative representation Search Implementation of algorithms Code Optimization C code Rewriting systems to generate and optimize Compilation performance algorithms at a high level Compiler Optimizations of abstraction Spiral Fast executable Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

  8. Carnegie Mellon Organization  Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

  9. Carnegie Mellon Organization  Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

  10. Carnegie Mellon Operators Definition  Operator: Multiple complex vectors ! multiple complex vectors  Higher-dimensional data is linearized  Operators are potentially nonlinear Example: Matrix-matrix-multiplication (MMM) A C B

  11. Carnegie Mellon Operator Language

  12. Carnegie Mellon OL Tensor Product: Repetitive Structure Kronecker product (structured matrices) OL Tensor product (structured operators) Definition (extension to non-linear)

  13. Carnegie Mellon Translating OL Formulas Into Programs

  14. Carnegie Mellon Example: Matrix Multiplication (MMM) Breakdown rules: capture various forms of blocking

  15. Carnegie Mellon Example: SAR Computation as OL Rules Grid Compute Range Azimuth 2D FFT Interpolation Interpolation

  16. Carnegie Mellon Organization  Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

  17. Carnegie Mellon Modeling Multicore: Base Cases  Hardware abstraction: shared cache with cache lines  Tensor product: embarrassingly parallel operator A Processor 0 A Processor 1 A Processor 2 A Processor 3 x y  Permutation: problematic; may produce false sharing x y

  18. Carnegie Mellon Parallelization: OL Rewriting Rules  Tags encode hardware constraints  Rules are algorithm-independent  Rules encode program transformations

  19. Carnegie Mellon The Joint Rule Set: MMM  Algorithm rules: breakdown rules  Hardware constraints: base cases  Program transformations: manipulation rules Combined rule set spans search space for empirical optimization

  20. Carnegie Mellon Parallelization Through Rewriting: MMM Load-balanced No false sharing

  21. Carnegie Mellon Same Approach for Different Paradigms Threading: Vectorization: GPUs: Verilog for FPGAs:

  22. Carnegie Mellon Organization  Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

  23. Carnegie Mellon Matrix Multiplication Library MKL 10.0 MKL 10.0 GotoBLAS 1.26 Spiral-generated library Spiral-generated library GotoBLAS 1.26 Rank-k Update , single precision, k=4 Rank-k Update , double precision, k=4 performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz 18 9 Spiral-generated library 16 8 Spiral-generated library 14 7 12 6 10 5 MKL 10.0 MKL 10.0 8 4 6 3 4 2 2 1 Input size Input size 0 0 2 4 8 16 32 64 128 256 512 2 4 8 16 32 64 128 256 512

  24. Carnegie Mellon Result: Spiral-Generated PFA SAR on Core2 Quad SAR Image Formation on Intel platforms performance [Gflop/s] 50 3.0 GHz Core 2 (65nm) 44 43 3.0 GHz Core 2 (45nm) 40 2.66 GHz Core i7 newer 3.0 GHz Core i7 (Virtual) platforms 30 20 10 0 100 Megapixels 16 Megapixels Algorithm by J. Rudin (best paper award, HPEC 2007): 30 Gflop/s on Cell  Each implementation: vectorized, threaded, cache tuned, ~13 MB of code 

  25. Carnegie Mellon Organization  Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

  26. Carnegie Mellon Summary  Platforms are powerful yet complicated optimization will stay a hard problem Image: Intel  OL: unified mathematical framework captures platforms and algorithms  Spiral: program generation and autotuning architecture kernel can provide full automation M (») A(µ)  Performance of supported kernels is competitive with expert tuning

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend