Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard - PowerPoint PPT Presentation

Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.19] Tuesday, March 11, 2008 1

Today’s sources CS 267 at UCB (Demmel & Yelick) Papers from various autotuning projects PHiPAC, ATLAS, FFTW, SPIRAL, TCE See: Proc. IEEE 2005 special issue on Program Generation, Optimization, and Platform Adaptation Me (for once!) 2

Review: Autotuners 3

Performance-engineering challenges Outer Control Structure Iterative Recursive Inner Control Structure Recursive Iterative Statement Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed Scalarized / None / Coloring / Belady / Compiler BRILA Compiler BRILA 4

Motivation for performance tuning pseudo Mflop/s Source: J. Johnson (2007), CScADS autotuning workshop 5

Context for autotuning Problem: HPC needs detailed low-level machine knowledge Autotuning methodology Identify and generate a space of implementations Search (modeling, experiments) to choose the best one Early idea seedlings Polyalgorithms Profile and feedback-directed compilation Domain- and architecture-specific code generators 6

Example: What a search space looks like Mflop/s k 0 = 1 m 0 n 0 Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler Source: PHiPAC Project at UC Berkeley (1997) 7

Cooley-Tukey FFT algorithm: Encoding in FFTW’s codelet generator N − 1 � x [ j ] · ω − kj x, y ∈ C N y [ k ] DFT N ( x, k ) ≡ ← N j =0 N 2 − 1 �� N 1 − 1 � � � � x [ n 1 · N 2 + n 2 ] · ω − k 1 n 1 · ω − k 1 n 2 · ω − k 2 n 2 y [ k 1 + k 2 · N 1 ] ← N 1 N N 2 n 1 n 2 =0 N 1 -point DFT Twiddle N 2 -point DFT let dftgen( N, x ) ≡ fun k → . . . # DFT N ( x, k ) let cooley tukey( N 1 , N 2 , x ) ≡ let ˆ x ≡ fun n 2 , n 1 → x ( n 2 + n 1 · N 2 ) in let G 1 ≡ fun n 2 → dftgen( N 1 , ˆ x ( n 2 , )) in (Functional let W ≡ fun k 1 , n 2 → G 1 ( n 2 , k 1 ) · ω − k 1 n 2 in pseudo-code) N let G 2 ≡ fun k 1 → dftgen( N 2 , W ( k 1 , )) in fun k → G 2 ( k mod N 1 , k div N 1 ) 8

Tensor Contraction Engine (TCE) for quantum chemistry 11

Tensor Contraction Engine (TCE) Application domain: Quantum chemistry Electronic structure calculations Dominant computation expressible as a “tensor contraction” TCE generates a complete parallel program from a high-level spec Automates time-space trade-offs Output S. Hirata (2002), and many others Following presentation taken from Proc. IEEE 2005 special issue 12

Motivation: Simplify program development Source: Baumgartner, et al . (2005) 13

Rewriting to reduce operation counts Naïvely, ≈ 4 × N 10 flops � S abij A acik × B befl × C d fjk × D cdel = c,d,e,f,k,l ⇓     � � �  × C d  × A acik S abij B befl × D cdel = fjk c,k d,f e,l Assuming associativity and distributivity, ≈ 6 × N 6 flops, but also requires temporary storage. Source: Baumgartner, et al . (2005) 14

Operation and storage minimization via loop fusion T (1) � B befl × D cdel = bcd f e,l T (2) T (1) � f × C d = fjk bcjk bcd d,f T (2) � S abij bcjk × A acik = c,k T 1 = T 2 = S = 0 for b, c, d, e, f, l do T 1[ b, c, d, f ] += B [ b, e, f, l ] · D [ c, d, e, l ] for b, c, d, f, j, k do T 2[ b, c, j, k ] += T 1[ b, c, d, f ] · C [ d, f, j, k ] for a, b, c, i, j, k do S [ a, b, i, j ] += T 2[ b, c, j, k ] · A [ a, c, i, k ] 15

Operation and storage minimization via loop fusion T (1) � B befl × D cdel = bcd f e,l T (2) T (1) � f × C d = fjk bcjk bcd d,f T (2) � S abij bcjk × A acik = c,k S = 0 for b, c do T 1 = T 2 = S = 0 T 1 f ← 0 , T 2 f ← 0 for b, c, d, e, f, l do for d, f do T 1[ b, c, d, f ] += B [ b, e, f, l ] · D [ c, d, e, l ] for e, l do for b, c, d, f, j, k do T 1 f += B [ b, e, f, l ] · D [ c, d, e, l ] T 2[ b, c, j, k ] += T 1[ b, c, d, f ] · C [ d, f, j, k ] for j, k do for a, b, c, i, j, k do T 2 f [ j, k ] += T 1 f · C [ d, f, j, k ] S [ a, b, i, j ] += T 2[ b, c, j, k ] · A [ a, c, i, k ] for a, i, j, k do S [ a, b, i, j ] += T 2 f [ j, k ] · A [ a, c, i, k ] 16

Time-space trade-offs Max index of a — f : O (1000) for a, e, c, f do i — k : O (100) for i, j do X aecf += T ijae · T ijcf “Contraction” of T over i, j for c, e, b, k do T (1) cebk ← f 1 ( c, e, b, k ) for a, f, b, k do Integrals, O(1000) flops T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for b, k do Y ceaf += T (1) cebk · T (2) “Contraction” over T (1) and T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 17

Time-space trade-offs Max index of a — f : O (1000) for a, e, c, f do i — k : O (100) for i, j do X aecf += T ijae · T ijcf for c, e, b, k do T (1) cebk ← f 1 ( c, e, b, k ) Same indices for a, f, b, k do ⇒ Loop fusion candidates T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for b, k do Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 18

Time-space trade-offs for a, e, c, f do for a, e, c, f do for i, j do for i, j do X aecf += T ijae · T ijcf X aecf += T ijae · T ijcf for c, e, b, k do for a, c, e, f, b, k do T (1) T (1) Add cebk ← f 1 ( c, e, b, k ) cebk ← f 1 ( c, e, b, k ) extra for a, f, b, k do for a, e, c, f, b, k do flops T (2) T (2) afbk ← f 2 ( a, f, b, k ) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for c, e, a, f do for b, k do for b, k do Y ceaf += T (1) cebk · T (2) Y ceaf += T (1) cebk · T (2) afbk afbk for c, e, a, f do for c, e, a, f do E += X aecf · Y ceaf E += X aecf · Y ceaf 19

Time-space trade-offs for a, e, c, f do for i, j do for a, e, c, f do ⇐ Fused X aecf += T ijae · T ijcf for i, j do for c, e, b, k do x += T ijae · T ijcf T (1) cebk ← f 1 ( c, e, b, k ) for b, k do for a, f, b, k do T (1) cebk ← f 1 ( c, e, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do y += T (1) cebk · T (2) for b, k do afbk E += x · y Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 20

Tiled & partially fused for a B , e B , c B , f B do for a, e, c, f do for a, e, c, f do for i, j do for i, j do X aecf += T ijae · T ijcf ˆ X aecf += T ijae · T ijcf for c, e, b, k do for b, k do T (1) cebk ← f 1 ( c, e, b, k ) for c, e do ˆ for a, f, b, k do T (1) ce ← f 1 ( c, e, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) for a, f do T (2) ˆ for c, e, a, f do af ← f 2 ( a, f, b, k ) for b, k do for c, e, a, f do Y ceaf += T (1) cebk · T (2) T (2) Y ceaf += ˆ ˆ ce · ˆ T (1) afbk af for c, e, a, f do for c, e, a, f do E += ˆ X aecf · ˆ E += X aecf · Y ceaf Y ceaf 21

Transform algebraically , to minimize flops Minimize temporary storage Distribute and partition data for a parallel system Search wrt space-time trade-off (feedback) For out-of-core problems, apply optimize data locality Generate final program (C/ Fortran + MPI/Global-arrays) 22

Tensor loop nest ⇒ Expression tree for a, e, c, f do for i, j do X aecf += T ijae · T ijcf for c, e, b, k do E, + ceaf T (1) X, + ij Y, + bk cebk ← f 1 ( c, e, b, k ) for a, f, b, k do T (1) T (2) T (2) afbk ← f 2 ( a, f, b, k ) T T for c, e, a, f do f 1 f 2 for b, k do Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 23

Expression tree ⇒ fusion graph E, + ceaf X, + ij Y, + bk T (1) T (2) T T f 1 f 2 E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 24

Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 25

Fusion graph E c e a f Fuse ⇒ X scalar Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 26

Fusion graph E c e a f Fuse ⇒ Y scalar Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 27

Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 a f b k a f b k c e c e 30

Empirical compilers and tools 32

Code generation tools for autotuning Code generation tools GNU Superoptimizer -- Exhaustive search over schedules of straight-line code Denali -- Theorem proving based scheduling iFKO (Whaley @ UTSA) -- Iterative floating-point kernel optimizer POET (Yi @ UTSA) -- Parameterized Optimizations for Empirical Tuning 33

Iterative/empirical compilers Compile-time “Iterative compilation” -- Kisuki, Knijnenberg, O’Boyle, et al . Hybrid model/search-based compiler -- Hall, et al. (USC) Eigenmann @ Purdue (Polaris) Quinlan, et al. (LLNL / PERI) Qasem (TSU), Kennedy, Mellor-Crummey (Rice) -- Whole program tuning Compilers that learn -- Cavazos (UDel); Stephenson/Amarsinghe (MIT) Run-time: Voss, et al.: ADAPT 34

Administrivia 35

Upcoming schedule changes Some adjustment of topics (TBD) Today — Project proposals due Th 3/13 — SIAM Parallel Processing (attendance encouraged) Tu 4/1 — No class Th 4/3 — Attend talk by Doug Post from DoD HPC Modernization Program 36

Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard - PowerPoint PPT Presentation

Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.19] Tuesday, March 11, 2008 1 Todays sources CS 267 at UCB (Demmel & Yelick)

Wickes Manufacturing TCE Plume Site Mancelona, Michigan December 2017 What is TCE ?

Metabolic Interactions Supporting Effective TCE Bioremediation under Biogeochemical Conditions

Environment (TCE) Tutorial v.1.1 authors: Otto Esko Pekka Jskelinen This work is

Compilers Structure of a Compiler Alex Aiken Intro to Compilers 1. Lexical Analysis 2. Parsing

Intelligent Compilation John Cavazos Department of Computer and Information Sciences University of

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Open64/ORC compilers Sbastian Pop Universit Louis Pasteur Strasbourg, Project A3 INRIA

Compilers & Translator Writing Systems Prof. R. Eigenmann ECE573, Fall 2005

CS406: Compilers Spring 2020 Week1: Overview, Structure of a compiler 1 Intro to Compilers

CS226/326 Compilers for Computer Languages David MacQueen Department of Computer Science

CMSC 430 Introduction to Compilers Spring 2017 Lexing and Parsing Overview Compilers are

From Compilers to Grammarware Dr. Vadim Zaytsev Introduction Compilers Grammarware T

CMSC 430 Introduction to Compilers Spring 2016 Lexing and Parsing Overview Compilers are

Compilers and computer architecture: Compiling OO language Martin Berger 1 December 2019 1 Email:

How compiler frontend is different from what IDE needs? Ilya Biryukov JetBrains ReSharper C++

TEACHING OLD COMPILERS NEW TRICKS TEACHING OLD COMPILERS NEW TRICKS Transpiling C ++ 17 to C ++ 11

Global HPCC Benchmarks in Chapel: STREAM Triad, Random Access, and FFT HPC Challenge BOF, SC06

Advanced Digital Signal Processing Part 4: DFT and FFT Gerhard Schmidt

Hardwa rdware re-acc acceler elerated ated CC CCD re D reado adout ut sm smear ar co

Divide and Conquer: The transform named in his honor is a mathematical technique that can be

Fresh Breeze A Radical Approach to Massively Parallel Architecture and Programming Jack Dennis

Efficient Ring-LWE Encryption on 8-bit AVR Processors . Zhe Liu 1 Hwajeong Seo 2 Sujoy Sinha Roy

Recursive neural networks for semantic interpretation Sam Bowman Department of Linguistics and

Data Management Planning: Get up to date with DMPTool and DMPonline IDCC14 San Francisco, CA

Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard - PowerPoint PPT Presentation

Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.19] Tuesday, March 11, 2008 1 Todays sources CS 267 at UCB (Demmel & Yelick)

Wickes Manufacturing TCE Plume Site Mancelona, Michigan December 2017 What is TCE ?

Metabolic Interactions Supporting Effective TCE Bioremediation under Biogeochemical Conditions

Environment (TCE) Tutorial v.1.1 authors: Otto Esko Pekka Jskelinen This work is

Compilers Structure of a Compiler Alex Aiken Intro to Compilers 1. Lexical Analysis 2. Parsing

Intelligent Compilation John Cavazos Department of Computer and Information Sciences University of

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Open64/ORC compilers Sbastian Pop Universit Louis Pasteur Strasbourg, Project A3 INRIA

Compilers &amp; Translator Writing Systems Prof. R. Eigenmann ECE573, Fall 2005

CS406: Compilers Spring 2020 Week1: Overview, Structure of a compiler 1 Intro to Compilers

CS226/326 Compilers for Computer Languages David MacQueen Department of Computer Science

CMSC 430 Introduction to Compilers Spring 2017 Lexing and Parsing Overview Compilers are

From Compilers to Grammarware Dr. Vadim Zaytsev Introduction Compilers Grammarware T

CMSC 430 Introduction to Compilers Spring 2016 Lexing and Parsing Overview Compilers are

Compilers and computer architecture: Compiling OO language Martin Berger 1 December 2019 1 Email:

How compiler frontend is different from what IDE needs? Ilya Biryukov JetBrains ReSharper C++

TEACHING OLD COMPILERS NEW TRICKS TEACHING OLD COMPILERS NEW TRICKS Transpiling C ++ 17 to C ++ 11

Global HPCC Benchmarks in Chapel: STREAM Triad, Random Access, and FFT HPC Challenge BOF, SC06

Advanced Digital Signal Processing Part 4: DFT and FFT Gerhard Schmidt

Hardwa rdware re-acc acceler elerated ated CC CCD re D reado adout ut sm smear ar co

Divide and Conquer: The transform named in his honor is a mathematical technique that can be

Fresh Breeze A Radical Approach to Massively Parallel Architecture and Programming Jack Dennis

Efficient Ring-LWE Encryption on 8-bit AVR Processors . Zhe Liu 1 Hwajeong Seo 2 Sujoy Sinha Roy

Recursive neural networks for semantic interpretation Sam Bowman Department of Linguistics and

Data Management Planning: Get up to date with DMPTool and DMPonline IDCC14 San Francisco, CA

Compilers & Translator Writing Systems Prof. R. Eigenmann ECE573, Fall 2005