autotuning 2 2 specialized code generators
play

Autotuning (2/2): Specialized code generators Prof. Richard Vuduc - PowerPoint PPT Presentation

Autotuning (2/2): Specialized code generators Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.18] Thursday, March 6, 2008 1 Todays sources CS 267 at UCB (Demmel & Yelick) Papers


  1. Autotuning (2/2): Specialized code generators Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.18] Thursday, March 6, 2008 1

  2. Today’s sources CS 267 at UCB (Demmel & Yelick) Papers from various autotuning projects PHiPAC, ATLAS, FFTW, SPIRAL, TCE See: Proc. IEEE 2005 special issue on Program Generation, Optimization, and Platform Adaptation Me (for once!) 2

  3. Review: Cache-oblivious algorithms 3

  4. A recursive algorithm for matrix-multiply B 11 B 12 Divide all dimensions in half B 21 B 22 Bilardi, et al .: Use grey-code ordering A 11 A 12 C 11 C 12 A 21 A 22 C 21 C 22 No. of misses, with tall-cache assumption: � � � n 3 � � 8 · Q ( n M 2 ) if n > Q ( n ) = ≤ Θ √ 3 3 n 2 L M otherwise 4

  5. Performance-engineering challenges Outer Control Structure Iterative Recursive Inner Control Structure Recursive Iterative Statement Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed Scalarized / None / Coloring / Belady / Compiler BRILA Compiler BRILA 5

  6. Cache-oblivious stencil computation Theorem [Frigo & Strumpen (ICS 2005)]: d = dimension ⇒ � n d · t � Q ( n, t ; d ) = O 1 M d 10 5 t=0 x=0 8 16 6

  7. Cache-conscious algorithm Source: Datta, et al . (2007) 7

  8. Survey of autotuning 8

  9. Early idea seedlings Polyalgorithms : John R. Rice (1969) “A polyalgorithm for the automatic solution of nonlinear equations” (1976) “The algorithm selection problem” Profiling and feedback-directed compilation (1971) D. Knuth: “An empirical study of FORTRAN programs” (1982) S. Graham, P . Kessler, M. McKusick: gprof (1991) P . Chang, S. Mahlke, W-m. W. Hwu: “Using profile information to assist classic code optimizations” Code generation from high-level representations (1989) J. Johnson, R.W. Johnson, D. Rodriguez, R. Tolimieri: “A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures.” (1992) M. Covell, C. Myers, A. Oppenheim: “Computer-aided algorithm design and arrangement” (1992) 9

  10. Why doesn’t the compiler do the dirty work? Why doesn’t the compiler do all of this? Analysis Over-specified dependencies Correctness requirements Limited access to relevant run-time information Architecture: Realistic hardware models? Engineering: Hard to modify a production compiler 10

  11. Source: Voss, ADAPT compiler project: http://www.eecg.toronto.edu/~voss/AdaptPage/results.html 11

  12. Source: Voss, ADAPT compiler project: http://www.eecg.toronto.edu/~voss/AdaptPage/results.html 12

  13. Source: Voss, ADAPT compiler project: http://www.eecg.toronto.edu/~voss/AdaptPage/results.html 13

  14. Automatic performance tuning, or “autotuning” Two-phase methodology for producing automatically tuned code Given: Computational kernel or program; inputs; machine Identify and generate a parameterized space of candidate implementations Select the fastest one using empirical modeling and automated experiments “Autotuner” = System that implements this Usually domain-specific (exception: “autotuning/iterative compilers”) Leverage back-end compiler for performance and portability 14

  15. How an autotuner differs from a compiler (roughly) Compiler Autotuner General-purpose Input Specification source code Code generation User responsive Long, but amortized time Static analysis; Automated empirical Implementation some run-time models and selection profiling/feedback experiments 15

  16. Example: What a search space looks like Mflop/s k 0 = 1 m 0 n 0 Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler Source: PHiPAC Project at UC Berkeley (1997) 16

  17. 17

  18. Dense linear algebra 18

  19. PHiPAC (1997) Portable High-Performance ANSI C [Bilmes, Asanovic, Chin, Demmel (1997)] Coding guidelines: C as high-level assembly language Code generator for multi-level cache- and register-blocked matrix multiply Exhaustive search over all parameters Began as class project which beat the vendor BLAS 19

  20. PHiPAC coding guideline example: Removing false dependencies Use local variables to remove false dependencies False read-after-write hazard a[i] = b[i] + c; a[i+1] = b[i+1] * d; between a[i] and b[i+1] float f1 = b[i]; float f2 = b[i+1]; In C99, may declare a & b unaliased (“restrict” keyword) a[i] = f1 + c; a[i+1] = f2 * d; 20

  21. ATLAS (1998) “Automatically Tuned Linear Algebra Software” — [R.C. Whaley and J. Dongarra (1998)] Overcame PHiPAC shortcomings on x86 platforms Copy optimization, prefetch, alternative schedulings Extended to full BLAS, some LAPACK support ( e.g. , LU) Code generator (written in C, output C w/ inline-assembly) with search Copy optimization prunes much of PHiPAC’s search space “Simple” line searches See: iterative floating-point kernel optimizer (iFKO) work 21

  22. Search vs. modeling Yotov, et al . “Is search really necessary to generate high- performance BLAS?” “Think globally, search locally” Small gaps ⇒ local search Large gaps ⇒ refine model “Unleashed” ⇒ hand-optimized plug-in kernels 22

  23. Signal processing 23

  24. Motivation for performance tuning pseudo Mflop/s Source: J. Johnson (2007), CScADS autotuning workshop 24

  25. FFTW (1997) “Fastest Fourier Transform in the West” [M. Frigo, S. Johnson (1997)] “ Codelet ” generator (in OCaml) Explicit represent a small fixed-size transform by its computation DAG Optimize DAG: Algebraic transformations, constant folding, “DAG transposition” Schedule DAG cache-obliviously and output as C source code Planner : At run-time, determine which codelets to apply Executor : Perform FFT of a particular size using plan Efficient “plug-in” assembly kernels 25

  26. 26

  27. 27

  28. Cooley-Tukey FFT algorithm N − 1 � x [ j ] · ω − kj x, y ∈ C N y [ k ] DFT N ( x, k ) ≡ ← N j =0 e 2 π √− 1 /N ≡ ω N N ≡ N 1 · N 2 ⇓ 0 ≤ k 1 < N 1 and 0 ≤ k 2 < N 2 N 2 − 1 �� N 1 − 1 � � � � x [ n 1 · N 2 + n 2 ] · ω − k 1 n 1 · ω − k 1 n 2 · ω − k 2 n 2 y [ k 1 + k 2 · N 1 ] ← N 1 N N 2 n 1 n 2 =0 28

  29. Cooley-Tukey FFT algorithm N − 1 � x [ j ] · ω − kj x, y ∈ C N y [ k ] DFT N ( x, k ) ≡ ← N j =0 e 2 π √− 1 /N ≡ ω N N ≡ N 1 · N 2 ⇓ 0 ≤ k 1 < N 1 and 0 ≤ k 2 < N 2 N 2 − 1 �� N 1 − 1 � � � � x [ n 1 · N 2 + n 2 ] · ω − k 1 n 1 · ω − k 1 n 2 · ω − k 2 n 2 y [ k 1 + k 2 · N 1 ] ← N 1 N N 2 n 1 n 2 =0 N 1 -point DFT Twiddle N 2 -point DFT 29

  30. Cooley-Tukey FFT algorithm: Encoding in the codelet generator N − 1 � x [ j ] · ω − kj x, y ∈ C N y [ k ] DFT N ( x, k ) ≡ ← N j =0 N 2 − 1 �� N 1 − 1 � � � � x [ n 1 · N 2 + n 2 ] · ω − k 1 n 1 · ω − k 1 n 2 · ω − k 2 n 2 y [ k 1 + k 2 · N 1 ] ← N 1 N N 2 n 1 n 2 =0 N 1 -point DFT Twiddle N 2 -point DFT let dftgen( N, x ) ≡ fun k → . . . # DFT N ( x, k ) let cooley tukey( N 1 , N 2 , x ) ≡ let ˆ x ≡ fun n 2 , n 1 → x ( n 2 + n 1 · N 2 ) in let G 1 ≡ fun n 2 → dftgen( N 1 , ˆ x ( n 2 , )) in (Functional let W ≡ fun k 1 , n 2 → G 1 ( n 2 , k 1 ) · ω − k 1 n 2 in pseudo-code) N let G 2 ≡ fun k 1 → dftgen( N 2 , W ( k 1 , )) in fun k → G 2 ( k mod N 1 , k div N 1 ) 30

  31. Planner phase Assembles plan using dynamic programming 31

  32. 32

  33. G5 P4 33

  34. SPIRAL (1998) Code generator Represent linear transformations as formulas Symbolic algebra + rewrite engine transforms formulas Search using variety of techniques (more later) 34

  35. Source: J. Johnson (2007), CScADS autotuning workshop 35

  36. Source: J. Johnson (2007), CScADS autotuning workshop 36

  37. High-level representations and rewrite rules ω kl � � DFT N ≡ N 0 ≤ k,l<N � � cos (2 l + 1) k π DCT-2 N ≡ 2 N 0 ≤ k,l<N . . . n = k · m : ( DFT k ⊗ I m ) T n m ( I k ⊗ DFT m ) L n = ⇒ DFT n → k n = k · m, gcd( k, m ) = 1 : = P n ( DFT k ⊗ DFT m ) Q n ⇒ DFT n → p is prime : R T = p ( I 1 ⊕ DFT p − 1 D p ( I 1 ⊕ DFT p − 1 ) R p ⇒ DFT p → . . . � � 1 1 DFT 2 → 1 − 1 37

  38. High-level representations expose parallelism       X 1 A X 1 X 2 A X 2       ( I 4 ⊗ A ) · =      ·   X 3 A X 3      X 4 A X 4   AX 1 AX 2   =   AX 3   AX 4 A applied 4 times independently 38

  39. High-level representations expose parallelism     x 1 x 1 �� � � � � a b x 2 a · I 2 b · I 2 x 2     ⊗ I 2 = ·   ·   c d x 3 c · I 2 d · I 2 x 3     x 4 x 4 � � � �   x 1 x 3 a + b x 2 x 4   =     � � � � x 1 x 3   c + d x 2 x 4 SIMD-vectorizable 39

  40. Search in SPIRAL Search over ruletrees, i.e., possible formula expansions Empirical search Exhaustive Random Dynamic programming Evolutionary search Hill climbing Machine learning methods 40

  41. Example: SMP + vectorization results Source: F. Franchetti (2007), CScADS autotuning workshop 41

  42. Administrivia 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend