autotuning 2 5 2 tce empirical compilers
play

Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard - PowerPoint PPT Presentation

Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.19] Tuesday, March 11, 2008 1 Todays sources CS 267 at UCB (Demmel & Yelick)


  1. Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.19] Tuesday, March 11, 2008 1

  2. Today’s sources CS 267 at UCB (Demmel & Yelick) Papers from various autotuning projects PHiPAC, ATLAS, FFTW, SPIRAL, TCE See: Proc. IEEE 2005 special issue on Program Generation, Optimization, and Platform Adaptation Me (for once!) 2

  3. Review: Autotuners 3

  4. Performance-engineering challenges Outer Control Structure Iterative Recursive Inner Control Structure Recursive Iterative Statement Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed Scalarized / None / Coloring / Belady / Compiler BRILA Compiler BRILA 4

  5. Motivation for performance tuning pseudo Mflop/s Source: J. Johnson (2007), CScADS autotuning workshop 5

  6. Context for autotuning Problem: HPC needs detailed low-level machine knowledge Autotuning methodology Identify and generate a space of implementations Search (modeling, experiments) to choose the best one Early idea seedlings Polyalgorithms Profile and feedback-directed compilation Domain- and architecture-specific code generators 6

  7. Example: What a search space looks like Mflop/s k 0 = 1 m 0 n 0 Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler Source: PHiPAC Project at UC Berkeley (1997) 7

  8. Cooley-Tukey FFT algorithm: Encoding in FFTW’s codelet generator N − 1 � x [ j ] · ω − kj x, y ∈ C N y [ k ] DFT N ( x, k ) ≡ ← N j =0 N 2 − 1 �� N 1 − 1 � � � � x [ n 1 · N 2 + n 2 ] · ω − k 1 n 1 · ω − k 1 n 2 · ω − k 2 n 2 y [ k 1 + k 2 · N 1 ] ← N 1 N N 2 n 1 n 2 =0 N 1 -point DFT Twiddle N 2 -point DFT let dftgen( N, x ) ≡ fun k → . . . # DFT N ( x, k ) let cooley tukey( N 1 , N 2 , x ) ≡ let ˆ x ≡ fun n 2 , n 1 → x ( n 2 + n 1 · N 2 ) in let G 1 ≡ fun n 2 → dftgen( N 1 , ˆ x ( n 2 , )) in (Functional let W ≡ fun k 1 , n 2 → G 1 ( n 2 , k 1 ) · ω − k 1 n 2 in pseudo-code) N let G 2 ≡ fun k 1 → dftgen( N 2 , W ( k 1 , )) in fun k → G 2 ( k mod N 1 , k div N 1 ) 8

  9. 9

  10. 10

  11. Tensor Contraction Engine (TCE) for quantum chemistry 11

  12. Tensor Contraction Engine (TCE) Application domain: Quantum chemistry Electronic structure calculations Dominant computation expressible as a “tensor contraction” TCE generates a complete parallel program from a high-level spec Automates time-space trade-offs Output S. Hirata (2002), and many others Following presentation taken from Proc. IEEE 2005 special issue 12

  13. Motivation: Simplify program development Source: Baumgartner, et al . (2005) 13

  14. Rewriting to reduce operation counts Naïvely, ≈ 4 × N 10 flops � S abij A acik × B befl × C d fjk × D cdel = c,d,e,f,k,l ⇓     � � �  × C d  × A acik S abij B befl × D cdel = fjk c,k d,f e,l Assuming associativity and distributivity, ≈ 6 × N 6 flops, but also requires temporary storage. Source: Baumgartner, et al . (2005) 14

  15. Operation and storage minimization via loop fusion T (1) � B befl × D cdel = bcd f e,l T (2) T (1) � f × C d = fjk bcjk bcd d,f T (2) � S abij bcjk × A acik = c,k T 1 = T 2 = S = 0 for b, c, d, e, f, l do T 1[ b, c, d, f ] += B [ b, e, f, l ] · D [ c, d, e, l ] for b, c, d, f, j, k do T 2[ b, c, j, k ] += T 1[ b, c, d, f ] · C [ d, f, j, k ] for a, b, c, i, j, k do S [ a, b, i, j ] += T 2[ b, c, j, k ] · A [ a, c, i, k ] 15

  16. Operation and storage minimization via loop fusion T (1) � B befl × D cdel = bcd f e,l T (2) T (1) � f × C d = fjk bcjk bcd d,f T (2) � S abij bcjk × A acik = c,k S = 0 for b, c do T 1 = T 2 = S = 0 T 1 f ← 0 , T 2 f ← 0 for b, c, d, e, f, l do for d, f do T 1[ b, c, d, f ] += B [ b, e, f, l ] · D [ c, d, e, l ] for e, l do for b, c, d, f, j, k do T 1 f += B [ b, e, f, l ] · D [ c, d, e, l ] T 2[ b, c, j, k ] += T 1[ b, c, d, f ] · C [ d, f, j, k ] for j, k do for a, b, c, i, j, k do T 2 f [ j, k ] += T 1 f · C [ d, f, j, k ] S [ a, b, i, j ] += T 2[ b, c, j, k ] · A [ a, c, i, k ] for a, i, j, k do S [ a, b, i, j ] += T 2 f [ j, k ] · A [ a, c, i, k ] 16

  17. Time-space trade-offs Max index of a — f : O (1000) for a, e, c, f do i — k : O (100) for i, j do X aecf += T ijae · T ijcf “Contraction” of T over i, j for c, e, b, k do T (1) cebk ← f 1 ( c, e, b, k ) for a, f, b, k do Integrals, O(1000) flops T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for b, k do Y ceaf += T (1) cebk · T (2) “Contraction” over T (1) and T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 17

  18. Time-space trade-offs Max index of a — f : O (1000) for a, e, c, f do i — k : O (100) for i, j do X aecf += T ijae · T ijcf for c, e, b, k do T (1) cebk ← f 1 ( c, e, b, k ) Same indices for a, f, b, k do ⇒ Loop fusion candidates T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for b, k do Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 18

  19. Time-space trade-offs for a, e, c, f do for a, e, c, f do for i, j do for i, j do X aecf += T ijae · T ijcf X aecf += T ijae · T ijcf for c, e, b, k do for a, c, e, f, b, k do T (1) T (1) Add cebk ← f 1 ( c, e, b, k ) cebk ← f 1 ( c, e, b, k ) extra for a, f, b, k do for a, e, c, f, b, k do flops T (2) T (2) afbk ← f 2 ( a, f, b, k ) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for c, e, a, f do for b, k do for b, k do Y ceaf += T (1) cebk · T (2) Y ceaf += T (1) cebk · T (2) afbk afbk for c, e, a, f do for c, e, a, f do E += X aecf · Y ceaf E += X aecf · Y ceaf 19

  20. Time-space trade-offs for a, e, c, f do for i, j do for a, e, c, f do ⇐ Fused X aecf += T ijae · T ijcf for i, j do for c, e, b, k do x += T ijae · T ijcf T (1) cebk ← f 1 ( c, e, b, k ) for b, k do for a, f, b, k do T (1) cebk ← f 1 ( c, e, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do y += T (1) cebk · T (2) for b, k do afbk E += x · y Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 20

  21. Tiled & partially fused for a B , e B , c B , f B do for a, e, c, f do for a, e, c, f do for i, j do for i, j do X aecf += T ijae · T ijcf ˆ X aecf += T ijae · T ijcf for c, e, b, k do for b, k do T (1) cebk ← f 1 ( c, e, b, k ) for c, e do ˆ for a, f, b, k do T (1) ce ← f 1 ( c, e, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) for a, f do T (2) ˆ for c, e, a, f do af ← f 2 ( a, f, b, k ) for b, k do for c, e, a, f do Y ceaf += T (1) cebk · T (2) T (2) Y ceaf += ˆ ˆ ce · ˆ T (1) afbk af for c, e, a, f do for c, e, a, f do E += ˆ X aecf · ˆ E += X aecf · Y ceaf Y ceaf 21

  22. Transform algebraically , to minimize flops Minimize temporary storage Distribute and partition data for a parallel system Search wrt space-time trade-off (feedback) For out-of-core problems, apply optimize data locality Generate final program (C/ Fortran + MPI/Global-arrays) 22

  23. Tensor loop nest ⇒ Expression tree for a, e, c, f do for i, j do X aecf += T ijae · T ijcf for c, e, b, k do E, + ceaf T (1) X, + ij Y, + bk cebk ← f 1 ( c, e, b, k ) for a, f, b, k do T (1) T (2) T (2) afbk ← f 2 ( a, f, b, k ) T T for c, e, a, f do f 1 f 2 for b, k do Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 23

  24. Expression tree ⇒ fusion graph E, + ceaf X, + ij Y, + bk T (1) T (2) T T f 1 f 2 E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 24

  25. Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 25

  26. Fusion graph E c e a f Fuse ⇒ X scalar Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 26

  27. Fusion graph E c e a f Fuse ⇒ Y scalar Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 27

  28. Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 28

  29. Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 29

  30. Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 a f b k a f b k c e c e 30

  31. 31

  32. Empirical compilers and tools 32

  33. Code generation tools for autotuning Code generation tools GNU Superoptimizer -- Exhaustive search over schedules of straight-line code Denali -- Theorem proving based scheduling iFKO (Whaley @ UTSA) -- Iterative floating-point kernel optimizer POET (Yi @ UTSA) -- Parameterized Optimizations for Empirical Tuning 33

  34. Iterative/empirical compilers Compile-time “Iterative compilation” -- Kisuki, Knijnenberg, O’Boyle, et al . Hybrid model/search-based compiler -- Hall, et al. (USC) Eigenmann @ Purdue (Polaris) Quinlan, et al. (LLNL / PERI) Qasem (TSU), Kennedy, Mellor-Crummey (Rice) -- Whole program tuning Compilers that learn -- Cavazos (UDel); Stephenson/Amarsinghe (MIT) Run-time: Voss, et al.: ADAPT 34

  35. Administrivia 35

  36. Upcoming schedule changes Some adjustment of topics (TBD) Today — Project proposals due Th 3/13 — SIAM Parallel Processing (attendance encouraged) Tu 4/1 — No class Th 4/3 — Attend talk by Doug Post from DoD HPC Modernization Program 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend