Computer Generation of General Size Linear Transform Libraries Yevgen Voronenko, Frédéric de Mesmay, and Markus Püschel Carnegie Mellon University, Pittsburgh, USA
PROBLEM AND MOTIVATION http://4.bp.blogspot.com/_LkcgSIEg0jw/TOi0XygeC4I/AAAAAAAAAH0/K0FyPn28gtc/s1600/Confuse.jpeg 2
Situation Today Numerical Library hand written library e.g. FFTW Codelet auto generated base cases for fixed size Codelet (Advanced Libraries) Codelet Not used in practice http://en.wikipedia.org/wiki/File:DLL_icon_on_Windows_Vista.png 3
Problem automatic new plattform arrives generation optimize & implement reoptimize & reimplement for spezific plattform on new plattform http://en.wikipedia.org/wiki/File:DLL_icon_on_Windows_Vista.png 4 http://en.wikipedia.org/wiki/Intel
Goal Platform Parameters Cacheline size Optimized #Cores , … Library Generator Transform Description DFT, DCT, … http://www.anilgears.com/images1/gear6.jpg 5
BACKGROUND ON LINEAR TRANSFORMS http://www.dimelo.com/img/style/analytics.jpg 6
Example DFT 𝑧 = 𝐸𝐺𝑈 𝑜 · 𝑦 Discrete Fourier Transform (DFT): 𝑜 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 base case: 𝐸𝐺𝑈 2 = 1 1 −1 1 Recursive Cases 7
Example: Implementation (Naïve) 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑜 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 void dft( int n, cpx *y, cpx *x) { int k = choose_factor(n); int m = n/k; cpx *t1 = Permute x with L(n,k); // t2 = (I_k tensor DFT_m)*t1 for ( int i=0; i<k; ++i) memory to dft(m, t2 + m*i, t1 + m*i); memory copy // t3 = diag( d(j) )*t2 for ( int i=0; i<n; ++i) SLOW! t3[i] = d(i) * t2[i]; // y = (DFT_k tensor I_m)*t3, for ( int i=0; i<m; ++i) dft_str(k, m, y + i, t3 + i); } // to be implemented void dft_str( int n, int str, cpx *Y, cpx *X); 8
DFT – Matrix Representation 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑜 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 Input Permutation Diagonal Matrix Recursive Cases graphics from paper 9
DFT – Dataflow Representation 1.1 2.1 1 3.1 4.1 1.2 2.2 2 3.2 4.2 1.3 2.3 3 3.3 4.3 1.4 2.4 4 3.4 4.4 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑜 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 10
DFT – Improvement Idea 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑜 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 𝑜 : A scalar multiplier for each entry 𝑈 𝑛 𝑜 : A permutation function 𝑀 𝑙 Merge 𝑜 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 “Scaled DFT” “ Strided DFT” 11
Example – Implementation (FFTW) n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m void dft( int n, cpx *y, cpx *x) { int k = choose_factor(n); // t1 = (I_k tensor DFT_m)L(n,k)*x for ( int i=0; i < k; ++i) dft_str(m, k, 1, t1 + m*i, x + m*i); // y = (DFT_k tensor I_m) diag(d(j)) for ( int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t1 + i); } // to be implemented void dft_str( int n, int istr, int ostr, cpx *y, cpx*x); void dft_scaled( int n, int str, cpx *d, cpx *y, cpx*x); Challenge: How to find those recursive functions? 12
Example - Observation Not yet parallelized and vectorized Many other recursive functions needed Optimizations are to be identified and implemented by hand 13 http://bookszeus.files.wordpress.com/2011/09/hands-on-keyboard.jpg
LIBRARY GENERATOR http://www.artistsvalley.com/images/icons/Database%20Application%20Icons/Data%20Generator/ 14 256x256/Data%20Generator.jpg
Transform Description Formulated in special language (SPL) New set of breakdown rules for each transform Can be complicated 15
Library Generation for Transforms dft(), dft_str() dft_scaled() Recursion Steps Breakdown Rules Library 𝐄𝐆𝐔 𝐨 = n 𝐄𝐆𝐔 𝐥 ⊗ I m T m n Base Cases I k ⊗ 𝐄𝐆𝐔 𝐧 L k DFT 2 to DFT 16 Generator http://www.anilgears.com/images1/gear6.jpg 16
From SPL to Σ -SPL DFT n Problem A Linear Transform: Breakdown rules in SPL: Breakdown Rule n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m Rewritten breakdown rules into Σ -SPL Σ -SPL 17
Difference of SPL and Σ -SPL Σ -SPL = Extended SPL + Loop constructs Σ : iterative matrix sum + explicit I/O: gather G(f) and scatter S(f) + index mapping functions M(f) for strides (permutations) How to rewrite SPL to Σ -SPL? 𝑔: 𝐵 → 𝐶 Functions: A: Integer interval [1..n] B: Integer interval [1..n] or ℂ, ℝ Example: Permutation 𝑔 𝑜→𝑜 (𝑐𝑗𝑘𝑓𝑑𝑢𝑗𝑤𝑓) 18
Σ -SPL – Gather and Scatter Matrices Parameterized matrices 1 𝑈 = 𝐻 𝑔 ≔ 𝑓 𝑔 0 , … , 𝑓 𝑔 𝑜−1 ⋱ 1 1 ⋱ 1 𝑧 = 𝐻 𝑔 ⋅ 𝑦 ⇔ 𝑧 𝑗 = 𝑦 𝑔 𝑗 𝑇 𝑔 ≔ 𝐻 𝑔 𝑈 = 𝑧 = 𝑇 𝑔 ⋅ 𝑦 ⇔ 𝑧 𝑔 𝑗 = 𝑦 𝑗 19
Inside the Tensor Product n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m Y = 𝐽 𝑙 ⊗ 𝐵 x 𝑍 𝑌 1 𝐵 ⋯ 0 1 ⋮ ⋮ = ⋮ ⋱ ⋮ 𝑍 𝑌 𝑜 0 ⋯ 𝐵 𝑜 𝑍 𝑌 1 𝐵 ⋯ 0 0 ⋯ 0 1 ⋮ ⋮ = + ⋯ + ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ 𝑍 𝑌 𝑜 0 ⋯ 0 0 ⋯ 𝐵 𝑜 Iterative Matrix Sum 𝑍 𝑗 = 𝐵 ⋅ 𝑌 𝑗 Partition of vectors: 20
Inside the Tensor Product n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m j=0 j=0 0 0 A : : loop j=k-1 j=k-1 k-1 k-1 kernel Y X 𝑙−1 𝐽 𝑙 ⊗ 𝐵 = 𝑇 0𝐵 𝐻 0 + ⋯ + 𝑇 𝑙−1 𝐵 𝐻 𝑙−1 = 𝑇 𝑘 𝐵 𝐻 𝑘 𝑘=0 21
Σ -SPL – Rewriting Rules (Extract) SPL to Σ -SPL Loop merging 22
From SPL to Σ -SPL n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m 𝑙−1 𝑜 = m 𝐽 𝑙 ⊗ 𝐸𝐺𝑈 𝑛 𝑀 𝑙 𝑇 ℎ 𝑘𝑛,1 𝐸𝐺𝑈 𝑛 𝐻(ℎ 𝑘𝑛,1 ) perm l k 𝑘=0 𝑙−1 = 𝑇 ℎ 𝑘𝑛,1 𝐸𝐺𝑈 𝑛 𝐻(ℎ 𝑘,𝑙 ) 𝑘=0 𝑛−1 𝑜 = 𝐸𝐺𝑈 𝑙 ⊗ 𝐽 𝑛 𝑈 𝑇 ℎ 𝑘,𝑛 𝐸𝐺𝑈 𝑙 𝑒𝑗𝑏(𝑔 ∘ ℎ 𝑗,𝑙 )𝐻(ℎ 𝑘,𝑛 ) 𝑛 𝑘=0 23
Recursion Step Closure Implement Transforms & Breakdown Rules expand & rewrite Parameterize Parametrize Implement find params according to Yes constraints New (Matrix dimensions ) R-Steps ? No 24
Recursion Steps - Example n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m dft Function Tag to mark recursions 𝑙−1 𝑜 = 𝐽 𝑙 ⊗ 𝐸𝐺𝑈 𝑛 𝑀 𝑙 𝑇 ℎ 𝑘𝑛,1 𝐸𝐺𝑈 𝑛 𝐻(ℎ 𝑘,𝑙 ) 𝑘=0 dft_str 𝑛−1 𝑜 = 𝐸𝐺𝑈 𝑙 ⊗ 𝐽 𝑛 𝑈 𝑇 ℎ 𝑘,𝑛 𝐸𝐺𝑈 𝑙 𝑒𝑗𝑏(𝑔 ∘ ℎ 𝑗,𝑙 )𝐻(ℎ 𝑘,𝑛 ) 𝑛 𝑘=0 dft_scaled Expand tag to scatter / gather 25
Recursion Steps – Graph dft dft_str dft_scale Stride = 1 dft_scale_base 26
Library Generation for Transforms dft(), dft_str() dft_scaled() Recursion Steps Code Generation Breakdown Rules Library Parameter Handling 𝐄𝐆𝐔 𝐨 = n 𝐄𝐆𝐔 𝐥 ⊗ I m T m n Base Cases I k ⊗ 𝐄𝐆𝐔 𝐧 L k DFT 2 to DFT 16 Generator http://www.anilgears.com/images1/gear6.jpg 27
Code Translation (Extract) Σ -SPL Code Rewriting Rules Code ( 𝐻 𝑔 𝑜→𝑂 , 𝑧, 𝑦 ) for(j=0..n-1) y[j] = x[f(j)]; Code ( 𝑒𝑗𝑏 𝑔 𝑜→ℂ , 𝑧, 𝑦 ) for(j=0..n-1) y[j] = f(j)*x[j]; 𝑙−1 Code ( 𝑁 , 𝑧, 𝑦 ) 𝑘 𝑘=0 for(j=0..k-1) Code ( 𝑁 𝑘 , 𝑧, 𝑦 ) http://www.carlosag.net/images/codeTranslatorLogo.gif 28
Parameter Handling Precomputed Function Parameters Parameters (Cold) (Hot) “Initial - Parameter” “Runtime - parameter” Large Parameter Set u1,u2, u3, u4, u5, u6, u7, u8, f1 𝑣1→𝑣2 𝑣1→ℂ 𝐸𝐺𝑈 𝑣1 = 𝑇 ℎ 𝑣3,1 𝐸𝐺𝑈 𝑔1 ⊗ 𝐽 𝑣1/𝑔1 𝑒𝑗𝑏 𝑞𝑠𝑓 𝑒 𝑣1/𝑔1 𝑣1 𝐻(ℎ 𝑣7,𝑣8 𝑣1→𝑣6 ) ⋅ 𝐽 𝑔1 ⊗ 𝐸𝐺𝑈 𝑣1/𝑔1 𝑀 𝑔1 29
Library Generation Generated Functions Actual Code Higher-order descriptor Classes Identified recursion DFT, DFT_STR, steps DFT_SCALED Cold parameters Constructor arguments DFT size n DFT(int n,…) Hot parameters Method arguments Vectors x,y DFT->compute(x, y) Base cases Constants DFT2 [1 1; 1 -1] 30
Library Generation: Platform Parallel & vector code New rewriting rules with # Processors, Cache-line size, vector length Additional recursion steps: DFT scalar: 4 DFT vectorized+parallel: 8 31
RESULTS http://britishmastersaf.files.wordpress.com/2011/05/results.jpg 32
Test Plattform CPU 2x dual core 3GHz Intel Xeon 5160 processors Compiler: Intel C/C++ Compiler 10.1 and SUN JDK 1.6.0 Threading: OpenMP pragmas Library Generation Time: between 10 and 60 minutes 33
Scalar Just the base approx. same algorithms cases up to 64 and optimization 34 Source: Paper
4 cores used 1 core used 2 cores used 35 Source: Paper
Extra passes due to conversion into RDFT 36 Source: Paper
Recommend
More recommend