Linear Transform Libraries Yevgen Voronenko, Frdric de Mesmay, and - PowerPoint PPT Presentation
Computer Generation of General Size Linear Transform Libraries Yevgen Voronenko, Frdric de Mesmay, and Markus Pschel Carnegie Mellon University, Pittsburgh, USA PROBLEM AND MOTIVATION
Computer Generation of General Size Linear Transform Libraries Yevgen Voronenko, Frédéric de Mesmay, and Markus Püschel Carnegie Mellon University, Pittsburgh, USA
PROBLEM AND MOTIVATION http://4.bp.blogspot.com/_LkcgSIEg0jw/TOi0XygeC4I/AAAAAAAAAH0/K0FyPn28gtc/s1600/Confuse.jpeg 2
Situation Today Numerical Library hand written library e.g. FFTW Codelet auto generated base cases for fixed size Codelet (Advanced Libraries) Codelet Not used in practice http://en.wikipedia.org/wiki/File:DLL_icon_on_Windows_Vista.png 3
Problem automatic new plattform arrives generation optimize & implement reoptimize & reimplement for spezific plattform on new plattform http://en.wikipedia.org/wiki/File:DLL_icon_on_Windows_Vista.png 4 http://en.wikipedia.org/wiki/Intel
Goal Platform Parameters Cacheline size Optimized #Cores , … Library Generator Transform Description DFT, DCT, … http://www.anilgears.com/images1/gear6.jpg 5
BACKGROUND ON LINEAR TRANSFORMS http://www.dimelo.com/img/style/analytics.jpg 6
Example DFT 𝑧 = 𝐸𝐺𝑈 𝑜 · 𝑦 Discrete Fourier Transform (DFT): 𝑜 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 base case: 𝐸𝐺𝑈 2 = 1 1 −1 1 Recursive Cases 7
Example: Implementation (Naïve) 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑜 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 void dft( int n, cpx *y, cpx *x) { int k = choose_factor(n); int m = n/k; cpx *t1 = Permute x with L(n,k); // t2 = (I_k tensor DFT_m)*t1 for ( int i=0; i<k; ++i) memory to dft(m, t2 + m*i, t1 + m*i); memory copy // t3 = diag( d(j) )*t2 for ( int i=0; i<n; ++i) SLOW! t3[i] = d(i) * t2[i]; // y = (DFT_k tensor I_m)*t3, for ( int i=0; i<m; ++i) dft_str(k, m, y + i, t3 + i); } // to be implemented void dft_str( int n, int str, cpx *Y, cpx *X); 8
DFT – Matrix Representation 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑜 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 Input Permutation Diagonal Matrix Recursive Cases graphics from paper 9
DFT – Dataflow Representation 1.1 2.1 1 3.1 4.1 1.2 2.2 2 3.2 4.2 1.3 2.3 3 3.3 4.3 1.4 2.4 4 3.4 4.4 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑜 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 10
DFT – Improvement Idea 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑜 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 𝑜 : A scalar multiplier for each entry 𝑈 𝑛 𝑜 : A permutation function 𝑀 𝑙 Merge 𝑜 𝑜 𝐽 𝑙 ⊗ 𝑬𝑮𝑼 𝒏 𝑀 𝑙 𝑬𝑮𝑼 𝒐 = 𝑬𝑮𝑼 𝒍 ⊗ 𝐽 𝑛 𝑈 𝑛 “Scaled DFT” “ Strided DFT” 11
Example – Implementation (FFTW) n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m void dft( int n, cpx *y, cpx *x) { int k = choose_factor(n); // t1 = (I_k tensor DFT_m)L(n,k)*x for ( int i=0; i < k; ++i) dft_str(m, k, 1, t1 + m*i, x + m*i); // y = (DFT_k tensor I_m) diag(d(j)) for ( int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t1 + i); } // to be implemented void dft_str( int n, int istr, int ostr, cpx *y, cpx*x); void dft_scaled( int n, int str, cpx *d, cpx *y, cpx*x); Challenge: How to find those recursive functions? 12
Example - Observation Not yet parallelized and vectorized Many other recursive functions needed Optimizations are to be identified and implemented by hand 13 http://bookszeus.files.wordpress.com/2011/09/hands-on-keyboard.jpg
LIBRARY GENERATOR http://www.artistsvalley.com/images/icons/Database%20Application%20Icons/Data%20Generator/ 14 256x256/Data%20Generator.jpg
Transform Description Formulated in special language (SPL) New set of breakdown rules for each transform Can be complicated 15
Library Generation for Transforms dft(), dft_str() dft_scaled() Recursion Steps Breakdown Rules Library 𝐄𝐆𝐔 𝐨 = n 𝐄𝐆𝐔 𝐥 ⊗ I m T m n Base Cases I k ⊗ 𝐄𝐆𝐔 𝐧 L k DFT 2 to DFT 16 Generator http://www.anilgears.com/images1/gear6.jpg 16
From SPL to Σ -SPL DFT n Problem A Linear Transform: Breakdown rules in SPL: Breakdown Rule n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m Rewritten breakdown rules into Σ -SPL Σ -SPL 17
Difference of SPL and Σ -SPL Σ -SPL = Extended SPL + Loop constructs Σ : iterative matrix sum + explicit I/O: gather G(f) and scatter S(f) + index mapping functions M(f) for strides (permutations) How to rewrite SPL to Σ -SPL? 𝑔: 𝐵 → 𝐶 Functions: A: Integer interval [1..n] B: Integer interval [1..n] or ℂ, ℝ Example: Permutation 𝑔 𝑜→𝑜 (𝑐𝑗𝑘𝑓𝑑𝑢𝑗𝑤𝑓) 18
Σ -SPL – Gather and Scatter Matrices Parameterized matrices 1 𝑈 = 𝐻 𝑔 ≔ 𝑓 𝑔 0 , … , 𝑓 𝑔 𝑜−1 ⋱ 1 1 ⋱ 1 𝑧 = 𝐻 𝑔 ⋅ 𝑦 ⇔ 𝑧 𝑗 = 𝑦 𝑔 𝑗 𝑇 𝑔 ≔ 𝐻 𝑔 𝑈 = 𝑧 = 𝑇 𝑔 ⋅ 𝑦 ⇔ 𝑧 𝑔 𝑗 = 𝑦 𝑗 19
Inside the Tensor Product n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m Y = 𝐽 𝑙 ⊗ 𝐵 x 𝑍 𝑌 1 𝐵 ⋯ 0 1 ⋮ ⋮ = ⋮ ⋱ ⋮ 𝑍 𝑌 𝑜 0 ⋯ 𝐵 𝑜 𝑍 𝑌 1 𝐵 ⋯ 0 0 ⋯ 0 1 ⋮ ⋮ = + ⋯ + ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ 𝑍 𝑌 𝑜 0 ⋯ 0 0 ⋯ 𝐵 𝑜 Iterative Matrix Sum 𝑍 𝑗 = 𝐵 ⋅ 𝑌 𝑗 Partition of vectors: 20
Inside the Tensor Product n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m j=0 j=0 0 0 A : : loop j=k-1 j=k-1 k-1 k-1 kernel Y X 𝑙−1 𝐽 𝑙 ⊗ 𝐵 = 𝑇 0𝐵 𝐻 0 + ⋯ + 𝑇 𝑙−1 𝐵 𝐻 𝑙−1 = 𝑇 𝑘 𝐵 𝐻 𝑘 𝑘=0 21
Σ -SPL – Rewriting Rules (Extract) SPL to Σ -SPL Loop merging 22
From SPL to Σ -SPL n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m 𝑙−1 𝑜 = m 𝐽 𝑙 ⊗ 𝐸𝐺𝑈 𝑛 𝑀 𝑙 𝑇 ℎ 𝑘𝑛,1 𝐸𝐺𝑈 𝑛 𝐻(ℎ 𝑘𝑛,1 ) perm l k 𝑘=0 𝑙−1 = 𝑇 ℎ 𝑘𝑛,1 𝐸𝐺𝑈 𝑛 𝐻(ℎ 𝑘,𝑙 ) 𝑘=0 𝑛−1 𝑜 = 𝐸𝐺𝑈 𝑙 ⊗ 𝐽 𝑛 𝑈 𝑇 ℎ 𝑘,𝑛 𝐸𝐺𝑈 𝑙 𝑒𝑗𝑏(𝑔 ∘ ℎ 𝑗,𝑙 )𝐻(ℎ 𝑘,𝑛 ) 𝑛 𝑘=0 23
Recursion Step Closure Implement Transforms & Breakdown Rules expand & rewrite Parameterize Parametrize Implement find params according to Yes constraints New (Matrix dimensions ) R-Steps ? No 24
Recursion Steps - Example n n I k ⊗ 𝐄𝐆𝐔 𝐧 L k 𝐄𝐆𝐔 𝐨 = 𝐄𝐆𝐔 𝐥 ⊗ I m T m dft Function Tag to mark recursions 𝑙−1 𝑜 = 𝐽 𝑙 ⊗ 𝐸𝐺𝑈 𝑛 𝑀 𝑙 𝑇 ℎ 𝑘𝑛,1 𝐸𝐺𝑈 𝑛 𝐻(ℎ 𝑘,𝑙 ) 𝑘=0 dft_str 𝑛−1 𝑜 = 𝐸𝐺𝑈 𝑙 ⊗ 𝐽 𝑛 𝑈 𝑇 ℎ 𝑘,𝑛 𝐸𝐺𝑈 𝑙 𝑒𝑗𝑏(𝑔 ∘ ℎ 𝑗,𝑙 )𝐻(ℎ 𝑘,𝑛 ) 𝑛 𝑘=0 dft_scaled Expand tag to scatter / gather 25
Recursion Steps – Graph dft dft_str dft_scale Stride = 1 dft_scale_base 26
Library Generation for Transforms dft(), dft_str() dft_scaled() Recursion Steps Code Generation Breakdown Rules Library Parameter Handling 𝐄𝐆𝐔 𝐨 = n 𝐄𝐆𝐔 𝐥 ⊗ I m T m n Base Cases I k ⊗ 𝐄𝐆𝐔 𝐧 L k DFT 2 to DFT 16 Generator http://www.anilgears.com/images1/gear6.jpg 27
Code Translation (Extract) Σ -SPL Code Rewriting Rules Code ( 𝐻 𝑔 𝑜→𝑂 , 𝑧, 𝑦 ) for(j=0..n-1) y[j] = x[f(j)]; Code ( 𝑒𝑗𝑏 𝑔 𝑜→ℂ , 𝑧, 𝑦 ) for(j=0..n-1) y[j] = f(j)*x[j]; 𝑙−1 Code ( 𝑁 , 𝑧, 𝑦 ) 𝑘 𝑘=0 for(j=0..k-1) Code ( 𝑁 𝑘 , 𝑧, 𝑦 ) http://www.carlosag.net/images/codeTranslatorLogo.gif 28
Parameter Handling Precomputed Function Parameters Parameters (Cold) (Hot) “Initial - Parameter” “Runtime - parameter” Large Parameter Set u1,u2, u3, u4, u5, u6, u7, u8, f1 𝑣1→𝑣2 𝑣1→ℂ 𝐸𝐺𝑈 𝑣1 = 𝑇 ℎ 𝑣3,1 𝐸𝐺𝑈 𝑔1 ⊗ 𝐽 𝑣1/𝑔1 𝑒𝑗𝑏 𝑞𝑠𝑓 𝑒 𝑣1/𝑔1 𝑣1 𝐻(ℎ 𝑣7,𝑣8 𝑣1→𝑣6 ) ⋅ 𝐽 𝑔1 ⊗ 𝐸𝐺𝑈 𝑣1/𝑔1 𝑀 𝑔1 29
Library Generation Generated Functions Actual Code Higher-order descriptor Classes Identified recursion DFT, DFT_STR, steps DFT_SCALED Cold parameters Constructor arguments DFT size n DFT(int n,…) Hot parameters Method arguments Vectors x,y DFT->compute(x, y) Base cases Constants DFT2 [1 1; 1 -1] 30
Library Generation: Platform Parallel & vector code New rewriting rules with # Processors, Cache-line size, vector length Additional recursion steps: DFT scalar: 4 DFT vectorized+parallel: 8 31
RESULTS http://britishmastersaf.files.wordpress.com/2011/05/results.jpg 32
Test Plattform CPU 2x dual core 3GHz Intel Xeon 5160 processors Compiler: Intel C/C++ Compiler 10.1 and SUN JDK 1.6.0 Threading: OpenMP pragmas Library Generation Time: between 10 and 60 minutes 33
Scalar Just the base approx. same algorithms cases up to 64 and optimization 34 Source: Paper
4 cores used 1 core used 2 cores used 35 Source: Paper
Extra passes due to conversion into RDFT 36 Source: Paper
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.