Linear Transform Libraries Yevgen Voronenko, Frdric de Mesmay, and - - PowerPoint PPT Presentation

linear transform libraries
SMART_READER_LITE
LIVE PREVIEW

Linear Transform Libraries Yevgen Voronenko, Frdric de Mesmay, and - - PowerPoint PPT Presentation

Computer Generation of General Size Linear Transform Libraries Yevgen Voronenko, Frdric de Mesmay, and Markus Pschel Carnegie Mellon University, Pittsburgh, USA PROBLEM AND MOTIVATION


slide-1
SLIDE 1

Computer Generation of General Size Linear Transform Libraries

Yevgen Voronenko, Frédéric de Mesmay, and Markus Püschel Carnegie Mellon University, Pittsburgh, USA

slide-2
SLIDE 2

PROBLEM AND MOTIVATION

2

http://4.bp.blogspot.com/_LkcgSIEg0jw/TOi0XygeC4I/AAAAAAAAAH0/K0FyPn28gtc/s1600/Confuse.jpeg

slide-3
SLIDE 3

Numerical Library

3

Situation Today

http://en.wikipedia.org/wiki/File:DLL_icon_on_Windows_Vista.png

hand written library e.g. FFTW

Codelet Codelet Codelet

auto generated base cases for fixed size (Advanced Libraries)

Not used in practice

slide-4
SLIDE 4

4

Problem

http://en.wikipedia.org/wiki/File:DLL_icon_on_Windows_Vista.png http://en.wikipedia.org/wiki/Intel

  • ptimize & implement

for spezific plattform reoptimize & reimplement

  • n new plattform

automatic generation new plattform arrives

slide-5
SLIDE 5

5

Goal

http://www.anilgears.com/images1/gear6.jpg

Generator

Optimized Library Transform Description Platform Parameters

DFT, DCT, … Cacheline size #Cores, …

slide-6
SLIDE 6

BACKGROUND ON LINEAR TRANSFORMS

6

http://www.dimelo.com/img/style/analytics.jpg

slide-7
SLIDE 7

Discrete Fourier Transform (DFT): 𝑧 = 𝐸𝐺𝑈𝑜 · 𝑦

7

Example DFT

base case: 𝐸𝐺𝑈2 = 1 1 1 −1 Recursive Cases 𝑬𝑮𝑼𝒐 = 𝑬𝑮𝑼𝒍 ⊗ 𝐽𝑛 𝑈

𝑛 𝑜 𝐽𝑙 ⊗ 𝑬𝑮𝑼𝒏 𝑀𝑙 𝑜

slide-8
SLIDE 8

8

Example: Implementation (Naïve)

void dft(int n, cpx *y, cpx *x) { int k = choose_factor(n); int m = n/k; cpx *t1 = Permute x with L(n,k); // t2 = (I_k tensor DFT_m)*t1 for(int i=0; i<k; ++i) dft(m, t2 + m*i, t1 + m*i); // t3 = diag( d(j) )*t2 for(int i=0; i<n; ++i) t3[i] = d(i) * t2[i]; // y = (DFT_k tensor I_m)*t3, for(int i=0; i<m; ++i) dft_str(k, m, y + i, t3 + i); } // to be implemented void dft_str(int n, int str, cpx *Y, cpx *X);

𝑬𝑮𝑼𝒐 = 𝑬𝑮𝑼𝒍 ⊗ 𝐽𝑛 𝑈

𝑛 𝑜 𝐽𝑙 ⊗ 𝑬𝑮𝑼𝒏 𝑀𝑙 𝑜

memory to memory copy SLOW!

slide-9
SLIDE 9

9

DFT – Matrix Representation

graphics from paper

Recursive Cases 𝑬𝑮𝑼𝒐 = 𝑬𝑮𝑼𝒍 ⊗ 𝐽𝑛 𝑈

𝑛 𝑜 𝐽𝑙 ⊗ 𝑬𝑮𝑼𝒏 𝑀𝑙 𝑜

Diagonal Matrix Input Permutation

slide-10
SLIDE 10

DFT – Dataflow Representation

10

𝑬𝑮𝑼𝒐 = 𝑬𝑮𝑼𝒍 ⊗ 𝐽𝑛 𝑈

𝑛 𝑜 𝐽𝑙 ⊗ 𝑬𝑮𝑼𝒏 𝑀𝑙 𝑜 1 2 3

4 1.1 2.1 3.1 4.1 1.2 2.2 3.2 4.2 1.3 2.3 3.3 4.3 1.4 2.4 3.4 4.4

slide-11
SLIDE 11

DFT – Improvement Idea

11

𝑬𝑮𝑼𝒐 = 𝑬𝑮𝑼𝒍 ⊗ 𝐽𝑛 𝑈

𝑛 𝑜 𝐽𝑙 ⊗ 𝑬𝑮𝑼𝒏 𝑀𝑙 𝑜

𝑈

𝑛 𝑜: A scalar multiplier for each entry

𝑀𝑙

𝑜: A permutation function

𝑬𝑮𝑼𝒐 = 𝑬𝑮𝑼𝒍 ⊗ 𝐽𝑛 𝑈

𝑛 𝑜 𝐽𝑙 ⊗ 𝑬𝑮𝑼𝒏 𝑀𝑙 𝑜

“Scaled DFT” “Strided DFT” Merge

slide-12
SLIDE 12

Example – Implementation (FFTW)

12

void dft(int n, cpx *y, cpx *x) { int k = choose_factor(n); // t1 = (I_k tensor DFT_m)L(n,k)*x for(int i=0; i < k; ++i) dft_str(m, k, 1, t1 + m*i, x + m*i); // y = (DFT_k tensor I_m) diag(d(j)) for(int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t1 + i); } // to be implemented void dft_str(int n, int istr, int ostr, cpx *y, cpx*x); void dft_scaled(int n, int str, cpx *d, cpx *y, cpx*x);

𝐄𝐆𝐔𝐨 = 𝐄𝐆𝐔𝐥 ⊗ Im Tm

n Ik ⊗ 𝐄𝐆𝐔𝐧 Lk n

Challenge: How to find those recursive functions?

slide-13
SLIDE 13

Not yet parallelized and vectorized Many other recursive functions needed Optimizations are to be identified and implemented by hand

13

Example - Observation

http://bookszeus.files.wordpress.com/2011/09/hands-on-keyboard.jpg

slide-14
SLIDE 14

LIBRARY GENERATOR

14

http://www.artistsvalley.com/images/icons/Database%20Application%20Icons/Data%20Generator/ 256x256/Data%20Generator.jpg

slide-15
SLIDE 15

Formulated in special language (SPL) New set of breakdown rules for each transform Can be complicated

15

Transform Description

slide-16
SLIDE 16

16

Library Generation for Transforms

http://www.anilgears.com/images1/gear6.jpg

Library Breakdown Rules

Generator

Base Cases Recursion Steps

DFT2 to DFT16 dft(), dft_str() dft_scaled()

𝐄𝐆𝐔𝐨 = 𝐄𝐆𝐔𝐥 ⊗ Im Tm

n

Ik ⊗ 𝐄𝐆𝐔𝐧 Lk

n

slide-17
SLIDE 17

17

From SPL to Σ-SPL

Problem Breakdown Rule Σ-SPL A Linear Transform: DFTn Breakdown rules in SPL: 𝐄𝐆𝐔𝐨 = 𝐄𝐆𝐔𝐥 ⊗ Im Tm

n Ik ⊗ 𝐄𝐆𝐔𝐧 Lk n

Rewritten breakdown rules into Σ-SPL

slide-18
SLIDE 18

Σ-SPL = Extended SPL + Loop constructs Σ: iterative matrix sum + explicit I/O: gather G(f) and scatter S(f) + index mapping functions M(f) for strides (permutations) Functions: 𝑔: 𝐵 → 𝐶 A: Integer interval [1..n] B: Integer interval [1..n] or ℂ, ℝ Example: Permutation 𝑔𝑜→𝑜 (𝑐𝑗𝑘𝑓𝑑𝑢𝑗𝑤𝑓)

18

Difference of SPL and Σ-SPL

How to rewrite SPL to Σ-SPL?

slide-19
SLIDE 19

Parameterized matrices 𝐻 𝑔 ≔ 𝑓𝑔 0 , … , 𝑓𝑔 𝑜−1

𝑈 =

1 ⋱ 1 𝑇 𝑔 ≔ 𝐻 𝑔 𝑈 = 1 ⋱ 1

19

Σ-SPL – Gather and Scatter Matrices

𝑧 = 𝐻 𝑔 ⋅ 𝑦 ⇔ 𝑧𝑗 = 𝑦𝑔 𝑗 𝑧 = 𝑇 𝑔 ⋅ 𝑦 ⇔ 𝑧𝑔 𝑗 = 𝑦𝑗

slide-20
SLIDE 20

20

Inside the Tensor Product

Y = 𝐽𝑙 ⊗ 𝐵 x 𝑍

1

⋮ 𝑍

𝑜

= 𝐵 ⋯ ⋮ ⋱ ⋮ ⋯ 𝐵 𝑌1 ⋮ 𝑌𝑜 𝑍

1

⋮ 𝑍

𝑜

= 𝐵 ⋯ ⋮ ⋱ ⋮ ⋯ + ⋯ + ⋯ ⋮ ⋱ ⋮ ⋯ 𝐵 𝑌1 ⋮ 𝑌𝑜 𝐄𝐆𝐔𝐨 = 𝐄𝐆𝐔𝐥 ⊗ Im Tm

n Ik ⊗ 𝐄𝐆𝐔𝐧 Lk n

Partition of vectors: 𝑍

𝑗 = 𝐵 ⋅ 𝑌𝑗

Iterative Matrix Sum

slide-21
SLIDE 21

21

Inside the Tensor Product

𝐄𝐆𝐔𝐨 = 𝐄𝐆𝐔𝐥 ⊗ Im Tm

n Ik ⊗ 𝐄𝐆𝐔𝐧 Lk n

𝐽𝑙 ⊗ 𝐵 = 𝑇0𝐵𝐻0 + ⋯ + 𝑇𝑙−1𝐵 𝐻𝑙−1 = 𝑇

𝑘𝐵 𝐻 𝑘 𝑙−1 𝑘=0 : k-1

Y

: k-1

X

A

loop kernel j=0 j=k-1 j=k-1 j=0

slide-22
SLIDE 22

22

Σ-SPL – Rewriting Rules (Extract)

SPL to Σ-SPL Loop merging

slide-23
SLIDE 23

23

From SPL to Σ-SPL

𝐄𝐆𝐔𝐨 = 𝐄𝐆𝐔𝐥 ⊗ Im Tm

n Ik ⊗ 𝐄𝐆𝐔𝐧 Lk n

𝐽𝑙 ⊗ 𝐸𝐺𝑈

𝑛 𝑀𝑙 𝑜 =

𝑇 ℎ𝑘𝑛,1 𝐸𝐺𝑈

𝑛𝐻(ℎ𝑘𝑛,1) 𝑙−1 𝑘=0

perm lk

m

= 𝑇 ℎ𝑘𝑛,1 𝐸𝐺𝑈

𝑛𝐻(ℎ𝑘,𝑙) 𝑙−1 𝑘=0

𝐸𝐺𝑈𝑙 ⊗ 𝐽𝑛 𝑈

𝑛 𝑜 =

𝑇 ℎ𝑘,𝑛 𝐸𝐺𝑈𝑙𝑒𝑗𝑏𝑕(𝑔 ∘ ℎ𝑗,𝑙)𝐻(ℎ𝑘,𝑛)

𝑛−1 𝑘=0

slide-24
SLIDE 24

24

Recursion Step Closure

New R-Steps ? Parameterize Implement Transforms & Breakdown Rules No Yes

Implement expand & rewrite Parametrize find params according to constraints (Matrix dimensions)

slide-25
SLIDE 25

25

Recursion Steps - Example

𝐄𝐆𝐔𝐨 = 𝐄𝐆𝐔𝐥 ⊗ Im Tm

n Ik ⊗ 𝐄𝐆𝐔𝐧 Lk n

𝐸𝐺𝑈𝑙 ⊗ 𝐽𝑛 𝑈

𝑛 𝑜 =

𝑇 ℎ𝑘,𝑛 𝐸𝐺𝑈𝑙𝑒𝑗𝑏𝑕(𝑔 ∘ ℎ𝑗,𝑙)𝐻(ℎ𝑘,𝑛)

𝑛−1 𝑘=0

𝐽𝑙 ⊗ 𝐸𝐺𝑈

𝑛 𝑀𝑙 𝑜 =

𝑇 ℎ𝑘𝑛,1 𝐸𝐺𝑈

𝑛𝐻(ℎ𝑘,𝑙) 𝑙−1 𝑘=0

Function Tag to mark recursions dft_str dft_scaled Expand tag to scatter / gather dft

slide-26
SLIDE 26

26

Recursion Steps – Graph

dft dft_str dft_scale dft_scale_base Stride = 1

slide-27
SLIDE 27

27

Library Generation for Transforms

http://www.anilgears.com/images1/gear6.jpg

Library Breakdown Rules

Generator

Base Cases Recursion Steps

DFT2 to DFT16 dft(), dft_str() dft_scaled()

𝐄𝐆𝐔𝐨 = 𝐄𝐆𝐔𝐥 ⊗ Im Tm

n

Ik ⊗ 𝐄𝐆𝐔𝐧 Lk

n

Code Generation Parameter Handling

slide-28
SLIDE 28

Code(𝐻 𝑔𝑜→𝑂 , 𝑧, 𝑦) for(j=0..n-1) y[j] = x[f(j)]; Code(𝑒𝑗𝑏𝑕 𝑔𝑜→ℂ , 𝑧, 𝑦) for(j=0..n-1) y[j] = f(j)*x[j]; Code( 𝑁

𝑘 𝑙−1 𝑘=0

, 𝑧, 𝑦) for(j=0..k-1) Code(𝑁

𝑘, 𝑧, 𝑦)

28

Code Translation (Extract)

http://www.carlosag.net/images/codeTranslatorLogo.gif

Σ-SPL Code

Rewriting Rules

slide-29
SLIDE 29

29

Parameter Handling

Function Parameters (Hot) Precomputed Parameters (Cold) Large Parameter Set

𝐸𝐺𝑈

𝑣1 = 𝑇 ℎ𝑣3,1 𝑣1→𝑣2

𝐸𝐺𝑈

𝑔1 ⊗ 𝐽𝑣1/𝑔1 𝑒𝑗𝑏𝑕 𝑞𝑠𝑓 𝑒𝑣1/𝑔1 𝑣1→ℂ

⋅ 𝐽

𝑔1 ⊗ 𝐸𝐺𝑈 𝑣1/𝑔1 𝑀𝑔1 𝑣1𝐻(ℎ𝑣7,𝑣8 𝑣1→𝑣6)

u1,u2, u3, u4, u5, u6, u7, u8, f1 “Runtime-parameter” “Initial-Parameter”

slide-30
SLIDE 30

Library Generation

Generated Functions

Higher-order descriptor Identified recursion steps Cold parameters DFT size n Hot parameters Vectors x,y Base cases DFT2

Actual Code

Classes DFT, DFT_STR, DFT_SCALED Constructor arguments DFT(int n,…) Method arguments DFT->compute(x, y) Constants [1 1; 1 -1]

30

slide-31
SLIDE 31

Library Generation: Platform

Parallel & vector code

31

New rewriting rules with # Processors, Cache-line size, vector length Additional recursion steps: DFT scalar: 4 DFT vectorized+parallel: 8

slide-32
SLIDE 32

RESULTS

32

http://britishmastersaf.files.wordpress.com/2011/05/results.jpg

slide-33
SLIDE 33

CPU 2x dual core 3GHz Intel Xeon 5160 processors Compiler: Intel C/C++ Compiler 10.1 and SUN JDK 1.6.0 Threading: OpenMP pragmas Library Generation Time: between 10 and 60 minutes

Test Plattform

33

slide-34
SLIDE 34

34

Source: Paper

Scalar

  • approx. same algorithms

and optimization Just the base cases up to 64

slide-35
SLIDE 35

35

Source: Paper

4 cores used 2 cores used 1 core used

slide-36
SLIDE 36

36

Source: Paper

Extra passes due to conversion into RDFT

slide-37
SLIDE 37

37

Source: Paper

slide-38
SLIDE 38

+ Complete automation + Description at high level of abstraction + performance competitiveness with best human written code + flexibility: different custom libraries

38

Conclusion

slide-39
SLIDE 39

39

slide-40
SLIDE 40

40

Hot/Cold Partitioning