The FFT Via Matrix Factorizations A Key to Designing High - - PowerPoint PPT Presentation

the fft via matrix factorizations
SMART_READER_LITE
LIVE PREVIEW

The FFT Via Matrix Factorizations A Key to Designing High - - PowerPoint PPT Presentation

The FFT Via Matrix Factorizations A Key to Designing High Performance Implementations Charles Van Loan Department of Computer Science Cornell University A High Level Perspective... Blocking For Performance A 11 A 12 A 1 q


slide-1
SLIDE 1

The FFT Via Matrix Factorizations

A Key to Designing High Performance Implementations Charles Van Loan Department of Computer Science Cornell University

slide-2
SLIDE 2

A High Level Perspective...

slide-3
SLIDE 3

Blocking For Performance

A =      A11 A12 · · · A1q A21 A22 · · · A2q . . . . . . ... . . . Ap1 Ap2 · · · Apq      } n1 } n2 } nq

  • n1

n2 nq A well known strategy for high-performance Ax = b and Ax = λx solvers.

slide-4
SLIDE 4

Factoring for Performance

One way to execute a matrix-vector product y = Fnx when Fn = At · · · A2A1 is as follows: y = x for k = 1:t y = Akx end A different factorization Fn = ˜ A˜

t · · · ˜

A1 would yield a different algorithm.

slide-5
SLIDE 5

The Discrete Fourier Transform (n = 8)

y = F8x =                    ω0

8

ω0

8

ω0

8

ω0

8

ω0

8

ω0

8

ω0

8

ω0

8

ω0

8

ω1

8

ω2

8

ω3

8

ω4

8

ω5

8

ω6

8

ω7

8

ω0

8

ω2

8

ω4

8

ω6

8

ω8

8

ω10

8

ω12

8

ω14

8

ω0

8

ω3

8

ω6

8

ω9

8

ω12

8

ω15

8

ω18

8

ω21

8

ω0

8

ω4

8

ω8

8

ω12

8

ω16

8

ω20

8

ω24

8

ω28

8

ω0

8

ω5

8

ω10

8

ω15

8

ω20

8

ω25

8

ω30

8

ω35

8

ω0

8

ω6

8

ω12

8

ω18

8

ω24

8

ω30

8

ω36

8

ω42

8

ω0

8

ω7

8

ω14

8

ω21

8

ω28

8

ω35

8

ω42

8

ω49

8

                   x ω8 = cos(2π/8) − i · sin(2π/8)

slide-6
SLIDE 6

The DFT Matrix In General...

If ωn = cos(2π/n) − i · sin(2π/n) then [Fn]pq = ωpq

n

= (cos(2π/n) − i · sin(2π/n))pq = cos(2pqπ/n) − i · sin(2pqπ/n) Fact: F H

n Fn = nIn

Thus, Fn/√n is unitary.

slide-7
SLIDE 7

Data Sparse Matrices

An n-by-n matrix A is data sparse if it can be represented with many fewer than n2 numbers. Example 1. A has lots of zeros. (“Traditional Sparse”) Example 2. A is Toeplitz... A =     a b c d e a b c f e a b g f e a    

slide-8
SLIDE 8

More Examples of Data Sparse Matrices

A is a Kronecker Product B ⊗ C, e.g., A =

  • b11C b12C

b21C b22C

  • If B ∈ IRm1×m1 and C ∈ IRm2×m2 then A = B ⊗ C has m2

1m2 2

entries but is parameterized by just m2

1 + m2 2 numbers.

slide-9
SLIDE 9

Extreme Data Sparsity

A =

n

  • i=1

n

  • j=1

n

  • k=1

n

  • ℓ=1

S(i, j, k, ℓ) · (2-by-2) ⊗ · · · ⊗ (2-by-2)

  • d times

A is 2d -by-2d but is parameterized by O(dn4) numbers.

slide-10
SLIDE 10

Factorization of Fn

The DFT matrix can be factored into a short product of sparse matrices, e.g., F1024 = A10 · · · A2A1P1024 where each A-matrix has 2 nonzeros per row and P1024 is a per- mutation.

slide-11
SLIDE 11

From Factorization to Algorithm

If n = 210 and Fn = A10 · · · A2A1Pn then y = Pnx for k = 1:10 y = Akx ← 2n flops. end computes y = Fnx and requires O(n log n) flops.

slide-12
SLIDE 12

Recursive Block Structure

F8(:, [ 0 2 4 6 1 3 5 7 ]) =              1 0 0 1 1 ω8 1 ω2

8

1 ω3

8

1 0 0 −1 1 −ω8 1 −ω2

8

1 −ω3

8

             F4 0 0 F4

  • Fn/2 “shows up” when you permute the columns of Fn so that

the odd-indexed columns come first.

slide-13
SLIDE 13

Recursion...

We build an 8-point DFT from two 4-point DFTs... F8 x =              1 0 0 1 1 ω8 1 ω2

8

1 ω3

8

1 0 0 −1 1 −ω8 1 −ω2

8

1 −ω3

8

            

  • F4x(0:2:7)

F4x(1:2:7)

slide-14
SLIDE 14

Radix-2 FFT: Recursive Implementation

function y =fft(x, n) if n = 1 y = x else m = n/2; ω = exp(−2πi/n) Ω = diag(1, ω, . . . , ωm−1) zT = fft(x(0:2:n − 1), m) zB = Ω· fft(x(1:2:n − 1), m) y =

  • Im

Im Im −Im zT zB

  • Overall: 5n log n flops.

end

slide-15
SLIDE 15

The Divide-and-Conquer Picture

(0:8:15) [0] [8]

❆ ❆ ✁ ✁

(4:8:15) [4] [12]

❆ ❆ ✁ ✁

(2:8:15) [2] [10]

❆ ❆ ✁ ✁

(6:8:15) [6] [14]

❆ ❆ ✁ ✁

(1:8:15) [1] [9]

❆ ❆ ✁ ✁

(5:8:15) [5] [13]

❆ ❆ ✁ ✁

(3:8:15) [3] [11]

❆ ❆ ✁ ✁

(7:8:15) [7] [15]

❆ ❆ ✁ ✁

(0:4:15)

❅ ❅

  • (2:4:15)

❅ ❅

  • (1:4:15)

❅ ❅

  • (3:4:15)

❅ ❅

  • (0:2:15)

◗◗◗◗ ✑ ✑ ✑ ✑

(1:2:15)

◗◗◗◗ ✑ ✑ ✑ ✑

(0:1:15)

❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟

slide-16
SLIDE 16

Towards a Nonrecursive Implementation

The Radix-2 Factorization... If n = 2m and Ωm = diag(1, ωn, . . . , ωm−1

n

), then FnΠn =

  • Fm

ΩmFm Fm −ΩmFm

  • =
  • Im

Ωm Im −Ωm

  • (I2 ⊗ Fm).

where Πn = In(:, [0:2:n 1:2:n]). Note: I2 ⊗ Fm =

  • Fm

Fm

  • .
slide-17
SLIDE 17

The Cooley-Tukey Factorization

n = 2t Fn = At · · · A1Pn Pn = the n-by-n “bit reversal ” permutation matrix Aq = Ir ⊗

  • IL/2

ΩL/2 IL/2 −ΩL/2

  • L = 2q, r = n/L

ΩL/2 = diag(1, ωL, . . . , ωL/2−1

L

) ωL = exp(−2πi/L)

slide-18
SLIDE 18

The Bit Reversal Permutation

(0:8:15) [0] [8]

❆ ❆ ✁ ✁

(4:8:15) [4] [12]

❆ ❆ ✁ ✁

(2:8:15) [2] [10]

❆ ❆ ✁ ✁

(6:8:15) [6] [14]

❆ ❆ ✁ ✁

(1:8:15) [1] [9]

❆ ❆ ✁ ✁

(5:8:15) [5] [13]

❆ ❆ ✁ ✁

(3:8:15) [3] [11]

❆ ❆ ✁ ✁

(7:8:15) [7] [15]

❆ ❆ ✁ ✁

(0:4:15)

❅ ❅

  • (2:4:15)

❅ ❅

  • (1:4:15)

❅ ❅

  • (3:4:15)

❅ ❅

  • (0:2:15)

◗◗◗◗ ✑ ✑ ✑ ✑

(1:2:15)

◗◗◗◗ ✑ ✑ ✑ ✑

(0:1:15)

❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟

slide-19
SLIDE 19

Bit Reversal

                            x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15)                             =                             x(0000) x(0001) x(0010) x(0011) x(0100) x(0101) x(0110) x(0111) x(1000) x(1001) x(1010) x(1011) x(1100) x(1101) x(1110) x(1111)                             →                             x(0000) x(1000) x(0100) x(1100) x(0010) x(1010) x(0110) x(1110) x(0001) x(1001) x(0101) x(1101) x(0011) x(1011) x(0111) x(1111)                             =                             x(0) x(8) x(4) x(12) x(2) x(10) x(6) x(14) x(1) x(9) x(5) x(13) x(3) x(11) x(7) x(15)                            

slide-20
SLIDE 20

Butterfly Operations

This matrix is block diagonal... Aq = Ir ⊗

  • IL/2

ΩL/2 IL/2 −ΩL/2

  • L = 2q, r = n/L

r copies of things like this             1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 ×            

slide-21
SLIDE 21

At the Scalar Level...

s ✟✟✟ s ❍❍❍

ω

s ❍ ❍ ❍ s ✟ ✟ ✟ b a a − ωb a + ωb

slide-22
SLIDE 22

Signal Flow Graph (n = 8)

x0 x4 x2 x6 x1 x5 x3 x7 s

s s s s s s s ✟✟✟ ✟✟✟ ✟✟✟ ✟✟✟ ❍❍❍ ❍❍❍ ❍❍❍ ❍❍❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟

ω0

8

ω0

8

ω0

8

ω0

8

s s s s s s s s

❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅

  • ω2

8

ω0

8

ω2

8

ω0

8

s s s s s s s s ✁ ✁ ✁ ✁ ✁ ✁ ✁✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁✁ ❆ ❆ ❆ ❆ ❆ ❆ ❆❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁

ω3

8

ω2

8

ω1

8

ω0

8

s s s s s s s s y0

y1 y2 y3 y4 y5 y6 y7

slide-23
SLIDE 23

The Transposed Stockham Factorization

If n = 2t, then Fn = St · · · S2S1, where for q = 1:t the factor Sq = AqΓq−1 is defined by Aq = Ir ⊗ BL, L = 2q, r = n/L, Γq−1 = Πr

∗ ⊗ IL ∗,

L

∗ = L/2, r ∗ = 2r,

BL =

  • IL

ΩL

IL

∗ −ΩL ∗

  • ,

ΩL

= diag(1, ωL, . . . , ωL

∗−1 L

).

slide-24
SLIDE 24

Perfect Shuffle

(Π4 ⊗ I2)             x0 x1 x2 x3 x4 x5 x6 x7             =             x0 x1 x4 x5 x2 x3 x6 x7            

slide-25
SLIDE 25

Cooley-Tukey Array Interpretation

Step q:

  • r=n/L

k

          

L=2q

− →

2k 2k+1

  • r

∗=n/L ∗

L

∗=2q−1

8 > < > :

slide-26
SLIDE 26

Reshaping

x =               × × × × × × × × ×               → x2×4 = × × × × × × × ×

slide-27
SLIDE 27

Transposed Stockham Array Interp

k k+r x(q−1)

L ∗×r ∗ = FL ∗xT

r

∗×L ∗ =

  • r

∗=n/L ∗

9 > = > ; L

∗=2q−1 .

x(q) = Sqx(q−1)

k x(q)

L×r = FLxT

r×L =

  • r=n/L

9 > > > > > > > > = > > > > > > > > ; L=2q

slide-28
SLIDE 28

2 × 2 × 2 Basic Radix-2 Versions

Store intermediate DFTs by row or column Intermediate DFTs adjacent or not. How the two butterfly loops are ordered. x =

  • Ir ⊗
  • IL/2

ΩL/2 IL/2 −ΩL/2

  • x

L = 2q, r = n/L

slide-29
SLIDE 29

The Gentleman-Sande Idea

It can be shown that F T

n = Fn and so if

Fn = At · · · A1P T

n

then Fn = F T

n

= PnAT

1 · · · AT t

and we can compute y = Fnx as follows... y = x for k = t: − 1:1 y = AT

k x

end y = Pny

slide-30
SLIDE 30

Convolution and Other Aps

From “problem space” to “DFT space” via for k = t: − 1:1 x = AT

k x

end x = Pnx Do your thing in DFT space. Then inverse transform back to Problem space via x = P T

n x

for k = 1:t x = Akx end x = x/n Can avoid the Pn ops by working in “scrambled” DFT space.

slide-31
SLIDE 31

Radix-4

Can combine four quarter-length DFTs to produce a single full- length DFT: v =     I I I I I−iI −I iI I −I I −I I iI−I−iI         a b c d     =     (a + c)+ (b + d) (a − c)−i(b − d) (a + c)− (b + d) (a − c)+i(b − d)     , The radix-4 butterfly. Better re-use of data. Fewer flops. Radix-4 FFT is 4.25n log n (instead of 5n log n).

slide-32
SLIDE 32

Mixed Radix

96

✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ★ ★ ★ ★ ❝❝❝ ❝ PPPPPPPPP

24

❅ ❅

  • 8

8 8 24

❅ ❅

  • 8

8 8 24

❅ ❅

  • 8

8 8 24

❅ ❅

  • 8

8 8

slide-33
SLIDE 33

Multiple DFTs

Given: n1-by-n2 matrix X. Multicolumn DFT Problem... X ← Fn1X Multirow DFT Problem... X ← XFn2

slide-34
SLIDE 34

Blocked Multiple DFTs

X ← Fn1X becomes

  • X1 | X2 | · · · | Xp
  • Fn1X1 | Fn1X2 | · · · | Fn1Xp
slide-35
SLIDE 35

The 4-Step Framework

A matrix reshaping of the x ← Fnx operation when n = n1n2: xn1×n2 ← xn1×n2Fn2 Multiple row DFT xn1×n2 ← Fn(0:n1 − 1, 0:n2 − 1).∗ xn1×n2 Pointwise multiply xn2×n1 ← xT

n1×n2

Transpose xn2×n1 ← xn2×n1Fn1 Multiple row DFT . Can be arranged so communication is concentrated in the trans- pose step.

slide-36
SLIDE 36

Distributed Transpose: Example

Initial: X =     X00 X01 X02 X03 X10 X11 X12 X13 X20 X21 X22 X23 X30 X31 X32 X33     . Transpose each block: X ←        XT

00 XT 01 XT 02 XT 03

XT

10 XT 11 XT 12 XT 13

XT

20 XT 21 XT 22 XT 23

XT

30 XT 31 XT 32 XT 33

       .

slide-37
SLIDE 37

Now regard as 2-by-2 and block transpose each block: X ←         XT

00 XT 10 XT 02 XT 12

XT

01 XT 11 XT 03 XT 13

XT

20 XT 30 XT 22 XT 32

XT

21 XT 31 XT 23 XT 33

        . Now do a 2-by-2 block transpose: X ←         XT

00 XT 10 XT 20 XT 30

XT

01 XT 11 XT 21 XT 31

XT

02 XT 12 XT 22 XT 32

XT

03 XT 13 XT 23 XT 33

        .

slide-38
SLIDE 38

Factorization and Transpose

xn×m ← xT

m×n

corresponds to x ← P(m, n)x where P(m, n) is a perfect shuffle permutation, e.g., P(3, 4) = I12(:, [0 3 6 9 1 4 7 10 2 5 8 11]) Different multi-pass transposition algorithms correspond to differ- ent factorizations of P(m, n).

slide-39
SLIDE 39

Two-Dimensional FFTs

If X is an n1-by-n2 matrix then is 2D DFT is X ← Fn1XFn2 Option 1. X ← Fn1X X ← XFn2 Option 2. Assume n1 = n2 and Fn1 = At · · · A1. for q = 1:t X ← AqXAT

q

end Interminlgling the column and row butterfly computations can result in better locality.

slide-40
SLIDE 40

3-Dimensional DFTs

Given X(1:n1, 1:n2, 1:n3), apply DFT in each of the three dimen- sions. If x = reshape(X(1:n1, 1:n2, 1:n3), n1n2n3, 1) then the problem is to compute x ← (Fn3 ⊗ Fn2 ⊗ Fn1)x i.e., x ← (In3 ⊗ In2 ⊗ Fn1)x x ← (In3 ⊗ Fn2 ⊗ In1)x x ← (Fn3 ⊗ In2 ⊗ In1)x

slide-41
SLIDE 41

d-Dimensional DFTs

Sample for d = 5: µ = 1 X(α1, α2, α3, α4, α5) X(α2, α3, α4, α5, α1) Fn1 ΠT

n1,n

µ = 2 X(α2, α3, α4, α5, α1) X(α3, α4, α5, α1, α2) Fn2 ΠT

n2,n

µ = 3 X(α3, α4, α5, α1, α2) X(α4, α5, α1, α2, α3) Fn3 ΠT

n3,n

µ = 4 X(α4, α5, α1, α2, α3) X(α5, α1, α2, α3, α4) Fn4 ΠT

n4,n

µ = 5 X(α5, α1, α2, α3, α4) X(α1, α2, α3, α4, α5) Fn5 ΠT

n5,n

Intemingling of component DFTs and tensor transpositions.

slide-42
SLIDE 42

References

FFTW: http:www.fftw.org

  • C. Van Loan (1992). Computational Frameworks for the Fast

Fourier Transform, SIAM Publications, Philadelphia, PA.