The FFT Via Matrix Factorizations A Key to Designing High Performance Implementations Charles Van Loan Department of Computer Science Cornell University
A High Level Perspective...
Blocking For Performance A 11 A 12 · · · A 1 q } n 1 A 21 A 22 · · · A 2 q } n 2 A = . . . ... . . . . . . A p 1 A p 2 · · · A pq } n q ���� ���� ���� n 1 n 2 n q A well known strategy for high-performance Ax = b and Ax = λx solvers.
Factoring for Performance One way to execute a matrix-vector product y = F n x when F n = A t · · · A 2 A 1 is as follows: y = x for k = 1: t y = A k x end ˜ t · · · ˜ A different factorization F n = A ˜ A 1 would yield a different algorithm.
The Discrete Fourier Transform ( n = 8 ) ω 0 ω 0 ω 0 ω 0 ω 0 ω 0 ω 0 ω 0 8 8 8 8 8 8 8 8 ω 0 ω 1 ω 2 ω 3 ω 4 ω 5 ω 6 ω 7 8 8 8 8 8 8 8 8 ω 0 ω 2 ω 4 ω 6 ω 8 ω 10 ω 12 ω 14 8 8 8 8 8 8 8 8 ω 0 ω 3 ω 6 ω 9 ω 12 ω 15 ω 18 ω 21 8 8 8 8 8 8 8 8 y = F 8 x = x ω 0 ω 4 ω 8 ω 12 ω 16 ω 20 ω 24 ω 28 8 8 8 8 8 8 8 8 ω 0 ω 5 ω 10 ω 15 ω 20 ω 25 ω 30 ω 35 8 8 8 8 8 8 8 8 ω 0 ω 6 ω 12 ω 18 ω 24 ω 30 ω 36 ω 42 8 8 8 8 8 8 8 8 ω 0 ω 7 ω 14 ω 21 ω 28 ω 35 ω 42 ω 49 8 8 8 8 8 8 8 8 ω 8 = cos(2 π/ 8) − i · sin(2 π/ 8)
The DFT Matrix In General... If ω n = cos(2 π/n ) − i · sin(2 π/n ) then [ F n ] pq = ω pq n = (cos(2 π/n ) − i · sin(2 π/n )) pq = cos(2 pqπ/n ) − i · sin(2 pqπ/n ) Fact: F H n F n = nI n Thus, F n / √ n is unitary.
Data Sparse Matrices An n -by- n matrix A is data sparse if it can be represented with many fewer than n 2 numbers. Example 1. A has lots of zeros. (“Traditional Sparse”) Example 2. A is Toeplitz... a b c d e a b c A = f e a b g f e a
More Examples of Data Sparse Matrices A is a Kronecker Product B ⊗ C , e.g., � � b 11 C b 12 C A = b 21 C b 22 C If B ∈ IR m 1 × m 1 and C ∈ IR m 2 × m 2 then A = B ⊗ C has m 2 1 m 2 2 entries but is parameterized by just m 2 1 + m 2 2 numbers.
Extreme Data Sparsity n n n n � � � � S ( i, j, k, ℓ ) · (2-by-2) ⊗ · · · ⊗ (2-by-2) A = � �� � i =1 j =1 k =1 ℓ =1 d times A is 2 d -by-2 d but is parameterized by O ( dn 4 ) numbers.
Factorization of F n The DFT matrix can be factored into a short product of sparse matrices, e.g., F 1024 = A 10 · · · A 2 A 1 P 1024 where each A -matrix has 2 nonzeros per row and P 1024 is a per- mutation.
From Factorization to Algorithm If n = 2 10 and F n = A 10 · · · A 2 A 1 P n then y = P n x for k = 1:10 y = A k x ← 2 n flops. end computes y = F n x and requires O ( n log n ) flops.
Recursive Block Structure F 8 (: , [ 0 2 4 6 1 3 5 7 ]) = 1 0 0 0 1 0 0 0 0 1 0 0 0 ω 8 0 0 ω 2 0 0 1 0 0 0 0 8 � F 4 0 � ω 3 0 0 0 1 0 0 0 8 1 0 0 0 − 1 0 0 0 0 F 4 0 1 0 0 0 − ω 8 0 0 − ω 2 0 0 1 0 0 0 0 8 − ω 3 0 0 0 1 0 0 0 8 F n/ 2 “shows up” when you permute the columns of F n so that the odd-indexed columns come first.
Recursion... We build an 8-point DFT from two 4-point DFTs... 1 0 0 0 1 0 0 0 0 1 0 0 0 ω 8 0 0 ω 2 0 0 1 0 0 0 0 8 � � F 4 x (0:2:7) ω 3 0 0 0 1 0 0 0 8 F 8 x = 1 0 0 0 − 1 0 0 0 F 4 x (1:2:7) 0 1 0 0 0 − ω 8 0 0 − ω 2 0 0 1 0 0 0 0 8 − ω 3 0 0 0 1 0 0 0 8
Radix-2 FFT: Recursive Implementation function y = fft ( x, n ) if n = 1 y = x else m = n/ 2; ω = exp( − 2 πi/n ) Ω = diag(1 , ω, . . . , ω m − 1 ) z T = fft ( x (0:2: n − 1) , m ) z B = Ω · fft ( x (1:2: n − 1) , m ) � � � � I m I m z T y = Overall: 5 n log n flops. I m − I m z B end
The Divide-and-Conquer Picture (0:1:15) ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ ✟ ✟ (0:2:15) (1:2:15) ✑ ◗◗◗◗ ✑ ◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ (0:4:15) (2:4:15) (1:4:15) (3:4:15) � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ (0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15) ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ [0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15]
Towards a Nonrecursive Implementation The Radix-2 Factorization... If n = 2 m and Ω m = diag(1 , ω n , . . . , ω m − 1 ) , n then � � � � F m Ω m F m I m Ω m ( I 2 ⊗ F m ) . F n Π n = = F m − Ω m F m I m − Ω m where Π n = I n (: , [0:2: n 1:2: n ]). � � F m 0 I 2 ⊗ F m = Note: . 0 F m
The Cooley-Tukey Factorization n = 2 t F n = A t · · · A 1 P n P n = the n -by- n “bit reversal ” permutation matrix � � I L/ 2 Ω L/ 2 L = 2 q , r = n/L A q = I r ⊗ I L/ 2 − Ω L/ 2 Ω L/ 2 = diag(1 , ω L , . . . , ω L/ 2 − 1 ) ω L = exp( − 2 πi/L ) L
The Bit Reversal Permutation (0:1:15) ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ ✟ ✟ (0:2:15) (1:2:15) ✑ ◗◗◗◗ ✑ ◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ (0:4:15) (2:4:15) (1:4:15) (3:4:15) � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ (0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15) ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ✁ ❆ [0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15]
Bit Reversal x (0) x (0000) x (0000) x (0) x (1) x (0001) x (1000) x (8) x (2) x (0010) x (0100) x (4) x (3) x (0011) x (1100) x (12) x (4) x (0100) x (0010) x (2) x (5) x (0101) x (1010) x (10) x (6) x (0110) x (0110) x (6) x (7) x (0111) x (1110) x (14) = → = x (8) x (1000) x (0001) x (1) x (9) x (1001) x (1001) x (9) x (10) x (1010) x (0101) x (5) x (11) x (1011) x (1101) x (13) x (12) x (1100) x (0011) x (3) x (13) x (1101) x (1011) x (11) x (14) x (1110) x (0111) x (7) x (15) x (1111) x (1111) x (15)
Butterfly Operations This matrix is block diagonal... � � I L/ 2 Ω L/ 2 L = 2 q , r = n/L A q = I r ⊗ I L/ 2 − Ω L/ 2 r copies of things like this 1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 ×
At the Scalar Level... ❍❍❍ ✟ a a + ωb s s ✟ ✟ ω ✟✟✟ ❍ ❍ ❍ a − ωb b s s
Recommend
More recommend