Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.1 - - PowerPoint PPT Presentation

parallel numerical algorithms
SMART_READER_LITE
LIVE PREVIEW

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.1 - - PowerPoint PPT Presentation

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.1 Fast Fourier Transform Michael T. Heath and Edgar Solomonik Department of Computer Science


slide-1
SLIDE 1

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT

Parallel Numerical Algorithms

Chapter 6 – Matrix Models Section 6.1 – Fast Fourier Transform Michael T. Heath and Edgar Solomonik

Department of Computer Science University of Illinois at Urbana-Champaign

CS 554 / CSE 512

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 34

slide-2
SLIDE 2

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT

Outline

1

Discrete Fourier Transform Roots of Unity DFT Inverse DFT

2

Convolution Problem

3

Fast Fourier Transform Computing DFT FFT Algorithm

4

Parallel FFT Binary Exchange Parallel FFT Transpose Parallel FFT

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 34

slide-3
SLIDE 3

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Roots of Unity DFT Inverse DFT

Roots of Unity

For given integer n, we use notation ωn = cos(2π/n) − i sin(2π/n) = e−2πi/n for primitive nth root of unity, where i = √−1 nth roots of unity, sometimes called twiddle factors in this context, are then given by ωk

n or by ω−k n , k = 0, . . . , n − 1

For convenience, we will assume that n is power of two, and all logarithms used will be base two We will also index sequences (components of vectors) starting from 0 rather than 1

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 34

slide-4
SLIDE 4

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Roots of Unity DFT Inverse DFT

Discrete Fourier Transform

Discrete Fourier Transform, or DFT, of sequence x = [x0, . . . , xn−1]T is sequence y = [y0, . . . , yn−1]T given by ym =

n−1

  • k=0

xk ωmk

n ,

m = 0, 1, . . . , n − 1

  • r

y = Fn x where entries of DFT matrix Fn are given by {Fn}mk = ωmk

n

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 34

slide-5
SLIDE 5

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Roots of Unity DFT Inverse DFT

Inverse DFT

It is easily seen that F −1

n

= (1/n)F H

n

So inverse DFT is given by xk = 1 n

n−1

  • m=0

ym ω−mk

n

k = 0, 1, . . . , n − 1

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 34

slide-6
SLIDE 6

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Roots of Unity DFT Inverse DFT

Example

F4 =     1 1 1 1 1 ω1 ω2 ω3 1 ω2 ω4 ω6 1 ω3 ω6 ω9     =     1 1 1 1 1 −i −1 i 1 −1 1 −1 1 i −1 −i     4 F −1

4

=     1 1 1 1 1 ω−1 ω−2 ω−3 1 ω−2 ω−4 ω−6 1 ω−3 ω−6 ω−9     =     1 1 1 1 1 i −1 −i 1 −1 1 −1 1 −i −1 i    

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 34

slide-7
SLIDE 7

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Problem

Convolution

Convolution takes input a and b and computes c ∀k ∈ [0, n − 1] ck =

k

  • j=0

ajbk−j If a and b are coefficients of degree n/2 − 1 polynomials pa(x) =

n/2−1

  • k=0

akxk, pb(x) =

n/2−1

  • k=0

bkxk the convolution computes the coefficients c of the product pc(x) = pa(x)pb(x) =

n−1

  • k=0

ckxk naive evaluation costs O(n2) operations

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 34

slide-8
SLIDE 8

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Problem

Convolution and Toeplitz Matrices

Convolution can be interpreted as matrix-vector multiplication with a triangular Toeplitz matrix [c0 c1 c2 c3] = [a1 a2 a3 a4]     b0 b1 b2 b3 b0 b1 b2 b0 b1 b0     Toeplitz and Hankel matrices (in the latter, each antidiagonal is defined by a single element) provide a general matrix representation for convolutional operators

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 34

slide-9
SLIDE 9

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Problem

Convolution via Interpolation by DFT

The DFT, Fna evaluates polynomial pa at each ωj The values of pc at each ωj are then easily obtained pc(ωj) = pa(ωj)pb(ωj) The inverse DFT, F −1

n pc(x) interpolates the values of the

polynomial pc at each ωj producing its coefficients c The overall procedure is described by c = F −1

n [(Fna) ⊙ (Fnb)]

where ⊙ is an elementwise product (a and b are padded with trailing zeros)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 34

slide-10
SLIDE 10

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Problem

Convolution via DFT

Lets write out the full expression ck = 1 n

  • s

ω−ks

n j

ωsj

n aj t

ωst

n bt

  • Rearrange the order of the summations to see what

happens to every product of a and b ck = 1 n

  • s
  • j
  • t

ω(j+t−k)s

n

ajbt For any u = j + t − k = 0, we observe

s(ωu n)s = 0

When j + t − k = 0 the products ω(s+t−j)k

n

= 1, so there are n nonzero terms ajbk−j in the summation

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 34

slide-11
SLIDE 11

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Computing DFT

To illustrate, consider computing DFT for n = 4, ym =

3

  • k=0

xk ωmk

n ,

m = 0, . . . , 3 Writing out equations in full, y0 = x0ω0

n + x1ω0 n + x2ω0 n + x3ω0 n

y1 = x0ω0

n + x1ω1 n + x2ω2 n + x3ω3 n

y2 = x0ω0

n + x1ω2 n + x2ω4 n + x3ω6 n

y3 = x0ω0

n + x1ω3 n + x2ω6 n + x3ω9 n

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 34

slide-12
SLIDE 12

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Computing DFT

Noting that ω0

n = ω4 n = 1,

ω2

n = ω6 n = −1,

ω9

n = ω1 n

and regrouping, we obtain y0 = (x0 + ω0

nx2) + ω0 n(x1 + ω0 nx3)

y1 = (x0 − ω0

nx2) + ω1 n(x1 − ω0 nx3)

y2 = (x0 + ω0

nx2) + ω2 n(x1 + ω0 nx3)

y3 = (x0 − ω0

nx2) + ω3 n(x1 − ω0 nx3)

DFT can now be computed with only 8 additions and 6 multiplications, instead of expected (4 − 1) ∗ 4 = 12 additions and 42 = 16 multiplications

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 34

slide-13
SLIDE 13

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Computing DFT

Actually, even fewer multiplications are required for this small case, since ω0

n = 1, but we have tried to illustrate how

algorithm works in general Main point is that computing DFT of original 4-point sequence has been reduced to computing DFT of its two 2-point even and odd subsequences This property holds in general: DFT of n-point sequence can be computed by breaking it into two DFTs of half length, provided n is even

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 34

slide-14
SLIDE 14

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Computing DFT

General pattern becomes clearer when viewed in terms of first few Fourier matrices F1 = 1, F2 = 1 1 1 −1

  • ,

F4 =     1 1 1 1 1 −i −1 i 1 −1 1 −1 1 i −1 −i     Let P4 be permutation matrix P4 =     1 1 1 1    

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 34

slide-15
SLIDE 15

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Computing DFT

Let D2 be diagonal matrix D2 = diag(1, ω4) = 1 −i

  • Then we have

F4P4 =     1 1 1 1 1 −1 −i i 1 1 −1 −1 1 −1 i −i     = F2 D2F2 F2 −D2F2

  • Thus, F4 can be rearranged so that each block is

diagonally scaled version of F2 Such hierarchical splitting can be carried out at each level, provided number of points is even

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 34

slide-16
SLIDE 16

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Computing DFT

In general, Pn is permutation that groups even-numbered columns of Fn before odd-numbered columns, and Dn/2 = diag

  • 1, ωn, . . . , ω(n/2)−1

n

  • To apply Fn to sequence of length n, we need merely

apply Fn/2 to its even and odd subsequences and scale results, where necessary, by ±Dn/2 Resulting recursive divide-and-conquer algorithm for computing DFT is called Fast Fourier Transform, or FFT FFT is particular way of computing DFT efficiently

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 34

slide-17
SLIDE 17

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Radix-2 Fast Fourier Transform (FFT)

Consider b = Fna, we have ∀j ∈ [0, n − 1] bj =

n−1

  • k=0

ωjk

n ak

Express DFT as two DFTs of dimension n/2, with a different root of unity ωn/2 Separate summands into odds and evens, use ωn/2 = ω2

n

bj =

n/2−1

  • k=0

ωj(2k)

n

a2k +

n/2−1

  • k=0

ωj(2k+1)

n

a2k+1 =

n/2−1

  • k=0

ωjk

n/2a2k + ωj n n/2−1

  • k=0

ωjk

n/2a2k+1

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 34

slide-18
SLIDE 18

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Radix-2 Fast Fourier Transform (FFT), contd.

bj =

n/2−1

  • k=0

ωjk

n/2a2k

  • uj

+ωj

n n/2−1

  • k=0

ωjk

n/2a2k+1

  • vj

The summations for bj and bj+n/2 are closely related, bj+n/2 =

n/2−1

  • k=0

ω(j+n/2)k

n/2

a2k + ωj+n/2

n n/2−1

  • k=0

ω(j+n/2)k

n/2

a2k+1 Now ω(j+n/2)k

n/2

= ωjk

n/2 since (ωn/2 n/2)k = 1k = 1 and using ωn/2 n

= −1, bj+n/2 =

n/2−1

  • k=0

ωjk

n/2a2k

  • uj

−ωj

n n/2−1

  • k=0

ωjk

n/2a2k+1

  • vj

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 34

slide-19
SLIDE 19

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Radix-2 Fast Fourier Transform (FFT), contd.

Let vectors u and v be two recursive FFTs, ∀j ∈ [0, n/2 − 1] uj =

n/2−1

  • k=0

ωjk

n/2a2k,

vj =

n/2−1

  • k=0

ωjk

n/2a2k+1

Given u and v scale using "twiddle factors" zj = ωj

n · vj

Then it suffices to combine the vectors as follows b = u + z u − z

  • This recombination is an FFT of dimension 2

b =

  • b1

b2

  • = vec

b1 b2

  • = vec

u z 1 1 1 −1

  • F4[0:2,0:2]
  • Radix-r algorithm for any A ∈ Rr×n/r

Fn vec (A) = vec

  • [Fn[0 : r, 0 : n/r] ⊙ (FrA)]Fn/r)T

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 34

slide-20
SLIDE 20

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

FFT Algorithm

procedure fft(x, y, n, ω) if n = 1 then y[0] = x[0] else for k = 0 to (n/2) − 1 p[k] = x[2k] s[k] = x[2k + 1] end fft(p, q, n/2, ω2) fft(s, t, n/2, ω2) for k = 0 to n − 1 y[k] = q[k mod (n/2)] + ωkt[k mod (n/2)] end end

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 34

slide-21
SLIDE 21

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Complexity of FFT Algorithm

There are log n levels of recursion, each of which involves Θ(n) arithmetic operations, so total cost is Θ(n log n) By contrast, straightforward evaluation of matrix-vector product defining DFT requires Θ(n2) arithmetic operations, which is enormously greater for long sequences n n log n n2 64 384 4096 128 896 16384 256 2048 65536 512 4608 262144 1024 10240 1048576

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 34

slide-22
SLIDE 22

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

FFT Algorithm

For clarity, separate arrays were used for subsequences, but transform can be computed in place using no additional storage Input sequence is assumed complex; if input sequence is real, then additional symmetries in DFT can be exploited to reduce storage and operation count by half Output sequence is not produced in natural order, but either input or output sequence can be rearranged at cost

  • f Θ(n log n), analogous to sorting

FFT algorithm can be formulated using iteration rather than recursion, which is often desirable for greater efficiency or when programming language does not support recursion

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 34

slide-23
SLIDE 23

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Computing DFT FFT Algorithm

Computing Inverse DFT

Because of similar form of DFT and its inverse, FFT algorithm can also be used to compute inverse DFT efficiently Ability to transform back and forth quickly between time and frequency domains makes it practical to perform any computations or analysis that may be required in whichever domain is more convenient and efficient

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 34

slide-24
SLIDE 24

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Binary Exchange Parallel FFT Transpose Parallel FFT

Binary Exchange Parallel FFT

To obtain fine-grain decomposition of FFT, we assign input data xk to task k, which also computes result yk

x0 x1 x2 x3 x4 x5 x6 x7 y0 y1 y2 y3 y4 y5 y6 y7

At stage m of algorithm, tasks k and j exchange data, where k and j differ only in their mth bits

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 34

slide-25
SLIDE 25

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Binary Exchange Parallel FFT Transpose Parallel FFT

Binary Exchange Parallel FFT

There are n tasks and log n stages, so parallel time required to compute FFT is Tn = (γ + α + β) log n where γ is cost of multiply-add, and α + β is cost of exchanging one number between pair of tasks at each stage Hypercube is natural network for FFT algorithm

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 34

slide-26
SLIDE 26

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Binary Exchange Parallel FFT Transpose Parallel FFT

Binary Exchange Parallel FFT

To obtain smaller number of coarse-grain tasks, agglomerate sets of n/p components of input and output vectors x and y, where we assume p is also power of two

x0 x1 x2 x3 x4 x5 x6 x7 y0 y1 y2 y3 y4 y5 y6 y7

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 34

slide-27
SLIDE 27

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Binary Exchange Parallel FFT Transpose Parallel FFT

Binary Exchange Parallel FFT

Components having their log p most significant bits in common are assigned to same task Thus, exchanges are required in binary exchange algorithm only for first log p stages, since data are local for remaining log(n/p) stages

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 34

slide-28
SLIDE 28

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Binary Exchange Parallel FFT Transpose Parallel FFT

Binary Exchange Parallel FFT

Each stage involves updating of n/p components by each task, and exchange of n/p components for each of first log p stages Thus, total time required using hypercube network is Tp = α (log p) + β n (log p)/p + γ n (log n)/p To determine isoefficiency function, set γ n log n ≈ E (α p log p + β n log p + γ n log n) which holds if n = Θ(p), so isoefficiency function is Θ(p log p), since T1 = Θ(n log n)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 34

slide-29
SLIDE 29

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Binary Exchange Parallel FFT Transpose Parallel FFT

Transpose Parallel FFT

Binary exchange algorithm has one phase that is communication free and another phase that requires communication at each stage Another approach is to realign data so that both computational phases are communication free, and only communication is for data realignment phase between computational phases To accomplish this, data can be organized in √n × √n array, as illustrated next for n = 16

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 34

slide-30
SLIDE 30

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Binary Exchange Parallel FFT Transpose Parallel FFT

Transpose Parallel FFT

2 6 10 14 4 8 1 5 9 13 3 7 11 15 2 6 10 14 4 8 12 1 5 9 13 3 7 11 15 12 initial phase final phase data realignment phase transpose

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 34

slide-31
SLIDE 31

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Binary Exchange Parallel FFT Transpose Parallel FFT

Transpose Parallel FFT

If array is partitioned by columns, which are assigned to p ≤ √n tasks, then no communication is required for first log(√n ) stages Data are then transposed using all-to-all personalized collective communication, so that each row of data array is now stored in single task Thus, final log(√n ) stages now require no communication Overall performance of transpose algorithm depends on particular implementation of all-to-all personalized collective communication

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 34

slide-32
SLIDE 32

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT Binary Exchange Parallel FFT Transpose Parallel FFT

Transpose Parallel FFT

Straightforward approach yields total parallel time Tp = Θ(α log p + β n log p/p + γ n log n/p) Compared with binary exchange algorithm, transpose algorithm has higher cost due to message start-up but lower cost due to per-word transfer time Thus, choice of algorithm depends on relative values of α and β for given parallel system

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 34

slide-33
SLIDE 33

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT

References

  • A. Averbuch and E. Gabber, Portable parallel FFT for

MIMD multiprocessors, Concurrency: Practice and Experience 10:583-605, 1998

  • C. Calvin, Implementation of parallel FFT algorithms on

distributed memory machines with a minimum overhead of communication, Parallel Computing 22:1255-1279, 1996

  • R. M. Chamberlain, Gray codes, fast Fourier transforms,

and hypercubes, Parallel Computing 6:225-233, 1988

  • E. Chu and A. George, FFT algorithms and their

adaptation to parallel processing, Linear Algebra Appl. 284:95-124, 1998

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 34

slide-34
SLIDE 34

Discrete Fourier Transform Convolution Fast Fourier Transform Parallel FFT

References

  • A. Edelman, Optimal matrix transposition and bit reversal
  • n hypercubes: all-to-all personalized communication,
  • J. Parallel Computing 11:328-331, 1991
  • A. Gupta and V. Kumar, The Scalability of FFT on Parallel

Computers, IEEE Trans. Parallel Distrib. Sys. 4:922-932, 1993

  • R. B. Pelz, Parallel FFTs, D. E. Keyes, A. Sameh, and
  • V. Venkatakrishnan, eds., Parallel Numerical Algorithms,
  • pp. 245-266, Kluwer, 1997

P . N. Swarztrauber, Multiprocessor FFTs, Parallel Computing 5:197-210, 1987

  • J. W. Demmel, Applied Numerical Linear Algebra, SIAM

Philadelphia, 1997.

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 34