A Fast Fourier Transform Compiler Paper by: Matteo Frigo MIT - - PowerPoint PPT Presentation

β–Ά
a fast fourier transform compiler
SMART_READER_LITE
LIVE PREVIEW

A Fast Fourier Transform Compiler Paper by: Matteo Frigo MIT - - PowerPoint PPT Presentation

A Fast Fourier Transform Compiler Paper by: Matteo Frigo MIT Laboratory for Computer Science. February 16, 1999 Presented by: Marco Poltera. November 16, 2011 Software Engineering Seminar Introduction and motivation / Computation of Discrete


slide-1
SLIDE 1

Paper by: Matteo Frigo MIT Laboratory for Computer Science. February 16, 1999 Presented by: Marco Poltera. November 16, 2011 Software Engineering Seminar

A Fast Fourier Transform Compiler

slide-2
SLIDE 2

Introduction and motivation

2

/ Computation of Discrete Fourier transform (DFT) required by many real world applications

  • Look at the internals of FFTW
  • Argue that a specialized compiler is a

valuable tool Goal

real world application, i.e. JPEG compression

DFT program, i.e. FFTW

result, i.e. compressed image uses

slide-3
SLIDE 3

Recap: DFT

/ linear transform: 𝑧 = π‘ˆπ‘¦ / DFT: with (primitive n-root of unity) 𝑧 = πΈπΊπ‘ˆ

π‘œπ‘¦

/ FFT: We can compute 𝑧 = π‘ˆπ‘¦ = (π‘ˆ

1(π‘ˆ2.. (π‘ˆ 𝑛𝑦)))

3

slide-4
SLIDE 4

Recap: DFT

/ DFT4 =

4 Example from: How to write fast numerical code. Markus PΓΌschel. Carnegie Mellon University. Course 18-645. Lecture 17.

slide-5
SLIDE 5

FFTW

/ FFTW consists of three parts:

5

Compiler (genfft)

  • run once
  • output:

codelets Planner

  • run once for

every transform size

  • hardware

adaption

  • output: plan
  • reusable

Executor

  • computes

the DFT

  • output:

transformed vector

slide-6
SLIDE 6

FFTW

6 graphic from: How to write fast numerical code. Markus PΓΌschel. Carnegie Mellon University. Course 18-645. Lecture 19.

slide-7
SLIDE 7

FFTW

7

code to adapt to hardware codelets 95 % 5 % FFTW code auto- generated by genfft

graph from paper

slide-8
SLIDE 8

The four phases of ge genf nfft

creation phase simplifier scheduler unparser

8

slide-9
SLIDE 9

Creation phase

9

creation phase simplifier scheduler unparser

n is a multiple

  • f 4

n = n1n2 and gcd (n1, n2) = 1 n = n1n2 and ni β‰  1 n is prime split-radix algorithm prime factor algorithm Cooley-Tukey FFT algorithm Rader’s algorithm application of DFT definition

choose an FFT algorithm

slide-10
SLIDE 10

Creation phase

10

creation phase simplifier scheduler unparser

generate dag according to FFT

slide-11
SLIDE 11

Creation phase

11

creation phase simplifier scheduler unparser

/ Example: Cooley-Tukey algorithm

let rec cooley_tukey n1 n2 input sign = let tmp1 j2 = fftgen n1 (fun j1 -> input (j1*n2+j2)) sign in let tmp2 i1 j2 = exp n (sign*i1*j2) @* tmp1 j2 i1 in let tmp3 i1= fftgen n2 (tmp2 i1) sign in (fun i -> tmp3 (i mod n1) (i/n1))

slide-12
SLIDE 12

Creation phase

12

creation phase simplifier scheduler unparser

/ DAG representation Type node = Num of Number.number | Load of Variable.variable | Store of Variable.variable * node | Plus of node list | Times of node * node | Uminus of node

v4 v2 v1 v0 v3

v3 = Plus [ v2; Times (Num 3, v0) ] v4 = Plus [ Times (Num 2, v2); v1; v0 ]

2 3

slide-13
SLIDE 13

Simplifier

/ algebraic transformations / i.e. apply distributive property: 𝑙𝑦 + 𝑙𝑧 β†’ 𝑙(𝑦 + 𝑧) / common-subexpressions / DFT-specific improvements / make numeric constants positive / dag transposition

13

creation phase simplifier scheduler unparser

slide-14
SLIDE 14

Simplifier: DAG transposition

/ three passes:

14

creation phase simplifier scheduler unparser

simplify

D E

simplify

F G

simplify

ET FT

slide-15
SLIDE 15

Simplifier: DAG transposition

15

creation phase simplifier scheduler unparser

a b a b c c 2 2 3 3 4 4

𝑑 = 4(2𝑏 + 3𝑐) a = 2 βˆ— 4𝑑 𝑐 = 3 βˆ— 4𝑑

slide-16
SLIDE 16

Scheduler

16

creation phase simplifier scheduler unparser

  • maximize register usage

Goal / schedule is cache-oblivious

slide-17
SLIDE 17

Scheduler

17

creation phase simplifier scheduler unparser

slide-18
SLIDE 18

Scheduler

/ #register spills = Θ(n log(n) / log(R))

18

creation phase simplifier scheduler unparser

slide-19
SLIDE 19

Unparser

/ Schedule is unparsed to C

19

creation phase simplifier scheduler unparser

slide-20
SLIDE 20

Conclusion

/ performance / rapid turnaround / effectiveness / derived new algorithms / not reduced to a specific language such as C

20

slide-21
SLIDE 21

Further information

/ Download FFTW: www.fftw.org / Details on FFTW: β€œFFTW: An Adaptive Software Architecture For The FFT” by M. Frigo/S. Johnson (1998)

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

Usage of FFTW

23

#include <fftw3.h> ... { fftw_complex *in, *out; fftw_plan p; ... in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);

  • ut = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);

p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE); ... fftw_execute(p); /* repeat as needed */ ... fftw_destroy_plan(p); fftw_free(in); fftw_free(out); }

from the tutorial included in the FFTW distribution 3.3

slide-24
SLIDE 24

DFT

/ FFT refers to / any O(NlogN) algorithm or / the specific Cooley-Tukey algorithm / computing a DFT of N points takes / in the naive way, using the definition, O(N2) arithmetical

  • perations

/ O(N log N) operations for a FFT

24

slide-25
SLIDE 25

FFTW and Parallelism

/ Parallel versions are available for / Cilk / Posix threads / MPI

25

slide-26
SLIDE 26

Simplifier

/ Implementation: / simplifier written as if it was an expression tree / mapping from trees to DAGs accomplished by memoization which is performed implicitly by a monad

26

creation phase simplifier scheduler unparser

slide-27
SLIDE 27

Pragmatic aspects of ge genf nfft

/ running time / memory requirements

27