A Fast Fourier Transform Compiler Matteo Frigo Supercomputing - PowerPoint PPT Presentation

A Fast Fourier Transform Compiler Matteo Frigo Supercomputing Technologies Group MIT Laboratory for Computer Science 1

genfft genfft is a domain-speci fi c FFT compiler. � Generates fast C code for Fourier transforms of arbitrary size. � Written in Objective Caml 2.0 [Leroy 1998], a dialect of ML. � “Discovered” previously unknown FFT algorithms. � From a complex-number FFT algorithm it derives a real-number algorithm [Sorensen et al., 1987] automatically. 2

genfft used in FFTW genfft produced the “inner loop” of FFTW (95% of the code). � FFTW [Frigo and Johnson, 1997–1999] is a library for computing one- and multidimensional real and complex discrete Fourier transforms of arbitrary size. � FFTW adapts itself to the hardware automatically, giving porta- bility and speed at the same time. � Parallel versions are available for Cilk, Posix threads and MPI. � Code, documentation and benchmark results at http��theory�lcs�mit�ed u�� fftw genfft makes FFTW possible. 3

FFTW’s performance FFTW is faster than all other portable FFT codes, and it is compara- ble with machine-speci fi c libraries provided by vendors. 250 B FFTW Singleton J B SUNPERF Krukar B J J 200 J J J B Ooura Temperton H M J B FFTPACK Bailey F B B J B 150 B B B mflops Green NR F J Sorensen QFT 7 F B H J H F F H F 100 J F F H H B H H H F H B F H F H M M B J M H B H M M 50 J M M M H 7 B H B B H B B M F 7 H H B 7 7 J H 7 F J J H J J J H 7 F F J J M 7 F F F F 7 F F F J 7 B M M 7 7 M M M 7 M M H 7 M M M 7 7 7 7 7 7 J F M 0 M 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 Transform Size Benchmark of FFT codes on a 167-MHz Sun UltraSPARC I, double precision. 4

FFTW’s performance 250 B B B B FFTW B B B B 200 SUNPERF E B B E E B E E 150 B B E E E E B E 100 B E E E B E B B B B E E E E 50 E B E E 0 transform size FFTW versus Sun’s Performance library on a 167-MHz Sun Ultra- SPARC I, single precision. 5

Why FFTW is fast FFTW does not implement a single fi xed algorithm. Instead, the transform is computed by an executor , composed of highly opti- mized, composable blocks of C code called codelets . Roughly speak- ing, a codelet computes a Fourier transform of small size. At runtime, a planner fi nds a fast composition of codelets. The planner measures the speed of different plans and chooses the fastest. FFTW contains 120 codelets for a total of 56215 lines of op- timized code. 6

Real FFTs are complex The data- fl ow graph of the Fourier transform is not a butter fl y, it is a mess! Complex butter fl y Real and imaginary parts The FFT is even messier when the size is not a power of � , or when the input con- sists of real numbers. 7

genfft ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing. 2. Simplify the dag. 3. Schedule the dag, using a register-oblivious schedule. 4. Unparse to C. (Could also unparse to FORTRAN or other lan- guages.) 8

The case for genfft Why not rely on the C compiler to optimize codelets? � FFT algorithms are best expressed recursively, but C compilers are not good at unrolling recursion. � The FFT needs very long code sequences to use all registers ef- fectively. (FFTW on Alpha EV56 computes FFT(64) with 2400 lines of straight C code.) � Register allocators choke if you feed them with long code sequences. In the case of the FFT, however, we know the (asymptotically) optimal register allocation. � Certain optimizations only make sense for FFTs. Since C compilers are good at instruction selection and instruction scheduling, I did not need to implement these steps in genfft . 9

Outline � Dag creation. � The simpli fi er. � Register-oblivious scheduling. � Related work. � Conclusion. 10

Dag representation A dag is represented as a data type node . (Think of a syntax tree or a symbolic algebra system.) type node � Num of Number�number � Load of Variable�variable � Plus of node list � Times of node � node � �� v � � v � v � � v v Plus � Times �Num �� v v � v v v Plus �Times �Num �� v (All numbers are real.) � v � � An abstraction layer implements complex arithmetic on pairs of dag nodes. 11

Dag creation The function fftgen performs a complete symbolic evaluation of an FFT algorithm, yielding a dag. No single FFT algorithm is good for all input sizes n . genfft contains many algorithms, and fftgen dispatches the most appropriate. � Cooley-Tukey [1965] (works for pq ). n � � Prime factor [Good, 1958] (works for pq when n � gcd � p� q � � � ). � Split-radix [Duhamel and Hollman, 1984] (works for � k ). n � � Rader [1968] (works when n is prime). 12

Cooley-Tukey FFT algorithm DSP book: � � � � p � � q n � � � � X X X j k j k j k j k y � x � � x � � � � � � � � � � n q n p k j � � pj � j A � � � j �� j �� j �� where pq and � . n � k � k � q k � OCaml code: let cooley�tukey n p q x � let inner j� � fftgen q �fun j� �� x �p � j� � j�� in let twiddle k� j� � �omega n �j� � k�� inner j� k�� in let outer k� � fftgen p �twiddle k�� in �fun k �� outer �k mod q� �k � q�� 13

The simpli fi er Well-known optimizations: � Algebraic simpli fi cation, e.g., a . a � � � � Constant folding. � Common-subexpression elimination. (In the current implemen- tation, CSE is encapsulated into a monad .) Speci fi c to the FFT: � (Elimination of negative constants.) � Network transposition. Why did I implement these well-known optimizations, which are implemented in C compilers anyway? Because the the speci fi c optimizations can only be done after the well-known optimizations. 15

Network transposition A network is a dag that computes a linear function [Crochiere and Oppenheim, 1975]. (Recall that the FFT is linear.) Reversing all edges yields a transposed network. � � x s x s � � � � y t y t � � � � � � � � � � � � s � � x x � � s � � t � � y y � � t If a network computes AY , the transposed network computes X � X . T Y � A 16

Optimize the transposed dag! genfft employs this dag optimization algorithm: O PTIMIZE ( G ) � �� S IMPLIFY ( G ) G � �� S IMPLIFY ( � ) T T G G � RETURN S IMPLIFY ( � ) G (Iterating the transposition has no effect.) 17

Bene fi ts of network transposition literature genfft size adds muls muls savings adds muls original transposed complex to complex 5 32 16 12 25% 34 10 10 84 32 24 25% 88 20 13 176 88 68 23% 214 76 15 156 68 56 18% 162 50 real to complex 5 12 8 6 25% 17 5 10 34 16 12 25% ? ? 13 76 44 34 23% ? ? 15 64 31 25 19% 67 25 18

How to help register allocators genfft ’s scheduler reorders the dag so that the register allocator of the C compiler can minimize the number of register spills. genfft ’s recursive partitioning heuristic is asymptotically optimal for k . n � � genfft ’s schedule is register-oblivious : This schedule does not de- pend on the number R of registers, and yet it is optimal for all R . Certain C compilers insist on imposing their own schedule. With �� /SPARC, FFTW is 50-100% faster if I use egcs �fno�schedule�insns . (How do I persuade the SGI compiler to respect my schedule?) 20

Recursive partitioning The scheduler partitions the dag by cutting it in the “middle.” This operation creates connected components that are scheduled recursively. 21

Register-oblivious scheduling Assume that we want to compute the FFT of n points, and a computer has R registers, where R . n � Lower bound [Hong and Kung, 1981]. If k , the n -point FFT n � � requires at least � register spills. �� n lg n� lg R Upper bound. If k , genfft ’s output program for n -point FFT n � � incurs at most � register spills. O � n lg n� lg R A more general theory of cache-oblivious algorithms exists [Frigo, Leiserson, Prokop, and Ramachandran, 1999]. 22

Related work � FOURGEN [Maruhn, 1976]. Written in PL/I, generates FOR- TRAN transforms of size k . � � [Johnson and Burrus, 1983] automate the design of DFT programs using dynamic programming. � [Selesnick and Burrus, 1986] generate MATLAB programs for FFTs of certain sizes. � [Perez and Takaoka, 1987] generated prime-factor FFT programs. � [Veldhuizen, 1995] uses a template metaprograms technique to generate C++. � EXTENT [Gupta et al., 1996] compiles a tensor-product lan- guage into FORTRAN. 23

A Fast Fourier Transform Compiler Matteo Frigo Supercomputing - PowerPoint PPT Presentation

A Fast Fourier Transform Compiler Matteo Frigo Supercomputing Technologies Group MIT Laboratory for Computer Science 1 genfft genfft is a domain-speci fi c FFT compiler. Generates fast C code for Fourier transforms of arbitrary size.

Fourier Series and Transform Overview Why Fourier transform? Trigonometric functions Who is

Topic 5: Discrete-Time Fourier Transform (DTFT) o DT Fourier Transform o Overview of Fourier

Topic 4: Continuous-Time Fourier Transform (CTFT) o Introduction to Fourier Transform o Fourier

Chapter 4 Chapter 4 The Fourier Series and The Fourier Series and Fourier Transform Fourier

Chapter 4 Chapter 4 The Fourier Series and The Fourier Series and Fourier Transform Fourier

Fourier Transform for Partial Differential Equations Introduction: Fourier Transform

Fast Fourier Transform Fourier Series & Transform Summary Discrete-time windowing X [

Fast Fourier Transform Discrete-time windowing Discrete Fourier Transform Relationship

Parallel Fast Fourier Transforms Gavin J. Pringle Joahcim Hein Introduction The Fourier

Discrete Fourier Transform Graduate School of Culture Technology (GSCT) Juhan Nam 1 Outlines

The The Fast Fourier Transform Fast Fourier Transform Basic FFT Stuff That s Good to

Signals and Systems Chapter 4: The Continuous Time Fourier Transform Derivation of the CT Fourier

Topic 10: The Z Transform o Introduction to Z Transform o Relationship to the Fourier transform o

and its Applications Karl Rupp karlirupp@hotmail.com Fourier Transform p.1/22 Content

Introduction to Machine Learning CMU-10701 9. Tail Bounds Barnabs Pczos Fourier Transform

The Fourier Transform I Image Analysis The Fourier transform Those who combine a theoretical

A Gentle Introduction to Machine Learning First Lecture Olov Andersson, AIICS Linkpings

Programming Heterogeneous Many-cores Using Directives HMPP - OpenAcc F. Bodin, CAPS CTO

Chapel: Global HPCC Benchmarks and Status Update Brad Chamberlain Chapel Team CUG 2007 May 7,

Robotics Control Intro and PID Control Supplemental Slides for 2017 (main material given by guest

Reconfigurable Computing Reconfigurable Computing Applications Applications Chapter 9 Chapter

Tensor numerical methods in scientific computing: Basic theory and initial applications Boris

Better Performance at Lower Occupancy Vasily Volkov UC Berkeley September 22, 2010 1 Prologue

Divide and Conquer Paradigm By: Melissa Manley How does it work? Divide : the original problem

A Fast Fourier Transform Compiler Matteo Frigo Supercomputing - PowerPoint PPT Presentation

A Fast Fourier Transform Compiler Matteo Frigo Supercomputing Technologies Group MIT Laboratory for Computer Science 1 genfft genfft is a domain-speci fi c FFT compiler. Generates fast C code for Fourier transforms of arbitrary size.

Fourier Series and Transform Overview Why Fourier transform? Trigonometric functions Who is

Topic 5: Discrete-Time Fourier Transform (DTFT) o DT Fourier Transform o Overview of Fourier

Topic 4: Continuous-Time Fourier Transform (CTFT) o Introduction to Fourier Transform o Fourier

Chapter 4 Chapter 4 The Fourier Series and The Fourier Series and Fourier Transform Fourier

Chapter 4 Chapter 4 The Fourier Series and The Fourier Series and Fourier Transform Fourier

Fourier Transform for Partial Differential Equations Introduction: Fourier Transform

Fast Fourier Transform Fourier Series &amp; Transform Summary Discrete-time windowing X [

Fast Fourier Transform Discrete-time windowing Discrete Fourier Transform Relationship

Parallel Fast Fourier Transforms Gavin J. Pringle Joahcim Hein Introduction The Fourier

Discrete Fourier Transform Graduate School of Culture Technology (GSCT) Juhan Nam 1 Outlines

The The Fast Fourier Transform Fast Fourier Transform Basic FFT Stuff That s Good to

Signals and Systems Chapter 4: The Continuous Time Fourier Transform Derivation of the CT Fourier

Topic 10: The Z Transform o Introduction to Z Transform o Relationship to the Fourier transform o

and its Applications Karl Rupp karlirupp@hotmail.com Fourier Transform p.1/22 Content

Introduction to Machine Learning CMU-10701 9. Tail Bounds Barnabs Pczos Fourier Transform

The Fourier Transform I Image Analysis The Fourier transform Those who combine a theoretical

A Gentle Introduction to Machine Learning First Lecture Olov Andersson, AIICS Linkpings

Programming Heterogeneous Many-cores Using Directives HMPP - OpenAcc F. Bodin, CAPS CTO

Chapel: Global HPCC Benchmarks and Status Update Brad Chamberlain Chapel Team CUG 2007 May 7,

Robotics Control Intro and PID Control Supplemental Slides for 2017 (main material given by guest

Reconfigurable Computing Reconfigurable Computing Applications Applications Chapter 9 Chapter

Tensor numerical methods in scientific computing: Basic theory and initial applications Boris

Better Performance at Lower Occupancy Vasily Volkov UC Berkeley September 22, 2010 1 Prologue

Divide and Conquer Paradigm By: Melissa Manley How does it work? Divide : the original problem

Fast Fourier Transform Fourier Series & Transform Summary Discrete-time windowing X [