Pruned FFT Implementations Franz Franchetti, Markus Pschel - - PowerPoint PPT Presentation

pruned fft implementations
SMART_READER_LITE
LIVE PREVIEW

Pruned FFT Implementations Franz Franchetti, Markus Pschel - - PowerPoint PPT Presentation

Carnegie Mellon Generating High Performance Pruned FFT Implementations Franz Franchetti, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury, and Intel


slide-1
SLIDE 1

Carnegie Mellon

Generating High Performance Pruned FFT Implementations

Sponsors: DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury, and Intel

Franz Franchetti, Markus Püschel Electrical and Computer Engineering Carnegie Mellon University

slide-2
SLIDE 2

Carnegie Mellon

The Idea: Pruned FFT

Pruned DFT: 5% – 30% operations reduction in application settings

 Input pruning

E.g., center ¾ inputs are known to be zero

 Output pruning

E.g., only the low ½ frequencies are used

 Simultaneous input & output pruning

Some inputs known zeros and some outputs discarded

FFT Pruned FFT

slide-3
SLIDE 3

Carnegie Mellon

The Problem

2 4 6 8 10 12 14 16 18 20 22 24 26 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072262144

Problem size

Discrete Fourier Transform (single precision): 2 x Core2 Extreme 3 GHz

Can we turn 5% – 30% operations savings into speed-up?

30x

best code (Spiral generated) Numerical Recipes

Same operations count

slide-4
SLIDE 4

Carnegie Mellon

Organization

 Spiral overview  Pruned FFT  Results  Concluding remarks

slide-5
SLIDE 5

Carnegie Mellon

Spiral

 Library generator for linear transforms

(DFT, DCT, DWT, filters, ….) and recently more …

 Wide range of platforms supported:

scalar, fixed point, vector, parallel, Verilog, GPU

 Research Goal: “Teach” computers to write fast libraries

  • Complete automation of implementation and optimization
  • Conquer the “high” algorithm level for automation

 When a new platform comes out:

Regenerate a retuned library

 When a new platform paradigm comes out (e.g., CPU+GPU):

Update the tool rather than rewriting the library Intel uses Spiral to generate parts of their MKL and IPP libraries

slide-6
SLIDE 6

Carnegie Mellon

How Spiral Works

Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (transform) algorithm C code Fast executable performance Search controls controls

Spiral

Spiral: Complete automation of the implementation and

  • ptimization task

Basic idea: Declarative representation

  • f algorithms

Rewriting systems to generate and optimize algorithms

slide-7
SLIDE 7

Carnegie Mellon

Fast Algorithms, Example: 4-point FFT

 Fast algorithms = matrix factorizations  SPL = mathematical, declarative specification  SPL formula can be translated into program

12 adds 4 mults 4 adds 4 adds 1 mult Identity Permutation Kronecker product Fourier transform

slide-8
SLIDE 8

Carnegie Mellon

Transforms and Breakdown Rules

Base case rules

Goal: Derive Cooley-Tukey Pruned FFT rule

“Teaches” Spiral about existing algorithm knowledge (~200 journal papers)

slide-9
SLIDE 9

Carnegie Mellon

Organization

 Spiral overview  Pruned FFT  Results  Concluding remarks

slide-10
SLIDE 10

Carnegie Mellon

Data Sparseness: Block Sequences

 Sequence  Block sequence  Example

  • 2 =
slide-11
SLIDE 11

Carnegie Mellon

Zero-Padding: Scatter Matrix

 Definition  Example

= .

slide-12
SLIDE 12

Carnegie Mellon

FFT

Pruned FFT Cooley-Tukey Pruned FFT Rule

 Recursive input pruning rule  Base case  Similar rule for output pruning and simultaneous pruning

slide-13
SLIDE 13

Carnegie Mellon

Derivation: Cooley-Tukey Pruned FFT Rule

Cooley-Tukey FFT rule + Kronecker product identities

slide-14
SLIDE 14

Carnegie Mellon

Organization

 Spiral overview  Pruned FFT  Results  Concluding remarks

slide-15
SLIDE 15

Carnegie Mellon

2 4 6 8 10 12 14 16 18 20 256 512 768 1,024 2,048 3,072 4,096 5,120 6,144 7,168 8,192 9,216 10,240 11,264 12,288 13,312 14,336 15,360 16,384 input size

Spiral DFT SSE+SMP Spiral DFT SSE Intel MKL 9.0 FFTW DFT Numerical Recipes in C

DFT: Spiral vs. FFTW and MKL (2 cores, 4-way SSE)

performance [Gflop/s], single-precision, Intel C++ 10.1, SSSE3, Windows XP 32-bit

Spiral-generated DFT is good baseline

slide-16
SLIDE 16

Carnegie Mellon

5 10 15 20 25 256 512 768 1,024 2,048 3,072 4,096 5,120 6,144 7,168 8,192 9,216 10,240 11,264 12,288 13,312 14,336 15,360 16,384 input size

Pruned DFT (first 1/16 non-zero) Pruned DFT (center 7/8 zero) Pruned DFT (center 1/4 non-zero) Pruned DFT (second half zero) Spiral DFT

Spiral: Pruned DFT vs. DFT (4-way SSE)

performance [Gflop/s], single-precision, Intel C++ 10.1, SSSE3, Windows XP 32-bit

FFT input pruning: speed-up for sequential vector DFT

slide-17
SLIDE 17

Carnegie Mellon

5 10 15 20 25 256 512 768 1,024 2,048 3,072 4,096 5,120 6,144 7,168 8,192 9,216 10,240 11,264 12,288 13,312 14,336 15,360 16,384 input size

Pruned DFT (first 1/16 non-zero) Pruned DFT (center 7/8 zero) Pruned DFT (center 1/4 non-zero) Pruned DFT (second half zero) Spiral DFT

Spiral: Pruned DFT vs. DFT (2 cores, 4-way SSE)

performance [Gflop/s], single-precision, Intel C++ 10.1, SSSE3, Windows XP 32-bit

FFT input pruning: speed-up for parallel vector DFT

slide-18
SLIDE 18

Carnegie Mellon

5 10 15 20 25 256 512 768 1,024 2,048 3,072 4,096 5,120 6,144 7,168 8,192 9,216 10,240 11,264 12,288 13,312 14,336 15,360 16,384 input size

I/O Pruned DFT (output: first 1/16 non-zero, input: center 3/4 zero) I/O Pruned DFT (output: center 7/8 zero, input: first 1/4 non-zero) Spiral DFT

Spiral: I/O Pruned DFT vs. DFT (4-way SSE)

performance [Gflop/s], single-precision, Intel C++ 10.1, SSSE3, Windows XP 32-bit

I/O pruning: better speed-up than input pruning only

slide-19
SLIDE 19

Carnegie Mellon

Organization

 Spiral overview  Pruned FFT  Results  Concluding remarks

slide-20
SLIDE 20

Carnegie Mellon

Summary

 Spiral’s goal: “Teach” computers to write fast libraries

From problem specification to very fast code---automatically (click button)

 Optimization at a high level of abstraction

Memory hierarchy, vector SIMD, multicore,…

 The generated programs are very fast

Often better than human-written code

 Pruned FFT: lower operations count translates into speed-up

up to 30% over best vector SIMD and multicore code for input pruning