Large Multicore FFTs: Approaches to Optimization Sharon Sacco and - - PowerPoint PPT Presentation

large multicore ffts approaches to optimization
SMART_READER_LITE
LIVE PREVIEW

Large Multicore FFTs: Approaches to Optimization Sharon Sacco and - - PowerPoint PPT Presentation

Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and


slide-1
SLIDE 1

HPEC 2008-1 SMHS 9/24/2008

MIT Lincoln Laboratory

Large Multicore FFTs: Approaches to Optimization

Sharon Sacco and James Geraci

24 September 2008

This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government

slide-2
SLIDE 2

MIT Lincoln Laboratory

HPEC 2008-2 SMHS 9/24/2008

  • 1D Fourier Transform
  • Mapping 1D FFTs onto Cell
  • 1D as 2D Traditional Approach

Outline

  • Introduction
  • Technical Challenges
  • Design
  • Performance
  • Summary
slide-3
SLIDE 3

MIT Lincoln Laboratory

HPEC 2008-3 SMHS 9/24/2008

1D Fourier Transform

gj = Σ fk e-2πijk/N

k = 0 N-1

  • This is a simple equation
  • A few people spend a lot of their careers trying to make it

run fast

  • This is a simple equation
  • A few people spend a lot of their careers trying to make it

run fast

slide-4
SLIDE 4

MIT Lincoln Laboratory

HPEC 2008-4 SMHS 9/24/2008

Mapping 1D FFT onto Cell

  • Small FFTs can fit into a single LS
  • memory. 4096 is the largest size.
  • Large FFTs must use XDR

memory as well as LS memory.

FFT Data

  • Cell FFTs can be classified by

memory requirements

  • Medium and large FFTs

require careful memory transfers

  • Cell FFTs can be classified by

memory requirements

  • Medium and large FFTs

require careful memory transfers

  • Medium FFTs can fit into multiple LS
  • memory. 65536 is the largest size.
slide-5
SLIDE 5

MIT Lincoln Laboratory

HPEC 2008-5 SMHS 9/24/2008

1D as 2D Traditional Approach

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w0 w0 w0 w0 w0 w1 w2 w3 w0 w2 w4 w6 w0 w3 w6 w9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  • 1. Corner

turn to compact columns

  • 2. FFT on columns
  • 3. Corner

turn to

  • riginal
  • rientation
  • 4. Multiply

(elementwise) by central twiddles

  • 5. FFT on rows
  • 6. Corner

turn to correct data

  • rder
  • 1D as 2D FFT reorganizes data a lot

– Timing jumps when used

  • Can reduce memory for twiddle tables
  • Only one FFT needed
  • 1D as 2D FFT reorganizes data a lot

– Timing jumps when used

  • Can reduce memory for twiddle tables
  • Only one FFT needed
slide-6
SLIDE 6

MIT Lincoln Laboratory

HPEC 2008-6 SMHS 9/24/2008

  • Communications
  • Memory
  • Cell Rounding

Outline

  • Introduction
  • Technical Challenges
  • Design
  • Performance
  • Summary
slide-7
SLIDE 7

MIT Lincoln Laboratory

HPEC 2008-7 SMHS 9/24/2008

Communications

Bandwidth to XDR memory 25.3 GB/s SPE connection to EIB is 50 GB/s

  • Minimizing XDR memory accesses is critical
  • Leverage EIB
  • Coordinating SPE communication is desirable

– Need to know SPE relative geometry

  • Minimizing XDR memory accesses is critical
  • Leverage EIB
  • Coordinating SPE communication is desirable

– Need to know SPE relative geometry

EIB bandwidth is 96 bytes / cycle

slide-8
SLIDE 8

MIT Lincoln Laboratory

HPEC 2008-8 SMHS 9/24/2008

Memory

XDR Memory is much larger than 1M pt FFT requirements Each SPE has 256 KB local store memory Each Cell has 2 MB local store memory total

  • Need to rethink algorithms to leverage the

memory

– Consider local store both from individual and collective SPE point of view

  • Need to rethink algorithms to leverage the

memory

– Consider local store both from individual and collective SPE point of view

slide-9
SLIDE 9

MIT Lincoln Laboratory

HPEC 2008-9 SMHS 9/24/2008

Cell Rounding

  • The cost to correct basic binary operations, add,

multiply, and subtract, is prohibitive

  • Accuracy should be improved by minimizing

steps to produce a result in algorithm

IEEE 754 Round to Nearest Cell (truncation)

b00 b00 b01 b10 b01 b10

1 bit

  • Average value – x01 + 0 bits
  • Average value – x01 + .5 bit
slide-10
SLIDE 10

MIT Lincoln Laboratory

HPEC 2008-10 SMHS 9/24/2008

  • Using Memory Well
  • Reducing Memory Accesses
  • Distributing on SPEs
  • Bit Reversal
  • Complex Format
  • Computational Considerations

Outline

  • Introduction
  • Technical Challenges
  • Design
  • Performance
  • Summary
slide-11
SLIDE 11

MIT Lincoln Laboratory

HPEC 2008-11 SMHS 9/24/2008

FFT Signal Flow Diagram and Terminology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 4 12 2 10 14 6 1 9 5 13 3 11 7 15

  • Size 16 can illustrate concepts for large FFTs

– Ideas scale well and it is “drawable”

  • This is the “decimation in frequency” data flow
  • Where the weights are applied determines the algorithm
  • Size 16 can illustrate concepts for large FFTs

– Ideas scale well and it is “drawable”

  • This is the “decimation in frequency” data flow
  • Where the weights are applied determines the algorithm

butterfly block radix 2 stage

slide-12
SLIDE 12

MIT Lincoln Laboratory

HPEC 2008-12 SMHS 9/24/2008

Reducing Memory Accesses

  • Columns will be

loaded in strips that fit in the total Cell local store

  • FFT algorithm

processes 4 columns at a time to leverage SIMD registers

  • Requires separate

code from row FFTS

  • Data reorganization

requires SPE to SPE DMAs

  • No bit reversal

1024 1024 64 4

slide-13
SLIDE 13

MIT Lincoln Laboratory

HPEC 2008-13 SMHS 9/24/2008

1D FFT Distribution with Single Reorganization

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 4 12 2 10 14 6 1 9 5 13 3 11 7 15

reorganize

  • One approach is to load everything onto a single SPE to do

the first part of the computation

  • After a single reorganization each SPE owns an entire block

and can complete the computations on its points

  • One approach is to load everything onto a single SPE to do

the first part of the computation

  • After a single reorganization each SPE owns an entire block

and can complete the computations on its points

slide-14
SLIDE 14

MIT Lincoln Laboratory

HPEC 2008-14 SMHS 9/24/2008

1D FFT Distribution with Multiple Reorganizations

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 4 12 2 10 14 6 1 9 5 13 3 11 7 15

reorganize

  • A second approach is to divide groups of contiguous

butterflies among SPEs and reorganize after each stage until the SPEs own a full block

  • A second approach is to divide groups of contiguous

butterflies among SPEs and reorganize after each stage until the SPEs own a full block

reorganize

slide-15
SLIDE 15

MIT Lincoln Laboratory

HPEC 2008-15 SMHS 9/24/2008

Selecting the Preferred Reorganization

Number of SPEs Number of Exchanges Data Moved in 1 DMA Number of Exchanges Data Moved in 1 DMA 2 2 N / 4 2 N / 4 4 12 N / 16 8 N / 8 8 56 N / 64 24 N / 16 Single Reorganization Multiple Reorganizations

  • Evaluation favors multiple reorganizations

– Fewer DMAs have less bus contention

Single Reorganization exceeds the number of busses

– DMA overhead (~ .3μs) is minimized – Programming is simpler for multiple reorganizations

  • Evaluation favors multiple reorganizations

– Fewer DMAs have less bus contention

Single Reorganization exceeds the number of busses

– DMA overhead (~ .3μs) is minimized – Programming is simpler for multiple reorganizations

N - the number of elements in SPE memory, P - number of SPEs

  • Number of exchanges

P * log2 (P)

  • Number of elements

exchanged

(N / 2) * log2 (P)

  • Number of exchanges

P * (P – 1)

  • Number of elements

exchanged

N * (P – 1) / P

Typical N is 32k complex elements

slide-16
SLIDE 16

MIT Lincoln Laboratory

HPEC 2008-16 SMHS 9/24/2008

Column Bit Reversal

  • Bit reversal of columns

can be implemented by the order of processing rows and double buffering

  • Reversal row pairs are

both read into local store and then written to each others memory location

000000001 100000000

Binary Row Numbers

  • Exchanging rows for bit reversal has a low cost
  • DMA addresses are table driven
  • Bit reversal table can be very small
  • Row FFTs are conventional 1D FFTs
slide-17
SLIDE 17

MIT Lincoln Laboratory

HPEC 2008-17 SMHS 9/24/2008

Complex Format

  • Interleaved complex format reduces number of

DMAs

  • Two common formats for complex

– interleaved – split

real 1 real 0 imag 0 imag 1 real 0 real 1 imag 1 imag 0

  • Complex format for user should be

standard

  • Internal format conversion is light

weight

  • Internal format should benefit the

algorithm

– Internal format is opaque to user

  • SIMD units need split

format for complex arithmetic

slide-18
SLIDE 18

MIT Lincoln Laboratory

HPEC 2008-18 SMHS 9/24/2008

  • Using Memory Well
  • Computational Considerations
  • Central Twiddles
  • Algorithm Choice

Outline

  • Introduction
  • Technical Challenges
  • Design
  • Performance
  • Summary
slide-19
SLIDE 19

MIT Lincoln Laboratory

HPEC 2008-19 SMHS 9/24/2008

Central Twiddles

  • Central twiddles can take as

much memory as the input data

  • Reading from memory could

increase FFT time up to 20%

  • For 32-bit FFTs central twiddles

can be computed as needed

– Trigonometric identity methods require double precision

Next generation Cell should make this the method of choice

– Direct sine and cosine algorithms are long

w0 w0 w0 w1 w0 w2 w0 w2 w0 w4 w3 w3 w0 w0 w1023 * 1023 w0 … … . . .

  • Central twiddles are a

significant part of the design

Central Twiddles for 1M FFT

slide-20
SLIDE 20

MIT Lincoln Laboratory

HPEC 2008-20 SMHS 9/24/2008

Algorithm Choice

  • Cooley-Tukey has a constant operation count

– 2 muladd to compute each result for each stage

  • Gentleman-Sande varies widely in operation count

– 1 – 3 operations for each result

  • DC term has the same accuracy on both
  • Gentleman-Sande worst term has 50% more roundoff error

when fused multiply-add is available

Cooley-Tukey

  • wk

wk a b a + b * wk a - b * wk

Gentleman-Sande

  • wk

wk a b a + b (a – b) * wk

Computational Butterflies

slide-21
SLIDE 21

MIT Lincoln Laboratory

HPEC 2008-21 SMHS 9/24/2008

Radix Choice

  • What counts for accuracy is how many operations from the

input to a particular result

Cooley Tukey Radix 4

t1 = xl * wz t2 = xk * wz/2 t3 = xm * w3z/2 s0 = xj + t1 a0 = xj – t1 s1 = t2 + t3 a1 = t2 – t3 yj = s0 + s1 yk = s0 – s1 yl = a0 – i * a1 ym = a0 + i * a1

  • Number of operations for 1 radix 4 stage (real or

imaginary: 9 ( 3 mul, 3 muladd, 3 add)

  • Number of operations for 2 radix 2 stages (real or

imaginary) : 6 ( 6 muladd)

  • Higher radices reuse computations but do not

reduce the amount of arithmetic needed for computation

  • Fused multiply add instructions are more accurate

than multiply followed by add

  • Radix 2 will give the best accuracy
slide-22
SLIDE 22

MIT Lincoln Laboratory

HPEC 2008-22 SMHS 9/24/2008

Outline

  • Introduction
  • Technical Challenges
  • Design
  • Performance
  • Summary
slide-23
SLIDE 23

MIT Lincoln Laboratory

HPEC 2008-23 SMHS 9/24/2008

Estimating Performance

  • Timing estimates typically cannot include all factors

– Computation is based on the minimum number of instructions to estimate – I/O timings are based on bus speed and amount of data – Experience is a guide for the efficiency of I/O and computations

I/O estimate:

  • Number of byte transfers

to/from XDR memory: 33560192 (minimum)

  • Bus speed to XDR: 25.3 GHz
  • Estimated efficiency: 80%
  • Minimum I/O time:

1.7 ms

Computation estimate:

  • Number of operations:

104875600

  • Maximum FLOPS: 205
  • Estimated efficiency: 85%
  • Minimum Computation time:

.6 ms (8 SPEs)

1M FFT estimate (without full ordering): 2 ms

slide-24
SLIDE 24

MIT Lincoln Laboratory

HPEC 2008-24 SMHS 9/24/2008

Preliminary Timing Results

  • Timing results are close to predictions

– 4 SPEs about a factor of 4 from prediction – 8 and 16 SPEs closer to prediction

0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 2 4 6 8 10 12 14 16 18

Time (seconds) Number of SPEs

1M FFT Timings

  • Timings were

performed on Mercury CTES

– QS21 @3.2 GHz Dual Cell Blades

slide-25
SLIDE 25

MIT Lincoln Laboratory

HPEC 2008-25 SMHS 9/24/2008

Summary

  • A good FFT design must consider the hardware

features

– Optimize memory accesses – Understand how different algorithms map to the hardware

  • Design needs to be flexible in the approach

– “One size fits all” isn’t always the best choice – Size will be a factor

  • Estimates of minimum time should be based on

the hardware characteristics

  • 1M point FFT is difficult to write, but possible