large multicore ffts approaches to optimization
play

Large Multicore FFTs: Approaches to Optimization Sharon Sacco and - PowerPoint PPT Presentation

Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and


  1. Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government MIT Lincoln Laboratory HPEC 2008-1 SMHS 9/24/2008

  2. Outline • Introduction • 1D Fourier Transform • Mapping 1D FFTs onto Cell • 1D as 2D Traditional Approach • Technical Challenges • Design • Performance • Summary MIT Lincoln Laboratory HPEC 2008-2 SMHS 9/24/2008

  3. 1D Fourier Transform g j = Σ N-1 f k e -2 π i jk /N k = 0 • This is a simple equation • This is a simple equation • A few people spend a lot of their careers trying to make it • A few people spend a lot of their careers trying to make it run fast run fast MIT Lincoln Laboratory HPEC 2008-3 SMHS 9/24/2008

  4. Mapping 1D FFT onto Cell FFT Data • Small FFTs can fit into a single LS memory. 4096 is the largest size. • Medium FFTs can fit into multiple LS memory. 65536 is the largest size. • Cell FFTs can be classified by • Cell FFTs can be classified by memory requirements memory requirements • Medium and large FFTs • Medium and large FFTs require careful memory require careful memory • Large FFTs must use XDR transfers transfers memory as well as LS memory. MIT Lincoln Laboratory HPEC 2008-4 SMHS 9/24/2008

  5. 1D as 2D Traditional Approach 2. FFT on columns w 0 w 0 w 0 w 0 1. Corner 3. Corner turn to turn to w 0 w 1 w 2 w 3 compact original 0 4 8 12 columns orientation w 0 w 2 w 4 w 6 1 5 9 13 w 0 w 3 w 6 w 9 0 1 2 3 2 6 10 14 0 1 2 3 4 5 6 7 3 7 11 15 4 5 6 7 8 9 10 11 4. Multiply 8 9 10 11 (elementwise) 6. Corner 12 13 14 15 by central turn to 12 13 14 15 twiddles correct data order 5. FFT on rows 0 1 2 3 • 1D as 2D FFT reorganizes data a lot • 1D as 2D FFT reorganizes data a lot 4 5 6 7 – Timing jumps when used – Timing jumps when used 8 9 10 11 • Can reduce memory for twiddle tables • Can reduce memory for twiddle tables 12 13 14 15 • Only one FFT needed • Only one FFT needed MIT Lincoln Laboratory HPEC 2008-5 SMHS 9/24/2008

  6. Outline • Introduction • Communications • Technical Challenges • Memory • Cell Rounding • Design • Performance • Summary MIT Lincoln Laboratory HPEC 2008-6 SMHS 9/24/2008

  7. Communications SPE connection to Bandwidth to EIB is 50 GB/s XDR memory 25.3 GB/s EIB bandwidth is 96 bytes / cycle • Minimizing XDR memory accesses is critical • Minimizing XDR memory accesses is critical • Leverage EIB • Leverage EIB • Coordinating SPE communication is desirable • Coordinating SPE communication is desirable – Need to know SPE relative geometry – Need to know SPE relative geometry MIT Lincoln Laboratory HPEC 2008-7 SMHS 9/24/2008

  8. Memory Each SPE has 256 KB local store memory XDR Memory is much larger than 1M pt FFT requirements Each Cell has 2 MB local store memory • Need to rethink algorithms to leverage the • Need to rethink algorithms to leverage the total memory memory – Consider local store both from individual and – Consider local store both from individual and collective SPE point of view collective SPE point of view MIT Lincoln Laboratory HPEC 2008-8 SMHS 9/24/2008

  9. Cell Rounding IEEE 754 Round to Nearest Cell (truncation) 1 bit b00 b01 b10 b00 b01 b10 • Average value – x01 + 0 bits • Average value – x01 + .5 bit • The cost to correct basic binary operations, add, multiply, and subtract, is prohibitive • Accuracy should be improved by minimizing steps to produce a result in algorithm MIT Lincoln Laboratory HPEC 2008-9 SMHS 9/24/2008

  10. Outline • Introduction • Technical Challenges • Using Memory Well • Design • Reducing Memory Accesses • Distributing on SPEs • Performance • Bit Reversal • Complex Format • Computational Considerations • Summary MIT Lincoln Laboratory HPEC 2008-10 SMHS 9/24/2008

  11. FFT Signal Flow Diagram and radix 2 stage Terminology 0 0 1 8 butterfly 4 2 12 3 4 2 5 10 block 6 6 7 14 8 1 9 9 10 5 11 13 12 3 13 11 14 7 15 15 • Size 16 can illustrate concepts for large FFTs • Size 16 can illustrate concepts for large FFTs – Ideas scale well and it is “drawable” – Ideas scale well and it is “drawable” • This is the “decimation in frequency” data flow • This is the “decimation in frequency” data flow • Where the weights are applied determines the algorithm • Where the weights are applied determines the algorithm MIT Lincoln Laboratory HPEC 2008-11 SMHS 9/24/2008

  12. Reducing Memory Accesses • Columns will be loaded in strips that 1024 fit in the total Cell local store • FFT algorithm processes 4 columns at a time to leverage SIMD 1024 registers • Requires separate code from row FFTS • Data reorganization requires SPE to SPE 64 DMAs • No bit reversal 4 MIT Lincoln Laboratory HPEC 2008-12 SMHS 9/24/2008

  13. 1D FFT Distribution with Single Reorganization 0 0 1 8 4 2 12 3 4 2 5 10 6 6 reorganize 7 14 8 1 9 9 10 5 11 13 12 3 13 11 14 7 15 15 • One approach is to load everything onto a single SPE to do • One approach is to load everything onto a single SPE to do the first part of the computation the first part of the computation • After a single reorganization each SPE owns an entire block • After a single reorganization each SPE owns an entire block and can complete the computations on its points and can complete the computations on its points MIT Lincoln Laboratory HPEC 2008-13 SMHS 9/24/2008

  14. 1D FFT Distribution with Multiple Reorganizations 0 0 1 8 4 2 12 3 4 2 5 10 6 6 reorganize reorganize 7 14 8 1 9 9 10 5 11 13 12 3 13 11 14 7 15 15 • A second approach is to divide groups of contiguous • A second approach is to divide groups of contiguous butterflies among SPEs and reorganize after each stage until butterflies among SPEs and reorganize after each stage until the SPEs own a full block the SPEs own a full block MIT Lincoln Laboratory HPEC 2008-14 SMHS 9/24/2008

  15. Selecting the Preferred Reorganization Single Reorganization Multiple Reorganizations • • Number of exchanges Number of exchanges Typical N is 32k P * (P – 1) P * log 2 (P) complex • • Number of elements Number of elements elements exchanged exchanged N * (P – 1) / P (N / 2) * log 2 (P) N - the number of elements in SPE memory, P - number of SPEs Number of Number of Data Moved Number of Data Moved SPEs Exchanges in 1 DMA Exchanges in 1 DMA 2 2 N / 4 2 N / 4 4 12 N / 16 8 N / 8 8 56 N / 64 24 N / 16 • Evaluation favors multiple reorganizations • Evaluation favors multiple reorganizations – Fewer DMAs have less bus contention – Fewer DMAs have less bus contention Single Reorganization exceeds the number of busses Single Reorganization exceeds the number of busses – DMA overhead (~ .3 μ s) is minimized – DMA overhead (~ .3 μ s) is minimized – Programming is simpler for multiple reorganizations – Programming is simpler for multiple reorganizations MIT Lincoln Laboratory HPEC 2008-15 SMHS 9/24/2008

  16. Column Bit Reversal • Bit reversal of columns 000000001 can be implemented by the order of Binary Row processing rows and Numbers double buffering • Reversal row pairs are 100000000 both read into local store and then written to each others memory location • Exchanging rows for bit reversal has a low cost • DMA addresses are table driven • Bit reversal table can be very small • Row FFTs are conventional 1D FFTs MIT Lincoln Laboratory HPEC 2008-16 SMHS 9/24/2008

  17. Complex Format • Two common formats for complex – interleaved real 0 imag 0 real 1 imag 1 – split • Complex format for user should be real 0 real 1 standard • Internal format conversion is light imag 0 imag 1 weight • SIMD units need split • Internal format should benefit the format for complex algorithm arithmetic – Internal format is opaque to user • Interleaved complex format reduces number of DMAs MIT Lincoln Laboratory HPEC 2008-17 SMHS 9/24/2008

  18. Outline • Introduction • Technical Challenges • Using Memory Well • Design • Computational Considerations • Central Twiddles • Performance • Algorithm Choice • Summary MIT Lincoln Laboratory HPEC 2008-18 SMHS 9/24/2008

  19. Central Twiddles • Central twiddles can take as w 0 w 0 w 0 w 0 w 0 … much memory as the input data w 0 w 1 w 2 w 3 • Reading from memory could w 0 w 2 w 4 . increase FFT time up to 20% w 0 w 3 . . • For 32-bit FFTs central twiddles w 0 can be computed as needed … w 1023 * 1023 – Trigonometric identity methods require double precision Central Twiddles for 1M FFT Next generation Cell should make this the method of choice • Central twiddles are a – Direct sine and cosine significant part of the algorithms are long design MIT Lincoln Laboratory HPEC 2008-19 SMHS 9/24/2008

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend