Large Multicore FFTs: Approaches to Optimization Sharon Sacco and - PowerPoint PPT Presentation

Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government MIT Lincoln Laboratory HPEC 2008-1 SMHS 9/24/2008

Outline • Introduction • 1D Fourier Transform • Mapping 1D FFTs onto Cell • 1D as 2D Traditional Approach • Technical Challenges • Design • Performance • Summary MIT Lincoln Laboratory HPEC 2008-2 SMHS 9/24/2008

1D Fourier Transform g j = Σ N-1 f k e -2 π i jk /N k = 0 • This is a simple equation • This is a simple equation • A few people spend a lot of their careers trying to make it • A few people spend a lot of their careers trying to make it run fast run fast MIT Lincoln Laboratory HPEC 2008-3 SMHS 9/24/2008

Mapping 1D FFT onto Cell FFT Data • Small FFTs can fit into a single LS memory. 4096 is the largest size. • Medium FFTs can fit into multiple LS memory. 65536 is the largest size. • Cell FFTs can be classified by • Cell FFTs can be classified by memory requirements memory requirements • Medium and large FFTs • Medium and large FFTs require careful memory require careful memory • Large FFTs must use XDR transfers transfers memory as well as LS memory. MIT Lincoln Laboratory HPEC 2008-4 SMHS 9/24/2008

1D as 2D Traditional Approach 2. FFT on columns w 0 w 0 w 0 w 0 1. Corner 3. Corner turn to turn to w 0 w 1 w 2 w 3 compact original 0 4 8 12 columns orientation w 0 w 2 w 4 w 6 1 5 9 13 w 0 w 3 w 6 w 9 0 1 2 3 2 6 10 14 0 1 2 3 4 5 6 7 3 7 11 15 4 5 6 7 8 9 10 11 4. Multiply 8 9 10 11 (elementwise) 6. Corner 12 13 14 15 by central turn to 12 13 14 15 twiddles correct data order 5. FFT on rows 0 1 2 3 • 1D as 2D FFT reorganizes data a lot • 1D as 2D FFT reorganizes data a lot 4 5 6 7 – Timing jumps when used – Timing jumps when used 8 9 10 11 • Can reduce memory for twiddle tables • Can reduce memory for twiddle tables 12 13 14 15 • Only one FFT needed • Only one FFT needed MIT Lincoln Laboratory HPEC 2008-5 SMHS 9/24/2008

Outline • Introduction • Communications • Technical Challenges • Memory • Cell Rounding • Design • Performance • Summary MIT Lincoln Laboratory HPEC 2008-6 SMHS 9/24/2008

Communications SPE connection to Bandwidth to EIB is 50 GB/s XDR memory 25.3 GB/s EIB bandwidth is 96 bytes / cycle • Minimizing XDR memory accesses is critical • Minimizing XDR memory accesses is critical • Leverage EIB • Leverage EIB • Coordinating SPE communication is desirable • Coordinating SPE communication is desirable – Need to know SPE relative geometry – Need to know SPE relative geometry MIT Lincoln Laboratory HPEC 2008-7 SMHS 9/24/2008

Memory Each SPE has 256 KB local store memory XDR Memory is much larger than 1M pt FFT requirements Each Cell has 2 MB local store memory • Need to rethink algorithms to leverage the • Need to rethink algorithms to leverage the total memory memory – Consider local store both from individual and – Consider local store both from individual and collective SPE point of view collective SPE point of view MIT Lincoln Laboratory HPEC 2008-8 SMHS 9/24/2008

Cell Rounding IEEE 754 Round to Nearest Cell (truncation) 1 bit b00 b01 b10 b00 b01 b10 • Average value – x01 + 0 bits • Average value – x01 + .5 bit • The cost to correct basic binary operations, add, multiply, and subtract, is prohibitive • Accuracy should be improved by minimizing steps to produce a result in algorithm MIT Lincoln Laboratory HPEC 2008-9 SMHS 9/24/2008

Outline • Introduction • Technical Challenges • Using Memory Well • Design • Reducing Memory Accesses • Distributing on SPEs • Performance • Bit Reversal • Complex Format • Computational Considerations • Summary MIT Lincoln Laboratory HPEC 2008-10 SMHS 9/24/2008

FFT Signal Flow Diagram and radix 2 stage Terminology 0 0 1 8 butterfly 4 2 12 3 4 2 5 10 block 6 6 7 14 8 1 9 9 10 5 11 13 12 3 13 11 14 7 15 15 • Size 16 can illustrate concepts for large FFTs • Size 16 can illustrate concepts for large FFTs – Ideas scale well and it is “drawable” – Ideas scale well and it is “drawable” • This is the “decimation in frequency” data flow • This is the “decimation in frequency” data flow • Where the weights are applied determines the algorithm • Where the weights are applied determines the algorithm MIT Lincoln Laboratory HPEC 2008-11 SMHS 9/24/2008

Reducing Memory Accesses • Columns will be loaded in strips that 1024 fit in the total Cell local store • FFT algorithm processes 4 columns at a time to leverage SIMD 1024 registers • Requires separate code from row FFTS • Data reorganization requires SPE to SPE 64 DMAs • No bit reversal 4 MIT Lincoln Laboratory HPEC 2008-12 SMHS 9/24/2008

1D FFT Distribution with Single Reorganization 0 0 1 8 4 2 12 3 4 2 5 10 6 6 reorganize 7 14 8 1 9 9 10 5 11 13 12 3 13 11 14 7 15 15 • One approach is to load everything onto a single SPE to do • One approach is to load everything onto a single SPE to do the first part of the computation the first part of the computation • After a single reorganization each SPE owns an entire block • After a single reorganization each SPE owns an entire block and can complete the computations on its points and can complete the computations on its points MIT Lincoln Laboratory HPEC 2008-13 SMHS 9/24/2008

1D FFT Distribution with Multiple Reorganizations 0 0 1 8 4 2 12 3 4 2 5 10 6 6 reorganize reorganize 7 14 8 1 9 9 10 5 11 13 12 3 13 11 14 7 15 15 • A second approach is to divide groups of contiguous • A second approach is to divide groups of contiguous butterflies among SPEs and reorganize after each stage until butterflies among SPEs and reorganize after each stage until the SPEs own a full block the SPEs own a full block MIT Lincoln Laboratory HPEC 2008-14 SMHS 9/24/2008

Selecting the Preferred Reorganization Single Reorganization Multiple Reorganizations • • Number of exchanges Number of exchanges Typical N is 32k P * (P – 1) P * log 2 (P) complex • • Number of elements Number of elements elements exchanged exchanged N * (P – 1) / P (N / 2) * log 2 (P) N - the number of elements in SPE memory, P - number of SPEs Number of Number of Data Moved Number of Data Moved SPEs Exchanges in 1 DMA Exchanges in 1 DMA 2 2 N / 4 2 N / 4 4 12 N / 16 8 N / 8 8 56 N / 64 24 N / 16 • Evaluation favors multiple reorganizations • Evaluation favors multiple reorganizations – Fewer DMAs have less bus contention – Fewer DMAs have less bus contention Single Reorganization exceeds the number of busses Single Reorganization exceeds the number of busses – DMA overhead (~ .3 μ s) is minimized – DMA overhead (~ .3 μ s) is minimized – Programming is simpler for multiple reorganizations – Programming is simpler for multiple reorganizations MIT Lincoln Laboratory HPEC 2008-15 SMHS 9/24/2008

Column Bit Reversal • Bit reversal of columns 000000001 can be implemented by the order of Binary Row processing rows and Numbers double buffering • Reversal row pairs are 100000000 both read into local store and then written to each others memory location • Exchanging rows for bit reversal has a low cost • DMA addresses are table driven • Bit reversal table can be very small • Row FFTs are conventional 1D FFTs MIT Lincoln Laboratory HPEC 2008-16 SMHS 9/24/2008

Complex Format • Two common formats for complex – interleaved real 0 imag 0 real 1 imag 1 – split • Complex format for user should be real 0 real 1 standard • Internal format conversion is light imag 0 imag 1 weight • SIMD units need split • Internal format should benefit the format for complex algorithm arithmetic – Internal format is opaque to user • Interleaved complex format reduces number of DMAs MIT Lincoln Laboratory HPEC 2008-17 SMHS 9/24/2008

Outline • Introduction • Technical Challenges • Using Memory Well • Design • Computational Considerations • Central Twiddles • Performance • Algorithm Choice • Summary MIT Lincoln Laboratory HPEC 2008-18 SMHS 9/24/2008

Central Twiddles • Central twiddles can take as w 0 w 0 w 0 w 0 w 0 … much memory as the input data w 0 w 1 w 2 w 3 • Reading from memory could w 0 w 2 w 4 . increase FFT time up to 20% w 0 w 3 . . • For 32-bit FFTs central twiddles w 0 can be computed as needed … w 1023 * 1023 – Trigonometric identity methods require double precision Central Twiddles for 1M FFT Next generation Cell should make this the method of choice • Central twiddles are a – Direct sine and cosine significant part of the algorithms are long design MIT Lincoln Laboratory HPEC 2008-19 SMHS 9/24/2008

Large Multicore FFTs: Approaches to Optimization Sharon Sacco and - PowerPoint PPT Presentation

Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore Based Packet Splitting Multicore Based Packet Splitting Approaches for High Speed

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

FFTs of Arbitrary Dimensions on GPUs Xiaobai Sun and Nikos Pitsianis Duke University September

Parallel 3D-FFTs for multi-core nodes on a mesh communication network Joachim Hein 1,2 , Heike

An Introductory Exascale Feasibility Study for FFTs and Multigrid Hormozd Gahvari William Gropp

FFTs Overview EECS 360 Notes Methods descriptions Hardware Implementations

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

WebGL Up and Running Tony Parisi http://www.tonyparisi.com/ Get the Code git clone

A Management System using Lean and Strategy Deployment AHRA Regional Conference Tacoma, WA January

Meeting 1 Monday, February 12, 2018 Ag Agenda Welcome & Introductions Overview of

Supporting people with dementia in the city 16 October 2019 Presenter Our mission: Working as

European Medicines Agency Celebrating ten years 1995 2005 A Scientific Perspective on

ctypes Direct access to happiness.dll Mike C. Fletcher VRPlumber Consulting Inc. Who is this

International Training Programme on Ageing 17 19 September 2013 Trinity Biomedical Science

Large Multicore FFTs: Approaches to Optimization Sharon Sacco and - PowerPoint PPT Presentation

Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore Based Packet Splitting Multicore Based Packet Splitting Approaches for High Speed

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

FFTs of Arbitrary Dimensions on GPUs Xiaobai Sun and Nikos Pitsianis Duke University September

Parallel 3D-FFTs for multi-core nodes on a mesh communication network Joachim Hein 1,2 , Heike

An Introductory Exascale Feasibility Study for FFTs and Multigrid Hormozd Gahvari William Gropp

FFTs Overview EECS 360 Notes Methods descriptions Hardware Implementations

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

WebGL Up and Running Tony Parisi http://www.tonyparisi.com/ Get the Code git clone

A Management System using Lean and Strategy Deployment AHRA Regional Conference Tacoma, WA January

Meeting 1 Monday, February 12, 2018 Ag Agenda Welcome &amp; Introductions Overview of

Supporting people with dementia in the city 16 October 2019 Presenter Our mission: Working as

European Medicines Agency Celebrating ten years 1995 2005 A Scientific Perspective on

ctypes Direct access to happiness.dll Mike C. Fletcher VRPlumber Consulting Inc. Who is this

International Training Programme on Ageing 17 19 September 2013 Trinity Biomedical Science

Meeting 1 Monday, February 12, 2018 Ag Agenda Welcome & Introductions Overview of