f f fast transforms using the cell b e processor fast
play

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms - PowerPoint PPT Presentation

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor t T t T f f i g th C ll/B E P i g th C ll/B E P David A. Bader joint work with Seunghwa Kang and Virat Agarwal Sony-Toshiba-IBM Center of


  1. F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor t T t T f f i g th C ll/B E P i g th C ll/B E P David A. Bader joint work with Seunghwa Kang and Virat Agarwal

  2. Sony-Toshiba-IBM Center of Competence for the Cell/B.E. at Georgia Tech for the Cell/B.E. at Georgia Tech � Mission Mission: grow the community of Cell Broadband Engine users and developers •Fall 2006: Georgia Tech wins competition for hosting the STI Center •First publicly-available IBM QS20 Cluster y •200 attendees at 2007 STI Workshop •Multicore curriculum and training •Multicore curriculum and training •Demonstrated performance on –Multimedia and gaming –Scientific computing S i tifi ti –Medical applications –Financial services David A. Bader, Director http://sti.cc.gatech.edu David A. Bader

  3. Applications • CellBuzz : Freely-available, open source libraries optimized for the Cell/B.E. f C http://sourceforge.net/projects/cellbuzz/ – ZLIB & GZIP: data compression – FFT: fast Fourier transform – RC5: encryption – MPEG-2: video encoding and decoding – JPEG2000: digital content processing • Financial Modeling David A. Bader

  4. Cell/B.E. Libraries: FFT and JPEG2000 • FFTC: Fastest Fourier Transform on FFTC: Fastest Fourier Transform on the Cell/B.E. the Cell/B.E. – – 1-Dimensional single precision DIF-FFT optimized 1-Dimensional single precision DIF-FFT optimized for 1K-16K complex input samples – Parallelize & optimize computation of a single FFT computation – D Design high performance synchronization barrier using i hi h f h i ti b i i inter-SPE communication – Demonstrated superior performance of 18.6 GFlop/s for 8K complex input samples. Butterflies of ordered DIF FFT IBM Power5 AMD Opteron Intel Pentium 4 • JPEG2000 on the JPEG2000 on the Cell/B.E. Cell/B.E. 25 FFTW on Cell Our implementation (8 SPEs) Intel Core Duo 20 – Optimize coding/decoding by data decomposition / data FFTC alignment / vectorization GigaFlop/s 15 – Demonstrated average speedup of 3.1 over 10 Intel 3.2 GHz Pentium-4 5 The source code is freely available from our CellBuzz project in SourceForge 0 1024 2048 4096 8192 16384 http://sourceforge.net/projects/cellbuzz/ Input size David A. Bader

  5. Cell/B.E. Libraries: ZLIB and MPEG-2 • ZLIB Data compression & ZLIB Data compression & decompression library decompression library – Vectorize compute intensive kernels and parallelize to run on multiple SPEs – Extend the gzip header format while maintaining compatibility with legacy gzip decompressors – Demonstrated speedup of 2.9 over high-end Intel Pentium-4 system • MPEG-2 Video Decoding MPEG-2 Video Decoding – First parallelization of a multimedia application on Cell/B.E. – Demonstrated a speedup of 3 over Intel 3.2GHz Xeon. e o st ated a speedup o 3 o e te 3 G eo The source code is freely available from our CellBuzz project in SourceForge http://sourceforge.net/projects/cellbuzz/ David A. Bader

  6. Using the Cell/B.E. in Aircraft Health Monitoring “Retired Marine Lt. Gen. Bernard Trainor said the issue of aging aircraft is a constant complaint of all branches of service.” Atl Atlanta Journal Constitution t J l C tit ti April 27, 2002 • Fault Diagnosis g – Estimate the crack length without di disassembly based on bl b d vibration data collected from multiple sensors. • Failure Prognosis – Estimate the expected time before crash David A. Bader

  7. System-of-Systems Decompostion Powe r and Cooling T ur bo Mac hine L ife , Oil Ac tuator L e akage E ngine Condition, Oil Se r vic ing and We ar and F ilte r Condition Ge ne r ator Oil L e ve l Hydr aulic F ilte r s, Pump, and Hydr aulic F luid L e ve l Batte r y Oxyge n Ge ne r ator Nitr oge n Ge ne r ator and F ilte r L L anding Ge ar anding Ge ar and and Ar r e sting Hook Str uc tur e fatigue life L anding Ge ar Str ut R otar y Ac tuator We ar Pr e ssur e and F luid L e ve l David A. Bader

  8. Overview of the Diagnosis and Prognosis Process Online Modules 1.6 1.6 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 Feature 0.2 0.2 0 1 2 3 4 5 6 7 8 0.2 0 0 1 2 3 4 5 6 7 8 crack length crack length Features & Features & De-Noising De Noising Preprocessed Preprocessed Extraction Extraction Mapping Sensor Data Data System Diagnosis Loading Fault Growth Feature Feature De Noising De-Noising Extraction & Techniques Particle Flight Regime Data Mapping Particle Filter & Model Parameter Techniques Filter Noise Models Tuning Experimental Data Prognosis Stress Table Crack K K mi ma Length 1. 30.2 27.9 2 27.2 25.6 2. 21.5 21.2 Simulated Data Driven 3 19.4 17.82 System Model for Data Methods System Model for Prognosis Diagnosis RUL Offline Modules DAQ In Involv lves m es multiple ltiple computa computationally e tionally expensiv pensive modules!!! e modules!!! David A. Bader

  9. Fast Transforms on the Cell/B.E. • Fast Fourier Transform • Discrete Wavelet Transform David Bader 9

  10. FFTC: Fastest Fourier Transform for Cell/B.E. • Focus on medium size FFT computations – Complex single-precision 1-Dimensional FFT • Input samples and output results reside in main memory. • Radix 2, 3 and 5. ad , 3 a d 5 • Optimized for 1K-16K input samples. • Focus on achieving high performance for the • Focus on achieving high performance for the computation of a single FFT, rather than increasing throughput increasing throughput. David Bader 10

  11. Existing FFT Research on Cell/B.E. • [Williams et al., 2006], analyzed peak performance. • [Cico, Cooper and Greene, 2006] estimated 22.1 GFlops/s for an 8K complex 1D FFT that resides in the Local Store of one SPE the Local Store of one SPE. – 8 independent FFTs in local store of 8 SPEs gives 176.8 GFlops/s. p / • [Chow, Fossum and Brokenshire, 2005] achieved 46.8 GFlops/s for 16M complex FFT. – Highly specialized for this particular input size. • FFTW is a highly portable FFT library of various types, precision and input size. David Bader 11

  12. Our FFTC is based on Cooley Tukey • Input is one dimensional vector of complex values. • Algorithm is iterative, no recursion. • Out of Place approach is used. pp • Requires two arrays A&B for computation, one input and one output that are swapped at every stage. p pp y g • Out of place approach prevents data reordering after the last stage. g • Algorithm requires log N stages. Each stage requires O( N ) computation. p – Complexity O( N log N ) David Bader 12

  13. Stage begin Twiddle factors Stage end David Bader 13

  14. Illustration of the Algorithm � Illustration of the algorithm for n=16 algorithm for n 16 complex values. � Distance between pairs of output values double at every subsequent stage. � Shows how output of one stage serves as the inp t to another input to another. David Bader 14

  15. FFTC design on Cell/B.E. : Challenges • Synchroni Synchronize e step after every step after every stage leads to signifi stage leads to significan cant overhead. overhead. • Reduce synchronization stages. g • Design efficient barrier synchronization routine. • We will later describe an We will later describe an efficient tree-based synchronization algorithm based on inter-SPE based on inter SPE communication. Insert synchronization barrier Insert synchronization barrier David Bader 15

  16. FFTC design on Cell/B.E. : Challenges (contd..) � Load balancing to achieve better SPU utilization Load balancing to achieve better SPU utilization – No SPE should wait at the synchronization barrier. – Require efficient parallelization technique to allocate data to R i ffi i t ll li ti t h i t ll t d t t SPEs. – Strategy should be scalable across multiple chips (large number of SPEs). b f SPE ) First 2 stages. � Vect ctorization dif orization difficult f icult for r ever ery stage y stage - Stages 1 & 2, do not have regular data access pattern. - Require data reorganization to fully utilize the SPE computational power. - Optimizing the first 2 stages become important for medium size inputs, as it may constitute 20-25% of the total 20 2 f running time. David Bader 16

  17. FFTC design on Cell/B.E. : Challenges (cont’d) � Limited local store Limited local store - require space for N/2 twiddle factors and input data. require space for N/2 twiddle factors and input data. - loop unrolling and duplication increases size of the code. - Effectively manage code and data within 256KB. � Algorith Algorithm is m is branch branchy: y: - Doubly nested for loop within the Doubly nested for loop within the outer while loop - Lack of branch predictor compromises performance compromises performance. - Provide branch hints and restructure the algorithm to eliminate branch eliminate branch. David Bader 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend