F F Fast Transforms using the Cell/B.E. Processor Fast Transforms - PowerPoint PPT Presentation

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor t T t T f f i g th C ll/B E P i g th C ll/B E P David A. Bader joint work with Seunghwa Kang and Virat Agarwal

Sony-Toshiba-IBM Center of Competence for the Cell/B.E. at Georgia Tech for the Cell/B.E. at Georgia Tech � Mission Mission: grow the community of Cell Broadband Engine users and developers •Fall 2006: Georgia Tech wins competition for hosting the STI Center •First publicly-available IBM QS20 Cluster y •200 attendees at 2007 STI Workshop •Multicore curriculum and training •Multicore curriculum and training •Demonstrated performance on –Multimedia and gaming –Scientific computing S i tifi ti –Medical applications –Financial services David A. Bader, Director http://sti.cc.gatech.edu David A. Bader

Applications • CellBuzz : Freely-available, open source libraries optimized for the Cell/B.E. f C http://sourceforge.net/projects/cellbuzz/ – ZLIB & GZIP: data compression – FFT: fast Fourier transform – RC5: encryption – MPEG-2: video encoding and decoding – JPEG2000: digital content processing • Financial Modeling David A. Bader

Cell/B.E. Libraries: FFT and JPEG2000 • FFTC: Fastest Fourier Transform on FFTC: Fastest Fourier Transform on the Cell/B.E. the Cell/B.E. – – 1-Dimensional single precision DIF-FFT optimized 1-Dimensional single precision DIF-FFT optimized for 1K-16K complex input samples – Parallelize & optimize computation of a single FFT computation – D Design high performance synchronization barrier using i hi h f h i ti b i i inter-SPE communication – Demonstrated superior performance of 18.6 GFlop/s for 8K complex input samples. Butterflies of ordered DIF FFT IBM Power5 AMD Opteron Intel Pentium 4 • JPEG2000 on the JPEG2000 on the Cell/B.E. Cell/B.E. 25 FFTW on Cell Our implementation (8 SPEs) Intel Core Duo 20 – Optimize coding/decoding by data decomposition / data FFTC alignment / vectorization GigaFlop/s 15 – Demonstrated average speedup of 3.1 over 10 Intel 3.2 GHz Pentium-4 5 The source code is freely available from our CellBuzz project in SourceForge 0 1024 2048 4096 8192 16384 http://sourceforge.net/projects/cellbuzz/ Input size David A. Bader

Cell/B.E. Libraries: ZLIB and MPEG-2 • ZLIB Data compression & ZLIB Data compression & decompression library decompression library – Vectorize compute intensive kernels and parallelize to run on multiple SPEs – Extend the gzip header format while maintaining compatibility with legacy gzip decompressors – Demonstrated speedup of 2.9 over high-end Intel Pentium-4 system • MPEG-2 Video Decoding MPEG-2 Video Decoding – First parallelization of a multimedia application on Cell/B.E. – Demonstrated a speedup of 3 over Intel 3.2GHz Xeon. e o st ated a speedup o 3 o e te 3 G eo The source code is freely available from our CellBuzz project in SourceForge http://sourceforge.net/projects/cellbuzz/ David A. Bader

Using the Cell/B.E. in Aircraft Health Monitoring “Retired Marine Lt. Gen. Bernard Trainor said the issue of aging aircraft is a constant complaint of all branches of service.” Atl Atlanta Journal Constitution t J l C tit ti April 27, 2002 • Fault Diagnosis g – Estimate the crack length without di disassembly based on bl b d vibration data collected from multiple sensors. • Failure Prognosis – Estimate the expected time before crash David A. Bader

System-of-Systems Decompostion Powe r and Cooling T ur bo Mac hine L ife , Oil Ac tuator L e akage E ngine Condition, Oil Se r vic ing and We ar and F ilte r Condition Ge ne r ator Oil L e ve l Hydr aulic F ilte r s, Pump, and Hydr aulic F luid L e ve l Batte r y Oxyge n Ge ne r ator Nitr oge n Ge ne r ator and F ilte r L L anding Ge ar anding Ge ar and and Ar r e sting Hook Str uc tur e fatigue life L anding Ge ar Str ut R otar y Ac tuator We ar Pr e ssur e and F luid L e ve l David A. Bader

Overview of the Diagnosis and Prognosis Process Online Modules 1.6 1.6 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 Feature 0.2 0.2 0 1 2 3 4 5 6 7 8 0.2 0 0 1 2 3 4 5 6 7 8 crack length crack length Features & Features & De-Noising De Noising Preprocessed Preprocessed Extraction Extraction Mapping Sensor Data Data System Diagnosis Loading Fault Growth Feature Feature De Noising De-Noising Extraction & Techniques Particle Flight Regime Data Mapping Particle Filter & Model Parameter Techniques Filter Noise Models Tuning Experimental Data Prognosis Stress Table Crack K K mi ma Length 1. 30.2 27.9 2 27.2 25.6 2. 21.5 21.2 Simulated Data Driven 3 19.4 17.82 System Model for Data Methods System Model for Prognosis Diagnosis RUL Offline Modules DAQ In Involv lves m es multiple ltiple computa computationally e tionally expensiv pensive modules!!! e modules!!! David A. Bader

Fast Transforms on the Cell/B.E. • Fast Fourier Transform • Discrete Wavelet Transform David Bader 9

FFTC: Fastest Fourier Transform for Cell/B.E. • Focus on medium size FFT computations – Complex single-precision 1-Dimensional FFT • Input samples and output results reside in main memory. • Radix 2, 3 and 5. ad , 3 a d 5 • Optimized for 1K-16K input samples. • Focus on achieving high performance for the • Focus on achieving high performance for the computation of a single FFT, rather than increasing throughput increasing throughput. David Bader 10

Existing FFT Research on Cell/B.E. • [Williams et al., 2006], analyzed peak performance. • [Cico, Cooper and Greene, 2006] estimated 22.1 GFlops/s for an 8K complex 1D FFT that resides in the Local Store of one SPE the Local Store of one SPE. – 8 independent FFTs in local store of 8 SPEs gives 176.8 GFlops/s. p / • [Chow, Fossum and Brokenshire, 2005] achieved 46.8 GFlops/s for 16M complex FFT. – Highly specialized for this particular input size. • FFTW is a highly portable FFT library of various types, precision and input size. David Bader 11

Our FFTC is based on Cooley Tukey • Input is one dimensional vector of complex values. • Algorithm is iterative, no recursion. • Out of Place approach is used. pp • Requires two arrays A&B for computation, one input and one output that are swapped at every stage. p pp y g • Out of place approach prevents data reordering after the last stage. g • Algorithm requires log N stages. Each stage requires O( N ) computation. p – Complexity O( N log N ) David Bader 12

Stage begin Twiddle factors Stage end David Bader 13

Illustration of the Algorithm � Illustration of the algorithm for n=16 algorithm for n 16 complex values. � Distance between pairs of output values double at every subsequent stage. � Shows how output of one stage serves as the inp t to another input to another. David Bader 14

FFTC design on Cell/B.E. : Challenges • Synchroni Synchronize e step after every step after every stage leads to signifi stage leads to significan cant overhead. overhead. • Reduce synchronization stages. g • Design efficient barrier synchronization routine. • We will later describe an We will later describe an efficient tree-based synchronization algorithm based on inter-SPE based on inter SPE communication. Insert synchronization barrier Insert synchronization barrier David Bader 15

FFTC design on Cell/B.E. : Challenges (contd..) � Load balancing to achieve better SPU utilization Load balancing to achieve better SPU utilization – No SPE should wait at the synchronization barrier. – Require efficient parallelization technique to allocate data to R i ffi i t ll li ti t h i t ll t d t t SPEs. – Strategy should be scalable across multiple chips (large number of SPEs). b f SPE ) First 2 stages. � Vect ctorization dif orization difficult f icult for r ever ery stage y stage - Stages 1 & 2, do not have regular data access pattern. - Require data reorganization to fully utilize the SPE computational power. - Optimizing the first 2 stages become important for medium size inputs, as it may constitute 20-25% of the total 20 2 f running time. David Bader 16

FFTC design on Cell/B.E. : Challenges (cont’d) � Limited local store Limited local store - require space for N/2 twiddle factors and input data. require space for N/2 twiddle factors and input data. - loop unrolling and duplication increases size of the code. - Effectively manage code and data within 256KB. � Algorith Algorithm is m is branch branchy: y: - Doubly nested for loop within the Doubly nested for loop within the outer while loop - Lack of branch predictor compromises performance compromises performance. - Provide branch hints and restructure the algorithm to eliminate branch eliminate branch. David Bader 17

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms - PowerPoint PPT Presentation

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor t T t T f f i g th C ll/B E P i g th C ll/B E P David A. Bader joint work with Seunghwa Kang and Virat Agarwal Sony-Toshiba-IBM Center of

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Bacteria Without a Cell Wall L-forms Pros & Cons of Cell Wall Cell membrane Cell wall DNA

Hardware Architecture of the Cell Broadband Engine Processor LOGO Presented by Wei Wei,

Cell Communication and Cell Signaling Why is cell signaling important? Why is cell signaling

JUST THE MATHS SLIDES NUMBER 16.7 LAPLACE TRANSFORMS 7 (An appendix) by A.J.Hobson One

Drawing on the Web CSS CSCI-UA 380 Transforms, Transitions, and Animation Drawing on the Web

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Optimization algorithms on Cell processor Vladim r T rebick y Optimization algorithms

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Communication Analysis of the Communication Analysis of the Communication Analysis of the Cell

Cell Hydration as Cell Hydration as an Essential Cell Parameter for an Essential Cell Parameter

Eukaryotic Cell Structures and Functions General Animal Cell Structure General Plant Cell

VHL and clear cell Renal Cell Carcinoma Gene expression profiles in renal cell VHL syndrome

Research that Transforms Healthcare and Transforms Lives Dianne Morrison-Beedy, PhD, RN, WHNP-BC,

Uncovering SAP vulnerabilities: Reversing and breaking the Diag protocol Martin Gallo Core

Opening Exercise Suppose that you are given three integers in int variables. Describe a way to

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift GPUs for Telcos

DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 ,

Memory-Optimized Distributed Graph Processing through Novel Compression Techniques Katia

Re-think Data Management Software Design Upon the Arrival of Storage Hardware with Built-in

Application compartmentalization Conventional gunzip Compartmentalized gunzip UNIX process UNIX

Exact JPEG recompression and forensics using interval arithmetic Andrew B. Lewis and Markus G.

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms - PowerPoint PPT Presentation

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor t T t T f f i g th C ll/B E P i g th C ll/B E P David A. Bader joint work with Seunghwa Kang and Virat Agarwal Sony-Toshiba-IBM Center of

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Bacteria Without a Cell Wall L-forms Pros &amp; Cons of Cell Wall Cell membrane Cell wall DNA

Hardware Architecture of the Cell Broadband Engine Processor LOGO Presented by Wei Wei,

Cell Communication and Cell Signaling Why is cell signaling important? Why is cell signaling

JUST THE MATHS SLIDES NUMBER 16.7 LAPLACE TRANSFORMS 7 (An appendix) by A.J.Hobson One

Drawing on the Web CSS CSCI-UA 380 Transforms, Transitions, and Animation Drawing on the Web

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Optimization algorithms on Cell processor Vladim r T rebick y Optimization algorithms

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Communication Analysis of the Communication Analysis of the Communication Analysis of the Cell

Cell Hydration as Cell Hydration as an Essential Cell Parameter for an Essential Cell Parameter

Eukaryotic Cell Structures and Functions General Animal Cell Structure General Plant Cell

VHL and clear cell Renal Cell Carcinoma Gene expression profiles in renal cell VHL syndrome

Research that Transforms Healthcare and Transforms Lives Dianne Morrison-Beedy, PhD, RN, WHNP-BC,

Uncovering SAP vulnerabilities: Reversing and breaking the Diag protocol Martin Gallo Core

Opening Exercise Suppose that you are given three integers in int variables. Describe a way to

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift GPUs for Telcos

DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 ,

Memory-Optimized Distributed Graph Processing through Novel Compression Techniques Katia

Re-think Data Management Software Design Upon the Arrival of Storage Hardware with Built-in

Application compartmentalization Conventional gunzip Compartmentalized gunzip UNIX process UNIX

Exact JPEG recompression and forensics using interval arithmetic Andrew B. Lewis and Markus G.

Bacteria Without a Cell Wall L-forms Pros & Cons of Cell Wall Cell membrane Cell wall DNA