Low Power DSP Architectures Trevor Mudge, Bredt Professor of - PDF document

Overview 1  AnySP  SODA++  Increase the Application Domain of the Wireless baseband architecture  Diet-SODA  SODA- -  Taking the sugar out of SODA University of Michigan – ARM June 09 1 1 1 2 Low Power DSP Architectures Trevor Mudge, Bredt Professor of Engineering, The University of Michigan, Ann Arbor 1st tubs.CITY Symposium July 1 – 3, 2009, Braunschweig Mark Woh 1 , Sangwon Seo 1 , Ron Dreslinski, Geoff Blake, Scott Mahlke 1 , Chaitali Chakrabarti 2 , Krisztian Flautner 3 University of Michigan – ACAL 1 Arizona State University 2 ARM, Ltd. 3 2 2 1

The Old Mobile Phone The Modern Mobile Phone 3 Video Recording • Future phones are becoming more complex • Richer applications require much more Video Editing requirements • How do phones handle this now? Higher Data Rates 3D Rendering Advanced Image Processing Photos From - http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/ http://www.apple.com/iphone University of Michigan – ARM June 09 3 3 3 Inside Today’s Smart Phones 4 • Modern phones are looking like Frankenchips! Inside the OMAP3430 Application Processor • Some cores unused and functionality duplicated Inside the X-Gold 608 (Representation of QCOM) University of Michigan – ARM June 09 4 4 4 2

Cost for Multi-System Support 5 Programmable Unified Architectures Provide  Lower Cost  Faster Time to Market  Support for Multiple Applications (Current and Future)  Bug Fixes After Manufacturing So where do we start?  Supporting multiple systems is reserved for the most expensive phones  Cost is in supporting all the systems that may or may not be used at once Data gathered from - Ramacher, U. 2007. Software-Defined Radio Prospects for Multistandard Mobile Phones. Computer 40, 10 (Oct. 2007) - Finchelstein, D.F.; Sze, V.; Sinangil, M.E.; Koken, Y.; Chandrakasan, A.P., "A low-power 0.7-V H.264 720p video decoder," Solid-State Circuits Conference, 2008. A-SSCC '08. University of Michigan – ARM June 09 5 5 5 View of the Unified Architecture World 6 V WiFi V V WiFi WiFi i i i 3G 3G 3G d GSM d GSM e e GSM d o o 2D/3D 2D/3D ISP ISP e ISP 2D/3D o University of Michigan – ARM June 09 6 6 6 3

Power/Performance Requirements for Multiple Systems 7 Different applications have different power/performance characteristics! We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) University of Michigan – ARM June 09 7 7 7 8 The Applications Is there anything we can learn from the applications themselves? 8 8 8 4

H.264 Basics 9 T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. “A low-power dual-mode video decoder for mobile applications,” IEEE Communications Magazine , volume 44, issue 8, pp.119-126, Aug. 2006. University of Michigan – ARM June 09 9 9 9 4G Wireless Basics 10  Three kernels make up the majority of the work  FFT – Extract Data from Signals  STBC – Combine Data into More Reliable Stream  LDPC – Error Correction on Data Stream University of Michigan – ARM June 09 10 10 10 5

Mobile Signal Processing Algorithm Characteristics 11 SIMD  Scalar  Overhead  SIMD Width  Amount  Algorithm  Workload (%)  Workload (%)  Workload (%)  (Elements)  of TLP  FFT  75  5  20  1024  Low  4G  STBC  81  5  14  4  High  LDPC  49  18  33  96  Low  Deblocking Filter  72  13  15  8  Medium  SIMD comes at a cost! H.264  Intra‐PredicMon  85  5  10  16  Medium  • Register File Power Inverse Transform  80  5  15  8  High  MoMon CompensaMon  75  5  10  8  High  • Data Movement/Alignment Cost SIMD architectures have to deal with this!  Algorithms have different SIMD widths  From very large to very small  Though SIMD width varies all algorithms can exploit it  Large percentage of work can be SIMDized  Larger SIMD width tend to have less TLP University of Michigan – ARM June 09 11 11 11 Traditional SIMD Power Breakdown 12  Register File Power consumes a lot of power in traditional 32-wide SIMD architecture University of Michigan – ARM June 09 12 12 6

Register File Access 13 Register Access  Bypass Read  Bypass Write  100%  90%  80%  70%  60%  50%  40%  Lots of power wasted on unneeded 30%  register file access! 20%  10%  0%  FFT  STBC  LDPC  Deblocking  Intra‐PredicMon  Inverse  Filter  Transform   Many of the register file access do not have to go back to the main register file University of Michigan – ARM June 09 13 13 13 Instruction Pair Frequency 14 Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions A few instruction pairs (3-5) make up the majority of all instruction pairs! University of Michigan – ARM June 09 14 14 14 7

Data Alignment Problem! 15 Intra‐PredicMon  Traditional SIMD machines take too long or cost too much to do this Good news – small fixed number patterns per kernel  H.264 Intra-prediction has 9 different prediction modes  Each prediction mode requires a specific permutation University of Michigan – ARM June 09 15 15 15 More Data Alignment Problems! 16 Inverse Transform  Adder tree can accelerate not only matrix operations! Many different video kernels can be accelerated too! University of Michigan – ARM June 09 16 16 16 8

Even More Data Alignment! 17 Block based decoding requires us to access different locations of memory for each task We cannot just rely of fetching contiguous sets of data C.H. Meenderinck, A. Azevedo, B.H.H. Juurlink, M. Alvarez, A. Ramirez, Parallel Scalability of Video Decoders, Journal of Signal Processing Systems, August 2008  Techniques like 2D-Wave and 3D-Wave decoding for H.264 helps increase amount of parallelism but we have to be able to access different macroblocks for each parallel computation University of Michigan – ARM June 09 17 17 17 Summary 18  Conclusion about 4G and H.264  Lots of different sized parallelism  From 4 wide to 96 wide to 1024 wide SIMD  Which means many different SIMD widths need to be supported  Very short lived values  Lots of potential for instruction fusings  Limited set of shuffle patterns required for each kernel University of Michigan – ARM June 09 18 18 9

19 AnySP Design 19 19 19 SODA SIMD Architecture 20 32-Wide SIMD with Simple Shuffle Network University of Michigan – ARM June 09 20 20 20 10

AnySP Architecture – High Level 21 16 Banked Memory with SRAM-based Crossbar 8 Groups of 8-Wide Flexible Function Units Multiple Output Adder Tree 128x128 16bit Swizzle Network Temporary Buffer and Bypass Network Datapath AGU and Scalar Pipeline University of Michigan – ARM June 09 21 21 21 Multi-Width Support 22 Each 8-wide SIMD Group works on different memory Normal 64-Wide SIMD mode – all lanes share one AGU locations of the same 8-wide code – AGU Offsets University of Michigan – ARM June 09 22 22 22 11

AnySP FFU Datapath 23 Flexible Functional Unit allows us to 1. Exploit Pipeline-parallelism by joining two lanes together 2. Handle register bypass and the temporary buffer 3. Join multiple pipelines to process deeper subgraphs 4. Fuse Instruction Pairs University of Michigan – ARM June 09 23 23 23 24 AnySP Results 24 24 24 12

Simulation Environment 25  Traditional SIMD architecture comparison  SODA at 90nm technology  AnySP  Synthesized at 90nm TSMC  Power, timing, area numbers were extracted  Kernels were hand written and optimized  4G – based on a NTT DoCoMo 4G test setup  H.264 – 4CIF@30fps University of Michigan – ARM June 09 25 25 25 AnySP Speedup vs SIMD-based Architecture 26  For all benchmarks we perform more than 2x better than a SIMD-based architecture University of Michigan – ARM June 09 26 26 26 13

AnySP Energy-Delay vs SIMD-based Architecture 27  More importantly energy efficiency is much better! University of Michigan – ARM June 09 27 27 27 AnySP Power Breakdown 28  We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm University of Michigan – ARM June 09 28 28 28 14

Conclusion & Future Work 29  Conclusion  We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform  Under the power budget and meeting the performance at 45nm  Future and Ongoing Work  Application-specific language  Larger class of algorithms for AnySP  Better utilization of resources for non-parallel kernels  Speedup sequential parts 29 University of Michigan – ARM June 09 29 29 30 Diet-SODA 30 30 30 15

Diet SODA 31  SODA, Ardbeg, AnySP may be too powerful for the application  Simple Imaging processing for cameras  Audio processing for voice  Lose flexibility and generality of Ardbeg, AnySP for performance at less # of gates  Build a modular design which people can add SIMD groups and special function blocks to increase performance, at cost of area but allow voltage scaling University of Michigan – ARM June 09 31 31 Histogram Equalization 32  Spreads out an unevenly distributed histogram  Increases contrast University of Michigan – ARM June 09 32 32 16

Low Power DSP Architectures Trevor Mudge, Bredt Professor of - PDF document

Overview 1 AnySP SODA++ Increase the Application Domain of the Wireless baseband architecture Diet-SODA SODA- - Taking the sugar out of SODA University of Michigan ARM June 09 1 1 1 2 Low Power DSP Architectures

6/23/09 J-DSP: An Online DSP Laboratory Overview J-DSP J-DSP Editor Editor J-DSP blocks

Highlights of the work J-DSP J-DSP Editor Editor Online DSP Quiz integrated with J-DSP

1 Collaborative Project Collaborative EMD Overview J-DSP J-DSP Editor Editor PLANNED IN THIS

J-DSP and Sensor Motes for Universally accessible DSP functions J-DSP Embeds Interactive

Reverse Engineering DSP Code GameCube DSP Analyzing GCN DSP code Pierre Bourdon Conclusion

Contents Slide 1-1 Some DSP Chip History Slide 1-2 Other DSP Manufacturers Slide 1-3 DSP

Solano Community College DSP Solano Community College DSP NVDA & JAWS Screen Reader Student

Contents Slide 1 Some DSP Chip History Slide 2 Other DSP Manufacturers Slide 3 DSP

C55 intro Highlights of the new C55x DSP Architecture The C55x DSP core supports new

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei & Tian

ISPD 2006 Arjun Rajagopal Arjun Rajagopal Dallas DSP Design Dallas DSP Design Texas

Architectures Architectural styles Software architectures Architectures versus middleware

Chapter 18: Programmable DSPs Keshab K. Parhi and Viktor Owall DSP Applications DSP applications

Direct Service Purchase (DSP) Restructure Vendor Council Meeting March 30, 2011 Current DSP

Static and Dynamic DSP Operations 818 West Diamond Avenue - Third Floor, Gaithersburg, MD 20878

Component Based Software Engineering approach on DSP Targets Agenda 2 / 2 / Motivations

Video Group CS MSU Graphics & Media Lab

A fast parameter space search for continuous gravitational waves from known binary systems

dE dE = c w dt dx Note that the momentum transfer rate is equivalent to a force in this

Signals Maninder Kaur professormaninder@gmail.com Maninder Kaur www.eazynotes.com 1 Various

Targeting the Right Ventricle Harm Jan Bogaard 3/9/2019 Its the RV! Hypothesis RV

Image Formation CS418 Computer Graphics John C. Hart Solar Radiation Rayleigh Scattering

K-Stacker, a new way of detecting and characterizing exoplanets with high contrast imaging

Virtual Inertia Emulation and Placement in Power Grids Optimization & Control for

Low Power DSP Architectures Trevor Mudge, Bredt Professor of - PDF document

Overview 1 AnySP SODA++ Increase the Application Domain of the Wireless baseband architecture Diet-SODA SODA- - Taking the sugar out of SODA University of Michigan ARM June 09 1 1 1 2 Low Power DSP Architectures

6/23/09 J-DSP: An Online DSP Laboratory Overview J-DSP J-DSP Editor Editor J-DSP blocks

Highlights of the work J-DSP J-DSP Editor Editor Online DSP Quiz integrated with J-DSP

1 Collaborative Project Collaborative EMD Overview J-DSP J-DSP Editor Editor PLANNED IN THIS

J-DSP and Sensor Motes for Universally accessible DSP functions J-DSP Embeds Interactive

Reverse Engineering DSP Code GameCube DSP Analyzing GCN DSP code Pierre Bourdon Conclusion

Contents Slide 1-1 Some DSP Chip History Slide 1-2 Other DSP Manufacturers Slide 1-3 DSP

Solano Community College DSP Solano Community College DSP NVDA &amp; JAWS Screen Reader Student

Contents Slide 1 Some DSP Chip History Slide 2 Other DSP Manufacturers Slide 3 DSP

C55 intro Highlights of the new C55x DSP Architecture The C55x DSP core supports new

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei &amp; Tian

ISPD 2006 Arjun Rajagopal Arjun Rajagopal Dallas DSP Design Dallas DSP Design Texas

Architectures Architectural styles Software architectures Architectures versus middleware

Chapter 18: Programmable DSPs Keshab K. Parhi and Viktor Owall DSP Applications DSP applications

Direct Service Purchase (DSP) Restructure Vendor Council Meeting March 30, 2011 Current DSP

Static and Dynamic DSP Operations 818 West Diamond Avenue - Third Floor, Gaithersburg, MD 20878

Component Based Software Engineering approach on DSP Targets Agenda 2 / 2 / Motivations

Video Group CS MSU Graphics &amp; Media Lab

A fast parameter space search for continuous gravitational waves from known binary systems

dE dE = c w dt dx Note that the momentum transfer rate is equivalent to a force in this

Signals Maninder Kaur professormaninder@gmail.com Maninder Kaur www.eazynotes.com 1 Various

Targeting the Right Ventricle Harm Jan Bogaard 3/9/2019 Its the RV! Hypothesis RV

Image Formation CS418 Computer Graphics John C. Hart Solar Radiation Rayleigh Scattering

K-Stacker, a new way of detecting and characterizing exoplanets with high contrast imaging

Virtual Inertia Emulation and Placement in Power Grids Optimization &amp; Control for

Solano Community College DSP Solano Community College DSP NVDA & JAWS Screen Reader Student

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei & Tian

Video Group CS MSU Graphics & Media Lab

Virtual Inertia Emulation and Placement in Power Grids Optimization & Control for