Low Power DSP Architectures Trevor Mudge, Bredt Professor of - - PDF document

low power dsp architectures
SMART_READER_LITE
LIVE PREVIEW

Low Power DSP Architectures Trevor Mudge, Bredt Professor of - - PDF document

Overview 1 AnySP SODA++ Increase the Application Domain of the Wireless baseband architecture Diet-SODA SODA- - Taking the sugar out of SODA University of Michigan ARM June 09 1 1 1 2 Low Power DSP Architectures


slide-1
SLIDE 1

1

1 1 1

University of Michigan – ARM June 09

Overview

  • AnySP
  • SODA++
  • Increase the Application Domain of the Wireless

baseband architecture

  • Diet-SODA
  • SODA- -
  • Taking the sugar out of SODA

1

2 2 2

Low Power DSP Architectures

Trevor Mudge, Bredt Professor of Engineering, The University of Michigan, Ann Arbor 1st tubs.CITY Symposium

July 1 – 3, 2009, Braunschweig

Mark Woh1, Sangwon Seo1, Ron Dreslinski, Geoff Blake, Scott Mahlke1, Chaitali Chakrabarti2, Krisztian Flautner3 University of Michigan – ACAL1 Arizona State University2 ARM, Ltd.3

slide-2
SLIDE 2

2

3 3 3

University of Michigan – ARM June 09

The Old Mobile Phone The Modern Mobile Phone

3

Video Recording Video Editing Higher Data Rates

Photos From - http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/ http://www.apple.com/iphone

3D Rendering Advanced Image Processing

  • Future phones are becoming more complex
  • Richer applications require much more

requirements

  • How do phones handle this now?

4 4 4

University of Michigan – ARM June 09

Inside Today’s Smart Phones

4

Inside the OMAP3430 Application Processor

Inside the X-Gold 608 (Representation of QCOM)

  • Modern phones are looking like Frankenchips!
  • Some cores unused and functionality duplicated
slide-3
SLIDE 3

3

5 5 5

University of Michigan – ARM June 09

Cost for Multi-System Support

  • Supporting multiple systems is reserved for the most expensive phones
  • Cost is in supporting all the systems that may or may not be used at once

5

Data gathered from

  • Ramacher, U. 2007. Software-Defined Radio Prospects for Multistandard Mobile Phones. Computer 40, 10 (Oct. 2007)
  • Finchelstein, D.F.; Sze, V.; Sinangil, M.E.; Koken, Y.; Chandrakasan, A.P., "A low-power 0.7-V H.264 720p video decoder,"

Solid-State Circuits Conference, 2008. A-SSCC '08.

Programmable Unified Architectures Provide  Lower Cost  Faster Time to Market  Support for Multiple Applications (Current and Future)  Bug Fixes After Manufacturing So where do we start?

6 6 6

University of Michigan – ARM June 09

View of the Unified Architecture World

6

3G ISP

V i d e

  • WiFi

GSM

2D/3D

3G ISP

V i d e

  • WiFi

GSM

2D/3D

GSM ISP WiFi 3G 2D/3D V i d e

slide-4
SLIDE 4

4

7 7 7

University of Michigan – ARM June 09

Power/Performance Requirements for Multiple Systems

7

Different applications have different power/performance characteristics! We need to design keeping each application in mind! (Not GPP but Domain Specific Processor)

8 8 8

The Applications

Is there anything we can learn from the applications themselves?

8

slide-5
SLIDE 5

5

9 9 9

University of Michigan – ARM June 09

H.264 Basics

9

T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. “A low-power dual-mode video decoder for mobile applications,” IEEE Communications Magazine, volume 44, issue 8, pp.119-126, Aug. 2006. 10 10 10

University of Michigan – ARM June 09

4G Wireless Basics

  • Three kernels make up the majority of the

work

  • FFT – Extract Data from Signals
  • STBC – Combine Data into More Reliable Stream
  • LDPC – Error Correction on Data Stream

10

slide-6
SLIDE 6

6

11 11 11

University of Michigan – ARM June 09

Mobile Signal Processing Algorithm Characteristics

11

  • Algorithms have different SIMD widths
  • From very large to very small
  • Though SIMD width varies all algorithms can exploit it
  • Large percentage of work can be SIMDized
  • Larger SIMD width tend to have less TLP

Algorithm
 SIMD
 Scalar
 Overhead
 SIMD
Width
 Amount
 Workload
(%)
 Workload
(%)
 Workload
(%)
 (Elements)


  • f
TLP


4G


FFT
 75
 5
 20
 1024
 Low
 STBC
 81
 5
 14
 4
 High
 LDPC
 49
 18
 33
 96
 Low


H.264


Deblocking
Filter
 72
 13
 15
 8
 Medium
 Intra‐PredicMon
 85
 5
 10
 16
 Medium
 Inverse
Transform
 80
 5
 15
 8
 High
 MoMon
CompensaMon
 75
 5
 10
 8
 High


SIMD comes at a cost!

  • Register File Power
  • Data Movement/Alignment Cost

SIMD architectures have to deal with this!

12 12 12

University of Michigan – ARM June 09

Traditional SIMD Power Breakdown

  • Register File Power consumes a lot of power in traditional

32-wide SIMD architecture

slide-7
SLIDE 7

7

13 13 13

University of Michigan – ARM June 09

Register File Access

13

  • Many of the register file access do not have to go back to the

main register file

0%
 10%
 20%
 30%
 40%
 50%
 60%
 70%
 80%
 90%
 100%


FFT
 STBC
 LDPC
 Deblocking
 Filter
 Intra‐PredicMon
 Inverse
 Transform


Register
Access
 Bypass
Read
 Bypass
Write


Lots of power wasted on unneeded register file access!

14 14 14

University of Michigan – ARM June 09

Instruction Pair Frequency

14

Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions A few instruction pairs (3-5) make up the majority of all instruction pairs!

slide-8
SLIDE 8

8

15 15 15

University of Michigan – ARM June 09

Data Alignment Problem!

  • H.264 Intra-prediction has 9 different prediction modes
  • Each prediction mode requires a specific permutation

15

Traditional SIMD machines take too long or cost too much to do this Good news – small fixed number patterns per kernel Intra‐PredicMon


16 16 16

University of Michigan – ARM June 09

More Data Alignment Problems!

16

Adder tree can accelerate not only matrix operations! Many different video kernels can be accelerated too! Inverse
Transform


slide-9
SLIDE 9

9

17 17 17

University of Michigan – ARM June 09

Even More Data Alignment!

  • Techniques like 2D-Wave and 3D-Wave decoding for H.264

helps increase amount of parallelism but we have to be able to access different macroblocks for each parallel computation

17

C.H. Meenderinck, A. Azevedo, B.H.H. Juurlink, M. Alvarez, A. Ramirez, Parallel Scalability of Video Decoders, Journal of Signal Processing Systems, August 2008

Block based decoding requires us to access different locations of memory for each task We cannot just rely of fetching contiguous sets of data

18 18 18

University of Michigan – ARM June 09

Summary

  • Conclusion about 4G and H.264
  • Lots of different sized parallelism
  • From 4 wide to 96 wide to 1024 wide SIMD
  • Which means many different SIMD widths need to be supported
  • Very short lived values
  • Lots of potential for instruction fusings
  • Limited set of shuffle patterns required for each kernel
slide-10
SLIDE 10

10

19 19 19

AnySP Design

19

20 20 20

University of Michigan – ARM June 09

SODA SIMD Architecture

20

32-Wide SIMD with Simple Shuffle Network

slide-11
SLIDE 11

11

21 21 21

University of Michigan – ARM June 09

AnySP Architecture – High Level

21

8 Groups of 8-Wide Flexible Function Units 128x128 16bit Swizzle Network 16 Banked Memory with SRAM-based Crossbar Multiple Output Adder Tree Temporary Buffer and Bypass Network Datapath AGU and Scalar Pipeline

22 22 22

University of Michigan – ARM June 09

Multi-Width Support

22

Normal 64-Wide SIMD mode – all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets

slide-12
SLIDE 12

12

23 23 23

University of Michigan – ARM June 09

AnySP FFU Datapath

23

Flexible Functional Unit allows us to

  • 1. Exploit Pipeline-parallelism by joining two lanes together
  • 2. Handle register bypass and the temporary buffer
  • 3. Join multiple pipelines to process deeper subgraphs
  • 4. Fuse Instruction Pairs

24 24 24

AnySP Results

24

slide-13
SLIDE 13

13

25 25 25

University of Michigan – ARM June 09

Simulation Environment

  • Traditional SIMD architecture comparison
  • SODA at 90nm technology
  • AnySP
  • Synthesized at 90nm TSMC
  • Power, timing, area numbers were extracted
  • Kernels were hand written and optimized
  • 4G – based on a NTT DoCoMo 4G test setup
  • H.264 – 4CIF@30fps

25

26 26 26

University of Michigan – ARM June 09

AnySP Speedup vs SIMD-based Architecture

  • For all benchmarks we perform more than 2x better

than a SIMD-based architecture

26

slide-14
SLIDE 14

14

27 27 27

University of Michigan – ARM June 09

AnySP Energy-Delay vs SIMD-based Architecture

  • More importantly energy efficiency is much better!

27

28 28 28

University of Michigan – ARM June 09

AnySP Power Breakdown

  • We estimate that both H.264 and 4G wireless can be done

in under 1 Watt at 45nm

28

slide-15
SLIDE 15

15

29 29 29

University of Michigan – ARM June 09

Conclusion & Future Work

  • Conclusion
  • We have presented an example architecture that could

possibly meet the requirements of 100Mbps 4G and HD video on the same platform

  • Under the power budget and meeting the performance at 45nm
  • Future and Ongoing Work
  • Application-specific language
  • Larger class of algorithms for AnySP
  • Better utilization of resources for non-parallel kernels
  • Speedup sequential parts

29

30 30 30

Diet-SODA

30

slide-16
SLIDE 16

16

31 31 31

University of Michigan – ARM June 09

Diet SODA

  • SODA, Ardbeg, AnySP may be too powerful for the

application

  • Simple Imaging processing for cameras
  • Audio processing for voice
  • Lose flexibility and generality of Ardbeg, AnySP for

performance at less # of gates

  • Build a modular design which people can add SIMD

groups and special function blocks to increase performance, at cost of area but allow voltage scaling

32 32 32

University of Michigan – ARM June 09

Histogram Equalization

  • Spreads out an unevenly distributed histogram
  • Increases contrast
slide-17
SLIDE 17

17

33 33 33

University of Michigan – ARM June 09

Histogram Equalization

for(i=0; i<length; i++) { for(j=0; j<width; j++){ k = image[i][j]; histogram[k] = histogram[k] + 1; } } for(i=0; i<length; i++){ for(j=0; j<width; j++){ k = image[i][j]; image[i][j] = sum_of_h[k] * constant; } } Hard to parallelize because of histogram[k] We can parallelize the load but we may suffer on the SIMD to scalar transfer

34 34 34

University of Michigan – ARM June 09

Basic Edge Detection

  • detect_edges : 8 directional masks ( Sobel ) + thresholding
slide-18
SLIDE 18

18

35 35 35

University of Michigan – ARM June 09

Basic Edge Detection

Can be easily parallelized across multiple masks or across multiple images

35

for(a=-1; a<2; a++){ for(b=-1; b<2; b++){ sum = sum + image[i+a][j+b] * mask_0[a+1][b+1]; } }

for(i=0; i<rows; i++){ for(j=0; j<cols; j++){ if(out_image[i][j] > high) {

  • ut_image[i][j] = new_hi;

} else {

  • ut_image[i][j] = new_low;

} } }

36 36 36

University of Michigan – ARM June 09

Basic Edge Detection

Operations are applied on 3x3 matrices 1) Reduction tree (need a function of sum of 9 values) 2) Select max function 3) Threshold operation if (out_image > threshold)

  • ut_image = hi

else

  • ut_image = low
slide-19
SLIDE 19

19

37 37 37

University of Michigan – ARM June 09

Performance Estimate

  • Estimate for how many cycle we need on 2048x1536 images

3.1 Megapixel

  • Unoptimized SIMDized kernels assuming 16-wide SIMD
  • Most work is done on 9-wide (3x3) data elements
  • Adding multiple SIMD groups can increase performance in

most kernels

SequenMal
(Mcycle)
 SIMDized
(Mcycle)
 Histogram
EqualizaMon
 75
 75
 Edge
DetecMon
 3267
 530
 Homogeneity
Edge
DetecMon
 556
 50
 Filter
 411
 81


38 38 38

University of Michigan – ARM June 09

Modular Design Consideration

  • Build a highly application specific modular core
  • Depending on which application supported user can add or

remove specialized hardware

  • Design requirement tradeoff
  • Smallest Area
  • Use minimum number of SIMD groups or run SIMD

groups faster

  • Faster performance
  • Add more SIMD Groups
  • Lower Power
  • Add more SIMD Groups and lower frequency and

lower VDD for same throughput

38

slide-20
SLIDE 20

20

39 39 39

The End

39