High performance, power-efficient DSPs based on the TI C64x Sridhar - - PowerPoint PPT Presentation

high performance power efficient dsps based on the ti c64x
SMART_READER_LITE
LIVE PREVIEW

High performance, power-efficient DSPs based on the TI C64x Sridhar - - PowerPoint PPT Presentation

High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research Results Stream-based programmable


slide-1
SLIDE 1

RICE UNIVERSITY

High performance, power-efficient DSPs based on the TI C64x

Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu

slide-2
SLIDE 2

2 RICE UNIVERSITY

Recent (2003) Research Results

Stream-based programmable processors meet real-time requirements for a set of base-station phy layer algorithms+,* Map algorithms on stream processors and studied tradeoffs between packing, ALU utilization and memory operations Improve power efficiency in stream processors by adapting compute resources to workload variations and varying voltage and clock frequency to real-time requirements* Design exploration between #ALUs and clock frequency to minimize power consumption of the processor

+ S. Rajagopal, S. Rixner, J. R. Cavallaro 'A programmable baseband processor design for software defined radios’, 2002, *Paper draft sent previously, rest of the contributions in thesis

slide-3
SLIDE 3

3 RICE UNIVERSITY

Recent (2003) Research Results

Peak computation rate available : ~200 billion arithmetic

  • perations at 1.2 GHz

Estimated Peak Power (0.13 micron) : 12.38 W at 1.2 GHz Power: 12.38 W for 32 users, constraint 9 decoding, at 128Kbps/user At 1.2 GHz, 1.4 V 300 mW for 4 users, constraint 7 decoding, at 128Kbps/user At 433 MHz, 0.875 V

slide-4
SLIDE 4

4 RICE UNIVERSITY

Motivation

This research could be applied to DSP design! Designing High performance DSPs Power-efficient Adapt computing resources with workload changes Such that Gradual changes in C64x architecture Gradual changes in compilers and tools

slide-5
SLIDE 5

5 RICE UNIVERSITY

Levels of changes

To allow changes in TI DSPs and tools gradually Changes classified into 3 levels Level 1 : simple, minimum changes (next silicon) Level 2 : intermediate, handover changes (1-2 years) Level 3 : actual proposed changes (2-3 years) We want to go to Level 3 but in steps!

slide-6
SLIDE 6

6 RICE UNIVERSITY

Level 1 changes: Power-efficiency

slide-7
SLIDE 7

7 RICE UNIVERSITY

Level 1 changes: Power saving features

(1) Use Dynamic Voltage and Frequency scaling When workload changes such as Users, data rates, modulation, coding rates, … Already in industry : Crusoe, XScale … (2) Use Voltage gating to turn off unused resources When units idle for a ‘sufficiently’ long time Saves static and dynamic power dissipation See example on next page

slide-8
SLIDE 8

8 RICE UNIVERSITY

Turning off ALUs

Adders Multipliers Adders Multipliers Default schedule Schedule after exploration Instruction Schedule ‘Sleep’ Instruction 2 multipliers turned off to save power

Turned off using voltage gating to eliminate static and dynamic power dissipation

slide-9
SLIDE 9

9 RICE UNIVERSITY

Level 1: Architecture tradeoffs

DVS: Advanced voltage regulation scheme Cannot use NMOS pass gates Cannot use tri-state buffers Use at a coarser time scale (once in a million cycles) 100-1000 cycles settling time Voltage gating: Gating device design important Should be able to supply current to gated circuit Use at coarser time scale (once in 100-1000 cycles) 1-10 cycles settling time

slide-10
SLIDE 10

10 RICE UNIVERSITY

Level 1: Tools/Programming impact

Need a DSP BIOS “TASK” running continuously which looks at the workload change and changes voltage/frequency using a look-up table in memory

  • Compiler should be made ‘re-targetable’

Target subset of ALUs and explore static performance with different adder-multiplier schedules Voltage gating using a ‘sleep’ instruction that the compiler generates for unused ALUs ALUs should be idle for > 100 cycles for this to occur Other resources can be gated off similarly to save static power dissipation Programmer is not aware of these changes

slide-11
SLIDE 11

11 RICE UNIVERSITY

Level 2 changes: Performance

slide-12
SLIDE 12

12 RICE UNIVERSITY

Solutions to increase DSP performance

(1) Increasing clock frequency C64x: 600 – 720 – 1000 - ? Easiest solution but limited benefits Not good for power, given cubic dependence with frequency (2) Increasing ALUs Limited instruction level parallelism (ILP) Register file area, ports explosion Compiler issues in extracting more ILP (3) Multiprocessors (MIMD) Usually 3rd party vendors (except C40-types)

slide-13
SLIDE 13

13 RICE UNIVERSITY

DSP multiprocessors

Source: Texas Instruments Wireless Infrastructure Solutions Guide, Pentek, Sundance, C80

DSP DSP DSP DSP ASSP ASSP Co-Proc’s Network Interface Interconnection

slide-14
SLIDE 14

14 RICE UNIVERSITY

Multiprocessing tradeoffs

Advantages: Performance, and tools don’t have to change!! Load-balancing algorithms on multiple DSPs not straight-forward+ Burden pushed on to the programmer Not scalable with number of processors difficult to adapt with workload changes Traditional DSPs not built for multiprocessing* (except C40-types) I/O impacts throughput, power and area (E)DMA use minimizes the throughput problem Power and area problems still remain

*R. Baines, The DSP bottleneck, IEEE Communications Magazine, May 1995, pp 46-54 (outdated?)

+S. Rajagopal, B. Jones and J.R. Cavallaro, Task partitioning wireless base-station algorithms on

multiple DSPs and FPGAs, ICSPAT’2001

slide-15
SLIDE 15

15 RICE UNIVERSITY

Options

Chip multiprocessors with SIMD parallelism (Level 3) SIMD parallelism can alleviate load balancing (shown in Level 3) Scalable with processors Automatic SIMD parallelism can be done by the compiler Single chip will alleviate I/O bottlenecks Tool will need changes To get to level 3, intermediate (Level 2) level investigation Level 2 Do SPMD on DSP multiprocessor

slide-16
SLIDE 16

16 RICE UNIVERSITY

Texas Instruments C64x DSP

Source: Texas Instruments C64x DSP Generation (sprt236a.pdf)

C64x Datapath

slide-17
SLIDE 17

17 RICE UNIVERSITY

A possible, plausible solution

Exploit data parallelism (DP)* Available in many wireless algorithms This is what ASICs do! int i,a[N],b[N],sum[N]; // 32 bits short int c[N],d[N],diff[N]; // 16 bits packed for (i = 0; i< 1024; ++i) { sum[i] = a[i] + b[i]; diff[i] = c[i] - d[i]; }

ILP DP Subword

*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

slide-18
SLIDE 18

18 RICE UNIVERSITY

SPMD multiprocessor DSP

C64x Datapath C64x Datapath C64x Datapath C64x Datapath

Same Program running on all DSPs

slide-19
SLIDE 19

19 RICE UNIVERSITY

Level 2: Architecture tradeoffs

C64x’s Interconnection could be similar to the ones used by 3rd party vendors FPGA- based C40 comm ports (Sundance) ~400 MBps VIM modules (Pentek) ~300 MBps Others developed by TI, BlueWave systems

slide-20
SLIDE 20

20 RICE UNIVERSITY

Level 2: Tools/Programming impact

All DSPs run the same program Programmer thinks of only 1 DSP program Burden now on tools Can use C8x compiler and tool support expertise Integration of C8x and C6x compilers Data parallelism used for SPMD DMA data movement can be left to programmer at this stage to keep data fed to the all the processors MPI (Message Passing) can also be alternatively applied

slide-21
SLIDE 21

21 RICE UNIVERSITY

Level 3 changes: Performance and Power

slide-22
SLIDE 22

22 RICE UNIVERSITY

A chip multiprocessor (CMP) DSP

+ + + * * * Internal Memory L2

ILP Subword

Internal Memory (L2) C64x DSP Core (1 cluster)

+ + + * * * + + + * * * + + + * * * + + + * * * …

ILP Subword DP

C64x based CMP DSP Core adapt #clusters to DP Identical clusters, same operations. Power-down unused ALUs, clusters

Instruction decoder Instruction decoder

slide-23
SLIDE 23

23 RICE UNIVERSITY

A 4 cluster CMP using TI C64x

C64x Datapath C64x Datapath C64x Datapath C64x Datapath

Significant savings possible in area and power Increasing benefits with larger #clusters (8,16,32 clusters)

slide-24
SLIDE 24

24 RICE UNIVERSITY

Alternate view of the CMP DSP

DMA Controller L2 internal memory Bank C Inter-cluster communication network Bank 2 Bank 1 Prefetch Buffers Clusters Of C64x C64x core C C64x core 0 C64x core 1 Instruction decoder

slide-25
SLIDE 25

25 RICE UNIVERSITY

Adapting #clusters to Data Parallelism

Adaptive Multiplexer Network

C C C C C C C C C C C No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off

Turned off using voltage gating to eliminate static and dynamic power dissipation

slide-26
SLIDE 26

26 RICE UNIVERSITY

Level 3: Architecture tradeoffs

Single processor -> SPMD -> SIMD Single chip : Max die size limited to 128 clusters with 8 functional units/cluster at 90 nm technology [estimate] Number of memory banks = #clusters Instruction addition to turn off clusters when data parallelism is insufficient

slide-27
SLIDE 27

27 RICE UNIVERSITY

Level 3: Tools/Programming impact

Level 2 compiler provides support for data parallelism adapt #clusters to data parallelism for power savings check for loop count index after loop unrolling If less than #clusters, provide instruction to turn off clusters Design of parallel algorithms and mapping important Programmer still writes regular C code Transparent to the programmer Burden on the compiler Automatic DMA data movement to keep data feeding into the arithmetic units

slide-28
SLIDE 28

28 RICE UNIVERSITY

Level 3 potential verification using the Imagine stream processor simulator Replacing the C64x DSP with a cluster containing 3 +, 3 X and a distributed register file

Verification of potential benefits

slide-29
SLIDE 29

29 RICE UNIVERSITY

Need for adapting to flexibility

Base-stations are designed for worst case workload Base-stations rarely operate at worst case workload Adapting the resources to the workload can save power!

slide-30
SLIDE 30

30 RICE UNIVERSITY

Example of flexibility needed in workloads

5 10 15 20 25 Operation count (in GOPs) (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) 2G base-station (16 Kbps/user) 3G base-station (128 Kbps/user) (Users, Constraint lengths)

Billions of computations per second needed Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi to ~23 GOPs for 32 users, constraint 9 viterbi

Note: GOPs refer

  • nly to arithmetic

computations

slide-31
SLIDE 31

31 RICE UNIVERSITY

Flexibility affects Data Parallelism*

64 32 32 (32,9) 16 32 32 (32,7) 64 16 32 (16,9) 16 16 32 (16,7) 64 8 32 (8,9) 16 8 32 (8,7) 64 4 32 (4,9) 16 4 32 (4,7) f(U,K,R) f(U,N) f(U,N) (U,K) Decoding Detection Estimation Workload

U - Users, K - constraint length, N - spreading gain, R - decoding rate

*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

slide-32
SLIDE 32

32 RICE UNIVERSITY

Cluster utilization variation with workload

5 10 15 20 25 30 50 100 (4,9) (4,7) 5 10 15 20 25 30 50 100 (8,9) (8,7) 5 10 15 20 25 30 50 100 (16,9) (16,7) 5 10 15 20 25 30 50 100 (32,9) (32,7)

Cluster Index Cluster Utilization Cluster utilization variation on a 32-cluster processor (32, 9) = 32 users, constraint length 9 Viterbi

slide-33
SLIDE 33

33 RICE UNIVERSITY

Frequency variation with workload

200 400 600 800 1000 1200

Real-time Frequency (in MHz)

(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)

Mem Stall L2 Stall Busy

slide-34
SLIDE 34

34 RICE UNIVERSITY

Operation

DVS when system changes significantly Users, data rates … Coarse time scale (every few seconds) Turn off clusters when parallelism changes significantly Parallelism can change within the same algorithm Eg: spreading gain changes during matched filtering Finer time scales (100’s of microseconds) Turn off ALUs when algorithms change significantly estimation, detection, decoding Finer time scales (100’s of microseconds)

slide-35
SLIDE 35

35 RICE UNIVERSITY

Power savings: Voltage Gating & Scaling

Workload Freq (MHz) Voltage Power Savings (W) Power (W) Savings needed used (V) clocking Memory Clusters New Base (4,7) 345.09 433 0.875 0.325 1.05 0.366 0.3 2.05 85.14 % (4,9) 380.69 433 0.875 0.193 0.56 0.604 0.69 2.05 66.41 % (8,7) 408.89 433 0.875 0.089 0.54 0.649 0.77 2.05 62.44 % (8,9) 463.29 533 0.95 0.304 0.71 0.643 1.33 2.98 55.46 % (16,7) 528.41 533 0.95 0.02 0.44 0.808 1.71 2.98 42.54 % (16,9) 637.21 667 1.05 0.156 0.58 0.603 3.21 4.55 29.46 % (32,7) 902.89 1000 1.3 0.792 1.18 1.375 7.11 10.46 32.03 % (32,9) 1118.3 1200 1.4 0.774 1.41 12.38 14.56 14.98 % Estimated Cluster Power Consumption 78 % Estimated L2 memory Power Consumption 11.5 % Estimated instruction deco

  • der

Power Consumption 10.5 % Estimated Chip Area (0.13 micron process) 45.7 mm2

Power can change from 12.38 W to 300 mW depending on workload changes

slide-36
SLIDE 36

36 RICE UNIVERSITY

How to decide ALUs vs. clock frequency

No independent variables Clusters, ALUs, frequency, voltage Trade-offs exist How to find the right combination for real-time @ lowest power!

2

P CV f ∝ V f ∝

3

P f ∝

‘1’ cluster 100 GHz (A)

+ + + * * * ‘a’ + ‘m’ * + + + * * * ‘a’ + ‘m’ * + + + * * * ‘a’ + ‘m’ *

‘c’ clusters ‘f’ MHz

+ + + * * * ‘1’ + ‘1’ * + + + * * * ‘10’ + ‘10’ * + + + * * * ‘10’ + ‘10’ * + + + * * * ‘10’ + ‘10’ *

‘100’ clusters 10 MHz (B) (C)

slide-37
SLIDE 37

37 RICE UNIVERSITY

Setting clusters, adders, multipliers

If sufficient DP, linear decrease in frequency with clusters Set clusters depending on DP and execution time estimate To find adders and multipliers, Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time Put all numbers in previous equation Compare increase in capacitance due to added ALUs and clusters with benefits in execution time Choose the solution that minimizes the power

Details available in Sridhar’s thesis

slide-38
SLIDE 38

38 RICE UNIVERSITY

Conclusions

We propose a step-by-step methodology to design high performance power-efficient DSPs based on the TI 64x architecture Initial results show benefits in power/performance greater than an

  • rder-of-magnitude over a conventional C64x

We tailor the design to ensure maximum compatibility with TI’s C6x architecture and tools We are interested in exploring opportunities in TI for designing and actual fabrication of a chip and associated tool development We are interested in feedback limitations that we have not accounted for Unreasonable assumptions that we have made

Recommended reading:

  • S. Rixner et al, A register organization for media processing, HPCA 2000
  • B. Khailany et al, Exploring the VLSI scalability of stream processors, HPCA 2003
  • U. J. Kapasi et al, Programmable Stream Processors, IEEE Computer, August 2003