High performance, power-efficient DSPs based on the TI C64x Sridhar - PowerPoint PPT Presentation

High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY

Recent (2003) Research Results � Stream-based programmable processors meet real-time requirements for a set of base-station phy layer algorithms +,* � Map algorithms on stream processors and studied tradeoffs between packing, ALU utilization and memory operations � Improve power efficiency in stream processors by adapting compute resources to workload variations and varying voltage and clock frequency to real-time requirements* � Design exploration between #ALUs and clock frequency to minimize power consumption of the processor RICE UNIVERSITY 2 + S. Rajagopal, S. Rixner, J. R. Cavallaro 'A programmable baseband processor design for software defined radios’, 2002, *Paper draft sent previously, rest of the contributions in thesis

Recent (2003) Research Results � Peak computation rate available : ~200 billion arithmetic operations at 1.2 GHz � Estimated Peak Power (0.13 micron) : 12.38 W at 1.2 GHz � Power: � 12.38 W for 32 users, constraint 9 decoding, at 128Kbps/user At 1.2 GHz, 1.4 V � 300 mW for 4 users, constraint 7 decoding, at 128Kbps/user At 433 MHz, 0.875 V RICE UNIVERSITY 3

Motivation � This research could be applied to DSP design! � Designing � High performance DSPs � Power-efficient � Adapt computing resources with workload changes � Such that � Gradual changes in C64x architecture � Gradual changes in compilers and tools RICE UNIVERSITY 4

Levels of changes � To allow changes in TI DSPs and tools gradually � Changes classified into 3 levels � Level 1 : simple, minimum changes (next silicon) � Level 2 : intermediate, handover changes (1-2 years) � Level 3 : actual proposed changes (2-3 years) We want to go to Level 3 but in steps! RICE UNIVERSITY 5

Level 1 changes: Power-efficiency RICE UNIVERSITY 6

Level 1 changes: Power saving features � (1) Use Dynamic Voltage and Frequency scaling � When workload changes such as � Users, data rates, modulation, coding rates, … � Already in industry : Crusoe, XScale … � (2) Use Voltage gating to turn off unused resources � When units idle for a ‘sufficiently’ long time � Saves static and dynamic power dissipation � See example on next page RICE UNIVERSITY 7

Turning off ALUs Adders Multipliers Adders Multipliers Instruction Schedule ‘Sleep’ Instruction Schedule after Default schedule Turned off using exploration voltage gating to eliminate static and dynamic power dissipation 2 multipliers turned off to save power RICE UNIVERSITY 8

Level 1: Architecture tradeoffs DVS: � Advanced voltage regulation scheme � Cannot use NMOS pass gates � Cannot use tri-state buffers � Use at a coarser time scale (once in a million cycles) � 100-1000 cycles settling time Voltage gating: � Gating device design important � Should be able to supply current to gated circuit � Use at coarser time scale (once in 100-1000 cycles) � 1-10 cycles settling time RICE UNIVERSITY 9

Level 1: Tools/Programming impact � Need a DSP BIOS “TASK” running continuously which looks at the workload change and changes voltage/frequency using a look-up table in memory Compiler should be made ‘re-targetable’ � � Target subset of ALUs and explore static performance with different adder-multiplier schedules � Voltage gating using a ‘sleep’ instruction that the compiler generates for unused ALUs � ALUs should be idle for > 100 cycles for this to occur � Other resources can be gated off similarly to save static power dissipation � Programmer is not aware of these changes RICE UNIVERSITY 10

Level 2 changes: Performance RICE UNIVERSITY 11

Solutions to increase DSP performance � (1) Increasing clock frequency � C64x: 600 – 720 – 1000 - ? � Easiest solution but limited benefits � Not good for power, given cubic dependence with frequency � (2) Increasing ALUs � Limited instruction level parallelism (ILP) � Register file area, ports explosion � Compiler issues in extracting more ILP � (3) Multiprocessors (MIMD) � Usually 3 rd party vendors (except C40-types) RICE UNIVERSITY 12

DSP multiprocessors DSP DSP ASSP Network Interconnection Interface DSP DSP ASSP Co-Proc’s RICE UNIVERSITY 13 Source: Texas Instruments Wireless Infrastructure Solutions Guide, Pentek, Sundance, C80

Multiprocessing tradeoffs � Advantages: � Performance, and tools don’t have to change!! � Load-balancing algorithms on multiple DSPs not straight-forward + � Burden pushed on to the programmer � Not scalable with number of processors � difficult to adapt with workload changes � Traditional DSPs not built for multiprocessing* (except C40-types) � I/O impacts throughput, power and area � (E)DMA use minimizes the throughput problem � Power and area problems still remain *R. Baines, The DSP bottleneck , IEEE Communications Magazine, May 1995, pp 46-54 (outdated?) RICE UNIVERSITY 14 + S. Rajagopal, B. Jones and J.R. Cavallaro, Task partitioning wireless base-station algorithms on multiple DSPs and FPGAs , ICSPAT’2001

Options � Chip multiprocessors with SIMD parallelism (Level 3) � SIMD parallelism can alleviate load balancing � (shown in Level 3) � Scalable with processors � Automatic SIMD parallelism can be done by the compiler � Single chip will alleviate I/O bottlenecks � Tool will need changes � To get to level 3, intermediate (Level 2) level investigation � Level 2 � Do SPMD on DSP multiprocessor RICE UNIVERSITY 15

Texas Instruments C64x DSP C64x Datapath RICE UNIVERSITY 16 Source: Texas Instruments C64x DSP Generation (sprt236a.pdf)

A possible, plausible solution Exploit data parallelism (DP)* � Available in many wireless algorithms � This is what ASICs do! int i,a[N],b[N],sum[N]; // 32 bits short int c[N],d[N],diff[N]; // 16 bits packed DP for (i = 0; i< 1024; ++i) { sum[i] = a[i] + b[i]; ILP diff[i] = c[i] - d[i]; } Subword RICE UNIVERSITY 17 *Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

SPMD multiprocessor DSP C64x Datapath C64x Datapath Same Program running on all DSPs C64x Datapath C64x Datapath RICE UNIVERSITY 18

Level 2: Architecture tradeoffs � C64x’s � Interconnection could be similar to the ones used by 3 rd party vendors � FPGA- based C40 comm ports (Sundance) ~400 MBps � VIM modules (Pentek) ~300 MBps � Others developed by TI, BlueWave systems RICE UNIVERSITY 19

Level 2: Tools/Programming impact � All DSPs run the same program � Programmer thinks of only 1 DSP program � Burden now on tools � Can use C8x compiler and tool support expertise � Integration of C8x and C6x compilers � Data parallelism used for SPMD � DMA data movement can be left to programmer at this stage to keep data fed to the all the processors � MPI (Message Passing) can also be alternatively applied RICE UNIVERSITY 20

Level 3 changes: Performance and Power RICE UNIVERSITY 21

A chip multiprocessor (CMP) DSP Internal Memory (L2) Internal Memory L2 + + ILP + + + + + + Instruction Subword decoder … + + + + + * * * * + Instruction * * * * ILP decoder + * * * * * Subword * DP * C64x based CMP DSP Core C64x DSP Core adapt #clusters to DP (1 cluster) Identical clusters, same operations. Power-down unused ALUs, clusters RICE UNIVERSITY 22

A 4 cluster CMP using TI C64x Significant savings possible in area and power C64x Datapath C64x Datapath Increasing benefits C64x Datapath with larger #clusters C64x Datapath (8,16,32 clusters) RICE UNIVERSITY 23

Alternate view of the CMP DSP DMA Controller L2 Bank internal Bank Bank memory C 2 1 Prefetch Buffers C64x core 0 Instruction C64x core C C64x core 1 Clusters decoder Of C64x Inter-cluster communication network RICE UNIVERSITY 24

Adapting #clusters to Data Parallelism Turned off using voltage gating to eliminate static and Adaptive dynamic power dissipation Multiplexer Network C C C C No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off C C C C C C RICE UNIVERSITY 25 C

Level 3: Architecture tradeoffs � Single processor -> SPMD -> SIMD � Single chip : � Max die size limited to 128 clusters with 8 functional units/cluster at 90 nm technology [estimate] � Number of memory banks = #clusters � Instruction addition to turn off clusters when data parallelism is insufficient RICE UNIVERSITY 26

Level 3: Tools/Programming impact � Level 2 compiler provides support for data parallelism � adapt #clusters to data parallelism for power savings � check for loop count index after loop unrolling � If less than #clusters, provide instruction to turn off clusters � Design of parallel algorithms and mapping important � Programmer still writes regular C code � Transparent to the programmer � Burden on the compiler � Automatic DMA data movement to keep data feeding into the arithmetic units RICE UNIVERSITY 27

Verification of potential benefits Level 3 potential verification using the Imagine stream processor simulator Replacing the C64x DSP with a cluster containing 3 +, 3 X and a distributed register file RICE UNIVERSITY 28

High performance, power-efficient DSPs based on the TI C64x Sridhar - PowerPoint PPT Presentation

High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research Results Stream-based programmable

" You are important to me " Help DSPs be the best they can be John Dickerson, Mary

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

Retargetable Compilers System on Chip Many different types of DSPs and embedded processors

Chapter 18: Programmable DSPs Keshab K. Parhi and Viktor Owall DSP Applications DSP applications

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch

A High- A High -Performance Area Performance Area- -Efficient Efficient AES Cipher on a Many

Mapping the SCA to Embedded Platforms Using the SCA with DSPs and FPGAs Steve Bernier Project

Components into Heterogeneous SCA Platforms Using the SCA with DSPs and FPGAs Steve Bernier

Single Touch Payroll - Phase 2 Digital Service Providers (DSPs) Presented by: Michael Karavas

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Innovative Power Control for Ultra Low-Power and High- Ultra Low Power and High Performance

Hydro Power Generation e-Power CLA-VAL Europe Product Range e-Power IP e-Power HP e-Power MP

THE POWER OF US THE POWER OF US FIRST NATIONAL WEBINAR September 12, 2017 WEBINAR AGENDA

How does the power industry support How does the power industry support How does the power

Power Converters and Power Quality II CERN Accelerator School on Power Converters Baden, Friday 9

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

Frontier and Squid for same data access by many jobs Dave Dykstra dwd@fnal.gov OSG Users'

WOMBAT: towards a Worldwide Observatory of Malicious Behaviors and Attack Threats Fabien Pouget

Managing Bro Deployments at Scale Using DevOps Technologies Ed Sealing Daniel Lohin 2015

DevOps: Where is My PodPod Hello! I am smalltown MaiCoin Site Reliability Engineer Taipei

Scale-out your Tier-Based Systems in 3 steps Using Spring Nati Shalom CTO GigaSpaces Agenda

Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems

StatisticalNLP Sofar:languagemodelsgiveP(s) Spring2010

GP Cluster 14 December 2017 Healthier. Stronger. Together PARKING - IMPORTANT Whilst delegates

High performance, power-efficient DSPs based on the TI C64x Sridhar - PowerPoint PPT Presentation

High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research Results Stream-based programmable

&quot; You are important to me &quot; Help DSPs be the best they can be John Dickerson, Mary

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

Retargetable Compilers System on Chip Many different types of DSPs and embedded processors

Chapter 18: Programmable DSPs Keshab K. Parhi and Viktor Owall DSP Applications DSP applications

Energy-efficient &amp; High-performance Energy-efficient &amp; High-performance Instruction Fetch

A High- A High -Performance Area Performance Area- -Efficient Efficient AES Cipher on a Many

Mapping the SCA to Embedded Platforms Using the SCA with DSPs and FPGAs Steve Bernier Project

Components into Heterogeneous SCA Platforms Using the SCA with DSPs and FPGAs Steve Bernier

Single Touch Payroll - Phase 2 Digital Service Providers (DSPs) Presented by: Michael Karavas

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Innovative Power Control for Ultra Low-Power and High- Ultra Low Power and High Performance

Hydro Power Generation e-Power CLA-VAL Europe Product Range e-Power IP e-Power HP e-Power MP

THE POWER OF US THE POWER OF US FIRST NATIONAL WEBINAR September 12, 2017 WEBINAR AGENDA

How does the power industry support How does the power industry support How does the power

Power Converters and Power Quality II CERN Accelerator School on Power Converters Baden, Friday 9

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

Frontier and Squid for same data access by many jobs Dave Dykstra dwd@fnal.gov OSG Users'

WOMBAT: towards a Worldwide Observatory of Malicious Behaviors and Attack Threats Fabien Pouget

Managing Bro Deployments at Scale Using DevOps Technologies Ed Sealing Daniel Lohin 2015

DevOps: Where is My PodPod Hello! I am smalltown MaiCoin Site Reliability Engineer Taipei

Scale-out your Tier-Based Systems in 3 steps Using Spring Nati Shalom CTO GigaSpaces Agenda

Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems

StatisticalNLP Sofar:languagemodelsgiveP(s) Spring2010

GP Cluster 14 December 2017 Healthier. Stronger. Together PARKING - IMPORTANT Whilst delegates

" You are important to me " Help DSPs be the best they can be John Dickerson, Mary

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch