Bespoke Processors for Applications with Ultra-low Area and Power - - PowerPoint PPT Presentation

bespoke processors for applications with ultra low area
SMART_READER_LITE
LIVE PREVIEW

Bespoke Processors for Applications with Ultra-low Area and Power - - PowerPoint PPT Presentation

Bespoke Processors for Applications with Ultra-low Area and Power Constraints by Cherupalli et al. ISCA 17 Jielun Tan, Tim Wesley Overview Motivation Intro to Bespoke Benchmarks and Results Discussion General Purpose CPUs in ULP


slide-1
SLIDE 1

Bespoke Processors for Applications with Ultra-low Area and Power Constraints

by Cherupalli et al. ISCA ‘17

Jielun Tan, Tim Wesley

slide-2
SLIDE 2

Overview

Motivation Intro to Bespoke Benchmarks and Results Discussion

slide-3
SLIDE 3

General Purpose CPUs in ULP

Ultra-Low Power applications (IoT, wearables, implantables) typically use small, general purpose microprocessors

  • Amortized cost of development
  • Most capabilities of these processors are never used by the application

○ Unused gates still drain power and take up area

slide-4
SLIDE 4

What about ASICs and FPGAs ?

  • Both are expensive to develop
  • ASICs

○ IPs required for different applications ○ Expensive at small scales

  • FPGA

○ Often larger than needed, to accommodate programmability ○ May still use too much power

slide-5
SLIDE 5

Algorithm Usage Examples

slide-6
SLIDE 6

Bespoke Processors--Tuning Process

  • Bespoke processor design flow:

○ First use traditional module-level removal ○ Next use Input-Independent Gate Activity Analysis ○ Finally, cut-and-stitch the netlist to form the final design

slide-7
SLIDE 7

Input-Independent Gate Activity Analysis

1. Load binary into memory 2. Set application inputs to Xs 3. After each cycle is simulated, the toggled gates are marked “keep” 4. If an X propagates to the PC, we have a possible branch a. Explore all possible branch paths, depth-first b. Remember the most conservative state (most Xs) i. Take union of gates of branches if most conservative is missing a few c. If branch is re-encountered i. Skip check if this state is a substate of that most conservative state ii. Merge lists of activated gates and make the result the new conservative state 5. Lists of all gates that are never toggled, along with their constant values, are passed to the cut-and-stitch function

slide-8
SLIDE 8

Cutting and Stitching

1. After X propagation, untoggled gates are removed from the netlist and replaced by a constant voltage 2. Rerun logical synthesis for further optimizations a. Typically gates that have constant inputs can reduced to even simpler logic 3. Place and route (this is not any further optimized)

slide-9
SLIDE 9

Input-independent Gate Activity Analysis Example

slide-10
SLIDE 10

Benchmarks

  • Baseline

  • penMSP430 with TSMC 65nm

○ Operating @1V @100MHz ○ Bare metal simulation or FreeRTOS ○ Either completely general purpose, or traditionally optimized for an application by removing modules

  • Each benchmark is then run on a Bespoke processor
  • ptimized for that benchmark

○ All unused modules are removed ○ X propagation and cut-and-stitch are performed

slide-11
SLIDE 11

Used Gates per Benchmark

slide-12
SLIDE 12

Results

  • Reduction in gate count, area and power for a bespoke design vs. unmodified baseline
slide-13
SLIDE 13

Results

  • Reduction in gate count, area, and power in bespoke design vs. module optimized baseline
slide-14
SLIDE 14

Results

slide-15
SLIDE 15

Multiple Programs

  • Multiple programs?

○ Run bespoke tuning process on each and take the union of the results

  • Ceiling at 80%... test suite does not activate all gates
slide-16
SLIDE 16

In-Field Updates

  • Bug fixes may need to be deployed, which may change the toggled gates
  • Milu mutation testing tool used to emulate changes in the program for future updates

○ Type I: conditional operator changes (AND -> OR) ○ Type II: computation operator mutants (add -> multiply) ○ Type III: loop conditional operator mutants (less than -> less than or equal to)

slide-17
SLIDE 17

Coverage for In-Field Updates

  • Between 25% and 100% of mutants for each type are covered
  • 70% of all mutants of all types of covered
  • If mutants are significantly different, then they can be considered as independent programs
  • Overhead of between 1% and 40%
  • Total area reductions between 23% and 66%, total power reductions between 13% and 53%
slide-18
SLIDE 18

Coverage for in-Field Updates cont.

  • An instruction that can be executed in one program is not necessarily executable in another

program ○ A particular ADD instruction may only use 16 bits out of a 32 bit ALU

  • A tailored bespoke processor can support arbitrary software updates by supporting a Turing

complete instruction (e.g. subneg) or a set of them ○ A program written using Turing complete instruction can be consisted solely of that instruction

slide-19
SLIDE 19

System Code

  • Application analysis of system code for FreeRTOS shows 57% of the gates are never used by the

OS

  • When benchmarks are evaluated individually with FreeRTOS

○ 37% unused in the worst case ○ 49% unused on average

  • Running 15 benchmarks on top of FreeRTOS still shows 27% of gates unused
slide-20
SLIDE 20

Generality and Limitations

  • Hardware with non-deterministic behaviors need additional techniques to be Bespoke tuned

○ Branch predictors ○ Caches ○ Speculative operations ○ Out-of-order cores

  • Xs need to be injected as the results of

○ ...branch predictions ○ ...tag checks ○ ...values where speculation may be used

  • Extending the X-prop process to explore data flow graphs may allow analysis of OoO to work
slide-21
SLIDE 21

Discussion Points

1. All of the examples they tested are just algorithms such as binary search or FFT. But actual applications, even in IoT and smaller, typically do more than just, e.g., binary search. Do Bespoke tuned processors have any value for real-world programs? 2. Is using Milu and adding mutations representative of what in-field updates would actually change? 3. Can the Bespoke tuning process be used for lowering power consumption of high-performance accelerators? 4. Is Bespoke tuning better or worse for certain cases than technologies such as HLS, Simulate-and-Eliminate, or just making an ASIC design?

slide-22
SLIDE 22

Related Works

  • High-Level Synthesis

○ Additional development costs ■ New high-level specs of application behavior needs to be defined ■ High-level spec needs to also be verified ○ C to ASICs is very difficult to do, especially to do efficiently ○ Unlikely to support multiple applications nor in-field updates

  • Simulate-and-Eliminate

○ Simulates the target application with a user-provided set of inputs on multiple base designs ■ Require significant user input ■ Only considers high-level, manually-identified components ■ Relies on user inputs to determine unused components--user may forget a test case!