Bespoke Processors for Applications with Ultra-low Area and Power Constraints
by Cherupalli et al. ISCA ‘17
Jielun Tan, Tim Wesley
Bespoke Processors for Applications with Ultra-low Area and Power - - PowerPoint PPT Presentation
Bespoke Processors for Applications with Ultra-low Area and Power Constraints by Cherupalli et al. ISCA 17 Jielun Tan, Tim Wesley Overview Motivation Intro to Bespoke Benchmarks and Results Discussion General Purpose CPUs in ULP
Jielun Tan, Tim Wesley
Motivation Intro to Bespoke Benchmarks and Results Discussion
Ultra-Low Power applications (IoT, wearables, implantables) typically use small, general purpose microprocessors
○ Unused gates still drain power and take up area
○ IPs required for different applications ○ Expensive at small scales
○ Often larger than needed, to accommodate programmability ○ May still use too much power
○ First use traditional module-level removal ○ Next use Input-Independent Gate Activity Analysis ○ Finally, cut-and-stitch the netlist to form the final design
1. Load binary into memory 2. Set application inputs to Xs 3. After each cycle is simulated, the toggled gates are marked “keep” 4. If an X propagates to the PC, we have a possible branch a. Explore all possible branch paths, depth-first b. Remember the most conservative state (most Xs) i. Take union of gates of branches if most conservative is missing a few c. If branch is re-encountered i. Skip check if this state is a substate of that most conservative state ii. Merge lists of activated gates and make the result the new conservative state 5. Lists of all gates that are never toggled, along with their constant values, are passed to the cut-and-stitch function
1. After X propagation, untoggled gates are removed from the netlist and replaced by a constant voltage 2. Rerun logical synthesis for further optimizations a. Typically gates that have constant inputs can reduced to even simpler logic 3. Place and route (this is not any further optimized)
○
○ Operating @1V @100MHz ○ Bare metal simulation or FreeRTOS ○ Either completely general purpose, or traditionally optimized for an application by removing modules
○ All unused modules are removed ○ X propagation and cut-and-stitch are performed
○ Run bespoke tuning process on each and take the union of the results
○ Type I: conditional operator changes (AND -> OR) ○ Type II: computation operator mutants (add -> multiply) ○ Type III: loop conditional operator mutants (less than -> less than or equal to)
program ○ A particular ADD instruction may only use 16 bits out of a 32 bit ALU
complete instruction (e.g. subneg) or a set of them ○ A program written using Turing complete instruction can be consisted solely of that instruction
OS
○ 37% unused in the worst case ○ 49% unused on average
○ Branch predictors ○ Caches ○ Speculative operations ○ Out-of-order cores
○ ...branch predictions ○ ...tag checks ○ ...values where speculation may be used
1. All of the examples they tested are just algorithms such as binary search or FFT. But actual applications, even in IoT and smaller, typically do more than just, e.g., binary search. Do Bespoke tuned processors have any value for real-world programs? 2. Is using Milu and adding mutations representative of what in-field updates would actually change? 3. Can the Bespoke tuning process be used for lowering power consumption of high-performance accelerators? 4. Is Bespoke tuning better or worse for certain cases than technologies such as HLS, Simulate-and-Eliminate, or just making an ASIC design?
○ Additional development costs ■ New high-level specs of application behavior needs to be defined ■ High-level spec needs to also be verified ○ C to ASICs is very difficult to do, especially to do efficiently ○ Unlikely to support multiple applications nor in-field updates
○ Simulates the target application with a user-provided set of inputs on multiple base designs ■ Require significant user input ■ Only considers high-level, manually-identified components ■ Relies on user inputs to determine unused components--user may forget a test case!