HW/SW Co-designed Processors : Challenges, Design Choices and a - PowerPoint PPT Presentation

HW/SW Co-designed Processors : Challenges, Design Choices and a Simulation Infrastructure for Evaluation Rakesh Kumar 1 , José Cano 1 , Aleksandar Brankovic 2 , Demos Pavlou 3 , Kyriakos Stavrou 3 , Enric Gibert 4 , Alejandro Martínez 5 , Antonio González 6 1 University of Edinburgh, UK 2 Intel 3 11pets 4 Pharmacelera 5 ARM 6 Universitat Politècnica de Catalunya, Spain IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Santa Rosa, California, USA - April 24-25, 2017

Outline • HW/SW co-designed processors • Building a simulation infrastructure • DARCO • Evaluation • Conclusions 2

HW/SW co-designed processors Application Programs Libraries Application Programs Operating Libraries System Guest ISA Operating Translation Optimization System Layer (Software) Host ISA ISA Execution Hardware Execution Hardware Conventional processor HW/SW co-designed processor • Simple Host ISA Energy Efficiency – In-order cores; move complexity to software layer • Dynamic Binary Optimizations in software (TOL) Performance – Aggressive and speculative – Exploit application behavior at runtime 3

HW/SW co-designed processors: History • IBM DAISY (1997) – Targets binary compatibility from PowerPC to VLIW architectures • IBM BOA (1999) – Targets high frequency PowerPC through simple hardware design • Transmeta Crusoe (2000) and Efficeon (2003) – Execute x86 binaries on proprietary VLIW with low power consumption – Better energy efficiency than Intel Pentium III • Nvidia Denver (2014) – Executes ARMv8 binaries on proprietary in-order core – Applying dynamic optimizations matches Out-of-order Intel Haswell DAISY BOA ??? 1997 1999 2000 2003 2014 4

HW/SW co-designed processors: History • IBM DAISY (1997) – Targets binary compatibility from PowerPC to VLIW architectures • IBM BOA (1999) – Targets high frequency PowerPC through simple hardware design Anything missing? • Transmeta Crusoe (2000) and Efficeon (2003) No major academic project! – Execute x86 binaries on proprietary VLIW with low power consumption Can lack of simulation infrastructure be the reason? – Better energy efficiency than Intel Pentium III • Nvidia Denver (2014) – Executes ARMv8 binaries on proprietary in-order core – Applying dynamic optimizations matches Out-of-order Intel Haswell DAISY BOA ??? 1997 1999 2000 2003 2014 5

Outline • Introduction • HW/SW co-designed processors • Building a simulation infrastructure • DARCO • Evaluation • Conclusions 6

What will a simulation infrastructure enable? • Where to implement (HW or SW?) microarchitectural features like – Instruction decoding/reordering, memory disambiguation, register renaming, … Application Programs Libraries Operating System Guest ISA Translation Optimization Layer (Software) Host ISA Execution Hardware 7

What will a simulation infrastructure enable? • Where to implement (HW or SW?) microarchitectural features like – Instruction decoding/reordering, memory disambiguation, register renaming, … • How to reduce “ startup delay ” – One of the major problems of Transmeta products • When and where to translate/optimize the guest binaries – As soon as code becomes “hot”?, wait for a core to become idle?,… • How to address speculative execution (memory, control) – Checkpointing granularity?, find susceptible speculation failure points? • When and how to profile the execution – Overhead vs opportunity for improvement 8

What will a simulation infrastructure enable? • Where to implement (HW or SW?) microarchitectural features like – Instruction decoding/reordering, memory disambiguation, register renaming, … • How to reduce “ startup delay ” – One of the major problems of Transmeta products A simulation infrastructure can help • When and where to translate/optimize the guest binaries evaluate trade-offs and design choices – As soon as code becomes “hot”?, wait for a core to become idle?,… • How to address speculative execution (memory, control) – Checkpointing granularity?, find susceptible speculation failure points? • When and how to profile the execution – Overhead vs opportunity for improvement 9

Simulation infrastructure: Complexity • Compilation framework – Code analysis/translation – Optimizations – Code generation • Runtime system – Profiling and instrumentation – Profile-guided optimizations • Microarchitectural simulator – Model components like pipeline, caches, … – Allow sampling 10

Simulation infrastructure: Complexity • Compilation framework – Code analysis/translation – Optimizations – Code generation Simulation infrastructure complexity = • Runtime system – Profiling and instrumentation Compilation framework + Runtime system + – Profile-guided optimizations Microarchitectural simulator • Microarchitectural simulator – Model components like pipeline, caches, … – Allow sampling 11

Simulation infrastructure: Requirements • Correctness – It should not change program behavior • Minimum software layer (TOL) overhead – TOL execution time must be small • Minimum emulation cost – Host to guest instruction ratio must be low x86 ARM Power • Support for multiple guest ISAs (front-ends) – Enables wider applicability TOL • Plug and play support Hardware – Easy to include/evaluate new features • Debugging – Strong debug toolchain 12

Outline • Introduction • HW/SW co-designed processors • Building a simulation infrastructure • DARCO: Infrastructure for Research on HW/SW Co-designed Processors • Evaluation • Conclusions 13

DARCO: The big picture x86 Component Co-designed Component x86 Binary Translation Optimization x86 OS Layer (TOL) Emulated x86 Register State Authoritative x86 x86 RISC Register State Emulated x86 Functional Functional Authoritative x86 Memory State Emulator Memory State Emulator Data and Data and Instruction Path Instruction Path Process Tracker Timing Controller Simulator Commands Path Commands Path State Checker • Models a processor that executes x86 code on a RISC host architecture • Four main components : Co-designed, x86, Timing Simulator, Controller 14

DARCO: Co-designed Component x86 Component Co-designed Component x86 Binary Translation Optimization x86 OS Layer (TOL) Emulated x86 Register State Authoritative x86 x86 RISC Register State Emulated x86 Functional Functional Authoritative x86 Memory State Emulator Memory State Emulator Data and Data and Instruction Path Instruction Path Process Tracker Timing Controller Simulator Commands Path Commands Path State Checker • Models the functionality of a HW/SW co-designed processor • Composed of TOL and host ISA functional emulator (user code) • Maintains emulated x86 architectural and memory states 15

DARCO: x86 Component x86 Component Co-designed Component x86 Binary Translation Optimization x86 OS Layer (TOL) Emulated x86 Register State Authoritative x86 x86 RISC Register State Emulated x86 Functional Functional Authoritative x86 Memory State Emulator Memory State Emulator Data and Data and Instruction Path Instruction Path Process Tracker Timing Controller Simulator Commands Path Commands Path State Checker • Provides a full-system functional emulator for the guest x86 ISA • Maintains authoritative x86 architectural and memory states • Filters instruction stream and passes user code to co-designed component 16

DARCO: Timing Simulator x86 Component Co-designed Component x86 Binary Translation Optimization x86 OS Layer (TOL) Emulated x86 Register State Authoritative x86 x86 RISC Register State Emulated x86 Functional Functional Authoritative x86 Memory State Emulator Memory State Emulator Data and Data and Instruction Path Instruction Path Process Tracker Timing Controller Simulator Commands Path Commands Path State Checker • Models a parameterized in-order core • Can distinguish application and TOL code • Includes power and energy modelling (McPAT) 17

DARCO: Controller x86 Component Co-designed Component x86 Binary Translation Optimization x86 OS Layer (TOL) Emulated x86 Register State Authoritative x86 x86 RISC Register State Emulated x86 Functional Functional Authoritative x86 Memory State Emulator Memory State Emulator Data and Data and Instruction Path Instruction Path Process Tracker Timing Controller Simulator Commands Path Commands Path State Checker User • Provides full control over the app execution and debugging utilities • Compares authoritative and emulated x86 states to ensure correctness 18

DARCO: Starting execution x86 Component Co-designed Component Tracker Controller XYZ Exec Space TOL Code Data Reg File Code Reg File XYZ Data User Tracks the emulated application . Passes user-level code to co- designed component Commands User : Execute application XYZ Controller : Send proper command to x86 component x86 OS : Starts application XYZ Tracker : Identifies application XYZ : Request 1 st code page from x86 component and send to TOL with initial state Controller TOL : Load the code page and starts emulating 19

HW/SW Co-designed Processors : Challenges, Design Choices and a - PowerPoint PPT Presentation

HW/SW Co-designed Processors : Challenges, Design Choices and a Simulation Infrastructure for Evaluation Rakesh Kumar 1 , Jos Cano 1 , Aleksandar Brankovic 2 , Demos Pavlou 3 , Kyriakos Stavrou 3 , Enric Gibert 4 , Alejandro Martnez 5 ,

TRANSPORTATION CHOICES TRANSPORTATION CHOICES Asia Yeary U.S. EPA Hawaii Sustainability

Outline Expressing Permission William Starr 1 Free Choices, Hard Choices 2 Expressing Permission

How Julia Goes Fast Leah Hanson Main Points 1. Design choices make Julia fast. 2. Design and

Stochastic Processors (or processors that do not always compute correctly by design) Rakesh Kumar

Automatic Algorithm Configuration Design choices and parameters everywhere Todays

Islington Eating Well Together: Making Healthy Choices the Easy Choices London Flagship Food

Embedded Processors Overview Design features Design features AMBA Bus AMBA Bus System

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Today Digital Signal Processors Digital signal processors Microcontrollers are optimized

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Making Ma ng the the Ri Righ ght Choices Choices Presen esentati tion on Title Title Helping

Welcome to the Middle! Courses w/o Choices Courses with Choices English A/B Arts

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer

CS307&CS356: Operating Systems Dept. of Computer Science & Engineering Chentao Wu

A Gentle Introduction to IoT Protocols: MQTT, CoAP, HTTP & WebSockets Antonio Almeida and

CSE 120 Hardware How hardware works Operating Systems Layer What the kernel does API

Chapter 11: File-System Interface File Concept Access Methods Directory Structure

Module 11: File-System Interface File Concept Access :Methods Directory Structure

Mathew Rowley How many bricks does it take to crack a microcell?

Good morning.. My name is Simon Howlett, Product Development Director at Workiva, and welcome to

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection

HW/SW Co-designed Processors : Challenges, Design Choices and a - PowerPoint PPT Presentation

HW/SW Co-designed Processors : Challenges, Design Choices and a Simulation Infrastructure for Evaluation Rakesh Kumar 1 , Jos Cano 1 , Aleksandar Brankovic 2 , Demos Pavlou 3 , Kyriakos Stavrou 3 , Enric Gibert 4 , Alejandro Martnez 5 ,

TRANSPORTATION CHOICES TRANSPORTATION CHOICES Asia Yeary U.S. EPA Hawaii Sustainability

Outline Expressing Permission William Starr 1 Free Choices, Hard Choices 2 Expressing Permission

How Julia Goes Fast Leah Hanson Main Points 1. Design choices make Julia fast. 2. Design and

Stochastic Processors (or processors that do not always compute correctly by design) Rakesh Kumar

Automatic Algorithm Configuration Design choices and parameters everywhere Todays

Islington Eating Well Together: Making Healthy Choices the Easy Choices London Flagship Food

Embedded Processors Overview Design features Design features AMBA Bus AMBA Bus System

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Today Digital Signal Processors Digital signal processors Microcontrollers are optimized

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Making Ma ng the the Ri Righ ght Choices Choices Presen esentati tion on Title Title Helping

Welcome to the Middle! Courses w/o Choices Courses with Choices English A/B Arts

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer

CS307&amp;CS356: Operating Systems Dept. of Computer Science &amp; Engineering Chentao Wu

A Gentle Introduction to IoT Protocols: MQTT, CoAP, HTTP &amp; WebSockets Antonio Almeida and

CSE 120 Hardware How hardware works Operating Systems Layer What the kernel does API

Chapter 11: File-System Interface File Concept Access Methods Directory Structure

Module 11: File-System Interface File Concept Access :Methods Directory Structure

Mathew Rowley How many bricks does it take to crack a microcell?

Good morning.. My name is Simon Howlett, Product Development Director at Workiva, and welcome to

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection

CS307&CS356: Operating Systems Dept. of Computer Science & Engineering Chentao Wu

A Gentle Introduction to IoT Protocols: MQTT, CoAP, HTTP & WebSockets Antonio Almeida and