HW/SW Co-designed Processors : Challenges, Design Choices and a - - PowerPoint PPT Presentation

hw sw co designed processors challenges design choices
SMART_READER_LITE
LIVE PREVIEW

HW/SW Co-designed Processors : Challenges, Design Choices and a - - PowerPoint PPT Presentation

HW/SW Co-designed Processors : Challenges, Design Choices and a Simulation Infrastructure for Evaluation Rakesh Kumar 1 , Jos Cano 1 , Aleksandar Brankovic 2 , Demos Pavlou 3 , Kyriakos Stavrou 3 , Enric Gibert 4 , Alejandro Martnez 5 ,


slide-1
SLIDE 1

HW/SW Co-designed Processors: Challenges, Design Choices and a Simulation Infrastructure for Evaluation

Rakesh Kumar1, José Cano1, Aleksandar Brankovic2, Demos Pavlou3, Kyriakos Stavrou3, Enric Gibert4, Alejandro Martínez5, Antonio González6

1University of Edinburgh, UK 2Intel 311pets 4Pharmacelera 5ARM 6Universitat Politècnica de Catalunya, Spain

IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Santa Rosa, California, USA - April 24-25, 2017

slide-2
SLIDE 2

Outline

  • HW/SW co-designed processors
  • Building a simulation infrastructure
  • DARCO
  • Evaluation
  • Conclusions

2

slide-3
SLIDE 3

HW/SW co-designed processors

3

Energy Efficiency Performance

HW/SW co-designed processor Operating System Translation Optimization Layer (Software) Libraries Application Programs Execution Hardware Host ISA Guest ISA Conventional processor Operating System Execution Hardware Libraries Application Programs ISA

  • Simple Host ISA

– In-order cores; move complexity to software layer

  • Dynamic Binary Optimizations in software (TOL)

– Aggressive and speculative – Exploit application behavior at runtime

slide-4
SLIDE 4
  • IBM DAISY (1997)

– Targets binary compatibility from PowerPC to VLIW architectures

  • IBM BOA (1999)

– Targets high frequency PowerPC through simple hardware design

  • Transmeta Crusoe (2000) and Efficeon (2003)

– Execute x86 binaries on proprietary VLIW with low power consumption – Better energy efficiency than Intel Pentium III

  • Nvidia Denver (2014)

– Executes ARMv8 binaries on proprietary in-order core – Applying dynamic optimizations matches Out-of-order Intel Haswell

HW/SW co-designed processors: History

2003 1997 1999 2000 2014

DAISY BOA

4

???

slide-5
SLIDE 5
  • IBM DAISY (1997)

– Targets binary compatibility from PowerPC to VLIW architectures

  • IBM BOA (1999)

– Targets high frequency PowerPC through simple hardware design

  • Transmeta Crusoe (2000) and Efficeon (2003)

– Execute x86 binaries on proprietary VLIW with low power consumption – Better energy efficiency than Intel Pentium III

  • Nvidia Denver (2014)

– Executes ARMv8 binaries on proprietary in-order core – Applying dynamic optimizations matches Out-of-order Intel Haswell

HW/SW co-designed processors: History

2003 1997 1999 2000 2014

DAISY BOA

5

???

Anything missing? No major academic project! Can lack of simulation infrastructure be the reason?

slide-6
SLIDE 6

Outline

  • Introduction
  • HW/SW co-designed processors
  • Building a simulation infrastructure
  • DARCO
  • Evaluation
  • Conclusions

6

slide-7
SLIDE 7

What will a simulation infrastructure enable?

  • Where to implement (HW or SW?) microarchitectural features like

– Instruction decoding/reordering, memory disambiguation, register renaming, …

7

Operating System Translation Optimization Layer (Software) Libraries Application Programs Execution Hardware Host ISA Guest ISA

slide-8
SLIDE 8

What will a simulation infrastructure enable?

  • Where to implement (HW or SW?) microarchitectural features like

– Instruction decoding/reordering, memory disambiguation, register renaming, …

  • How to reduce “startup delay”

– One of the major problems of Transmeta products

  • When and where to translate/optimize the guest binaries

– As soon as code becomes “hot”?, wait for a core to become idle?,…

  • How to address speculative execution (memory, control)

– Checkpointing granularity?, find susceptible speculation failure points?

  • When and how to profile the execution

– Overhead vs opportunity for improvement

8

slide-9
SLIDE 9

What will a simulation infrastructure enable?

  • Where to implement (HW or SW?) microarchitectural features like

– Instruction decoding/reordering, memory disambiguation, register renaming, …

  • How to reduce “startup delay”

– One of the major problems of Transmeta products

  • When and where to translate/optimize the guest binaries

– As soon as code becomes “hot”?, wait for a core to become idle?,…

  • How to address speculative execution (memory, control)

– Checkpointing granularity?, find susceptible speculation failure points?

  • When and how to profile the execution

– Overhead vs opportunity for improvement

9

A simulation infrastructure can help evaluate trade-offs and design choices

slide-10
SLIDE 10

Simulation infrastructure: Complexity

10

  • Compilation framework

– Code analysis/translation – Optimizations – Code generation

  • Runtime system

– Profiling and instrumentation – Profile-guided optimizations

  • Microarchitectural simulator

– Model components like pipeline, caches, … – Allow sampling

slide-11
SLIDE 11

Simulation infrastructure: Complexity

11

  • Compilation framework

– Code analysis/translation – Optimizations – Code generation

  • Runtime system

– Profiling and instrumentation – Profile-guided optimizations

  • Microarchitectural simulator

– Model components like pipeline, caches, … – Allow sampling

Simulation infrastructure complexity = Compilation framework + Runtime system + Microarchitectural simulator

slide-12
SLIDE 12
  • Correctness

– It should not change program behavior

  • Minimum software layer (TOL) overhead

– TOL execution time must be small

  • Minimum emulation cost

– Host to guest instruction ratio must be low

  • Support for multiple guest ISAs (front-ends)

– Enables wider applicability

  • Plug and play support

– Easy to include/evaluate new features

  • Debugging

– Strong debug toolchain

Simulation infrastructure: Requirements

12

TOL Hardware

x86 ARM Power

slide-13
SLIDE 13

Outline

  • Introduction
  • HW/SW co-designed processors
  • Building a simulation infrastructure
  • DARCO: Infrastructure for Research on HW/SW Co-designed Processors
  • Evaluation
  • Conclusions

13

slide-14
SLIDE 14

DARCO: The big picture

14

x86 Component

x86 Binary Commands Path Commands Path

Timing Simulator

Process Tracker Data and Instruction Path Data and Instruction Path

Co-designed Component Controller x86 Functional Emulator RISC Functional Emulator x86 OS Translation Optimization Layer (TOL)

Emulated x86 Memory State Authoritative x86 Register State Authoritative x86 Memory State State Checker Emulated x86 Register State

  • Models a processor that executes x86 code on a RISC host architecture
  • Four main components: Co-designed, x86, Timing Simulator, Controller
slide-15
SLIDE 15

DARCO: Co-designed Component

  • Models the functionality of a HW/SW co-designed processor
  • Composed of TOL and host ISA functional emulator (user code)
  • Maintains emulated x86 architectural and memory states

15

x86 Component

x86 Binary Commands Path Commands Path

Timing Simulator

Process Tracker Data and Instruction Path Data and Instruction Path

Co-designed Component Controller x86 Functional Emulator RISC Functional Emulator x86 OS Translation Optimization Layer (TOL)

Emulated x86 Memory State Authoritative x86 Register State Authoritative x86 Memory State State Checker Emulated x86 Register State

slide-16
SLIDE 16

DARCO: x86 Component

  • Provides a full-system functional emulator for the guest x86 ISA
  • Maintains authoritative x86 architectural and memory states
  • Filters instruction stream and passes user code to co-designed component

16

x86 Component

x86 Binary Commands Path Commands Path

Timing Simulator

Process Tracker Data and Instruction Path Data and Instruction Path

Co-designed Component Controller x86 Functional Emulator RISC Functional Emulator x86 OS Translation Optimization Layer (TOL)

Emulated x86 Memory State Authoritative x86 Register State Authoritative x86 Memory State State Checker Emulated x86 Register State

slide-17
SLIDE 17

DARCO: Timing Simulator

  • Models a parameterized in-order core
  • Can distinguish application and TOL code
  • Includes power and energy modelling (McPAT)

17

x86 Component

x86 Binary Commands Path Commands Path

Timing Simulator

Process Tracker Data and Instruction Path Data and Instruction Path

Co-designed Component Controller x86 Functional Emulator RISC Functional Emulator x86 OS Translation Optimization Layer (TOL)

Emulated x86 Memory State Authoritative x86 Register State Authoritative x86 Memory State State Checker Emulated x86 Register State

slide-18
SLIDE 18

DARCO: Controller

  • Provides full control over the app execution and debugging utilities
  • Compares authoritative and emulated x86 states to ensure correctness

18

User x86 Component

x86 Binary Commands Path Commands Path

Timing Simulator

Process Tracker Data and Instruction Path Data and Instruction Path

Co-designed Component Controller x86 Functional Emulator RISC Functional Emulator x86 OS Translation Optimization Layer (TOL)

Emulated x86 Memory State Authoritative x86 Register State Authoritative x86 Memory State State Checker Emulated x86 Register State

slide-19
SLIDE 19

19

DARCO: Starting execution

x86 Component

Tracker

Co-designed Component Controller User Commands

User : Execute application XYZ Controller : Send proper command to x86 component x86 OS : Starts application XYZ Tracker : Identifies application XYZ Controller : Request 1st code page from x86 component and send to TOL with initial state TOL : Load the code page and starts emulating TOL XYZ Exec Space

Code Code Reg File Data Reg File

Tracks the emulated

  • application. Passes

user-level code to co- designed component

Data

XYZ

slide-20
SLIDE 20

20

DARCO: Handling system calls

Co-designed Component x86 Component Controller User Events and Commands

TOL : Execution sequence reaches a system call TOL : Flush state and send request to x86 component x86 Comp : Reach the same execution point and make the system call Controller : Send new state to TOL TOL : Continue emulation TOL XYZ Exec Space

Code Code

Tracker

Reg File Data Reg File

Sys Call Sys Call

Data Handler

The system call is faked – it is NOT emulated in the co- designed component

slide-21
SLIDE 21

21

x86 Component

x86 Binary Commands Path Commands Path

Timing Simulator

Process Tracker Data and Instruction Path Data and Instruction Path

Co-designed Component Controller x86 Functional Emulator RISC Functional Emulator x86 OS Translation Optimization Layer (TOL)

Emulated x86 Memory State Authoritative x86 Register State Authoritative x86 Memory State State Checker Emulated x86 Register State

DARCO: Translation Optimization Layer (TOL)

slide-22
SLIDE 22
  • Interpretation (IM)

– x86 instructions interpreted sequentially – Profiling for execution frequency of basic blocks

  • Basic block translation (BBM)

– x86 basic blocks translated to intermediate representation – Lightweight optimizations (dead code, constants) – Profiling for branch directions and execution frequency of basic blocks

  • Superblock optimization (SBM)

– Bigger optimization regions across multiple x86 basic blocks – Aggressive and speculative optimizations – No profiling (reduces overhead)

Cold Code

22

Warm Code Hot Code

DARCO: TOL execution modes

slide-23
SLIDE 23

23

x86 eip Interpret > BBth No BB translate Yes Store in Code $ Chain Execute from Code $ In Code $? No Yes From Code $ Create SB Optimize SB > SBth Yes No Already optimized

DARCO: TOL execution flow

slide-24
SLIDE 24

24

x86 eip Interpret > BBth No BB translate Yes Store in Code $ Chain Execute from Code $ In Code $? No Yes From Code $ Create SB Optimize SB > SBth Yes No Already optimized

  • x86 to Intermediate Representation (IR)
  • Control Speculation
  • Loop Unrolling
  • Static Single Assignment (SSA)
  • Forward Pass
  • Constant Folding
  • Constant/Copy Propagation
  • Common Subexpression Elimination
  • Backward Pass
  • Dead Code Elimination
  • Data Dependence Graph (DDG)
  • Memory Alias Analysis
  • Redundant Load Removal
  • Store Forwarding
  • Instruction Scheduling
  • Data Speculation
  • Register Allocation
  • Code Generation (opt. IR to Host ISA)

DARCO: TOL execution flow

slide-25
SLIDE 25

A C D

DARCO: Building a superblock

25

A branch B branch C branch D branch E branch 5% 95% 2% 98% 30% 70% A branch C branch D branch Exec counter > SBth

assert assert

branch

Unbiased Branch

SuperBlock

slide-26
SLIDE 26

A C D

DARCO: Building a superblock

26

A branch B branch C branch D branch E branch 5% 95% 2% 98% 30% 70% A branch C branch D branch Exec counter > SBth

assert assert

branch

Unbiased Branch

Impact: Bigger optimization regions

SuperBlock

slide-27
SLIDE 27

DARCO: Timing Simulator

27

x86 Component

x86 Binary Commands Path Commands Path

Timing Simulator

Process Tracker Data and Instruction Path Data and Instruction Path

Co-designed Component Controller x86 Functional Emulator RISC Functional Emulator x86 OS Translation Optimization Layer (TOL)

Emulated x86 Memory State Authoritative x86 Register State Authoritative x86 Memory State State Checker Emulated x86 Register State

slide-28
SLIDE 28

DARCO: Timing Simulator

  • Models a configurable RISC superscalar in-order processor
  • Pipeline decoupled into: Front-End, Instruction Queue, Back-End

28

Front End

AC IF DEC

  • In. Queue

Back End

ISSUE RR EXE WB TLB L1 TLB L2 L1-I$ L2$ L1-D$ Main Memory Prefetcher BP BTB

slide-29
SLIDE 29

DARCO: Meeting the requirements

  • Correctness

– Architectural/memory states compared periodically

  • Minimum TOL overhead

– 3-stages for translation/optimization; Chaining

  • Minimum emulation cost

– Aggressive/speculative optimizations

  • Support for multiple guest ISAs (front-ends)

– Incorporating additional front-ends is straightforward

  • Plug and play support

– Modular design

  • Debugging

– Powerful debug toolchain

29 TOL x86 ARM? Power?

x86 Component x86 Binary Commands Path Commands Path

Timin g Simul ator

Proces s Tracker Data and Instruction Path Data and Instruction Path Co-design Component

Controller

x86 Functional Emulator RISC Functional Emulator x86 OS Translation Optimization Layer (TOL)

Emulated x86 Memory State Authoritative x86 Register State Authoritative x86 Memory State

State Checker

Emulated x86 Register State

SBM Optim.

x 8 6 e i p Interpret > B B t h N
  • BB translate
Y e s Store in Code $ Chain Execute from Code $ I n C
  • d
e $ ? N
  • Y
e s From Code $ Create SB Optimize SB > S B t h Y e s N
  • Already
  • ptimized

Controller Controller

slide-30
SLIDE 30

Outline

  • Introduction
  • HW/SW co-designed processors
  • Building a simulation infrastructure
  • DARCO
  • Evaluation
  • Conclusions

30

slide-31
SLIDE 31

Evaluation methodology

  • Benchmarks analyzed

– SPEC CPU2006: Integer, Floating point – Physicsbench

  • We simulate 4B (billion) x86 instructions

– Some benchmarks have fewer instructions and run to completion

  • DARCO speed (x86 instructions)

– 3.4 MIPS and ~0.4 MIPS enabling the Timing Simulator – Intel Xeon E5-2630L (2.40 GHz) and L5630 Dual (2.13 GHz) with 24-128 GB RAM

  • Highlights

– x86 dynamic code distribution – Execution time distribution – Emulation cost

31

slide-32
SLIDE 32

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk 410.bwaves 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 450.soplex 453.povray 454.calculix 459.GemsFDTD 470.lbm 482.sphinx3 100.breakable 101.continuous 102.deformable 104.explosions 105.highspeed 106.periodic 107.ragdoll SPECINT2006 SPECFP2006 Physicsbench SPECINT2006 SPECFP2006 Physicsbench Averages

X86 dynamic instruction percentage Insn executed in IM Insn execute in BBM Insn executed in SBM

Evaluation: x86 dynamic code distribution

32

  • ~90% of the dynamic instruction stream comes from SBM (optimized code)
  • Represents ~14% of the static code (not shown)

90% Lower repetition factor of the SuperBlocks

slide-33
SLIDE 33

Evaluation: Execution time distribution

  • Application: cycles executing RISC instructions that emulate the x86 code
  • TOL (overhead): cycles translating/optimizing the guest code, etc

33

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk 410.bwaves 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 450.soplex 453.povray 454.calculix 459.GemsFDTD 470.lbm 482.sphinx3 100.breakable 101.continuous 102.deformable 104.explosions 105.highspeed 106.periodic 107.ragdoll SPECINT2006 SPECFP2006 Physicsbench SPECINT2006 SPECFP2006 Physicsbench Averages

Percentage of execution time TOL (overhead) Application

20%

slide-34
SLIDE 34

1 2 3 4 5 6 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk 410.bwaves 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 450.soplex 453.povray 454.calculix 459.GemsFDTD 470.lbm 482.sphinx3 100.breakable 101.continuous 102.deformable 104.explosions 105.highspeed 106.periodic 107.ragdoll SPECINT2006 SPECFP2006 Physicsbench SPECINT2006 SPECFP2006 Physicsbench Averages

Host Instructions per x86 Instruction

Evaluation: Emulation cost

  • Number of host instructions generated per x86 instruction ~3
  • Reasonable for translating from CISC to RISC

34

3.2

slide-35
SLIDE 35

Conclusions

  • HW/SW co-designed processors

– Huge potential to improve energy efficiency and performance – Several industrial projects – No major project in academia (lack of simulation infrastructures)

  • Challenges

– To become mainstream (e.g. startup delay) – In building a simulation infrastructure (e.g. software layer overhead)

  • DARCO

– May enable academic research in HW/SW co-designed domain – Provides a modular simulation infrastructure – Easy to add new components/optimizations

35

slide-36
SLIDE 36

HW/SW Co-designed Processors: Challenges, Design Choices and a Simulation Infrastructure for Evaluation

Rakesh Kumar1, José Cano1, Aleksandar Brankovic2, Demos Pavlou3, Kyriakos Stavrou3, Enric Gibert4, Alejandro Martínez5, Antonio González6

1University of Edinburgh, UK 2Intel 311pets 4Pharmacelera 5ARM 6Universitat Politècnica de Catalunya, Spain

THANK YOU !!!