with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015 - PowerPoint PPT Presentation

Filtered Runahead Execution with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015

Runahead Execution Overview • Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] • The core checkpoints architectural state • The result of the memory operation that caused the stall is marked as poisoned in the physical register file • The core continues to fetch and execute instructions • Operations are discarded instead of retired

Core Stall Cycles % Total Core Cycles 100 10 20 30 40 50 60 70 80 90 0 calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264 bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf MI-Average

Core Stall Cycles % Total Core Cycles 100 10 20 30 40 50 60 70 80 90 0 3.0 calculix 1.8 povray 2.2 namd 2.4 gamess 2.7 perlbench 2.01.8 tonto gromac 1.62.31.6 gobmk dealII sjeng 1.4 gcc 1.7 hmmer 1.4 2.1 h264 bzip2 0.91.4 astar xalancbmk 1.4 1.3 zeusmp cactus 1.5 wrf 0.9 GemsFDTD 1.2 leslie 0.71.2 omnetpp milc 0.9 soplex 0.8 0.9 sphinx bwaves 0.40.7 libquantum lbm 0.3 mcf 0.9 MI-Average

Runahead Buffer Overview • Overview of Memory Dependence Chains • Traditional Runahead Observations • Runahead Buffer Proposal and Pipeline Modifications • Runahead Buffer System Configuration and Evaluation • Runahead Buffer Conclusions

Background • Every load has a chain of operations that must be completed to generate the address of the memory access

Example Dependence Chain LD [R3] -> R5 ADD R4, R5 -> R9 ADD R9, R1 -> R6 Cache Miss LD [R6] -> R8

Example Dependence Chain LD [R3] -> R5 These are the only operations that need to be ADD R4, R5 -> R9 completed before the cache miss can be executed ADD R9, R1 -> R6 Cache Miss LD [R6] -> R8

Runahead Observations 1 Total Operations Executed During Runahead 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Dependence Chain Other Operation

Runahead Observations 1 Total Operations Executed During Runahead 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% are irrelevant to the dependence chain of a cache miss Traditional runahead executes many operations that calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Dependence Chain Other Operation

Runahead Observations 2 Total Cache Miss Dependence Chains 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Repeated Chain Unique Chain

Runahead Observations 2 Total Cache Miss Dependence Chains 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% calculix povray namd gamess Most dependence chains are repeated in perlbench tonto gromac gobmk dealII traditional runahead sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Repeated Chain Unique Chain

Dependence Chain Length Runahead Observations 3 10 20 30 40 50 60 70 80 0 calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Average

Dependence Chain Length Runahead Observations 3 10 20 30 40 50 60 70 80 0 calculix povray namd gamess perlbench tonto gromac Most dependence chains are short gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Average

Runahead Buffer • At a full window stall, dynamically identify the dependence chain to use during runahead from the reorder buffer • Once the chain is identified, we place it in a runahead buffer • The front-end is then clock-gated and the runahead buffer directly feeds decoded micro-ops into the back- end for runahead execution

Runahead Buffer Pipeline Modifications Pseudo- Wakeup Arch Checkpoint Poison RA-Buffer RA-Cache Bits Register Decode Rename Select/ Execute Commit Fetch Read Wakeup

Runahead Buffer Chain Generation 0xA LD [R0] -> R2 LD [P15] -> P2 Cycle: 0 7 6 1 5 4 2 3 LD [P3] -> P5 LD [R3] -> R5 0xD ADD P4, P5 -> P9 ADD R4, R5 -> R7 0xE ADD R7, R1 -> R6 Source ADD P9, P1 -> P6 0x7 P7 P1, P4, P5 P4, P5 P5 P3 P6 P9, P1 Register MOV P6 -> P7 MOV R6 -> R0 0x8 Search List: LD [R0] -> R2 0xA LD [P7] -> P8

Runahead Buffer Optimizations • A small dependence chain cache (2-entries) improves performance • Hybrid Policy • The core begins traditional runahead execution instead of using the runahead buffer if: • An operation with the same PC as the operation that is blocking the ROB is not found in the ROB • The generated dependence chain is too long (more than 32 operations)

System Configuration • Single Core • 5 Configurations • 4-wide Issue • Traditional Runahead • Runahead Buffer • 192 Entry Reorder Buffer • Runahead Buffer + Chain Cache • Runahead Buffer • Hybrid Policy • 32 Entry • Traditional Runahead + Energy • Runahead Buffer Chain Cache: 2-Entries Optimizations • Caches • 32 KB L1 I/D-Cache, 3-Cycle • 1MB Last Level Cache, 18-Cycle • Stream Prefetcher • Non-Uniform Access Latency DRAM System

Runahead Buffer Performance % IPC Difference over No- Prefetching Baseline 10 15 20 25 30 35 40 -5 0 5 calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf GMean Hybrid Policy Runahead Buffer + Chain Cache Runahead Buffer Runahead

Runahead Buffer Performance % IPC Difference over 40 35 No-Prefetching Baseline 30 25 20 Runahead Runahead Buffer 15 Runahead Buffer + Chain Cache 10 Hybrid Policy 5 0 -5

Runahead Buffer MLP 18 Cache Misses per Runahead 16 14 12 Interval 10 8 Runahead 6 Runahead Buffer 4 2 0

Energy Analysis 2.5 % Energy Difference over No-PF 2 1.5 Baseline Runahead Runahead Enhancements 1 Runahead Buffer Runahead Buffer + Chain Cache 0.5 Hybrid 0

Stall Cycles in Runahead Buffer Mode 100% 90% 80% % Total Cycles 70% 60% 50% 40% 30% 20% 10% 0%

Stream Prefetching 160 140 % IPC Difference over No- Prefetching Baseline 120 Stream 100 Runahead + Stream 80 60 Runahead Buffer + Stream 40 Runahead Buffer + Chain Cache + 20 Stream 0 Hybrid + Stream -20

Bandwidth Consumption 2.00 1.80 Normalized Bandwidth 1.60 1.40 1.20 Stream 1.00 Runahead 0.80 Runahead Buffer 0.60 Runahead Buffer + Chain Cache 0.40 0.20 Hybrid 0.00

Energy Analysis 2.5 % Energy Difference over No-PF Baseline + Stream 2 Runahead + Stream 1.5 Baseline Runahead Enhancements + Stream 1 Runhead Buffer + Stream 0.5 Runahead Buffer + Chain Cache + Stream 0 Hybrid + Stream

Runahead Buffer Conclusions • Many of the operations that are executed in traditional runahead execution are unnecessary to generate cache misses • The runahead buffer uses filtered dependence chains that only contain the operations required for a cache miss • These chains are generally short • This chain is read into a buffer and speculatively executed as if they were in a loop when the core would be otherwise idle

Runahead Buffer Conclusions • The runahead buffer enables the front-end to be idle for 47% of the total execution cycles of the medium and high memory intensity SPEC CPU2006 benchmarks • The runahead buffer generates over twice as much MLP on average as traditional runahead execution • The runahead buffer results in a 17.2% performance increase and 6.7% decrease in energy consumption over a system with no- prefetching. Traditional runahead execution results in a 12.3% performance increase and 9.5% energy increase

Filtered Runahead Execution with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015

with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015 - PowerPoint PPT Presentation

Filtered Runahead Execution with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015 Runahead Execution Overview Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] The core

Runahead Runahead Runahead Runahead High Level Description High Level Description

External buffer Raslan Darawsheh Mellanox External buffer First was introduced by Olivier

TinyOS Determine when Fill message Specify Pass buffer message buffer Network Communication

Lab 2: Buffer Overflows Fengwei Zhang SUSTech CS 315 Computer Security 1 Buffer Overflows

Delta Pointers: Buffer Overflow Checks Without the Checks Tadde us Kroes & Koen Koning

Smashing the Buffer Smashing the Buffer Miroslav tampar Miroslav tampar (mstampar@zsis.hr )

Buffer Software Security overflows and other memory safety vulnerabilities Buffer overflow

Buffer Overflows with Content 2 A Process Stack Buffer Overflow Common Techniques employed

More Vulnerabilities (buffer overreads, format string, integer overflow, heap overflows) Chester

Introduction Buffer Overflows Buffer overflows were the most common form of security

a single gadget weird machine Framing Signals a return to portable shellcode Erik Bosman and

Week 03 Lectures PostgreSQL Buffer Manager 1/95 PostgreSQL buffer manager: provides a shared

Shared buffer laboratory 2 implements a shared buffer Process loop Ke yboard wait for

Riparian Buffer or Riparian Forest Buffer Equivalency Demonstration and Offsetting Water

k + -buffer: Fragment Synchronized k-buffer I3D 2014 Andreas A. Vasilakis & Ioannis Fudos {

Buffer Overflow overflows Defenses and other memory safety vulnerabilities Buffer overflow

External Sorting (From Chapter 13)

System Notes 02: Hardware Hector Garcia-Molina CS 245 Notes 2 1 Outline Hardware: Disks

The Page Cache Don Porter 1 CSE 506: Opera.ng Systems Logical Diagram Binary Memory Threads

External Sorting [R&G] Chapter 13 CS4320 1 Why Sort? A classic problem in computer

Search Lookaside Buffer: Efficient Caching for Index Data Structures Xingbo Wu, Fan Ni, Song

Relaxed memory concurrency and verified compilation Viktor Vafeiadis Max Planck Institute for

Data Management Systems Storage Management Basic principles Memory hierarchy The

Stack Smashing as of Today A State-of-the-Art Overview on Buffer Overflow Protections on

with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015 - PowerPoint PPT Presentation

Filtered Runahead Execution with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015 Runahead Execution Overview Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] The core

Runahead Runahead Runahead Runahead High Level Description High Level Description

External buffer Raslan Darawsheh Mellanox External buffer First was introduced by Olivier

TinyOS Determine when Fill message Specify Pass buffer message buffer Network Communication

Lab 2: Buffer Overflows Fengwei Zhang SUSTech CS 315 Computer Security 1 Buffer Overflows

Delta Pointers: Buffer Overflow Checks Without the Checks Tadde us Kroes &amp; Koen Koning

Smashing the Buffer Smashing the Buffer Miroslav tampar Miroslav tampar (mstampar@zsis.hr )

Buffer Software Security overflows and other memory safety vulnerabilities Buffer overflow

Buffer Overflows with Content 2 A Process Stack Buffer Overflow Common Techniques employed

More Vulnerabilities (buffer overreads, format string, integer overflow, heap overflows) Chester

Introduction Buffer Overflows Buffer overflows were the most common form of security

a single gadget weird machine Framing Signals a return to portable shellcode Erik Bosman and

Week 03 Lectures PostgreSQL Buffer Manager 1/95 PostgreSQL buffer manager: provides a shared

Shared buffer laboratory 2 implements a shared buffer Process loop Ke yboard wait for

Riparian Buffer or Riparian Forest Buffer Equivalency Demonstration and Offsetting Water

k + -buffer: Fragment Synchronized k-buffer I3D 2014 Andreas A. Vasilakis &amp; Ioannis Fudos {

Buffer Overflow overflows Defenses and other memory safety vulnerabilities Buffer overflow

External Sorting (From Chapter 13)

System Notes 02: Hardware Hector Garcia-Molina CS 245 Notes 2 1 Outline Hardware: Disks

The Page Cache Don Porter 1 CSE 506: Opera.ng Systems Logical Diagram Binary Memory Threads

External Sorting [R&amp;G] Chapter 13 CS4320 1 Why Sort? A classic problem in computer

Search Lookaside Buffer: Efficient Caching for Index Data Structures Xingbo Wu, Fan Ni, Song

Relaxed memory concurrency and verified compilation Viktor Vafeiadis Max Planck Institute for

Data Management Systems Storage Management Basic principles Memory hierarchy The

Stack Smashing as of Today A State-of-the-Art Overview on Buffer Overflow Protections on

Delta Pointers: Buffer Overflow Checks Without the Checks Tadde us Kroes & Koen Koning

k + -buffer: Fragment Synchronized k-buffer I3D 2014 Andreas A. Vasilakis & Ioannis Fudos {

External Sorting [R&G] Chapter 13 CS4320 1 Why Sort? A classic problem in computer