with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015 - - PowerPoint PPT Presentation

with a runahead buffer
SMART_READER_LITE
LIVE PREVIEW

with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015 - - PowerPoint PPT Presentation

Filtered Runahead Execution with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015 Runahead Execution Overview Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] The core


slide-1
SLIDE 1

Filtered Runahead Execution with a Runahead Buffer

Milad Hashemi Yale N. Patt December 8, 2015

slide-2
SLIDE 2

Runahead Execution Overview

  • Runahead dynamically expands the instruction window

when the pipeline is stalled [Mutlu et al., 2003]

  • The core checkpoints architectural state
  • The result of the memory operation that caused the stall is

marked as poisoned in the physical register file

  • The core continues to fetch and execute instructions
  • Operations are discarded instead of retired
slide-3
SLIDE 3

Core Stall Cycles

10 20 30 40 50 60 70 80 90 100 calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264 bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie

  • mnetpp

milc soplex sphinx bwaves libquantum lbm mcf MI-Average % Total Core Cycles

slide-4
SLIDE 4

Core Stall Cycles

3.0 1.8 2.2 2.4 2.7 2.01.8 1.62.31.6 1.4 1.7 1.42.1 0.91.4 1.41.3 1.5 0.9 1.2 0.71.2 0.9 0.80.9 0.40.7 0.3 0.9 10 20 30 40 50 60 70 80 90 100 calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264 bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie

  • mnetpp

milc soplex sphinx bwaves libquantum lbm mcf MI-Average % Total Core Cycles

slide-5
SLIDE 5

Runahead Buffer Overview

  • Overview of Memory Dependence Chains
  • Traditional Runahead Observations
  • Runahead Buffer Proposal and Pipeline Modifications
  • Runahead Buffer System Configuration and Evaluation
  • Runahead Buffer Conclusions
slide-6
SLIDE 6

Background

  • Every load has a chain of operations that must be completed

to generate the address of the memory access

slide-7
SLIDE 7

ADD R9, R1 -> R6 ADD R4, R5 -> R9 LD [R3] -> R5

Cache Miss

LD [R6] -> R8

Example Dependence Chain

slide-8
SLIDE 8

Example Dependence Chain

ADD R9, R1 -> R6 ADD R4, R5 -> R9 LD [R3] -> R5

Cache Miss

LD [R6] -> R8

These are the only operations that need to be completed before the cache miss can be executed

slide-9
SLIDE 9

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie

  • mnetpp

milc soplex sphinx bwaves libquantum lbm mcf Total Operations Executed During Runahead Other Operation Dependence Chain

Runahead Observations 1

slide-10
SLIDE 10

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie

  • mnetpp

milc soplex sphinx bwaves libquantum lbm mcf Total Operations Executed During Runahead Other Operation Dependence Chain

Traditional runahead executes many operations that are irrelevant to the dependence chain of a cache miss

Runahead Observations 1

slide-11
SLIDE 11

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie

  • mnetpp

milc soplex sphinx bwaves libquantum lbm mcf Total Cache Miss Dependence Chains Unique Chain Repeated Chain

Runahead Observations 2

slide-12
SLIDE 12

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie

  • mnetpp

milc soplex sphinx bwaves libquantum lbm mcf Total Cache Miss Dependence Chains Unique Chain Repeated Chain

Most dependence chains are repeated in traditional runahead

Runahead Observations 2

slide-13
SLIDE 13

10 20 30 40 50 60 70 80

calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie

  • mnetpp

milc soplex sphinx bwaves libquantum lbm mcf Average

Dependence Chain Length

Runahead Observations 3

slide-14
SLIDE 14

Runahead Observations 3

10 20 30 40 50 60 70 80

calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie

  • mnetpp

milc soplex sphinx bwaves libquantum lbm mcf Average

Dependence Chain Length

Most dependence chains are short

slide-15
SLIDE 15

Runahead Buffer

  • At a full window stall, dynamically identify the

dependence chain to use during runahead from the reorder buffer

  • Once the chain is identified, we place it in a runahead

buffer

  • The front-end is then clock-gated and the runahead

buffer directly feeds decoded micro-ops into the back- end for runahead execution

slide-16
SLIDE 16

Runahead Buffer Pipeline Modifications

Fetch Decode Rename Select/ Wakeup Register Read Execute Commit

Arch Checkpoint RA-Cache Pseudo- Wakeup

Poison Bits RA-Buffer

slide-17
SLIDE 17

Runahead Buffer Chain Generation

LD [P3] -> P5 LD [P15] -> P2 ADD P4, P5 -> P9 ADD P9, P1 -> P6 MOV P6 -> P7 LD [P7] -> P8 0xD 0xE 0x7 0x8 0xA 0xA

Cycle: 0 1 2 3 4 5 6 7 Source Register Search List: P7 P6 P9, P1 P1, P4, P5 P4, P5 P5 P3

LD [R0] -> R2 LD [R3] -> R5 ADD R4, R5 -> R7 ADD R7, R1 -> R6 MOV R6 -> R0 LD [R0] -> R2

slide-18
SLIDE 18

Runahead Buffer Optimizations

  • A small dependence chain cache (2-entries)

improves performance

  • Hybrid Policy
  • The core begins traditional runahead execution instead of

using the runahead buffer if:

  • An operation with the same PC as the operation that is blocking the

ROB is not found in the ROB

  • The generated dependence chain is too long (more than 32
  • perations)
slide-19
SLIDE 19

System Configuration

  • Single Core
  • 4-wide Issue
  • 192 Entry Reorder Buffer
  • Runahead Buffer
  • 32 Entry
  • Runahead Buffer Chain Cache: 2-Entries
  • Caches
  • 32 KB L1 I/D-Cache, 3-Cycle
  • 1MB Last Level Cache, 18-Cycle
  • Stream Prefetcher
  • Non-Uniform Access Latency DRAM

System

  • 5 Configurations
  • Traditional Runahead
  • Runahead Buffer
  • Runahead Buffer + Chain Cache
  • Hybrid Policy
  • Traditional Runahead + Energy

Optimizations

slide-20
SLIDE 20

Runahead Buffer Performance

  • 5

5 10 15 20 25 30 35 40

calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie

  • mnetpp

milc soplex sphinx bwaves libquantum lbm mcf GMean

% IPC Difference over No- Prefetching Baseline Runahead Runahead Buffer Runahead Buffer + Chain Cache Hybrid Policy

slide-21
SLIDE 21

Runahead Buffer Performance

  • 5

5 10 15 20 25 30 35 40 % IPC Difference over No-Prefetching Baseline Runahead Runahead Buffer Runahead Buffer + Chain Cache Hybrid Policy

slide-22
SLIDE 22

Runahead Buffer MLP

2 4 6 8 10 12 14 16 18 Cache Misses per Runahead Interval Runahead Runahead Buffer

slide-23
SLIDE 23

Energy Analysis

0.5 1 1.5 2 2.5 % Energy Difference over No-PF Baseline Runahead Runahead Enhancements Runahead Buffer Runahead Buffer + Chain Cache Hybrid

slide-24
SLIDE 24

Stall Cycles in Runahead Buffer Mode

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % Total Cycles

slide-25
SLIDE 25

Stream Prefetching

  • 20

20 40 60 80 100 120 140 160 % IPC Difference over No- Prefetching Baseline Stream Runahead + Stream Runahead Buffer + Stream Runahead Buffer + Chain Cache + Stream Hybrid + Stream

slide-26
SLIDE 26

Bandwidth Consumption

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 Normalized Bandwidth Stream Runahead Runahead Buffer Runahead Buffer + Chain Cache Hybrid

slide-27
SLIDE 27

Energy Analysis

0.5 1 1.5 2 2.5 % Energy Difference over No-PF Baseline Baseline + Stream Runahead + Stream Runahead Enhancements + Stream Runhead Buffer + Stream Runahead Buffer + Chain Cache + Stream Hybrid + Stream

slide-28
SLIDE 28

Runahead Buffer Conclusions

  • Many of the operations that are executed in traditional runahead

execution are unnecessary to generate cache misses

  • The runahead buffer uses filtered dependence chains that only contain

the operations required for a cache miss

  • These chains are generally short
  • This chain is read into a buffer and speculatively executed as if they

were in a loop when the core would be otherwise idle

slide-29
SLIDE 29

Runahead Buffer Conclusions

  • The runahead buffer enables the front-end to be idle for 47% of the

total execution cycles of the medium and high memory intensity SPEC CPU2006 benchmarks

  • The runahead buffer generates over twice as much MLP on average

as traditional runahead execution

  • The runahead buffer results in a 17.2% performance increase and

6.7% decrease in energy consumption over a system with no-

  • prefetching. Traditional runahead execution results in a 12.3%

performance increase and 9.5% energy increase

slide-30
SLIDE 30

Filtered Runahead Execution with a Runahead Buffer

Milad Hashemi Yale N. Patt December 8, 2015