ins nstruc ruction C
- n Cac
ache
Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez
Explor
- ring
Predictive Replacement Policies
BRANCH T TARGET BU BUFFER
and and
for for
Dead
i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET - - PowerPoint PPT Presentation
Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jimnez Explor oring Predictive
Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez
Explor
Predictive Replacement Policies
and and
for for
Dead
Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez
Explor
Predictive Replacement Policies
and and
for for
Dead
Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez
Explor
Predictive Replacement Policies
and and
for for
Dead
Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez
Explor
Predictive Replacement Policies
and and
for for
Dead
LRU Rand SRRIP
21% 36% 24%
LRU Rand SRRIP
21% 36% 24%
LRU Rand SRRIP
21% 36% 24%
223 training + 439 evaluation workloads
Suite C Suite A Suite B
Thousands of workloads from popular benchmark suites
Maximilien Breughe Presentation ISCA 2016
Part of fifth Championship Branch Prediction, provided by Samsung
Many applications have significant I-cache and BTB misses
pipeline flush
retire
Base
wrong speculation branch misspred
Direction
Miss
target miss
machine clear
stalled front end bound fetch latency
ITLB Miss
Icache
miss
branch resteer fetch band width
src1 src2
back end bound
core bound
divider exec port
memory bound
extern mem L3 L2 L1 store
msrom
pipeline flush
retire
Base
wrong speculation branch misspred
Direction
Miss
target miss
machine clear
stalled front end bound fetch latency
ITLB Miss
Icache
miss
branch resteer fetch band width
src1 src2
back end bound
core bound
divider exec port
memory bound
extern mem L3 L2 L1 store
msrom
pipeline flush
retire
Base
wrong speculation branch misspred
Direction
Miss
target miss
machine clear
stalled front end bound fetch latency
ITLB Miss
Icache
miss
branch resteer fetch band width
src1 src2
back end bound
core bound
divider exec port
memory bound
extern mem L3 L2 L1 store
msrom
pipeline flush
retire
Base
wrong speculation branch misspred
Direction
Miss
target miss
machine clear
stalled front end bound fetch latency
ITLB Miss
Icache
miss
branch resteer fetch band width
src1 src2
back end bound
core bound
divider exec port
memory bound
extern mem L3 L2 L1 store
msrom
pipeline flush
retire
Base
wrong speculation branch misspred
Direction
Miss
target miss
machine clear
stalled front end bound fetch latency
ITLB Miss
Icache
miss
branch resteer fetch band width
src1 src2
back end bound
core bound
divider exec port
memory bound
extern mem L3 L2 L1 store
msrom
pipeline flush
retire
Base
wrong speculation branch misspred
Direction
Miss
target miss
machine clear
stalled front end bound fetch latency
ITLB Miss
Icache
miss
branch resteer fetch band width
src1 src2
back end bound
core bound
divider exec port
memory bound
extern mem L3 L2 L1 store
msrom
pipeline flush
retire
Base
wrong speculation branch misspred
Direction
Miss
target miss
machine clear
stalled front end bound fetch latency
ITLB Miss
Icache
miss
branch resteer fetch band width
src1 src2
back end bound
core bound
divider exec port
memory bound
extern mem L3 L2 L1 store
msrom
No previous work on Predictive Replacement Policies for I-cache and BTB
If A becomes dead, B and C are likely to become dead too.
PC! PC! PC!
A B C
Sampling Dead Block Prediction learns from a small number of sets
PC! PC! PC! PCβ PCβ PCβ PC" PC" PC" PC# PC# Update Table Prediction Table
Sampling Dead Block Prediction reduces many dead blocks in LL cache
Photo Credit: Sampling Dead Block Predictor by Khan et al.
SDBP increase I-cache MPKI by 4% in average
SDBP LRU
5 10 15 20 25 30
MPKI Benchmark
PC!
I-cache/BTB
PC!
D-cache
PC! PC!
XOR
Signature
XOR
Signature
LRU stack
1 bit 1 bit 3 bits 16 bits
Eviction Reuse GHRP prediction is done by tracking the behavior using the signature
LRU stack
1 bit 1 bit 3 bits 16 bits
Hash3 Hash1 Hash2
Voting is required for GHRP decisions Prediction
Hash3 Hash1 Hash2
Threshold Threshold Threshold
Voting is required for GHRP decisions Prediction
Hash3 Hash1 Hash2
Threshold Threshold Threshold
Voting is required for GHRP decisions Prediction
New Signature
Bypass
New Prediction
New Signature
Victim Block
Miss Not Bypass
Victim Block
Miss Not Bypass
Victim Block
Miss Not Bypass
New Block
Hit Block
Hit
Hit Block
Hit
Hit Block Hit Block
Hit
<< Shift Left <<
PCt-3 PCt-2 PCt-1 PCt-4 PCt-3 PCt-2 PCt-1 New Global History PCt-3 PCt-2 PCt-1 PCt
If A becomes dead in I-cache, B is likely to become dead in BTB too
Br! A Br!
B
BTB I-cache
BTB and I-cache can share prediction resources
BTB and I-cache can share prediction resources
LRU stack
LRU stack
I-cache BTB
BTB and I-cache joint design brt-4 brt-3 brt-2 brt-1
Workloads 662 traces Short-Mobile, Long-Mobile, Short-Server, Long-Server CBP5, Samsung Branch Predictor Hashed Perceptron I-cache 64KB 8 Way 64B BTB 4K Entry 8 Way Simulator CBP5 Trace driven MPKI Comparison LRU(baseline) Random SRRIP SDBP
Simulator CBP5 Trace driven MPKI Branch Predictor Hashed Perceptron Comparison LRU(baseline) Random SRRIP SDBP I-cache 64KB 8 Way 64B BTB 4K Entry 8 Way Workloads 662 traces Short-Mobile, Long-Mobile, Short-Server, Long-Server CBP5, Samsung
Simulator CBP5 Trace driven MPKI Branch Predictor Hashed Perceptron Comparison LRU(baseline) Random SRRIP SDBP I-cache 64KB 8 Way 64B BTB 4K Entry 8 Way Workloads 662 traces Short-Mobile, Long-Mobile, Short-Server, Long-Server CBP5, Samsung
Branch Predictor Hashed Perceptron Simulator CBP5 Trace driven MPKI Workloads 662 traces Short-Mobile, Long-Mobile, Short-Server, Long-Server CBP5, Samsung Comparison LRU(baseline) Random SRRIP SDBP I-cache 64KB 8 Way 64B BTB 4K Entry 8 Way Branch Predictor Hashed Perceptron
Branch Predictor Hashed Perceptron Simulator CBP5 Trace driven MPKI Workloads 662 traces Short-Mobile, Long-Mobile, Short-Server, Long-Server CBP5, Samsung Comparison LRU(baseline) Random SRRIP SDBP BTB 4K Entry 8 Way I-cache 64KB 8 Way 64B
64KB, 8-way I-cache with 64B blocks Total storage overhead GHRP is 5.13KB
2 bit counters 4,096 entries Three tables Prediction Tables 3KB 1 bit 1,024 blocks Prediction bits 128B 16 bits 1,024 blocks Signature bits 2KB 16 bits One register History Register 2B
With 95% certainty GHRP reduces I-cache MPKI by 33% compared to LRU
I-cache MPKI reduction relative to LRU
With 95% certainty GHRP reduces BTB MPKI by 41% compared to LRU
BTB MPKI reduction relative to LRU
LRU Rand SRRIP SDBP GHRP
21% 36% 24% 48% 28%