i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET - - PowerPoint PPT Presentation

i ns ruct i on c nstruc on cac ache
SMART_READER_LITE
LIVE PREVIEW

i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET - - PowerPoint PPT Presentation

Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jimnez Explor oring Predictive


slide-1
SLIDE 1

ins nstruc ruction C

  • n Cac

ache

Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez

Explor

  • ring

Predictive Replacement Policies

BRANCH T TARGET BU BUFFER

and and

for for

Dead

slide-2
SLIDE 2

ins nstruc ruction C

  • n Cac

ache

Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez

Explor

  • ring

Predictive Replacement Policies

BRANCH T TARGET BU BUFFER

and and

for for

Dead

slide-3
SLIDE 3

ins nstruc ruction C

  • n Cac

ache

Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez

Explor

  • ring

Predictive Replacement Policies

BRANCH T TARGET BU BUFFER

and and

for for

Dead

slide-4
SLIDE 4

ins nstruc ruction C

  • n Cac

ache

Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez

Explor

  • ring

Predictive Replacement Policies

BRANCH T TARGET BU BUFFER

and and

for for

Dead

slide-5
SLIDE 5

LRU Rand SRRIP

21% 36% 24%

slide-6
SLIDE 6

LRU Rand SRRIP

21% 36% 24%

slide-7
SLIDE 7

LRU Rand SRRIP

21% 36% 24%

slide-8
SLIDE 8

223 training + 439 evaluation workloads

Suite C Suite A Suite B

Thousands of workloads from popular benchmark suites

Maximilien Breughe Presentation ISCA 2016

Part of fifth Championship Branch Prediction, provided by Samsung

slide-9
SLIDE 9

Many applications have significant I-cache and BTB misses

slide-10
SLIDE 10

pipeline flush

retire

Base

wrong speculation branch misspred

Direction

Miss

target miss

machine clear

stalled front end bound fetch latency

ITLB Miss

Icache

miss

branch resteer fetch band width

src1 src2

back end bound

core bound

divider exec port

memory bound

extern mem L3 L2 L1 store

msrom

slide-11
SLIDE 11

pipeline flush

retire

Base

wrong speculation branch misspred

Direction

Miss

target miss

machine clear

stalled front end bound fetch latency

ITLB Miss

Icache

miss

branch resteer fetch band width

src1 src2

back end bound

core bound

divider exec port

memory bound

extern mem L3 L2 L1 store

msrom

slide-12
SLIDE 12

pipeline flush

retire

Base

wrong speculation branch misspred

Direction

Miss

target miss

machine clear

stalled front end bound fetch latency

ITLB Miss

Icache

miss

branch resteer fetch band width

src1 src2

back end bound

core bound

divider exec port

memory bound

extern mem L3 L2 L1 store

msrom

slide-13
SLIDE 13

pipeline flush

retire

Base

wrong speculation branch misspred

Direction

Miss

target miss

machine clear

stalled front end bound fetch latency

ITLB Miss

Icache

miss

branch resteer fetch band width

src1 src2

back end bound

core bound

divider exec port

memory bound

extern mem L3 L2 L1 store

msrom

slide-14
SLIDE 14

pipeline flush

retire

Base

wrong speculation branch misspred

Direction

Miss

target miss

machine clear

stalled front end bound fetch latency

ITLB Miss

Icache

miss

branch resteer fetch band width

src1 src2

back end bound

core bound

divider exec port

memory bound

extern mem L3 L2 L1 store

msrom

slide-15
SLIDE 15

pipeline flush

retire

Base

wrong speculation branch misspred

Direction

Miss

target miss

machine clear

stalled front end bound fetch latency

ITLB Miss

Icache

miss

branch resteer fetch band width

src1 src2

back end bound

core bound

divider exec port

memory bound

extern mem L3 L2 L1 store

msrom

slide-16
SLIDE 16

pipeline flush

retire

Base

wrong speculation branch misspred

Direction

Miss

target miss

machine clear

stalled front end bound fetch latency

ITLB Miss

Icache

miss

branch resteer fetch band width

src1 src2

back end bound

core bound

divider exec port

memory bound

extern mem L3 L2 L1 store

msrom

slide-17
SLIDE 17

No previous work on Predictive Replacement Policies for I-cache and BTB

slide-18
SLIDE 18

If A becomes dead, B and C are likely to become dead too.

PC! PC! PC!

A B C

slide-19
SLIDE 19

Sampling Dead Block Prediction learns from a small number of sets

PC! PC! PC! PCβ PCβ PCβ PC" PC" PC" PC# PC# Update Table Prediction Table

slide-20
SLIDE 20

Sampling Dead Block Prediction reduces many dead blocks in LL cache

Photo Credit: Sampling Dead Block Predictor by Khan et al.

slide-21
SLIDE 21

SDBP increase I-cache MPKI by 4% in average

SDBP LRU

5 10 15 20 25 30

MPKI Benchmark

slide-22
SLIDE 22

In I-cache or BTB,

  • ne PC accesses only one set

PC!

I-cache/BTB

PC!

D-cache

PC! PC!

slide-23
SLIDE 23

GHRP

slide-24
SLIDE 24

GHRP correlates

Reuse behavior with control flow History PCt

XOR

PCt-4 PCt-3 PCt-2 PCt-1

Signature

Global History

slide-25
SLIDE 25

PCt

XOR

PCt-4 PCt-3 PCt-2 PCt-1

Signature

GHRP correlates

Reuse behavior with control flow History

slide-26
SLIDE 26

Extra information kept in I-cache block

signature

valid

LRU stack

1 bit 1 bit 3 bits 16 bits

prediction

slide-27
SLIDE 27

⇣ ⇡

Eviction Reuse GHRP prediction is done by tracking the behavior using the signature

slide-28
SLIDE 28

Extra information kept in I-cache block

signature

valid

LRU stack

1 bit 1 bit 3 bits 16 bits

prediction

slide-29
SLIDE 29

Hash3 Hash1 Hash2

Voting is required for GHRP decisions Prediction

slide-30
SLIDE 30

Hash3 Hash1 Hash2

Threshold Threshold Threshold

Voting is required for GHRP decisions Prediction

slide-31
SLIDE 31

Hash3 Hash1 Hash2

Majority vote

Threshold Threshold Threshold

Voting is required for GHRP decisions Prediction

slide-32
SLIDE 32
slide-33
SLIDE 33

New Signature

slide-34
SLIDE 34

Bypass

New Prediction

New Signature

slide-35
SLIDE 35

Victim Block

Miss Not Bypass

slide-36
SLIDE 36

Victim Block

Miss Not Bypass

slide-37
SLIDE 37

Victim Block

Miss Not Bypass

New Block

slide-38
SLIDE 38

Hit Block

Hit

slide-39
SLIDE 39

Hit Block

Hit

slide-40
SLIDE 40

Hit Block Hit Block

Hit

slide-41
SLIDE 41

<< Shift Left <<

PCt-3 PCt-2 PCt-1 PCt-4 PCt-3 PCt-2 PCt-1 New Global History PCt-3 PCt-2 PCt-1 PCt

slide-42
SLIDE 42
slide-43
SLIDE 43

If A becomes dead in I-cache, B is likely to become dead in BTB too

Br! A Br!

B

BTB I-cache

slide-44
SLIDE 44

BTB and I-cache can share prediction resources

slide-45
SLIDE 45

BTB and I-cache can share prediction resources

signature

valid

LRU stack

prediction signature

valid

LRU stack

prediction

I-cache BTB

slide-46
SLIDE 46

BTB and I-cache joint design brt-4 brt-3 brt-2 brt-1

slide-47
SLIDE 47

Workloads 662 traces Short-Mobile, Long-Mobile, Short-Server, Long-Server CBP5, Samsung Branch Predictor Hashed Perceptron I-cache 64KB 8 Way 64B BTB 4K Entry 8 Way Simulator CBP5 Trace driven MPKI Comparison LRU(baseline) Random SRRIP SDBP

slide-48
SLIDE 48

Simulator CBP5 Trace driven MPKI Branch Predictor Hashed Perceptron Comparison LRU(baseline) Random SRRIP SDBP I-cache 64KB 8 Way 64B BTB 4K Entry 8 Way Workloads 662 traces Short-Mobile, Long-Mobile, Short-Server, Long-Server CBP5, Samsung

slide-49
SLIDE 49

Simulator CBP5 Trace driven MPKI Branch Predictor Hashed Perceptron Comparison LRU(baseline) Random SRRIP SDBP I-cache 64KB 8 Way 64B BTB 4K Entry 8 Way Workloads 662 traces Short-Mobile, Long-Mobile, Short-Server, Long-Server CBP5, Samsung

slide-50
SLIDE 50

Branch Predictor Hashed Perceptron Simulator CBP5 Trace driven MPKI Workloads 662 traces Short-Mobile, Long-Mobile, Short-Server, Long-Server CBP5, Samsung Comparison LRU(baseline) Random SRRIP SDBP I-cache 64KB 8 Way 64B BTB 4K Entry 8 Way Branch Predictor Hashed Perceptron

slide-51
SLIDE 51

Branch Predictor Hashed Perceptron Simulator CBP5 Trace driven MPKI Workloads 662 traces Short-Mobile, Long-Mobile, Short-Server, Long-Server CBP5, Samsung Comparison LRU(baseline) Random SRRIP SDBP BTB 4K Entry 8 Way I-cache 64KB 8 Way 64B

slide-52
SLIDE 52

64KB, 8-way I-cache with 64B blocks Total storage overhead GHRP is 5.13KB

  • r 8% of the capacity of the I-cache

2 bit counters 4,096 entries Three tables Prediction Tables 3KB 1 bit 1,024 blocks Prediction bits 128B 16 bits 1,024 blocks Signature bits 2KB 16 bits One register History Register 2B

slide-53
SLIDE 53

With 95% certainty GHRP reduces I-cache MPKI by 33% compared to LRU

I-cache MPKI reduction relative to LRU

slide-54
SLIDE 54

With 95% certainty GHRP reduces BTB MPKI by 41% compared to LRU

BTB MPKI reduction relative to LRU

slide-55
SLIDE 55

LRU Rand SRRIP SDBP GHRP

21% 36% 24% 48% 28%

slide-56
SLIDE 56

Questions?