Sangam: A Multi-component Core Cache Prefetcher Mainak Chaudhuri, - - PowerPoint PPT Presentation

sangam a multi component core cache prefetcher
SMART_READER_LITE
LIVE PREVIEW

Sangam: A Multi-component Core Cache Prefetcher Mainak Chaudhuri, - - PowerPoint PPT Presentation

Sangam: A Multi-component Core Cache Prefetcher Mainak Chaudhuri, Nayan Deshmukh Introduction The word Sangam refers to a confluence of 3 rivers which corresponds to 3 core components in our prefetcher We achieve 40.3% speedup over


slide-1
SLIDE 1

Sangam: A Multi-component Core Cache Prefetcher

Mainak Chaudhuri, Nayan Deshmukh

slide-2
SLIDE 2

Introduction

  • The word ‘Sangam’ refers to a confluence of 3 rivers

which corresponds to 3 core components in our prefetcher

  • We achieve 40.3% speedup over no prefetching for

46 single core workloads

  • For 4 core we achieve 19.5% speedup over no

prefetching for 100 multiprogramed workloads (45 homo, 55 hetro)

slide-3
SLIDE 3

Sangam

IP-Delta-based sequence predictor IP-based stride prefetcher Adaptive degree Next- line prefetcher Recent access filter Last PQ Entry? Encode residual prefetches as metadata Inject prefetch Sequence Complete? Stop

slide-4
SLIDE 4

Sangam

All the components have a common base degree d All the components have a common base degree d

slide-5
SLIDE 5

Where?

slide-6
SLIDE 6

Where?

  • Where to place the prefetcher
  • L1 allows for better learning whereas L2, L3 allows

for more hardware resources

slide-7
SLIDE 7

Where?

  • Where to place the prefetcher
  • L1 allows for better learning whereas L2, L3 allows

for more hardware resources

1.34

Speedup at different levels of cache

1.2 1.22 1.24 1.26 1.28 1.3 1.32

IP-stride IP-delta

Speedup

L1 prefetcher L2 prefetcher

slide-8
SLIDE 8
  • Uses both control-flow and data-flow information to predict a

sequence of accesses

IP Last d+1 deltas

.

IP Table h(IP, Delta) Next d deltas

.

IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

. . . . . . . . . . . .

slide-9
SLIDE 9
  • Uses both control-flow and data-flow information to predict a

sequence of accesses

IP Last d+1 deltas

.

IP Table h(IP, Delta) Next d deltas

.

IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

. . . . . .

IP

. . . . . .

slide-10
SLIDE 10
  • Uses both control-flow and data-flow information to predict a

sequence of accesses

IP Last d+1 deltas

.

IP Table h(IP, Delta) Next d deltas

.

IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

. . . . . .

IP

. . . . . .

  • ffset
slide-11
SLIDE 11
  • Uses both control-flow and data-flow information to predict a

sequence of accesses

IP Last d+1 deltas

.

IP Table h(IP, Delta) Next d deltas

.

IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

. . . . . .

IP

. . . . . .

  • ffset
slide-12
SLIDE 12
  • Uses both control-flow and data-flow information to predict a

sequence of accesses

IP Last d+1 deltas

.

IP Table h(IP, Delta) Next d deltas

.

IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

. . . . . .

IP

. . . . . .

  • ffset
slide-13
SLIDE 13
  • Uses both control-flow and data-flow information to predict a

sequence of accesses

IP Last d+1 deltas

.

IP Table h(IP, Delta) Next d deltas

.

IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

. . . . . .

IP

. . . . . .

  • ffset
slide-14
SLIDE 14
  • Uses both control-flow and data-flow information to predict a

sequence of accesses

IP Last d+1 deltas

.

IP Table h(IP, Delta) Next d deltas

.

IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

. . . . . .

IP

. . . . . .

  • ffset

h

slide-15
SLIDE 15
  • Uses both control-flow and data-flow information to predict a

sequence of accesses

IP Last d+1 deltas

.

IP Table h(IP, Delta) Next d deltas

.

IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

. . . . . .

IP

. . . . . .

  • ffset

h

slide-16
SLIDE 16
  • Uses both control-flow and data-flow information to predict a

sequence of accesses

IP Last d+1 deltas

.

IP Table h(IP, Delta) Next d deltas

.

IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

. . . . . .

IP

. . . . . .

Delta

  • ffset

h

slide-17
SLIDE 17
  • Uses both control-flow and data-flow information to predict a

sequence of accesses

IP Last d+1 deltas

.

IP Table h(IP, Delta) Next d deltas

.

IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

. . . . . .

IP

. . . . . .

Delta Confidence

  • ffset

h

slide-18
SLIDE 18
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . . IP

  • ffset
slide-19
SLIDE 19
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . . IP

  • ffset
slide-20
SLIDE 20
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . . IP

  • ffset
slide-21
SLIDE 21
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . . IP

  • ffset
slide-22
SLIDE 22
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . . IP

  • ffset
slide-23
SLIDE 23
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . .

kth

IP

  • ffset
slide-24
SLIDE 24
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . .

kth (d+1)th

IP

  • ffset
slide-25
SLIDE 25
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . .

kth (d+1)th

IP

  • ffset
slide-26
SLIDE 26
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . .

kth (d+1)th

h IP

  • ffset
slide-27
SLIDE 27
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . .

kth (d+1)th

h IP

  • ffset
slide-28
SLIDE 28
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . .

kth (d+1)th

h

(d+1-k)th

IP

  • ffset
slide-29
SLIDE 29
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

Delta

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . .

Delta kth (d+1)th

h

(d+1-k)th

IP

  • ffset
slide-30
SLIDE 30
  • Learning

IP Last d+1 deltas

.

IP Table Next d deltas IP-Delta Table

Last

  • ffset

Delta

IP-Delta-based Sequence predictor

h(IP, Delta)

. . . . . .

. . .

Delta Confidence kth (d+1)th

h

(d+1-k)th

IP

  • ffset
slide-31
SLIDE 31
  • We use IP based stride predictor when IP-delta

predictor can no longer offer predictions

  • This covers both cases when either the entry is

missing from IP-delta table or the sequence is below confidence threshold

IP-based stride prefetcher

IP Last d+1 deltas

. . . . . .

IP Table

Last

  • ffset

IP

slide-32
SLIDE 32
  • We use IP based stride predictor when IP-delta

predictor can no longer offer predictions

  • This covers both cases when either the entry is

missing from IP-delta table or the sequence is below confidence threshold

IP-based stride prefetcher

IP Last d+1 deltas

. . . . . .

IP Table

Last

  • ffset

IP

We use the IP stride predictor when the last two deltas seen for IP are equal

slide-33
SLIDE 33
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4

slide-34
SLIDE 34
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4

slide-35
SLIDE 35
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4

slide-36
SLIDE 36
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4

slide-37
SLIDE 37
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4

slide-38
SLIDE 38
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4

slide-39
SLIDE 39
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 5

slide-40
SLIDE 40
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 5

slide-41
SLIDE 41
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 5

slide-42
SLIDE 42
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 5

slide-43
SLIDE 43
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 5 5

slide-44
SLIDE 44
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 5 5

slide-45
SLIDE 45
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 5 5

slide-46
SLIDE 46
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 5 5 5

slide-47
SLIDE 47
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 5 5 5

Demand Access

X+1

slide-48
SLIDE 48
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 5 5 5

Demand Access

X+1

Lookup

slide-49
SLIDE 49
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 5 5 5

Demand Access

X+1

Lookup

slide-50
SLIDE 50
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 1 5 5 5

Demand Access

X+1 3

Lookup

slide-51
SLIDE 51
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 1 5 5 5

Demand Access

X+1 3

Lookup

slide-52
SLIDE 52
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 1 5 5 5

Demand Access

X+1 3

slide-53
SLIDE 53
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 1 5 5 5

Demand Access

X+1 3

Evict

slide-54
SLIDE 54
  • Maintaining coverage at the cost of accuracy leads to
  • verall better performance
  • Used when both IP-delta and IP stride prefetcher

cannot offer prediction

  • Feedback directed degree selection

Next-line prefetcher

X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 1 5 5 5

Demand Access

X+1 3

Evict

slide-55
SLIDE 55
  • Lot of overlapping prefetches due to 3 different

components

  • Could also be in a single component (e.g. Next-line)
  • Should efficiently use Prefetch Queue
  • Store the recent demand and prefetch accesses in a

Recent access filter

  • Store the recent demand and prefetch accesses in a

small fully associative buffer

  • Only issue the prefetch requests if it misses in the

recent access filter

  • Small in size to avoid missing genuine requests
slide-56
SLIDE 56

Handling resource shortage

  • Short size of L1 prefetch queue restricts aggressive

prefetching

  • Leverage the communication b/w L1 and L2

prefetcher

  • When the PQ has only entry left then we piggyback

the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch

  • L2 cache uses this info to complete the prefetching
slide-57
SLIDE 57

Handling resource shortage

  • Short size of L1 prefetch queue restricts aggressive

prefetching

  • Leverage the communication b/w L1 and L2

prefetcher

  • When the PQ has only entry left then we piggyback

the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch

  • L2 cache uses this info to complete the prefetching
slide-58
SLIDE 58

Handling resource shortage

  • Short size of L1 prefetch queue restricts aggressive

prefetching

  • Leverage the communication b/w L1 and L2

prefetcher

  • When the PQ has only entry left then we piggyback

the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch

  • L2 cache uses this info to complete the prefetching

32 bits

slide-59
SLIDE 59

Handling resource shortage

  • Short size of L1 prefetch queue restricts aggressive

prefetching

  • Leverage the communication b/w L1 and L2

prefetcher

  • When the PQ has only entry left then we piggyback

the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch

  • L2 cache uses this info to complete the prefetching

32 bits

slide-60
SLIDE 60

Handling resource shortage

  • Short size of L1 prefetch queue restricts aggressive

prefetching

  • Leverage the communication b/w L1 and L2

prefetcher

  • When the PQ has only entry left then we piggyback

the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch

  • L2 cache uses this info to complete the prefetching

32 bits

slide-61
SLIDE 61

Handling resource shortage

  • Short size of L1 prefetch queue restricts aggressive

prefetching

  • Leverage the communication b/w L1 and L2

prefetcher

  • When the PQ has only entry left then we piggyback

the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch

  • L2 cache uses this info to complete the prefetching

32 bits IP-delta

slide-62
SLIDE 62

Handling resource shortage

  • Short size of L1 prefetch queue restricts aggressive

prefetching

  • Leverage the communication b/w L1 and L2

prefetcher

  • When the PQ has only entry left then we piggyback

the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch

  • L2 cache uses this info to complete the prefetching

32 bits d1 d2 dk . . . IP-delta

slide-63
SLIDE 63

Handling resource shortage

  • Short size of L1 prefetch queue restricts aggressive

prefetching

  • Leverage the communication b/w L1 and L2

prefetcher

  • When the PQ has only entry left then we piggyback

the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch

  • L2 cache uses this info to complete the prefetching

32 bits d1 d2 dk . . . IP-delta 1

slide-64
SLIDE 64

Handling resource shortage

  • Short size of L1 prefetch queue restricts aggressive

prefetching

  • Leverage the communication b/w L1 and L2

prefetcher

  • When the PQ has only entry left then we piggyback

the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch

  • L2 cache uses this info to complete the prefetching

32 bits d1 d2 dk . . . IP-delta 1 IP-stride

slide-65
SLIDE 65

Handling resource shortage

  • Short size of L1 prefetch queue restricts aggressive

prefetching

  • Leverage the communication b/w L1 and L2

prefetcher

  • When the PQ has only entry left then we piggyback

the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch

  • L2 cache uses this info to complete the prefetching

32 bits d1 d2 dk . . . IP-delta 1 stride IP-stride

slide-66
SLIDE 66

Sangam

IP-Delta-based sequence predictor IP-based stride prefetcher Adaptive degree Next- line prefetcher Recent access filter Last PQ Entry? Encode residual prefetches as metadata Inject prefetch Sequence Complete? Stop N Y Y N

slide-67
SLIDE 67
  • L1 prefetcher overhead

Storage Overhead

Structure Storage (bits) TOTAL IP Table 128 sets, 15 ways 120960 IP-Delta Table 256 sets, 8 ways 131072 259870 bits =

31.72 KB

NL buffer 64 entries 4672 Recent Access Filter 40 entries 2840 Auxiliary Counters 316 bits

slide-68
SLIDE 68
  • L1 prefetcher overhead

Storage Overhead

Structure Storage (bits) TOTAL IP Table 128 sets, 15 ways 120960 IP-Delta Table 256 sets, 8 ways 131072

  • L2 prefetcher overhead = 31.36 KB

259870 bits =

31.72 KB

NL buffer 64 entries 4672 Recent Access Filter 40 entries 2840 Auxiliary Counters 316 bits

slide-69
SLIDE 69

Performance distribution

1.323 1.335 1.348 1.355 1.387 1.399 Sangam 1.35 1.4 1.45 core Speedup 1.323 1.335 1.25 1.3 1.35 Single-core Speedup IP-delta-based sequence IP-based stride Recent access filter Resource shortage handling Next-line prefetcher with static degree Adpative degree next-line L2 cache

slide-70
SLIDE 70

Performance distribution

1.323 1.335 1.348 1.355 1.387 1.399 Sangam 1.35 1.4 1.45 core Speedup

Sangam provide a performance improvement of 40.3% for single core and 19.5% for multi-core (homo: 10.2%, hetro:

1.323 1.335 1.25 1.3 1.35 Single-core Speedup IP-delta-based sequence IP-based stride Recent access filter Resource shortage handling Next-line prefetcher with static degree Adpative degree next-line L2 cache

single core and 19.5% for multi-core (homo: 10.2%, hetro: 27.7%)

slide-71
SLIDE 71

Thank You

slide-72
SLIDE 72

Questions?