Sangam: A Multi-component Core Cache Prefetcher Mainak Chaudhuri, - - PowerPoint PPT Presentation
Sangam: A Multi-component Core Cache Prefetcher Mainak Chaudhuri, - - PowerPoint PPT Presentation
Sangam: A Multi-component Core Cache Prefetcher Mainak Chaudhuri, Nayan Deshmukh Introduction The word Sangam refers to a confluence of 3 rivers which corresponds to 3 core components in our prefetcher We achieve 40.3% speedup over
Introduction
- The word ‘Sangam’ refers to a confluence of 3 rivers
which corresponds to 3 core components in our prefetcher
- We achieve 40.3% speedup over no prefetching for
46 single core workloads
- For 4 core we achieve 19.5% speedup over no
prefetching for 100 multiprogramed workloads (45 homo, 55 hetro)
Sangam
IP-Delta-based sequence predictor IP-based stride prefetcher Adaptive degree Next- line prefetcher Recent access filter Last PQ Entry? Encode residual prefetches as metadata Inject prefetch Sequence Complete? Stop
Sangam
All the components have a common base degree d All the components have a common base degree d
Where?
Where?
- Where to place the prefetcher
- L1 allows for better learning whereas L2, L3 allows
for more hardware resources
Where?
- Where to place the prefetcher
- L1 allows for better learning whereas L2, L3 allows
for more hardware resources
1.34
Speedup at different levels of cache
1.2 1.22 1.24 1.26 1.28 1.3 1.32
IP-stride IP-delta
Speedup
L1 prefetcher L2 prefetcher
- Uses both control-flow and data-flow information to predict a
sequence of accesses
IP Last d+1 deltas
.
IP Table h(IP, Delta) Next d deltas
.
IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
. . . . . . . . . . . .
- Uses both control-flow and data-flow information to predict a
sequence of accesses
IP Last d+1 deltas
.
IP Table h(IP, Delta) Next d deltas
.
IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
. . . . . .
IP
. . . . . .
- Uses both control-flow and data-flow information to predict a
sequence of accesses
IP Last d+1 deltas
.
IP Table h(IP, Delta) Next d deltas
.
IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
. . . . . .
IP
. . . . . .
- ffset
- Uses both control-flow and data-flow information to predict a
sequence of accesses
IP Last d+1 deltas
.
IP Table h(IP, Delta) Next d deltas
.
IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
. . . . . .
IP
. . . . . .
- ffset
- Uses both control-flow and data-flow information to predict a
sequence of accesses
IP Last d+1 deltas
.
IP Table h(IP, Delta) Next d deltas
.
IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
. . . . . .
IP
. . . . . .
- ffset
- Uses both control-flow and data-flow information to predict a
sequence of accesses
IP Last d+1 deltas
.
IP Table h(IP, Delta) Next d deltas
.
IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
. . . . . .
IP
. . . . . .
- ffset
- Uses both control-flow and data-flow information to predict a
sequence of accesses
IP Last d+1 deltas
.
IP Table h(IP, Delta) Next d deltas
.
IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
. . . . . .
IP
. . . . . .
- ffset
h
- Uses both control-flow and data-flow information to predict a
sequence of accesses
IP Last d+1 deltas
.
IP Table h(IP, Delta) Next d deltas
.
IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
. . . . . .
IP
. . . . . .
- ffset
h
- Uses both control-flow and data-flow information to predict a
sequence of accesses
IP Last d+1 deltas
.
IP Table h(IP, Delta) Next d deltas
.
IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
. . . . . .
IP
. . . . . .
Delta
- ffset
h
- Uses both control-flow and data-flow information to predict a
sequence of accesses
IP Last d+1 deltas
.
IP Table h(IP, Delta) Next d deltas
.
IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
. . . . . .
IP
. . . . . .
Delta Confidence
- ffset
h
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . . IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . . IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . . IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . . IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . . IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . .
kth
IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . .
kth (d+1)th
IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . .
kth (d+1)th
IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . .
kth (d+1)th
h IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . .
kth (d+1)th
h IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . .
kth (d+1)th
h
(d+1-k)th
IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
Delta
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . .
Delta kth (d+1)th
h
(d+1-k)th
IP
- ffset
- Learning
IP Last d+1 deltas
.
IP Table Next d deltas IP-Delta Table
Last
- ffset
Delta
IP-Delta-based Sequence predictor
h(IP, Delta)
. . . . . .
. . .
Delta Confidence kth (d+1)th
h
(d+1-k)th
IP
- ffset
- We use IP based stride predictor when IP-delta
predictor can no longer offer predictions
- This covers both cases when either the entry is
missing from IP-delta table or the sequence is below confidence threshold
IP-based stride prefetcher
IP Last d+1 deltas
. . . . . .
IP Table
Last
- ffset
IP
- We use IP based stride predictor when IP-delta
predictor can no longer offer predictions
- This covers both cases when either the entry is
missing from IP-delta table or the sequence is below confidence threshold
IP-based stride prefetcher
IP Last d+1 deltas
. . . . . .
IP Table
Last
- ffset
IP
We use the IP stride predictor when the last two deltas seen for IP are equal
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 4
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 5
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 5
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 5
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 4 5
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 5 5
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 5 5
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 4 5 5
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 5 5 5
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 5 5 5
Demand Access
X+1
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 5 5 5
Demand Access
X+1
Lookup
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 2 1 5 5 5
Demand Access
X+1
Lookup
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 1 5 5 5
Demand Access
X+1 3
Lookup
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 1 5 5 5
Demand Access
X+1 3
Lookup
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 1 5 5 5
Demand Access
X+1 3
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+1 1 X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 1 5 5 5
Demand Access
X+1 3
Evict
- Maintaining coverage at the cost of accuracy leads to
- verall better performance
- Used when both IP-delta and IP stride prefetcher
cannot offer prediction
- Feedback directed degree selection
Next-line prefetcher
X+2 2 X+d d Next-line buffer Degree 1 2 ... d Hits ... Insertions ... 1 5 5 5
Demand Access
X+1 3
Evict
- Lot of overlapping prefetches due to 3 different
components
- Could also be in a single component (e.g. Next-line)
- Should efficiently use Prefetch Queue
- Store the recent demand and prefetch accesses in a
Recent access filter
- Store the recent demand and prefetch accesses in a
small fully associative buffer
- Only issue the prefetch requests if it misses in the
recent access filter
- Small in size to avoid missing genuine requests
Handling resource shortage
- Short size of L1 prefetch queue restricts aggressive
prefetching
- Leverage the communication b/w L1 and L2
prefetcher
- When the PQ has only entry left then we piggyback
the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch
- L2 cache uses this info to complete the prefetching
Handling resource shortage
- Short size of L1 prefetch queue restricts aggressive
prefetching
- Leverage the communication b/w L1 and L2
prefetcher
- When the PQ has only entry left then we piggyback
the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch
- L2 cache uses this info to complete the prefetching
Handling resource shortage
- Short size of L1 prefetch queue restricts aggressive
prefetching
- Leverage the communication b/w L1 and L2
prefetcher
- When the PQ has only entry left then we piggyback
the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch
- L2 cache uses this info to complete the prefetching
32 bits
Handling resource shortage
- Short size of L1 prefetch queue restricts aggressive
prefetching
- Leverage the communication b/w L1 and L2
prefetcher
- When the PQ has only entry left then we piggyback
the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch
- L2 cache uses this info to complete the prefetching
32 bits
Handling resource shortage
- Short size of L1 prefetch queue restricts aggressive
prefetching
- Leverage the communication b/w L1 and L2
prefetcher
- When the PQ has only entry left then we piggyback
the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch
- L2 cache uses this info to complete the prefetching
32 bits
Handling resource shortage
- Short size of L1 prefetch queue restricts aggressive
prefetching
- Leverage the communication b/w L1 and L2
prefetcher
- When the PQ has only entry left then we piggyback
the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch
- L2 cache uses this info to complete the prefetching
32 bits IP-delta
Handling resource shortage
- Short size of L1 prefetch queue restricts aggressive
prefetching
- Leverage the communication b/w L1 and L2
prefetcher
- When the PQ has only entry left then we piggyback
the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch
- L2 cache uses this info to complete the prefetching
32 bits d1 d2 dk . . . IP-delta
Handling resource shortage
- Short size of L1 prefetch queue restricts aggressive
prefetching
- Leverage the communication b/w L1 and L2
prefetcher
- When the PQ has only entry left then we piggyback
the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch
- L2 cache uses this info to complete the prefetching
32 bits d1 d2 dk . . . IP-delta 1
Handling resource shortage
- Short size of L1 prefetch queue restricts aggressive
prefetching
- Leverage the communication b/w L1 and L2
prefetcher
- When the PQ has only entry left then we piggyback
the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch
- L2 cache uses this info to complete the prefetching
32 bits d1 d2 dk . . . IP-delta 1 IP-stride
Handling resource shortage
- Short size of L1 prefetch queue restricts aggressive
prefetching
- Leverage the communication b/w L1 and L2
prefetcher
- When the PQ has only entry left then we piggyback
the remaining prefetch info with the last prefetch the remaining prefetch info with the last prefetch
- L2 cache uses this info to complete the prefetching
32 bits d1 d2 dk . . . IP-delta 1 stride IP-stride
Sangam
IP-Delta-based sequence predictor IP-based stride prefetcher Adaptive degree Next- line prefetcher Recent access filter Last PQ Entry? Encode residual prefetches as metadata Inject prefetch Sequence Complete? Stop N Y Y N
- L1 prefetcher overhead
Storage Overhead
Structure Storage (bits) TOTAL IP Table 128 sets, 15 ways 120960 IP-Delta Table 256 sets, 8 ways 131072 259870 bits =
31.72 KB
NL buffer 64 entries 4672 Recent Access Filter 40 entries 2840 Auxiliary Counters 316 bits
- L1 prefetcher overhead
Storage Overhead
Structure Storage (bits) TOTAL IP Table 128 sets, 15 ways 120960 IP-Delta Table 256 sets, 8 ways 131072
- L2 prefetcher overhead = 31.36 KB
259870 bits =
31.72 KB
NL buffer 64 entries 4672 Recent Access Filter 40 entries 2840 Auxiliary Counters 316 bits
Performance distribution
1.323 1.335 1.348 1.355 1.387 1.399 Sangam 1.35 1.4 1.45 core Speedup 1.323 1.335 1.25 1.3 1.35 Single-core Speedup IP-delta-based sequence IP-based stride Recent access filter Resource shortage handling Next-line prefetcher with static degree Adpative degree next-line L2 cache
Performance distribution
1.323 1.335 1.348 1.355 1.387 1.399 Sangam 1.35 1.4 1.45 core Speedup
Sangam provide a performance improvement of 40.3% for single core and 19.5% for multi-core (homo: 10.2%, hetro:
1.323 1.335 1.25 1.3 1.35 Single-core Speedup IP-delta-based sequence IP-based stride Recent access filter Resource shortage handling Next-line prefetcher with static degree Adpative degree next-line L2 cache