ACCORD: Associativity for DRAM Caches by Coordinating Way-Install and Way-Prediction
ISCA 2018
1
Vinson Young (GT) Chiachen Chou (GT) Aamer Jaleel (NVIDIA) Moinuddin K. Qureshi (GT)
Authors:
ACCORD: Associativity for DRAM Caches by Coordinating Way-Install - - PowerPoint PPT Presentation
ACCORD: Associativity for DRAM Caches by Coordinating Way-Install and Way-Prediction ISCA 2018 Vinson Young (GT) Chiachen Chou (GT) Authors: Aamer Jaleel (NVIDIA) Moinuddin K. Qureshi (GT) 1 3D-DRAM MITIGATES BANDWIDTH WALL Modern system
1
Vinson Young (GT) Chiachen Chou (GT) Aamer Jaleel (NVIDIA) Moinuddin K. Qureshi (GT)
Authors:
4-8x Bandwidth (of traditional memory)
3D-DRAM MITIGATES BANDWIDTH WALL
2
Hybrid Memory Cube (HMC) from Micron, High Bandwidth Memory (HBM) from Samsung 3D-Stacked DRAM
Limited Capacity
Memory
3D-DRAM + High-Capacity Memory = Hybrid Memory
Modern system packing many cores è Bandwidth Wall
OS-visible Space
System Memory (NVM / DRAM)
USE 3D-DRAM AS A CACHE
3
DRAM-Cache (3D-DRAM)
fast slow CPU L1$ L2$ L3$ CPU L2$ L1$
Using 3D-DRAM as a DRAM cache, can improve memory bandwidth (and avoid OS/software change)
MCDRAM from Intel
Organize at line granularity (64B) for capacity/BW utilization Gigascale cache needs large tag-store (tens of MBs) 3D-DRAM
ARCHITECTING LARGE DRAM CACHES
4
4GB Data 128 MB Tags Tags? Too large for SRAM
Organize at line granularity (64B) for high cache utilization Gigascale cache needs large tag-store (tens of MBs) Practical designs must store Tags in DRAM 3D-DRAM
ARCHITECTING LARGE DRAM CACHES
5
How to architect tag-store for low-latency tag access? 4GB Data 128 MB Tags
EFFICIENT TAG ORGANIZATION (KNL CACHE)
6
Practical designs are 64B line-size, store Tag-With-Data, and are direct-mapped, to optimize for hit-latency.
Tag Data Tag Data Tag Data Tag Data
Tag-With-Data [Alloy Cache, Intel Knights Landing]
Single Tag+Data Lookup (1x hit latency), but direct-mapped
Intel Knights Landing Product (MCDRAM) uses this DRAM-cache organization.
60 70 80 90 1
a y 2
a y 4
a y 8
a y Hit Rate (%)
Reduce 25%
POTENTIAL OF ASSOCIATIVITY
7
How can we make DRAM caches associative?
Assumes 16-core system, with 4GB DRAM-Cache, in front of PCM memory.
ASSOCIATIVITY OPTION 1: SERIAL TAG LOOKUP Serial Tag Lookup enables associativity, but, it has serialization delay.
8
A B
Way 0 Way 1 Address
A B
If miss
ASSOCIATIVITY OPTION 2: PARALLEL TAG LOOKUP
Parallel Lookup avoids serialization latency, but, it introduces 2x bandwidth cost.
9
A B
Way 0 Way 1 Address
A B
60 70 80 90 1-way 2-way 4-way 8-way Hit Rate (%)
0.5 1 1.5
2-way 4-way 8-way
(b) Speedup (Parallel)
Speedup (Parallel)
Reduce 25%
ASSOCIATIVITY FOR DRAM CACHE (PARALLEL)
10
Increasing associativity naively actually degrades performance due to increased BW cost
60 70 80 90 1
a y 2
a y 4
a y 8
a y Hit Rate (%)
0.5 1 1.5
2
a y 4
a y 8
a y
(b) Speedup (Parallel)
Speedup (Parallel) 0.5 1 1.5 2
a y 4
a y 8
a y
(c) Speedup (Idealized)
Speedup (Idealized)
Reduce 25%
21%
ASSOCIATIVITY FOR DRAM CACHE (IDEAL)
11
With latency / BW
Associativity must still maintain the latency/BW
OPTION 3: WAY-PREDICTED TAG LOOKUP Way-Predicted Tag Lookup can obtain improved hit- rate, with BW / latency of direct-mapped cache.
12
Way-Predicted Tag Lookup
A B
Way 0 Way 1 Address
B
If miss Way Prediction
Accuracy (4-way) 74.3% 91.6% Accuracy (8-way) 63.2% 81.2% MRU Pred (1bit/set) Partial-Tag (4bit/line) SRAM Storage 4MB 32MB Way-Pred Accuracy (2-way) 85.7% 97.3%
WAY-PREDICTION ACCURACY & COST Prior methods for way-prediction have low accuracy and/or have high storage overhead.
13
TOWARDS ASSOCIATIVITY W/ WAY-PREDICTION
14
Way-Predicted Tag Lookup
A B
Way 0 Way 1 Address
B
If miss Way Prediction
Goal: Low storage-overhead and high accuracy way-prediction, to enable associative DRAM cache
ACCORD OVERVIEW
– Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS)
15
INSIGHT: WAY-PREDICTABILITY AT LOW STORAGE?
Insight: Modifying install policy can make way- prediction much simpler!
16
EVEN ODD Way 0 Way 1 EVEN ODD ODD EVEN
Base Install Policy (Rand)
EVEN ODD EVEN ODD ODD EVEN
Tag-based Install Policy
Way 0 Way 1 Hard-to-predict (~50%) Predict 100%! But, direct-mapped
PROPOSAL: ACCORD AssoCiativity by CoORDinating way-install and prediction. ACCORD achieves a way-predictable cache at low cost.
17
Way 0 Way 1 A2 B3 A3 B5 B7 Way Install Policy Way Predictor Coordinate
ACCORD OVERVIEW
– Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS)
18
PROBABILISTIC WAY-STEERING PWS enables way-predictability, by trading speed of learning to use both ways (hit-rate)
19
Install using PWS Page A,B Bias=90% 10% Static prediction: ~90% B1 B2 B3 B4 B6 B7 B0 B5 Way 0 Way 1 Address A1 A2 A3 A4 A6 A7 A0 A5 B1 B2 B3 B4 B6 B7 B0 B5 A1 A2 A3 A4 A6 A7 A0 A5 Preferred Will use both ways, improve hit-rate
SENSITIVITY TO PWS PROBABILITY
20
0% 20% 40% 60% 80% 100% 0% 2% 4% 6% 8% 10% 12% 14% 50% 60% 70% 80% 85% 90% 100% Way-Pred Accuracy (%) Miss Reduction (%) Bias for selecting “preferred way” Way-Pred Accuracy
2-way design Direct-mapped Preferred-way Install Probability = x% bias to install in preferred way
SENSITIVITY TO PWS PROBABILITY
21
0% 20% 40% 60% 80% 100% 0% 2% 4% 6% 8% 10% 12% 14% 50% 60% 70% 80% 85% 90% 100% Way-Pred Accuracy (%) Miss Reduction (%) Preferred-way Install Probability Miss Reduction (%) Way-Pred Accuracy
2-way design Direct-mapped
SENSITIVITY TO PWS PROBABILITY
22
2.6% 3.7% 4.7% 5.5% 5.6% 5.3% 0.0%
0% 20% 40% 60% 80% 100% 0% 2% 4% 6% 8% 10% 12% 14% 50% 60% 70% 80% 85% 90% 100% Way-Pred Accuracy (%) Miss Reduction (%) Speedup (%) Preferred-way Install Probability Speedup Miss Reduction (%) Way-Pred Accuracy
Preferred-way Install Probability (85%) provides best trade-off of hit-rate for WP accuracy, for 5.6% speedup.
5.6% speedup
ACCORD OVERVIEW
– Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS)
23
GANGED WAY-STEERING
24
B1 B2 B3 B4 B6 B7 B0 B5 Way 0 Way 1 Address B0 B1 B2 B3 B4 B6 B7 B5 Way 0 Way 1 Address B0 B1 B2 B3 B4 B6 B7 B5 A0 A1 A2 A3 A4 A6 A7 A5 A1 A2 A3 A4 A6 A7 A0 A5 B1 B2 B3 B4 B6 B7 B0 B5 A1 A2 A3 A4 A6 A7 A0 A5 Probabilistic Way-Steering Per-line randomized decision Ganged Way-Steering Per-page rand decision Preferred Preferred Pred ~50% Pred >90% Ganged Way-Steering makes install decision at large granularity, to improve predictability for workloads with high spatial locality.
GANGED WAY-STEERING IMPLEMENTATION
25
Way 0 Way 1 A2 B3 A3 B5 B7
0x001
RegionID Way Guide Install Recent Install Table (RIT) Install RegionID
0x101
Way Predict Way Recent Lookup Table (RLT) Access
GWS Per-Region Last-Way install + Last-Way prediction. 64-entry RIT and 64-entry RLT needs only 320 Bytes.
1
PWS+GWS WAY-PREDICTION ACCURACY
26
70% 75% 80% 85% 90% 95% 100%
PWS+GWS PWS Libquantum
GWS enables spatial workloads to have near-100% accuracy PWS has ~85% base accuracy
Combination of PWS+GWS achieves 90% accuracy, at the cost of 320B storage.
Way-Pred Acc (%)
70% 75% 80% 85% 90% 95% 100%
Average (21 workloads) PWS+GWS PWS
PWS+GWS (ACCORD 2-WAY) RESULTS PWS + GWS gets 7.3% of 10% speedup of perfectly-predicted 2-way cache.
7.3% speedup
27
System assumes 4GB DRAM Cache, and PCM-based main memory.
0% 2% 4% 6% 8% 10% 12%
Speedup PWS PWS+GWS Perfect
ACCORD OVERVIEW
– Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS)
28
– ACCORD 4-way has 3% speedup – ACCORD 8-way has 6% slowdown…
We need solutions to reduce miss-confirmation
to confirm line is not resident
29
E A C B D Way 0 Way 1 Way 2 Way 3 Address E Miss!
SOLUTION: SKEWED WAY-STEERING Restricting placement, reduces miss-confirmation è hit-rate benefits without any storage overhead
30
Only 2 lookups to determine miss Way 0 Way 2Way 3 E Access: A B C A B 4-way with 2-skew: Access: ABC One Preferred + One Alternate way Way 1
SPEEDUP FROM ACCORD (WITH SWS)
31
0% 2% 4% 6% 8% 10% 12%
Speedup 2-Way
SWS 8-way achieves 11% speedup
4-Way 8-Way
ACCORD OVERVIEW
– Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS)
32
§ ACCORD: associative DRAM caches by coordinating way-
install and way-prediction.
§ Probabilistic Way-Steering
§ Biased-install enables accurate static way-prediction
§ Ganged Way-Steering
§ Region-based install enables accurate region-based way-prediction
§ Skewed Way-Steering
§ Skew enables flexibility in line placement, while maintaining miss cost
§ ACCORD enables associativity at negligible storage cost
(320B), to achieve 11% speedup.
33
ACCORD BACKUP SLIDES
34
REPLACEMENT POLICY?
– State in SRAM
– State in DRAM
35
COMPARISON TO OTHER WAY PREDICTORS
36
0% 2% 4% 6% 8% 10% 12%
Speedup
ACCORD outperforms other predictors while needing negligible storage overhead (320 B)
COLUMN-ASSOCIATIVE CACHE
– Install lines in preferred way (way-0) – On eviction, move line to alternate way (way-1) – On hit to alternate way, move to preferred way
– In general, way-prediction accuracy similar to MRU – But, requires significant bandwidth to swap lines on hit to alternate way. CA-cache thus causes 4% slowdown.
37