ACCORD: Associativity for DRAM Caches by Coordinating Way-Install - - PowerPoint PPT Presentation

accord associativity for dram caches by coordinating way
SMART_READER_LITE
LIVE PREVIEW

ACCORD: Associativity for DRAM Caches by Coordinating Way-Install - - PowerPoint PPT Presentation

ACCORD: Associativity for DRAM Caches by Coordinating Way-Install and Way-Prediction ISCA 2018 Vinson Young (GT) Chiachen Chou (GT) Authors: Aamer Jaleel (NVIDIA) Moinuddin K. Qureshi (GT) 1 3D-DRAM MITIGATES BANDWIDTH WALL Modern system


slide-1
SLIDE 1

ACCORD: Associativity for DRAM Caches by Coordinating Way-Install and Way-Prediction

ISCA 2018

1

Vinson Young (GT) Chiachen Chou (GT) Aamer Jaleel (NVIDIA) Moinuddin K. Qureshi (GT)

Authors:

slide-2
SLIDE 2

4-8x Bandwidth (of traditional memory)

3D-DRAM MITIGATES BANDWIDTH WALL

2

Hybrid Memory Cube (HMC) from Micron, High Bandwidth Memory (HBM) from Samsung 3D-Stacked DRAM

Limited Capacity

Memory

3D-DRAM + High-Capacity Memory = Hybrid Memory

Modern system packing many cores è Bandwidth Wall

slide-3
SLIDE 3

OS-visible Space

System Memory (NVM / DRAM)

USE 3D-DRAM AS A CACHE

3

DRAM-Cache (3D-DRAM)

Memory Hierarchy

fast slow CPU L1$ L2$ L3$ CPU L2$ L1$

Using 3D-DRAM as a DRAM cache, can improve memory bandwidth (and avoid OS/software change)

MCDRAM from Intel

slide-4
SLIDE 4

Organize at line granularity (64B) for capacity/BW utilization Gigascale cache needs large tag-store (tens of MBs) 3D-DRAM

ARCHITECTING LARGE DRAM CACHES

4

4GB Data 128 MB Tags Tags? Too large for SRAM

slide-5
SLIDE 5

Organize at line granularity (64B) for high cache utilization Gigascale cache needs large tag-store (tens of MBs) Practical designs must store Tags in DRAM 3D-DRAM

ARCHITECTING LARGE DRAM CACHES

5

How to architect tag-store for low-latency tag access? 4GB Data 128 MB Tags

slide-6
SLIDE 6

EFFICIENT TAG ORGANIZATION (KNL CACHE)

6

Practical designs are 64B line-size, store Tag-With-Data, and are direct-mapped, to optimize for hit-latency.

Tag Data Tag Data Tag Data Tag Data

Tag-With-Data [Alloy Cache, Intel Knights Landing]

Single Tag+Data Lookup (1x hit latency), but direct-mapped

Intel Knights Landing Product (MCDRAM) uses this DRAM-cache organization.

slide-7
SLIDE 7

60 70 80 90 1

  • w

a y 2

  • w

a y 4

  • w

a y 8

  • w

a y Hit Rate (%)

Reduce 25%

  • f misses

POTENTIAL OF ASSOCIATIVITY

7

How can we make DRAM caches associative?

Assumes 16-core system, with 4GB DRAM-Cache, in front of PCM memory.

slide-8
SLIDE 8

ASSOCIATIVITY OPTION 1: SERIAL TAG LOOKUP Serial Tag Lookup enables associativity, but, it has serialization delay.

8

A B

Way 0 Way 1 Address

A B

If miss

slide-9
SLIDE 9

ASSOCIATIVITY OPTION 2: PARALLEL TAG LOOKUP

Parallel Lookup avoids serialization latency, but, it introduces 2x bandwidth cost.

9

A B

Way 0 Way 1 Address

A B

slide-10
SLIDE 10

60 70 80 90 1-way 2-way 4-way 8-way Hit Rate (%)

0.5 1 1.5

2-way 4-way 8-way

(b) Speedup (Parallel)

Speedup (Parallel)

Reduce 25%

  • f misses
  • 46%

ASSOCIATIVITY FOR DRAM CACHE (PARALLEL)

10

Increasing associativity naively actually degrades performance due to increased BW cost

slide-11
SLIDE 11

60 70 80 90 1

  • w

a y 2

  • w

a y 4

  • w

a y 8

  • w

a y Hit Rate (%)

0.5 1 1.5

2

  • w

a y 4

  • w

a y 8

  • w

a y

(b) Speedup (Parallel)

Speedup (Parallel) 0.5 1 1.5 2

  • w

a y 4

  • w

a y 8

  • w

a y

(c) Speedup (Idealized)

Speedup (Idealized)

Reduce 25%

  • f misses
  • 46%

21%

ASSOCIATIVITY FOR DRAM CACHE (IDEAL)

11

With latency / BW

  • f direct-mapped

Associativity must still maintain the latency/BW

  • f direct-mapped caches. How?
slide-12
SLIDE 12

OPTION 3: WAY-PREDICTED TAG LOOKUP Way-Predicted Tag Lookup can obtain improved hit- rate, with BW / latency of direct-mapped cache.

12

Way-Predicted Tag Lookup

A B

Way 0 Way 1 Address

B

If miss Way Prediction

slide-13
SLIDE 13

Accuracy (4-way) 74.3% 91.6% Accuracy (8-way) 63.2% 81.2% MRU Pred (1bit/set) Partial-Tag (4bit/line) SRAM Storage 4MB 32MB Way-Pred Accuracy (2-way) 85.7% 97.3%

WAY-PREDICTION ACCURACY & COST Prior methods for way-prediction have low accuracy and/or have high storage overhead.

13

slide-14
SLIDE 14

TOWARDS ASSOCIATIVITY W/ WAY-PREDICTION

14

Way-Predicted Tag Lookup

A B

Way 0 Way 1 Address

B

If miss Way Prediction

Goal: Low storage-overhead and high accuracy way-prediction, to enable associative DRAM cache

slide-15
SLIDE 15

ACCORD OVERVIEW

  • Background
  • ACCORD

– Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS)

  • Summary

15

slide-16
SLIDE 16

INSIGHT: WAY-PREDICTABILITY AT LOW STORAGE?

Insight: Modifying install policy can make way- prediction much simpler!

16

EVEN ODD Way 0 Way 1 EVEN ODD ODD EVEN

Base Install Policy (Rand)

EVEN ODD EVEN ODD ODD EVEN

Tag-based Install Policy

Way 0 Way 1 Hard-to-predict (~50%) Predict 100%! But, direct-mapped

slide-17
SLIDE 17

PROPOSAL: ACCORD AssoCiativity by CoORDinating way-install and prediction. ACCORD achieves a way-predictable cache at low cost.

17

Way 0 Way 1 A2 B3 A3 B5 B7 Way Install Policy Way Predictor Coordinate

slide-18
SLIDE 18

ACCORD OVERVIEW

  • Background
  • ACCORD

– Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS)

  • Summary

18

slide-19
SLIDE 19

PROBABILISTIC WAY-STEERING PWS enables way-predictability, by trading speed of learning to use both ways (hit-rate)

19

Install using PWS Page A,B Bias=90% 10% Static prediction: ~90% B1 B2 B3 B4 B6 B7 B0 B5 Way 0 Way 1 Address A1 A2 A3 A4 A6 A7 A0 A5 B1 B2 B3 B4 B6 B7 B0 B5 A1 A2 A3 A4 A6 A7 A0 A5 Preferred Will use both ways, improve hit-rate

slide-20
SLIDE 20

SENSITIVITY TO PWS PROBABILITY

20

0% 20% 40% 60% 80% 100% 0% 2% 4% 6% 8% 10% 12% 14% 50% 60% 70% 80% 85% 90% 100% Way-Pred Accuracy (%) Miss Reduction (%) Bias for selecting “preferred way” Way-Pred Accuracy

2-way design Direct-mapped Preferred-way Install Probability = x% bias to install in preferred way

slide-21
SLIDE 21

SENSITIVITY TO PWS PROBABILITY

21

0% 20% 40% 60% 80% 100% 0% 2% 4% 6% 8% 10% 12% 14% 50% 60% 70% 80% 85% 90% 100% Way-Pred Accuracy (%) Miss Reduction (%) Preferred-way Install Probability Miss Reduction (%) Way-Pred Accuracy

2-way design Direct-mapped

slide-22
SLIDE 22

SENSITIVITY TO PWS PROBABILITY

22

2.6% 3.7% 4.7% 5.5% 5.6% 5.3% 0.0%

0% 20% 40% 60% 80% 100% 0% 2% 4% 6% 8% 10% 12% 14% 50% 60% 70% 80% 85% 90% 100% Way-Pred Accuracy (%) Miss Reduction (%) Speedup (%) Preferred-way Install Probability Speedup Miss Reduction (%) Way-Pred Accuracy

Preferred-way Install Probability (85%) provides best trade-off of hit-rate for WP accuracy, for 5.6% speedup.

5.6% speedup

slide-23
SLIDE 23

ACCORD OVERVIEW

  • Background
  • ACCORD

– Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS)

  • Summary

23

slide-24
SLIDE 24

GANGED WAY-STEERING

24

B1 B2 B3 B4 B6 B7 B0 B5 Way 0 Way 1 Address B0 B1 B2 B3 B4 B6 B7 B5 Way 0 Way 1 Address B0 B1 B2 B3 B4 B6 B7 B5 A0 A1 A2 A3 A4 A6 A7 A5 A1 A2 A3 A4 A6 A7 A0 A5 B1 B2 B3 B4 B6 B7 B0 B5 A1 A2 A3 A4 A6 A7 A0 A5 Probabilistic Way-Steering Per-line randomized decision Ganged Way-Steering Per-page rand decision Preferred Preferred Pred ~50% Pred >90% Ganged Way-Steering makes install decision at large granularity, to improve predictability for workloads with high spatial locality.

slide-25
SLIDE 25

GANGED WAY-STEERING IMPLEMENTATION

25

Way 0 Way 1 A2 B3 A3 B5 B7

0x001

RegionID Way Guide Install Recent Install Table (RIT) Install RegionID

0x101

Way Predict Way Recent Lookup Table (RLT) Access

GWS Per-Region Last-Way install + Last-Way prediction. 64-entry RIT and 64-entry RLT needs only 320 Bytes.

1

slide-26
SLIDE 26

PWS+GWS WAY-PREDICTION ACCURACY

26

70% 75% 80% 85% 90% 95% 100%

PWS+GWS PWS Libquantum

GWS enables spatial workloads to have near-100% accuracy PWS has ~85% base accuracy

Combination of PWS+GWS achieves 90% accuracy, at the cost of 320B storage.

Way-Pred Acc (%)

70% 75% 80% 85% 90% 95% 100%

Average (21 workloads) PWS+GWS PWS

slide-27
SLIDE 27

PWS+GWS (ACCORD 2-WAY) RESULTS PWS + GWS gets 7.3% of 10% speedup of perfectly-predicted 2-way cache.

7.3% speedup

27

System assumes 4GB DRAM Cache, and PCM-based main memory.

0% 2% 4% 6% 8% 10% 12%

Speedup PWS PWS+GWS Perfect

slide-28
SLIDE 28

ACCORD OVERVIEW

  • Background
  • ACCORD

– Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS)

  • Summary

28

slide-29
SLIDE 29

DIFFICULTY IN SCALING TO N-WAYS

  • Scaling ACCORD to N-ways

– ACCORD 4-way has 3% speedup – ACCORD 8-way has 6% slowdown…

We need solutions to reduce miss-confirmation

  • Miss confirmation: N-way cache needs N accesses

to confirm line is not resident

29

E A C B D Way 0 Way 1 Way 2 Way 3 Address E Miss!

slide-30
SLIDE 30

SOLUTION: SKEWED WAY-STEERING Restricting placement, reduces miss-confirmation è hit-rate benefits without any storage overhead

30

Only 2 lookups to determine miss Way 0 Way 2Way 3 E Access: A B C A B 4-way with 2-skew: Access: ABC One Preferred + One Alternate way Way 1

slide-31
SLIDE 31

SPEEDUP FROM ACCORD (WITH SWS)

31

0% 2% 4% 6% 8% 10% 12%

Speedup 2-Way

SWS 8-way achieves 11% speedup

4-Way 8-Way

slide-32
SLIDE 32

ACCORD OVERVIEW

  • Background
  • ACCORD

– Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS)

  • Summary

32

slide-33
SLIDE 33

SUMMARY OF ACCORD

§ ACCORD: associative DRAM caches by coordinating way-

install and way-prediction.

§ Probabilistic Way-Steering

§ Biased-install enables accurate static way-prediction

§ Ganged Way-Steering

§ Region-based install enables accurate region-based way-prediction

§ Skewed Way-Steering

§ Skew enables flexibility in line placement, while maintaining miss cost

§ ACCORD enables associativity at negligible storage cost

(320B), to achieve 11% speedup.

33

slide-34
SLIDE 34

ACCORD BACKUP SLIDES

ACCORD backup slides

34

slide-35
SLIDE 35

REPLACEMENT POLICY?

  • LRU

– State in SRAM

  • 1-bit per line needs 8MB. Size of Last-level cache

– State in DRAM

  • 9% slowdown due to state-update cost (Hit to alternate way)

35

slide-36
SLIDE 36

COMPARISON TO OTHER WAY PREDICTORS

36

  • 6%
  • 4%
  • 2%

0% 2% 4% 6% 8% 10% 12%

Speedup

ACCORD outperforms other predictors while needing negligible storage overhead (320 B)

slide-37
SLIDE 37

COLUMN-ASSOCIATIVE CACHE

  • Column-associative / Hash-Rehash cache

– Install lines in preferred way (way-0) – On eviction, move line to alternate way (way-1) – On hit to alternate way, move to preferred way

  • Effectiveness

– In general, way-prediction accuracy similar to MRU – But, requires significant bandwidth to swap lines on hit to alternate way. CA-cache thus causes 4% slowdown.

37