accord associativity for dram caches by coordinating way
play

ACCORD: Associativity for DRAM Caches by Coordinating Way-Install - PowerPoint PPT Presentation

ACCORD: Associativity for DRAM Caches by Coordinating Way-Install and Way-Prediction ISCA 2018 Vinson Young (GT) Chiachen Chou (GT) Authors: Aamer Jaleel (NVIDIA) Moinuddin K. Qureshi (GT) 1 3D-DRAM MITIGATES BANDWIDTH WALL Modern system


  1. ACCORD: Associativity for DRAM Caches by Coordinating Way-Install and Way-Prediction ISCA 2018 Vinson Young (GT) Chiachen Chou (GT) Authors: Aamer Jaleel (NVIDIA) Moinuddin K. Qureshi (GT) 1

  2. 3D-DRAM MITIGATES BANDWIDTH WALL Modern system packing many cores è Bandwidth Wall ✔ 4-8x Bandwidth (of traditional memory) ✘ Limited Capacity 3D-Stacked DRAM Memory Hybrid Memory Cube (HMC) from Micron, 3D-DRAM + High-Capacity Memory = Hybrid Memory High Bandwidth Memory (HBM) from Samsung 2

  3. USE 3D-DRAM AS A CACHE fast CPU CPU Memory Hierarchy L1$ L1$ L2$ L2$ L3$ DRAM-Cache (3D-DRAM) System Memory MCDRAM from Intel (NVM / DRAM) slow OS-visible Space Using 3D-DRAM as a DRAM cache, can improve memory bandwidth (and avoid OS/software change) 3

  4. ARCHITECTING LARGE DRAM CACHES Organize at line granularity (64B) for capacity/BW utilization Gigascale cache needs large tag-store (tens of MBs) 128 MB 4GB Data Tags Tags? 3D-DRAM Too large for SRAM 4

  5. ARCHITECTING LARGE DRAM CACHES Organize at line granularity (64B) for high cache utilization Gigascale cache needs large tag-store (tens of MBs) 128 MB 4GB Data Tags 3D-DRAM Practical designs must store Tags in DRAM How to architect tag-store for low-latency tag access? 5

  6. EFFICIENT TAG ORGANIZATION (KNL CACHE) Tag-With-Data [Alloy Cache, Intel Knights Landing] Tag Data Tag Data Tag Data Tag Data Single Tag+Data Lookup (1x hit latency), but direct-mapped Practical designs are 64B line-size , store Tag-With-Data , and are direct-mapped , to optimize for hit-latency. Intel Knights Landing Product (MCDRAM) uses this DRAM-cache organization. 6

  7. POTENTIAL OF ASSOCIATIVITY 90 Reduce 25% of misses Hit Rate (%) 80 70 60 y y y y a a a a w w w w - - - - 1 2 4 8 How can we make DRAM caches associative? Assumes 16-core system, with 4GB DRAM-Cache, in front of PCM memory. 7

  8. ASSOCIATIVITY OPTION 1: SERIAL TAG LOOKUP Way 0 Way 1 Address A B If miss A B Serial Tag Lookup enables associativity, but, it has serialization delay. 8

  9. ASSOCIATIVITY OPTION 2: PARALLEL TAG LOOKUP Way 0 Way 1 Address A B A B Parallel Lookup avoids serialization latency, but, it introduces 2x bandwidth cost. 9

  10. ASSOCIATIVITY FOR DRAM CACHE (PARALLEL) 90 1.5 Reduce 25% Speedup (Parallel) of misses -46% Hit Rate (%) 80 1 70 0.5 60 0 1-way 2-way 4-way 8-way 2-way 4-way 8-way (b) Speedup (Parallel) Increasing associativity naively actually degrades performance due to increased BW cost 10

  11. ASSOCIATIVITY FOR DRAM CACHE (IDEAL) 90 1.5 1.5 Reduce 25% Speedup (Idealized) 21% Speedup (Parallel) of misses -46% Hit Rate (%) 80 1 1 70 0.5 0.5 60 0 0 y y y y y y y y y y a a a a a a a a a a w w w w w w w w w w - - - - - - - - - - 1 2 4 8 2 4 8 2 4 8 (b) Speedup (Parallel) (c) Speedup (Idealized) With latency / BW of direct-mapped Associativity must still maintain the latency/BW of direct-mapped caches. How? 11

  12. OPTION 3: WAY-PREDICTED TAG LOOKUP Way 0 Way 1 Address A B Way Prediction If miss B Way-Predicted Tag Lookup Way-Predicted Tag Lookup can obtain improved hit- rate, with BW / latency of direct-mapped cache. 12

  13. WAY-PREDICTION ACCURACY & COST MRU Pred Partial-Tag (1bit/set) (4bit/line) SRAM Storage 4MB 32MB Way-Pred Accuracy 85.7% 97.3% (2-way) Accuracy (4-way) 74.3% 91.6% Accuracy (8-way) 63.2% 81.2% Prior methods for way-prediction have low accuracy and/or have high storage overhead. 13

  14. TOWARDS ASSOCIATIVITY W/ WAY-PREDICTION Way 0 Way 1 Address A B Way Prediction If miss B Way-Predicted Tag Lookup Goal: Low storage-overhead and high accuracy way-prediction, to enable associative DRAM cache 14

  15. ACCORD OVERVIEW • Background • ACCORD – Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS) • Summary 15

  16. INSIGHT: WAY-PREDICTABILITY AT LOW STORAGE? Way 0 Way 1 Way 0 Way 1 EVEN EVEN ODD ODD EVEN EVEN ODD ODD EVEN ODD EVEN ODD Base Install Policy (Rand) Tag-based Install Policy Predict 100%! Hard-to-predict (~50%) But, direct-mapped Insight: Modifying install policy can make way- prediction much simpler! 16

  17. PROPOSAL: ACCORD Coordinate Way Install Way 0 Way 1 Way Predictor Policy A2 A3 B3 B5 B7 A sso C iativity by C o ORD inating way-install and prediction . ACCORD achieves a way-predictable cache at low cost. 17

  18. ACCORD OVERVIEW • Background • ACCORD – Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS) • Summary 18

  19. PROBABILISTIC WAY-STEERING Page A,B Bias=90% 10% Preferred Address Way 0 Way 1 B0 B0 A0 A0 B1 A1 B1 A1 B2 B2 A2 A2 B3 A3 A3 B3 A4 A4 B4 B4 A5 A5 B5 B5 B6 A6 A6 B6 B7 B7 A7 A7 Static prediction: ~90% Install using PWS Will use both ways, improve hit-rate PWS enables way-predictability, by trading speed of learning to use both ways (hit-rate) 19

  20. SENSITIVITY TO PWS PROBABILITY Preferred-way Install Probability = x% bias to install in preferred way Way-Pred Accuracy 14% 100% Miss Reduction (%) Way-Pred Accuracy (%) 12% 80% 10% 60% 8% 6% 40% 4% 20% 2% 0% 0% 50% 60% 70% 80% 85% 90% 100% Bias for selecting “preferred way” 2-way design Direct-mapped 20

  21. SENSITIVITY TO PWS PROBABILITY Miss Reduction (%) Way-Pred Accuracy 14% 100% Miss Reduction (%) Way-Pred Accuracy (%) 12% 80% 10% 60% 8% 6% 40% 4% 20% 2% 0% 0% 50% 60% 70% 80% 85% 90% 100% Preferred-way Install Probability 2-way design Direct-mapped 21

  22. SENSITIVITY TO PWS PROBABILITY 5.6% speedup Speedup Miss Reduction (%) Way-Pred Accuracy 14% 100% Miss Reduction (%) Way-Pred Accuracy (%) 12% 80% 10% Speedup (%) 60% 8% 5.6% 5.5% 5.3% 6% 4.7% 40% 3.7% 4% 2.6% 20% 2% 0.0% 0% 0% 50% 60% 70% 80% 85% 90% 100% Preferred-way Install Probability Preferred-way Install Probability (85%) provides best trade-off of hit-rate for WP accuracy, for 5.6% speedup. 22

  23. ACCORD OVERVIEW • Background • ACCORD – Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS) • Summary 23

  24. GANGED WAY - STEERING Preferred Preferred Address Address Way 0 Way 1 Way 0 Way 1 B0 B0 A0 A0 B0 A0 B0 A1 B1 A1 B1 A1 B1 B1 B2 B2 A2 A2 A2 B2 B2 A3 B3 B3 A3 B3 A3 B3 A4 B4 B4 A4 B4 A4 B4 A5 B5 B5 A5 B5 A5 B5 B6 B6 A6 A6 A6 B6 B6 B7 A7 B7 A7 B7 A7 B7 Pred ~50% Pred >90% Probabilistic Way-Steering Ganged Way-Steering Per-line randomized decision Per-page rand decision Ganged Way-Steering makes install decision at large granularity, to improve predictability for workloads with high spatial locality. 24

  25. GANGED WAY - STEERING IMPLEMENTATION Guide Install Predict Way Access RegionID Way RegionID Way Install 0x001 0 0x101 1 Way 0 Way 1 Recent Install A2 A3 B3 Table (RIT) Recent Lookup B5 Table (RLT) B7 GWS Per-Region Last-Way install + Last-Way prediction. 64-entry RIT and 64-entry RLT needs only 320 Bytes . 25

  26. PWS+GWS WAY-PREDICTION ACCURACY GWS enables spatial workloads to PWS has ~85% base accuracy have near-100% accuracy 100% 100% Way-Pred Acc (%) 95% 95% 90% 90% 85% 85% 80% 80% 75% 75% 70% 70% PWS PWS+GWS PWS PWS+GWS Average (21 workloads) Libquantum Combination of PWS+GWS achieves 90% accuracy, at the cost of 320B storage. 26

  27. PWS+GWS (ACCORD 2-WAY) RESULTS 7.3% speedup 12% 10% 8% Speedup 6% 4% 2% 0% PWS+GWS PWS Perfect PWS + GWS gets 7.3% of 10% speedup of perfectly-predicted 2-way cache. System assumes 4GB DRAM Cache, and PCM-based main memory. 27

  28. ACCORD OVERVIEW • Background • ACCORD – Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS) • Summary 28

  29. DIFFICULTY IN SCALING TO N-WAYS • Scaling ACCORD to N-ways – ACCORD 4-way has 3% speedup – ACCORD 8-way has 6% slowdown… Way 0 Way 1 Way 2 Way 3 Address E A B C D E Miss! • Miss confirmation: N-way cache needs N accesses to confirm line is not resident We need solutions to reduce miss-confirmation 29

  30. SOLUTION: SKEWED WAY-STEERING 4-way with 2-skew: Access: ABC One Preferred + One Alternate way A A B C B Way 1 Way 2Way 3 Way 0 Access: E Only 2 lookups to determine miss Restricting placement, reduces miss-confirmation è hit-rate benefits without any storage overhead 30

  31. SPEEDUP FROM ACCORD (WITH SWS) 12% 10% 8% Speedup 6% 4% 2% 0% 4-Way 8-Way 2-Way SWS 8-way achieves 11% speedup 31

  32. ACCORD OVERVIEW • Background • ACCORD – Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS) • Summary 32

  33. SUMMARY OF ACCORD § ACCORD: associative DRAM caches by coordinating way- install and way-prediction. § Probabilistic Way-Steering § Biased-install enables accurate static way-prediction § Ganged Way-Steering § Region-based install enables accurate region-based way-prediction § Skewed Way-Steering § Skew enables flexibility in line placement, while maintaining miss cost § ACCORD enables associativity at negligible storage cost (320B), to achieve 11% speedup. 33

  34. ACCORD BACKUP SLIDES ACCORD backup slides 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend