cache hierarchies
play

Cache Hierarchies Po-An Tsai , Nathan Beckmann, and Daniel Sanchez - PowerPoint PPT Presentation

Jenga: Software-Defined Cache Hierarchies Po-An Tsai , Nathan Beckmann, and Daniel Sanchez Executive summary Heterogeneous caches are traditionally organized as a rigid hierarchy Easy to program but introduce expensive overheads when


  1. Prior work to mitigate the cost of rigid hierarchies  Bypass levels to avoid cache pollutions L3 Private L4  Do not install lines at specific levels L1 & L2  Give lines low priority in replacement policy It’s better to build the right hierarchy and  Speculatively access up the hierarchy avoid the root cause: unnecessary accesses to  Hit/miss predictors, prefetchers L4 Private L3 L1 & L2  Hide latency with speculative accesses unwanted cache levels  They must still check all levels for correctness!  Waste energy and bandwidth 7

  2. Jenga = flexible hardware + smart software Software Hardware 8

  3. Jenga = flexible hardware + smart software Software Time Hardware 8

  4. Jenga = flexible hardware + smart software Software Read hardware monitors Time Hardware 8

  5. Jenga = flexible hardware + smart software Software Optimize hierarchies Read hardware monitors Time Hardware 8

  6. Jenga = flexible hardware + smart software Software Optimize hierarchies Read hardware Update monitors hierarchies Time Hardware 8

  7. Jenga = flexible hardware + smart software Software Optimize hierarchies Read hardware Update monitors hierarchies 100ms Time Hardware 8

  8. Jenga = flexible hardware + smart software Software Optimize Optimize hierarchies hierarchies Read hardware Update monitors hierarchies 100ms Time Hardware 8

  9. Jenga hardware: supporting virtual hierarchies (VHs)  Cores consult virtual hierarchy table (VHT) to find the access path  Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels 9

  10. Jenga hardware: supporting virtual hierarchies (VHs)  Cores consult virtual hierarchy table (VHT) to find the access path  Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels DRAM bank SRAM Bank NoC Router VH id TLB Core VHT Addr Private $ 9

  11. Jenga hardware: supporting virtual hierarchies (VHs)  Cores consult virtual hierarchy table (VHT) to find the access path  Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels Two-level using both DRAM bank SRAM and DRAM SRAM Bank NoC Router VH id TLB Core VHT Addr Private $ 9

  12. Jenga hardware: supporting virtual hierarchies (VHs)  Cores consult virtual hierarchy table (VHT) to find the access path  Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels Two-level using both DRAM bank SRAM and DRAM SRAM Bank NoC Router VH id TLB Core VHT Addr Private $ 9

  13. Accessing a two-level virtual hierarchy Access path: SRAM bank  DRAM bank  Mem Tile DRAM cache bank Private Tile 10 Core 1 VHT Caches 10

  14. Accessing a two-level virtual hierarchy Access path: SRAM bank  DRAM bank  Mem Tile Virtual L1 SRAM (bank 10) DRAM (VL1) cache Core miss  VL1 bank 1 bank 1 Private Tile 10 Core 1 VHT Caches 10

  15. Accessing a two-level virtual hierarchy Access path: SRAM bank  DRAM bank  Mem Virtual L2 DRAM (bank 38) (VL2) VL1 miss  VL2 bank 2 Tile Virtual L1 SRAM (bank 10) DRAM (VL1) 2 cache Core miss  VL1 bank 1 bank 1 Private Tile 10 Core 1 VHT Caches 10

  16. Accessing a two-level virtual hierarchy Access path: SRAM bank  DRAM bank  Mem VL2 hit, serve line 3 Virtual L2 DRAM (bank 38) (VL2) VL1 miss  VL2 bank 2 Tile Virtual L1 SRAM (bank 10) DRAM (VL1) 2 cache Core miss  VL1 bank 1 bank 1 Private 3 Tile 10 Core 1 VHT Caches 10

  17. Accessing an single-level VH using SRAM + DRAM  With VHT, software can group any combinations of banks to form a VH Private Core VHT Caches Main Memory 11

  18. Accessing an single-level VH using SRAM + DRAM  With VHT, software can group any combinations of banks to form a VH Single-level using both Private Core VHT SRAM and DRAM Caches Main Memory 11

  19. Accessing an single-level VH using SRAM + DRAM  With VHT, software can group any combinations of banks to form a VH Addr X Single-level using both Private Core VHT SRAM and DRAM Caches Main Memory 11

  20. Accessing an single-level VH using SRAM + DRAM  With VHT, software can group any combinations of banks to form a VH Addr X Single-level using both Private Core VHT SRAM and DRAM Caches Addr Y Main Memory 11

  21. Accessing an single-level VH using SRAM + DRAM  With VHT, software can group any combinations of banks to form a VH Addr X Single-level using both Private Core VHT SRAM and DRAM Caches Addr Y Main Logically equivalent to… Memory SRAM SRAM Private Core SRAM Caches DRAM 11

  22. Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software Hardware Monitors Reconfigure Virtual Set VHTs Hierarchies 12

  23. Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software Application miss curves Hardware Monitors Reconfigure Virtual Set VHTs Hierarchies 12

  24. Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software Application miss curves Virtual Hardware Hierarchy Monitors Allocation Reconfigure Virtual Set VHTs Hierarchies 12

  25. Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software VH sizes & levels Application VL2 miss curves Virtual VL1 Hardware Hierarchy Monitors Allocation Reconfigure Virtual Set VHTs Hierarchies 12

  26. Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software VH sizes & levels Application VL2 miss curves Virtual VL1 Hardware Hierarchy Monitors Allocation Reconfigure Bandwidth-Aware Virtual Placement Set VHTs Hierarchies 12

  27. Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software VH sizes & levels Application VL2 miss curves Virtual VL1 Hardware Hierarchy Monitors Allocation Reconfigure Bandwidth-Aware Virtual Placement Set VHTs Hierarchies Final allocation 12

  28. Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies 13

  29. Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Color  latency Start DRAM bank 13

  30. Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Color  latency Start DRAM bank Access Latency Cache DRAM bank 13 Total Capacity

  31. Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Color  latency Start Latency DRAM bank Access Latency Virtual Cache size Cache DRAM bank 13 Total Capacity

  32. Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Color  latency Start Latency DRAM bank Access Latency Virtual Cache size Cache Access latency DRAM bank 13 Total Capacity

  33. Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Color  latency Start Latency DRAM bank Miss curve from hardware monitors Access Latency Virtual Cache size Cache Access latency DRAM bank Miss latency 13 Total Capacity

  34. Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Latency curve for single-level, Color  latency Start heterogeneous cache Latency DRAM bank Miss curve from hardware monitors Access Latency Virtual Cache size Cache Access latency DRAM bank Miss latency Total latency 13 Total Capacity

  35. Optimizing hierarchies by minimizing system latency 14

  36. Optimizing hierarchies by minimizing system latency  Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency  But only builds single-level VHs 14

  37. Optimizing hierarchies by minimizing system latency  Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency  But only builds single-level VHs App1 Latency App2 App3 Capacity 14

  38. Optimizing hierarchies by minimizing system latency  Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency  But only builds single-level VHs App1 Capacity Latency App2 App2 App1 App3 App3 Capacity 14

  39. Optimizing hierarchies by minimizing system latency  Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency  But only builds single-level VHs App1 Capacity Latency App2 App2 App1 App3 App3 Capacity 14

  40. Multi-level hierarchies are much more complex 15

  41. Multi-level hierarchies are much more complex  Many intertwined factors  Best VL1 size depends on VL2 size  Best VL2 size depends on VL1 size  Should we have VL2? (Depends on total size) 15

  42. Multi-level hierarchies are much more complex  Many intertwined factors  Best VL1 size depends on VL2 size  Best VL2 size depends on VL1 size  Should we have VL2? (Depends on total size)  Jenga encodes these tradeoffs in a single curve  Can reuse prior allocation algorithms 15

  43. How to get a latency curve for a multi-level VH 16

  44. How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! 16

  45. How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! Project Best 1- and 2-level hierarchy at every size 16

  46. How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! Project Best 1- and 2-level hierarchy at every size 16

  47. How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! Project Best 1- and 2-level Best overall hierarchy hierarchy at every size at every size 16

  48. How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! Project Best 1- and 2-level Best overall hierarchy hierarchy at every size at every size 16

  49. How to get a latency curve for a multi-level VH Curve lets us optimize Two-level hierarchies form multi-level hierarchies! a latency surface! Project Best 1- and 2-level Best overall hierarchy hierarchy at every size at every size 16

  50. Allocating virtual hierarchies Latency curves VH1 VH2 VH3 17

  51. Allocating virtual hierarchies Latency curves VH1 Cache allocation algorithm VH2 VH3 17

  52. Allocating virtual hierarchies Total capacity Latency curves of each VH VH1 Cache allocation algorithm Capacity VH2 VH3 VH1 VH2 VH3 17

  53. Allocating virtual hierarchies Total capacity Latency curves of each VH VH1 Decide Cache the best allocation hierarchy algorithm Capacity VH2 VH3 VH1 VH2 VH3 17

  54. Allocating virtual hierarchies Virtual hierarchy Total capacity Latency curves size and levels of each VH VH1 VL1 Decide Cache the best allocation hierarchy algorithm Capacity VH2 VL1 VH3 VL1 VL2 VH1 VH2 VH3 17

  55. Bandwidth-aware virtual hierarchy placement DRAM bank VL1 SRAM bank VL1 VL1 VL2 18

  56. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth DRAM bank VL1 SRAM bank VL1 VL1 VL2 18

  57. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency DRAM bank VL1 SRAM bank VL1 VL1 VL2 18

  58. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 VL1 VL1 VL2 18

  59. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 VL1 VL1 VL2 18

  60. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.0X Latency VL1 1.0X Latency VL1 VL2 18

  61. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.0X Latency 1.1X Latency VL1 1.0X Latency VL1 VL2 18

  62. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 1.0X Latency VL1 VL2 18

  63. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 1.0X Latency 1.1X Latency VL1 VL2 18

  64. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.3X Latency 1.1X Latency 1.0X Latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 VL2 18

  65. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.3X Latency 1.1X Latency 1.0X Latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 VL2 18

  66. Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.3X Latency 1.1X Latency 1.0X Latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 VL2 18

  67. Jenga adds small overheads 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend