Cache Hierarchies Po-An Tsai , Nathan Beckmann, and Daniel Sanchez - PowerPoint PPT Presentation

Prior work to mitigate the cost of rigid hierarchies  Bypass levels to avoid cache pollutions L3 Private L4  Do not install lines at specific levels L1 & L2  Give lines low priority in replacement policy It’s better to build the right hierarchy and  Speculatively access up the hierarchy avoid the root cause: unnecessary accesses to  Hit/miss predictors, prefetchers L4 Private L3 L1 & L2  Hide latency with speculative accesses unwanted cache levels  They must still check all levels for correctness!  Waste energy and bandwidth 7

Jenga = flexible hardware + smart software Software Hardware 8

Jenga = flexible hardware + smart software Software Time Hardware 8

Jenga = flexible hardware + smart software Software Read hardware monitors Time Hardware 8

Jenga = flexible hardware + smart software Software Optimize hierarchies Read hardware monitors Time Hardware 8

Jenga = flexible hardware + smart software Software Optimize hierarchies Read hardware Update monitors hierarchies Time Hardware 8

Jenga = flexible hardware + smart software Software Optimize hierarchies Read hardware Update monitors hierarchies 100ms Time Hardware 8

Jenga = flexible hardware + smart software Software Optimize Optimize hierarchies hierarchies Read hardware Update monitors hierarchies 100ms Time Hardware 8

Jenga hardware: supporting virtual hierarchies (VHs)  Cores consult virtual hierarchy table (VHT) to find the access path  Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels 9

Jenga hardware: supporting virtual hierarchies (VHs)  Cores consult virtual hierarchy table (VHT) to find the access path  Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels DRAM bank SRAM Bank NoC Router VH id TLB Core VHT Addr Private $ 9

Jenga hardware: supporting virtual hierarchies (VHs)  Cores consult virtual hierarchy table (VHT) to find the access path  Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels Two-level using both DRAM bank SRAM and DRAM SRAM Bank NoC Router VH id TLB Core VHT Addr Private $ 9

Accessing a two-level virtual hierarchy Access path: SRAM bank  DRAM bank  Mem Tile DRAM cache bank Private Tile 10 Core 1 VHT Caches 10

Accessing a two-level virtual hierarchy Access path: SRAM bank  DRAM bank  Mem Tile Virtual L1 SRAM (bank 10) DRAM (VL1) cache Core miss  VL1 bank 1 bank 1 Private Tile 10 Core 1 VHT Caches 10

Accessing a two-level virtual hierarchy Access path: SRAM bank  DRAM bank  Mem Virtual L2 DRAM (bank 38) (VL2) VL1 miss  VL2 bank 2 Tile Virtual L1 SRAM (bank 10) DRAM (VL1) 2 cache Core miss  VL1 bank 1 bank 1 Private Tile 10 Core 1 VHT Caches 10

Accessing a two-level virtual hierarchy Access path: SRAM bank  DRAM bank  Mem VL2 hit, serve line 3 Virtual L2 DRAM (bank 38) (VL2) VL1 miss  VL2 bank 2 Tile Virtual L1 SRAM (bank 10) DRAM (VL1) 2 cache Core miss  VL1 bank 1 bank 1 Private 3 Tile 10 Core 1 VHT Caches 10

Accessing an single-level VH using SRAM + DRAM  With VHT, software can group any combinations of banks to form a VH Private Core VHT Caches Main Memory 11

Accessing an single-level VH using SRAM + DRAM  With VHT, software can group any combinations of banks to form a VH Single-level using both Private Core VHT SRAM and DRAM Caches Main Memory 11

Accessing an single-level VH using SRAM + DRAM  With VHT, software can group any combinations of banks to form a VH Addr X Single-level using both Private Core VHT SRAM and DRAM Caches Main Memory 11

Accessing an single-level VH using SRAM + DRAM  With VHT, software can group any combinations of banks to form a VH Addr X Single-level using both Private Core VHT SRAM and DRAM Caches Addr Y Main Memory 11

Accessing an single-level VH using SRAM + DRAM  With VHT, software can group any combinations of banks to form a VH Addr X Single-level using both Private Core VHT SRAM and DRAM Caches Addr Y Main Logically equivalent to… Memory SRAM SRAM Private Core SRAM Caches DRAM 11

Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software Hardware Monitors Reconfigure Virtual Set VHTs Hierarchies 12

Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software Application miss curves Hardware Monitors Reconfigure Virtual Set VHTs Hierarchies 12

Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software Application miss curves Virtual Hardware Hierarchy Monitors Allocation Reconfigure Virtual Set VHTs Hierarchies 12

Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software VH sizes & levels Application VL2 miss curves Virtual VL1 Hardware Hierarchy Monitors Allocation Reconfigure Virtual Set VHTs Hierarchies 12

Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software VH sizes & levels Application VL2 miss curves Virtual VL1 Hardware Hierarchy Monitors Allocation Reconfigure Bandwidth-Aware Virtual Placement Set VHTs Hierarchies 12

Jenga software: finding near-optimal hierarchies  Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software VH sizes & levels Application VL2 miss curves Virtual VL1 Hardware Hierarchy Monitors Allocation Reconfigure Bandwidth-Aware Virtual Placement Set VHTs Hierarchies Final allocation 12

Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies 13

Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Color  latency Start DRAM bank 13

Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Color  latency Start DRAM bank Access Latency Cache DRAM bank 13 Total Capacity

Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Color  latency Start Latency DRAM bank Access Latency Virtual Cache size Cache DRAM bank 13 Total Capacity

Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Color  latency Start Latency DRAM bank Access Latency Virtual Cache size Cache Access latency DRAM bank 13 Total Capacity

Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Color  latency Start Latency DRAM bank Miss curve from hardware monitors Access Latency Virtual Cache size Cache Access latency DRAM bank Miss latency 13 Total Capacity

Modeling performance of heterogeneous caches  Treat SRAM and DRAM as different “flavors” of banks with different latencies Latency curve for single-level, Color  latency Start heterogeneous cache Latency DRAM bank Miss curve from hardware monitors Access Latency Virtual Cache size Cache Access latency DRAM bank Miss latency Total latency 13 Total Capacity

Optimizing hierarchies by minimizing system latency 14

Optimizing hierarchies by minimizing system latency  Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency  But only builds single-level VHs 14

Optimizing hierarchies by minimizing system latency  Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency  But only builds single-level VHs App1 Latency App2 App3 Capacity 14

Optimizing hierarchies by minimizing system latency  Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency  But only builds single-level VHs App1 Capacity Latency App2 App2 App1 App3 App3 Capacity 14

Multi-level hierarchies are much more complex 15

Multi-level hierarchies are much more complex  Many intertwined factors  Best VL1 size depends on VL2 size  Best VL2 size depends on VL1 size  Should we have VL2? (Depends on total size) 15

Multi-level hierarchies are much more complex  Many intertwined factors  Best VL1 size depends on VL2 size  Best VL2 size depends on VL1 size  Should we have VL2? (Depends on total size)  Jenga encodes these tradeoffs in a single curve  Can reuse prior allocation algorithms 15

How to get a latency curve for a multi-level VH 16

How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! 16

How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! Project Best 1- and 2-level hierarchy at every size 16

How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! Project Best 1- and 2-level Best overall hierarchy hierarchy at every size at every size 16

How to get a latency curve for a multi-level VH Curve lets us optimize Two-level hierarchies form multi-level hierarchies! a latency surface! Project Best 1- and 2-level Best overall hierarchy hierarchy at every size at every size 16

Allocating virtual hierarchies Latency curves VH1 VH2 VH3 17

Allocating virtual hierarchies Latency curves VH1 Cache allocation algorithm VH2 VH3 17

Allocating virtual hierarchies Total capacity Latency curves of each VH VH1 Cache allocation algorithm Capacity VH2 VH3 VH1 VH2 VH3 17

Allocating virtual hierarchies Total capacity Latency curves of each VH VH1 Decide Cache the best allocation hierarchy algorithm Capacity VH2 VH3 VH1 VH2 VH3 17

Allocating virtual hierarchies Virtual hierarchy Total capacity Latency curves size and levels of each VH VH1 VL1 Decide Cache the best allocation hierarchy algorithm Capacity VH2 VL1 VH3 VL1 VL2 VH1 VH2 VH3 17

Bandwidth-aware virtual hierarchy placement DRAM bank VL1 SRAM bank VL1 VL1 VL2 18

Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth DRAM bank VL1 SRAM bank VL1 VL1 VL2 18

Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency DRAM bank VL1 SRAM bank VL1 VL1 VL2 18

Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 VL1 VL1 VL2 18

Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.0X Latency VL1 1.0X Latency VL1 VL2 18

Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.0X Latency 1.1X Latency VL1 1.0X Latency VL1 VL2 18

Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 1.0X Latency VL1 VL2 18

Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 1.0X Latency 1.1X Latency VL1 VL2 18

Bandwidth-aware virtual hierarchy placement  Place data close without saturating DRAM bandwidth  Every iteration, Jenga …  Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency VL1 1.3X Latency 1.1X Latency 1.0X Latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 VL2 18

Jenga adds small overheads 19

Cache Hierarchies Po-An Tsai , Nathan Beckmann, and Daniel Sanchez - PowerPoint PPT Presentation

Jenga: Software-Defined Cache Hierarchies Po-An Tsai , Nathan Beckmann, and Daniel Sanchez Executive summary Heterogeneous caches are traditionally organized as a rigid hierarchy Easy to program but introduce expensive overheads when

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Lecture 20: Cache Hierarchies, Virtual Memory Todays topics: Cache hierarchies

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Integrable twisted hierarchies Twisted with D 2 symmetries hierarchies of a splitting type

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Complexity Hierarchies Lecture 2 2

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Lecture 19: Cache Basics Todays topics: Out-of-order execution Cache hierarchies

Lecture 21: Memory Hierarchy Todays topics: Cache organization Cache hits/misses 1

100GE Upgrades at FNAL Phil DeMar ; Andrey Bobyshev CHEP 2015 April 14, 2015 FNAL High-Impact

Review of Symbols Review of Symbols CS 105 Basic Parameters Tour of the Black Holes of

Other threats Threat model (beyond TLS) TLS = confidentiality, integrity, authenticity

Virtual Private Networks Distributed Systems Paul Krzyzanowski Private networks Problem You

Comet Virtual Clusters Whats underneath? Philip Papadopoulos San Diego Supercomputer

Enhancing MySQL Security Vinicius M. Grippa Support Engineer for MySQL/MongoDB

Classes in C++ A lot of this stuff is trivia, but it can be hard to discern up front. Classes in

Realizing a Virtual Private Network using Named Data Networking September 28, 2017 Craig