based Last Level Cache Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, - - PowerPoint PPT Presentation

▶

Mar 15, 2023 183 likes •374 views

Density Tradeoffs of Non-Volatile Memory as a Replacement for SRAM based Last Level Cache Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, Jayesh Gaur, Sasikanth Manipatruni, Sreenivas Subramoney, Tanay Karnik, Steven Swanson, Ian Young and Hong Wang

SLIDE 1

Density Tradeoffs of Non-Volatile Memory as a Replacement for SRAM based Last Level Cache

Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, Jayesh Gaur, Sasikanth Manipatruni, Sreenivas Subramoney, Tanay Karnik, Steven Swanson, Ian Young and Hong Wang

SLIDE 2

Next generation NVM has potential for bigger and energy

efficient LLC  Potential ~3x capacity gain over state-of-art SRAM with logic compatible

process, non-volatility

Write error rate (WER) target for industry LLC adoption increases write

latency in practice

Our proposed solutions show good performance gains and

can help make NVM as viable replacement of SRAM for LLC

Novel Non-volatile Memory based Last Level Cache (NVMLLC) Architecture

SLIDE 3

Agenda

Motivation & problem
Current solutions
Our proposals
Results

SLIDE 4

NVMs offer capacity advantages

ver SRAMs

for LLC

NVMs promise high density
Spin Torque Transfer (STT)

RAM, Spin Hall Effect (SHE) MRAM, etc..

Can build large LLCs
Significant power/density

benefits over SRAM LLC

SLIDE 5

Advantage of increasing LLC capacity

1.00 1.15 1.23 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25

Perf. normalized

to 4MB SRAM LLC

SRAM 4MB SRAM 8MB SRAM 16MB

SLIDE 6

But, high write latency negates the capacity gains

1.00 1.15 1.13 1.05 0.85 0.60 0.70 0.80 0.90 1.00 1.10 1.20 Performance normalized to 4MB SRAM LLC SRAM 4MB STTRAM 8MB WR +0ns STTRAM 8MB WR +5ns STTRAM 8MB WR +10ns STTRAM 8MB WR +20ns

SLIDE 7

None of the current techniques reduce the write latency enough

Architectural Techniques
Dead block predictor for

bypassing

LAP
Hybrid Cache
Circuit and Device Techniques
Increase bit-cell transistor

size, trade-off latency with retention/higher WER, new devices, etc

SLIDE 8

Our Proposal:

Reduce Write Interference Eliminate Redundant Writes

1 2

SLIDE 9

Reduce Write Interference

Many programs exhibit long high read/write phases
Usual Dead Block Predictor based bypassing not sufficient
Need more aggressive write bypassing to reduce write interference

50 100 150 200 250 300 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337 349 361 373 385 397 409 421 433 445 457 469 481 493

Number of requests arrived

Intervals of 10k cycles

num_writes num_reads

gcc.200

SLIDE 10

Write Congestion Aware Bypassing (WCAB)

Don’t bypass

Request queue is full && pending writes > write_th

If any read ready Send read Send write

Don’t bypass

min_score <= byp_score_th Bypass write with min_score Get average write occupancy calculated in intervals (int_write_occ) Refer Lookup Table to find bypass score threshold (byp_score_th) for int_write_occ Find pending write with lowest live score (min_score)

Interval write occupancy (int_write_occ) Bypass score threshold (byp_score_th) 1/4th of request queue 20% Half of request queue 50% 3/4th of request queue 70% Equal to request queue 100%

Lookup Table (Tuned) write_th 75% of request queue

SLIDE 11

Significant percentage of frequent clean and dirty fills in LLC
Dirty fills generate writes in both Exclusive and Inclusive LLC
Clean fills create writes in Exclusive LLC

Eliminates Redundant Writes

0% 20% 40% 60% 80% 100%

Percentage of frequent clean and dirty fills in LLC

frequent clean fills frequent dirty fills

ne time fills

SLIDE 12

Virtual Hybrid Cache (VHC)

Write Merging in L2
Frequent dirty lines stay in L2 for longer
Used existing technique to classify frequent dirty lines
Many writes merge in L2 reducing fills in LLC
Relaxed Exclusivity (duplicate lines b/w L2 and LLC)
Enhancement over LAP for Exclusive Cache
Retain the duplicate lines near LRU to reduce hit rate loss
Dirty lines (whenever found) not duplicated in LLC

SLIDE 13

Simulation Methodology & Results

SLIDE 14

Simulation Methodology

Used modified version Multi2Sim simulating 4 x86 cores
Core parameters similar to Intel Skylake
SRAM baseline: 4MB, 4 banks, 16 ways with round trip delay of 20 cycles
STTRAM baseline: 8MB, 8 banks, additional write latency of 20ns
Workloads:
Selected 20 workloads from SPEC 2006 and HPCG
With High L2 MPKI and a range of LLC MPKIs (Table 1 in the paper)
20 homogeneous and 44 heterogeneous (by randomly mixing the 20 workloads)

SLIDE 15

Performance vs STTRAM LLC Baseline

0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

Performance normalized to STTRAM 8MB baseline WCAB WCAB+VHC

Our proposals provide 26% performance gain over the baseline

SLIDE 16

Performance vs Similar Area SRAM LLC

1.10 1.07 0.87 0.71 1.12 1.18 1.12 1.03

0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 5 ns, 7 MB 10 ns, 12 MB 20 ns, 16 MB 30 ns, 20 MB

Performance normalized to SRAM 4MB

STT - baseline STT - Proposed Architecture

Our proposals provide up to 18% performance gain over the SRAM of same area

SLIDE 17

Performance vs Prior Art

1.10 1.09 1.09 1.05 1.03 1.04 1.07 1.13 1.11 1.18 1.30 1.26 0.9 1.0 1.1 1.2 1.3 1.4 Homogeneous Heterogeneous Geomean Performance normalized to 8MB STTRAM baseline Hybrid LLC - 2MB SRAM, 4MB STTRAM Hybrid LLC - 1MB SRAM, 6MB STTRAM STTRAM LLC - LAP STTRAM LLC - Proposed Architecture

Our proposals perform significantly better than the prior art

SLIDE 18

Conclusions

Next generation NVM has potential for bigger and energy efficient LLC
Require architectural solutions to absorb high write latency and obtain

capacity benefits

Our proposed solutions show good performance gains and can help

make NVM as viable replacement of SRAM for LLC

THANK YOU!!