based Last Level Cache Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, - - PowerPoint PPT Presentation

based last level cache
SMART_READER_LITE
LIVE PREVIEW

based Last Level Cache Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, - - PowerPoint PPT Presentation

Density Tradeoffs of Non-Volatile Memory as a Replacement for SRAM based Last Level Cache Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, Jayesh Gaur, Sasikanth Manipatruni, Sreenivas Subramoney, Tanay Karnik, Steven Swanson, Ian Young and Hong Wang


slide-1
SLIDE 1

Density Tradeoffs of Non-Volatile Memory as a Replacement for SRAM based Last Level Cache

Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, Jayesh Gaur, Sasikanth Manipatruni, Sreenivas Subramoney, Tanay Karnik, Steven Swanson, Ian Young and Hong Wang

slide-2
SLIDE 2
  • Next generation NVM has potential for bigger and energy

efficient LLC  Potential ~3x capacity gain over state-of-art SRAM with logic compatible

process, non-volatility

  • Write error rate (WER) target for industry LLC adoption increases write

latency in practice

  • Our proposed solutions show good performance gains and

can help make NVM as viable replacement of SRAM for LLC

2

Novel Non-volatile Memory based Last Level Cache (NVMLLC) Architecture

slide-3
SLIDE 3

3

Agenda

  • Motivation & problem
  • Current solutions
  • Our proposals
  • Results
slide-4
SLIDE 4

NVMs offer capacity advantages

  • ver SRAMs

for LLC

  • NVMs promise high density
  • Spin Torque Transfer (STT)

RAM, Spin Hall Effect (SHE) MRAM, etc..

  • Can build large LLCs
  • Significant power/density

benefits over SRAM LLC

4

slide-5
SLIDE 5

5

Advantage of increasing LLC capacity

1.00 1.15 1.23 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25

  • Perf. normalized

to 4MB SRAM LLC

SRAM 4MB SRAM 8MB SRAM 16MB

slide-6
SLIDE 6

6

But, high write latency negates the capacity gains

1.00 1.15 1.13 1.05 0.85 0.60 0.70 0.80 0.90 1.00 1.10 1.20 Performance normalized to 4MB SRAM LLC SRAM 4MB STTRAM 8MB WR +0ns STTRAM 8MB WR +5ns STTRAM 8MB WR +10ns STTRAM 8MB WR +20ns

slide-7
SLIDE 7

None of the current techniques reduce the write latency enough

  • Architectural Techniques
  • Dead block predictor for

bypassing

  • LAP
  • Hybrid Cache
  • Circuit and Device Techniques
  • Increase bit-cell transistor

size, trade-off latency with retention/higher WER, new devices, etc

7

slide-8
SLIDE 8

Our Proposal:

Reduce Write Interference Eliminate Redundant Writes

1 2

8

slide-9
SLIDE 9

9

Reduce Write Interference

  • Many programs exhibit long high read/write phases
  • Usual Dead Block Predictor based bypassing not sufficient
  • Need more aggressive write bypassing to reduce write interference

1

50 100 150 200 250 300 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337 349 361 373 385 397 409 421 433 445 457 469 481 493

Number of requests arrived

Intervals of 10k cycles

num_writes num_reads

gcc.200

slide-10
SLIDE 10

10

Write Congestion Aware Bypassing (WCAB)

Don’t bypass

NO

Request queue is full && pending writes > write_th

If any read ready Send read Send write

NO

Don’t bypass

NO

min_score <= byp_score_th Bypass write with min_score Get average write occupancy calculated in intervals (int_write_occ) Refer Lookup Table to find bypass score threshold (byp_score_th) for int_write_occ Find pending write with lowest live score (min_score)

Interval write occupancy (int_write_occ) Bypass score threshold (byp_score_th) 1/4th of request queue 20% Half of request queue 50% 3/4th of request queue 70% Equal to request queue 100%

Lookup Table (Tuned) write_th 75% of request queue

1

slide-11
SLIDE 11
  • Significant percentage of frequent clean and dirty fills in LLC
  • Dirty fills generate writes in both Exclusive and Inclusive LLC
  • Clean fills create writes in Exclusive LLC

11

Eliminates Redundant Writes

2

0% 20% 40% 60% 80% 100%

Percentage of frequent clean and dirty fills in LLC

frequent clean fills frequent dirty fills

  • ne time fills
slide-12
SLIDE 12

12

Virtual Hybrid Cache (VHC)

  • Write Merging in L2
  • Frequent dirty lines stay in L2 for longer
  • Used existing technique to classify frequent dirty lines
  • Many writes merge in L2 reducing fills in LLC
  • Relaxed Exclusivity (duplicate lines b/w L2 and LLC)
  • Enhancement over LAP for Exclusive Cache
  • Retain the duplicate lines near LRU to reduce hit rate loss
  • Dirty lines (whenever found) not duplicated in LLC

2

slide-13
SLIDE 13

Simulation Methodology & Results

13

slide-14
SLIDE 14

14

Simulation Methodology

  • Used modified version Multi2Sim simulating 4 x86 cores
  • Core parameters similar to Intel Skylake
  • SRAM baseline: 4MB, 4 banks, 16 ways with round trip delay of 20 cycles
  • STTRAM baseline: 8MB, 8 banks, additional write latency of 20ns
  • Workloads:
  • Selected 20 workloads from SPEC 2006 and HPCG
  • With High L2 MPKI and a range of LLC MPKIs (Table 1 in the paper)
  • 20 homogeneous and 44 heterogeneous (by randomly mixing the 20 workloads)
slide-15
SLIDE 15

15

Performance vs STTRAM LLC Baseline

0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

Performance normalized to STTRAM 8MB baseline WCAB WCAB+VHC

Our proposals provide 26% performance gain over the baseline

slide-16
SLIDE 16

16

Performance vs Similar Area SRAM LLC

1.10 1.07 0.87 0.71 1.12 1.18 1.12 1.03

0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 5 ns, 7 MB 10 ns, 12 MB 20 ns, 16 MB 30 ns, 20 MB

Performance normalized to SRAM 4MB

STT - baseline STT - Proposed Architecture

Our proposals provide up to 18% performance gain over the SRAM of same area

slide-17
SLIDE 17

17

Performance vs Prior Art

1.10 1.09 1.09 1.05 1.03 1.04 1.07 1.13 1.11 1.18 1.30 1.26 0.9 1.0 1.1 1.2 1.3 1.4 Homogeneous Heterogeneous Geomean Performance normalized to 8MB STTRAM baseline Hybrid LLC - 2MB SRAM, 4MB STTRAM Hybrid LLC - 1MB SRAM, 6MB STTRAM STTRAM LLC - LAP STTRAM LLC - Proposed Architecture

Our proposals perform significantly better than the prior art

slide-18
SLIDE 18

18

Conclusions

  • Next generation NVM has potential for bigger and energy efficient LLC
  • Require architectural solutions to absorb high write latency and obtain

capacity benefits

  • Our proposed solutions show good performance gains and can help

make NVM as viable replacement of SRAM for LLC

THANK YOU!!