RELIABILITY OF RESISTIVE MEMORIES Mahdi Nazm Bojnordi Assistant - - PowerPoint PPT Presentation
RELIABILITY OF RESISTIVE MEMORIES Mahdi Nazm Bojnordi Assistant - - PowerPoint PPT Presentation
RELIABILITY OF RESISTIVE MEMORIES Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadlines April 6 th : student paper presentation This
Overview
¨ Upcoming deadlines
¤ April 6th: student paper presentation
¨ This lecture
¤ Hard errors in resistive memories ¤ Increasing reliability by replication, ECP
, SAFER, FREE-p
¤ Resistive computing
Recall: Resistive vs. Dynamic RAM
¨ Phase-Change RAM
¤ Nonvolatile ¤ Projected to be more
scalable
¤ Cells may be written
individually
¤ Slower, with more
energy intensive writes
¤ Susceptible to hard
errors
¨ DRAM
¤ Volatile, charge based ¤ Difficult to further scale
down the capacitor
¤ All of the accesses are
through row buffer
¤ Faster, with acceptable
energy consumption
¤ Vulnerable to soft
errors
Solutions to Memory Hard Errors
¨ Accept failure of some fraction of pages
¤ Map failed pages out of logical memory
¨ Wear-level data pages/blocks, and within blocks
¤ Shift/rotate data randomly (intervals/locations)
¨ Differential writes
¤ Write only cells with values that change
¨ Correct errors when possible
¤ Error correction techniques
Error Correction Techniques
¨ No correction (detection only)
¤ Inefficient ¤ A page must be retired when the first cell fails
¨ SECDED ECC
¤ With a 12.5% memory overhead 8 chips 8 bits/chip SEC/SECDED 64 bits 7/8 bits
10.9%/12.5% overhead
Error Correction Techniques
¨ No correction (detection only)
¤ Inefficient ¤ A page must be retired when the first cell fails
¨ SECDED ECC
¤ With a 12.5% memory overhead ¤ A page must be retired when a block within the page
suffers a second error
X X
Error Correction Codes
¨ Good for soft errors
¤ Transient errors
¨ Not good for hard errors
¤ ECC has high entropy and can hasten wear-out ¤ Flipping just one data bit changes about half of ECC bits
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Dynamically Replicated Memory
¨ Goal: handle hard errors by pairing two pages that
have faults in different locations; replicate data across the two pages
¨ How: errors are detected with parity bits; replica
reads are issued if the initial read is faulty
[ASPLOS’10]
Dynamically Replicated Memory
¨ Improve the lifetime of PCM by up to 40x over
conventional error-detection techniques
[ASPLOS’10]
Error Correction Pointers
¨ Key idea: instead of using ECC to handle a few
transient faults in DRAM, use error-correcting pointers to handle hard errors in specific locations
¨ For a 512-bit line with 1 failed bit, maintain a 9-bit
field to track the failed location and another bit to store the value in that location
¨ Can store multiple such pointers and can recover
from faults in the pointers too
[ISCA’10]
Error Correction Pointers
1 1 0 … 1
511 510 509 508 3 2 1
1
8 7 6 5 3 2 1 4
correction pointer data cells
1
R
replacement cell
1
correction entry
1
Full?
1 [ISCA’10]
Error Correction Pointers
1 1 0 … 1
511 510 509 508 3 2 1
1
8 7 6 5 3 2 1 4
1
R
1
Full? 5 3 2 1
1
data cells correction entries
0000
4
[ISCA’10]
Error Correction Pointers
0001
1 1 0 … 1
511 510 509 508 3 2 1
1
8 7 6 5 3 2 1 4
1
R
1
5 3 2 1
1 1 1 1 1 1 1
8 7 6 5 3 2 1
1
4
1
R
0010
4 Full?
data cells correction entries What if correction entry fails?
[ISCA’10]
Stuck-At-Fault Error Recovery
¨ Observation: a failed cell with a stuck-at value is still
readable
¨ Goal: either write the word or its flipped version so that
the failed bit is made to store the stuck-at value
¨ For multi-bit errors, the line can be partitioned such that
each partition has a single error
¨ Errors are detected by verifying a write; recently failed
bit locations are cached so multiple writes can be avoided
[MICRO’10]
Stuck-At-Fault Error Recovery
[MICRO’10]
¨ Three partition candidates in SAFER
How to detect two fails? (read the paper)
Stuck-At-Fault Error Recovery
¨ Fail recovery
[MICRO’10]
Multi-tiered ECC for Hard/Soft Errors
¨ FREE-p: fine-grained remapping with ECC and
embedded pointer
¤ Re-use a “dead” 64B block for storing a remap pointer ¤ Architectural techniques to accelerate address
remapping
¨ Detection/correction at the memory controller
¤ Allow simple NVRAM devices ¤ Tolerate hard/soft errors in the cell array, periphery,
etc.
[HPCA’11]
FREE-p
¨ Embed a 64-bit pointer within a faulty block
¤ There are still-functional bits in a faulty block ¤ 1-bit D/P flag per 64B block
n Identify a block is remapped or not
¤ Avoid chained remapping
n Embed always the FINAL pointer
[HPCA’11]
Capacity vs. Lifetime
[HPCA’11]
Resistive Computation
¨ Leverage STT-MRAM for energy efficiency ¤ Near-zero leakage power ¤ Low-energy read operation ¨ Goal: selectively migrate on-chip storage and
combinational logic to STT-MRAM to reduce power
¤ On-chip storage: caches, TLBs, register files, queues ¤ Combinational logic: lookup-table (LUT) based computing
[ISCA’10]
Hybrid CMT Pipeline
¨ Small arrays
and simple logic in CMOS
¨ Large arrays
and complex logic in STT- MRAM
Fetch Logic
CLK
Inst Buf x 8 Thrd Sel
CLK CLK
Decode Logic Reg File x 8
CLK
Func Unit ALU FPU
CLK CLK
Shared L2$ Banks x 8
MC0 Queue MC0 Logic MC1 Queue MC1 Logic MC2 Queue MC2 Logic MC3 Queue MC3 Logic CLK
ST Buf x 8 STT-MRAM LUTs STT-MRAM Arrays Pure CMOS I$ I-TLB D$ D-TLB
[ISCA’10]
System Power
!"# $!"# %!"# &!"# '!"# (!!"# )*+,# ,--.*/0*#
!"#$%&'"()*&+"*,$%-.)/&#"& 0123&!"#$%&'"()*&
1234352#67829# :;<3=>?#67829# !"# $!"# %!"# &!"# '!"# (!!"# )*+,# ,--.*/0*#
!"#$#%"&'()"*&+(*,#-./"0& 1(&2345&!"#$#%"&'()"*&
1$# 1(2#345#-162# )7892#
[ISCA’10]
System Performance
0.2 0.4 0.6 0.8 1 B L A S T B S O M C G C H O L E S K Y E Q U A K E F F T K M E A N S L U M G O C E A N R A D I X S W I M W A T E R
- N
G E O M E A N System Throughput Normalized to CMOS [ISCA’10]