DRAM RELIABILITY
CS/ECE 7810: Advanced Computer Architecture
Mahdi Nazm Bojnordi
Assistant Professor School of Computing University of Utah
DRAM RELIABILITY Mahdi Nazm Bojnordi Assistant Professor School of - - PowerPoint PPT Presentation
DRAM RELIABILITY Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadline March 4 th (11:59PM): homework Late submission = NO submission
Assistant Professor School of Computing University of Utah
¨ Upcoming deadline
¤ March 4th (11:59PM): homework ¤ Late submission = NO submission ¤ March 25th: sign up for your student paper presentation
¨ This lecture
¤ Memory errors ¤ Error detection vs. correction ¤ Memory scrubbing ¤ Disturbance errors
¨ Any unwanted data change (bit flip)
¤ storage cell ¤ sensing circuits ¤ wires
¨ Soft errors are mainly caused by
¤ Slight manufacturing defects ¤ Gamma rays and alpha particles ¤ Electrical interference ¤ …
¨ Main memory stores a huge number of bits ¤ Nontrivial bit flip probability ¤ Even worse as the technology scales down ¨ Reliable systems must be protected against errors ¨ Techniques ¤ Error detection
n parity is a rudimentary method of checking the data to see if
errors exist
¤ Error correction code (ECC)
n additional bits used for error detection and correction
¨ Example: add redundant bits to the original data
¤ SECDED: Hamming distance (0000, 1111) = 4
Power Correct #bits Comments Nothing 0,1 1 Single error detection (SED) 00,11 2 01,10 =>errors Single error correction (SEC) 000,111 3 001,010,100 => 0 110,101,011 => 1 Single error correction double error detection (SECDED) 0000,1111 4 One 1 => 0 Two 1’s => error Three 1’s => 1
[adopted from Lipasti]
¨ Reduce the overhead by applying the codes to
# bits SED overhead SECDED overhead 1 1 (100%) 3 (300%) 32 1 (3%) 7 (22%) 64 1 (1.6%) 8 (13%) n 1 (1/n) 1 + log2 n + a little
ECC DIMM (9 x8 chips) ECC Chip
¨ ECC allows the memory controller to correct cell
Data ECC Memory Scrubber
ERROR
ECC Logic (1) READ (3) Write (2) Recover
Uncorrectable Data Bits ECC Bits
¨ ECC can correct a fixed
number of errors
¨ Data can become
uncorrectable if ECC is used without scrubbing
¨ Scrubbing prevents errors
from accumulating over time
¤ Periodically read all of the
memory locations
¤ Check ECC, correct errors,
and write corrected data back to memory ECC Logic
¨ SECDED Support
64-bit data word 8-bit ECC
¨ SECDED Support ¨ Chipkill Support
64-bit data word 8-bit ECC At most one bit from each DRAM chip
¨ SECDED Support ¨ Chipkill Support
and storage 8-bit data word 5-bit ECC At most one bit from each DRAM chip
¨ Repeated row activations can cause bit flips in
[Apple]
¨ Cause 1: Electromagnetic coupling
¤Toggling the wordline voltage briefly increases the
¤Slightly opens adjacent rows à Charge leakage
¨ Cause 2: Conductive bridges ¨ Cause 3: Hot-carrier injection
[slide source:Mutlu]
¨ Throttle accesses to same row
¤Limit access-interval: ≥500ns ¤Limit number of accesses: ≤128K (=64ms/500ns)
¨ Refresh more frequently
¤Shorten refresh-interval by ~7x
[Kim’2014]
¨ Key Idea
¤After closing a row, we activate (i.e., refresh) one
¨ Reliability Guarantee
¤When p=0.005, errors in one year: 9.4×10-14 ¤By adjusting the value of p, we can vary the
[Kim’2014]