DRAM RELIABILITY Mahdi Nazm Bojnordi Assistant Professor School of - - PowerPoint PPT Presentation

dram reliability
SMART_READER_LITE
LIVE PREVIEW

DRAM RELIABILITY Mahdi Nazm Bojnordi Assistant Professor School of - - PowerPoint PPT Presentation

DRAM RELIABILITY Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadline March 4 th (11:59PM): homework Late submission = NO submission


slide-1
SLIDE 1

DRAM RELIABILITY

CS/ECE 7810: Advanced Computer Architecture

Mahdi Nazm Bojnordi

Assistant Professor School of Computing University of Utah

slide-2
SLIDE 2

Overview

¨ Upcoming deadline

¤ March 4th (11:59PM): homework ¤ Late submission = NO submission ¤ March 25th: sign up for your student paper presentation

¨ This lecture

¤ Memory errors ¤ Error detection vs. correction ¤ Memory scrubbing ¤ Disturbance errors

slide-3
SLIDE 3

Memory Errors

¨ Any unwanted data change (bit flip)

¤ storage cell ¤ sensing circuits ¤ wires

¨ Soft errors are mainly caused by

¤ Slight manufacturing defects ¤ Gamma rays and alpha particles ¤ Electrical interference ¤ …

slide-4
SLIDE 4

Error Detection and Correction

¨ Main memory stores a huge number of bits ¤ Nontrivial bit flip probability ¤ Even worse as the technology scales down ¨ Reliable systems must be protected against errors ¨ Techniques ¤ Error detection

n parity is a rudimentary method of checking the data to see if

errors exist

¤ Error correction code (ECC)

n additional bits used for error detection and correction

slide-5
SLIDE 5

Error Correction Codes

¨ Example: add redundant bits to the original data

bits

¤ SECDED: Hamming distance (0000, 1111) = 4

Power Correct #bits Comments Nothing 0,1 1 Single error detection (SED) 00,11 2 01,10 =>errors Single error correction (SEC) 000,111 3 001,010,100 => 0 110,101,011 => 1 Single error correction double error detection (SECDED) 0000,1111 4 One 1 => 0 Two 1’s => error Three 1’s => 1

[adopted from Lipasti]

slide-6
SLIDE 6

Error Correction Codes

¨ Reduce the overhead by applying the codes to

words instead of bits

# bits SED overhead SECDED overhead 1 1 (100%) 3 (300%) 32 1 (3%) 7 (22%) 64 1 (1.6%) 8 (13%) n 1 (1/n) 1 + log2 n + a little

ECC DIMM (9 x8 chips) ECC Chip

slide-7
SLIDE 7

Memory Error Correction

¨ ECC allows the memory controller to correct cell

retention errors and relax memory cell retention requirements.

Data ECC Memory Scrubber

ERROR

ECC Logic (1) READ (3) Write (2) Recover

slide-8
SLIDE 8

Uncorrectable Data Bits ECC Bits

Memory Scrubbing

¨ ECC can correct a fixed

number of errors

¨ Data can become

uncorrectable if ECC is used without scrubbing

¨ Scrubbing prevents errors

from accumulating over time

¤ Periodically read all of the

memory locations

¤ Check ECC, correct errors,

and write corrected data back to memory ECC Logic

slide-9
SLIDE 9

Stronger Error Corrections

¨ SECDED Support

64-bit data word 8-bit ECC

  • One extra x8 chip per rank
  • Storage and energy overhead of 12.5%
  • Cannot handle complete failure in one chip
slide-10
SLIDE 10

Stronger Error Corrections

¨ SECDED Support ¨ Chipkill Support

  • Use 72 DRAM chips to read out 72 bits
  • Dramatic increase in activation energy and overfetch
  • Storage overhead is still 12.5%

64-bit data word 8-bit ECC At most one bit from each DRAM chip

slide-11
SLIDE 11

Stronger Error Corrections

¨ SECDED Support ¨ Chipkill Support

  • Use 13 DRAM chips to read out 13 bits
  • Storage and energy overhead: 62.5%
  • Other options exist; trade-off between energy

and storage 8-bit data word 5-bit ECC At most one bit from each DRAM chip

slide-12
SLIDE 12

Row Hammer Problem

¨ Repeated row activations can cause bit flips in

adjacent rows

Row of Cells

Row Row Row

Row

Wordline

VLO

LOW

VHIGH

HIGH

Vic Victim im Row Vic Victim im Row Ha Hammer ered ed Row

Op Opened Cl Closed

[Apple]

slide-13
SLIDE 13

Modern DRAM is Vulnerable

First Appearance

slide-14
SLIDE 14

How Program Induces RH Errors?

CP CPU

loop: mov (X), %eax mov (Y), %ebx clflush (X) clflush (Y) mfence jmp loop

DR DRAM M Module le

Y X

slide-15
SLIDE 15

Sources of Disturbance Errors

¨ Cause 1: Electromagnetic coupling

¤Toggling the wordline voltage briefly increases the

voltage of adjacent wordlines

¤Slightly opens adjacent rows à Charge leakage

¨ Cause 2: Conductive bridges ¨ Cause 3: Hot-carrier injection

Confirmed by at least one manufacturer

[slide source:Mutlu]

slide-16
SLIDE 16

Basic Solutions

¨ Throttle accesses to same row

¤Limit access-interval: ≥500ns ¤Limit number of accesses: ≤128K (=64ms/500ns)

¨ Refresh more frequently

¤Shorten refresh-interval by ~7x

Both naive solutions introduce significant overhead in performance and power

[Kim’2014]

slide-17
SLIDE 17

Probabilistic Adjacent Row Activation

¨ Key Idea

¤After closing a row, we activate (i.e., refresh) one

  • f its neighbors with a low probability: p = 0.005

¨ Reliability Guarantee

¤When p=0.005, errors in one year: 9.4×10-14 ¤By adjusting the value of p, we can vary the

strength of protection against errors

[Kim’2014]