Flash Memory: Characterization, Optimization, and Recovery Yu Cai, - - PowerPoint PPT Presentation

flash memory characterization
SMART_READER_LITE
LIVE PREVIEW

Flash Memory: Characterization, Optimization, and Recovery Yu Cai, - - PowerPoint PPT Presentation

Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery Yu Cai, Yixin Luo, Erich F. Haratsch*, Ken Mai, Onur Mutlu Carnegie Mellon University, *LSI Corporation 1 You Probably Know Many use cases: + High


slide-1
SLIDE 1

Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery

Yu Cai, Yixin Luo, Erich F. Haratsch*, Ken Mai, Onur Mutlu Carnegie Mellon University, *LSI Corporation

1

slide-2
SLIDE 2

You Probably Know

  • Many use cases:

+ High performance, low energy consumption

2

slide-3
SLIDE 3

NAND Flash Memory Challenges

– Requires erase before program (write) – High raw bit error rate

3

CPU Flash Controller

ECC Controller

Raw Flash Memory Chips

slide-4
SLIDE 4

Limited Flash Memory Lifetime

4

Program/Erase (P/E) Cycles (or Writes Per Cell) Raw bit error rate (RBER) ECC-correctable RBER ~3000 ~2000

Goal: Extend flash memory lifetime at low cost

P/E Cycle Lifetime

slide-5
SLIDE 5

Retention Loss

5

Charge leakage over time

One dominant source of flash memory errors [DATE ‘12, ICCD ‘12]

1

Retention error

Flash cell

slide-6
SLIDE 6

NAND Flash 101

6

Before I show you how we extend flash lifetime …

slide-7
SLIDE 7

Threshold Voltage (Vth)

7

Normalized Vth

1

Flash cell Flash cell

slide-8
SLIDE 8

Threshold Voltage (Vth) Distribution

8

Normalized Vth

1

Probability Density Function (PDF)

slide-9
SLIDE 9

Read Reference Voltage (Vref)

9

Normalized Vth PDF

1

Vref

slide-10
SLIDE 10

Multi-Level Cell (MLC)

10

Normalized Vth Erased (11) P1 (10) P2 (00) P3 (01) PDF ER-P1 Vref P1-P2 Vref P2-P3 Vref

slide-11
SLIDE 11

11

Normalized Vth PDF P1 (10) P2 (00) P3 (01) Before retention loss: After some retention loss:

Threshold Voltage Reduces Over Time

slide-12
SLIDE 12

Fixed Read Reference Voltage Becomes Suboptimal

12

Normalized Vth P1-P2 Vref P2-P3 Vref Normalized Vth PDF P1 (10) P2 (00) P3 (01) Raw bit errors Before retention loss: After some retention loss:

slide-13
SLIDE 13

Optimal Read Reference Voltage (OPT)

13

Normalized Vth PDF P1 (10) P2 (00) P3 (01) P1-P2 Vref P2-P3 Vref P1-P2 OPT P2-P3 OPT Minimal raw bit errors After some retention loss:

slide-14
SLIDE 14

14

Goal 1: Design a low-cost mechanism that dynamically finds the optimal read reference voltage

slide-15
SLIDE 15

Correctable errors

Retention Failure

15

Normalized Vth PDF P1 (10) P2 (00) P3 (01) P1-P2 Vref P2-P3 Vref Uncorrectable errors After some retention loss: After significant retention loss:

slide-16
SLIDE 16

16

Goal 1: Design a low-cost mechanism that dynamically finds the optimal read reference voltage Goal 2: Design an offline mechanism to recover data after detecting uncorrectable errors

slide-17
SLIDE 17

17

To understand the effects of retention loss:

  • Characterize retention loss using real chips
slide-18
SLIDE 18

18

Goal 2: Design an offline mechanism to recover data after detecting uncorrectable errors To understand the effects of retention loss:

  • Characterize retention loss using real chips

Goal 1: Design a low-cost mechanism that dynamically finds the optimal read reference voltage

slide-19
SLIDE 19

Characterization Methodology

19

FPGA-based flash memory testing platform [Cai+,FCCM ‘11]

slide-20
SLIDE 20

Characterization Methodology

  • FPGA-based flash memory testing platform
  • Real 20- to 24-nm MLC NAND flash chips
  • 0- to 40-day worth of retention loss
  • Room temperature (20⁰C)
  • 0 to 50k P/E Cycles

20

slide-21
SLIDE 21

21

Characterize the effects of retention loss

  • 1. Threshold Voltage Distribution
  • 2. Optimal Read Reference Voltage
  • 3. RBER and P/E Cycle Lifetime
slide-22
SLIDE 22
  • 1. Threshold Voltage (Vth) Distribution

22

Normalized Vth PDF P1 P2 P3

slide-23
SLIDE 23
  • 1. Threshold Voltage (Vth) Distribution

23

Finding: Cell’s threshold voltage decreases over time P1 P2 P3 0-day 40-day 0-day 40-day

slide-24
SLIDE 24
  • 2. Optimal Read Reference Voltage (OPT)

24

P1 P2 P3 Finding: OPT decreases over time 0-day OPT 40-day OPT 0-day OPT 40-day OPT

slide-25
SLIDE 25
  • 3. RBER and P/E Cycle Lifetime

25

P/E Cycles RBER

slide-26
SLIDE 26

Actual OPT Reading data with 7-day worth of retention loss.

  • 3. RBER and P/E Cycle Lifetime

26

ECC-correctable RBER

Finding: Using actual OPT achieves the longest lifetime

Vref closer to actual OPT Nominal Lifetime Extended Lifetime

slide-27
SLIDE 27

Characterization Summary

Due to re retention lo loss

‐Cell’s threshold voltage (Vth) decreases over time ‐Optim imal read reference volt ltage (OPT) decreases

  • ver time

Using the actual OPT T for reading

‐Achieves the longest lif lifetime

27

slide-28
SLIDE 28

28

Goal 2: Design an offline mechanism to recover data after detecting uncorrectable errors To understand the effects of retention loss:

  • Characterize retention loss using real chips

Goal 1: Design a low-cost mechanism that dynamically finds the optimal read reference voltage

slide-29
SLIDE 29

Naïve Solution: Sweeping Vref

Key idea: Read the data multiple times with different read reference voltages until the raw bit errors are correctable by ECC Finds the optimal read reference voltage Requires many read-retries  higher read latency

29

slide-30
SLIDE 30

Comparison of Flash Read Techniques

30

Flash Read Techniques Lifetime (P/E Cycle) Performance (Read Latency) Fixed Vref

 

Sweeping Vref

 

Our Goal

 

slide-31
SLIDE 31
  • 1. The optimal read reference voltage gradually

decreases over time Key idea: Record the old OPT as a prediction (Vpred) of the actual OPT Benefit: Close to actual OPT  Fewer read retries

  • 2. The amount of retention loss is similar across pages

within a flash block Key idea: Record only one Vpred for each block Benefit: Small storage overhead (768KB out of 512GB)

Observations

31

slide-32
SLIDE 32

Retention Optimized Reading (ROR)

Components:

  • 1. Online pre-optimization algorithm

‐Periodically records a Vpred for each block

  • 2. Improved read-retry technique

‐Utilizes the recorded Vpred to minimize read-retry count

32

slide-33
SLIDE 33
  • 1. Online Pre-Optimization Algorithm
  • Triggered periodically (e.g., per day)
  • Find and record an OPT as per-block Vpred
  • Performed in background
  • Small storage overhead

33

Normalized Vth PDF

New Vpred Old Vpred

slide-34
SLIDE 34
  • 2. Improved Read-Retry Technique
  • Performed as normal read
  • Vpred already close to actual OPT
  • Decrease Vref if Vpred fails, and retry

34

Normalized Vth PDF

OPT Vpred

Very close

slide-35
SLIDE 35

Retention Optimized Reading: Summary

35

Flash Read Techniques Lifetime (P/E Cycle) Performance (Read Latency) Fixed Vref

 

Sweeping Vref

 64% ↑ 

ROR

 64% ↑  _____

  • Nom. Life: 2.4% ↓
  • Ext. Life: 70.4% ↓
slide-36
SLIDE 36

36

Goal 2: Design an offline mechanism to recover data after detecting uncorrectable errors To understand the effects of retention loss:

  • Characterize retention loss using real chips

Goal 1: Design a low-cost mechanism that dynamically finds the optimal read reference voltage

slide-37
SLIDE 37

Correctable errors

Retention Failure

37

Normalized Vth PDF P1 (10) P2 (00) P3 (01) P1-P2 Vref P2-P3 Vref Uncorrectable errors After some retention loss: After significant retention loss:

slide-38
SLIDE 38

Leakage Speed Variation

38

Normalized Vth PDF S F low-leaking cell ast-leaking cell S F

slide-39
SLIDE 39

Initially, Right After Programming

39

Normalized Vth PDF S F S F S F S F P2 P3

slide-40
SLIDE 40

P2 P3 F F F F

After Some Retention Loss

40

Normalized Vth PDF S F S F S F S F

Fast-leaking cells have lower Vth Slow-leaking cells have higher Vth

slide-41
SLIDE 41

Eventually: Retention Failure

41

Normalized Vth PDF S F S F S F S F

OPT

P2 P3

slide-42
SLIDE 42

Retention Failure Recovery (RFR)

Key idea: Guess original state of the cell from its leakage speed property Three steps 1. Identify risky cells 2. Identify fast-/slow-leaking cells 3. Guess original states

42

slide-43
SLIDE 43
  • 1. Identify Risky Cells

43

Normalized Vth PDF S S F F OPT+σ OPT OPT–σ Risky cells P2 P3 + S = + F = Key Formula

slide-44
SLIDE 44
  • 2. Identifying Fast- vs. Slow-Leaking Cells

44

Normalized Vth PDF OPT+σ OPT OPT–σ Risky cells P2 P3 + S = + F = Key Formula ? ? ? ? ? ?

slide-45
SLIDE 45
  • 2. Identifying Fast- vs. Slow-Leaking Cells

45

Normalized Vth PDF OPT+σ OPT OPT–σ Risky cells P2 P3 + S = + F = Key Formula ? ? ? ? S F F S ? ?

slide-46
SLIDE 46
  • 3. Guess Original States

46

Normalized Vth PDF S F F S Risky cells P2 P3 + S = + F = Key Formula

slide-47
SLIDE 47

RFR Evaluation

  • Expect to eliminate

50% of raw bit errors

  • ECC can correct

remaining errors

47

Program with random data Detect failure, backup data Recover data 28 days 12 addt’l. days

slide-48
SLIDE 48

48

Goal 2: Design an offline mechanism to recover data after detecting uncorrectable errors To understand the effects of retention loss:

  • Characterize retention loss using real chips

Goal 1: Design a low-cost mechanism that dynamically finds the optimal read reference voltage

slide-49
SLIDE 49

Conclusion

Problem: Retention loss reduces flash lifetime Overall Goal: Extend flash lifetime at low cost Flash Characterization: Developed an understanding

  • f the effects of retention loss in real chips

Retention Optimized Reading: A low-cost mechanism that dynamically finds the optimal read reference voltage

‐ 64% lifetime ↑, 70.4% read latency ↓

Retention Failure Recovery: An offline mechanism that recovers data after detecting uncorrectable errors

‐ Raw bit error rate 50% ↓, reduces data loss

49

slide-50
SLIDE 50

Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery

Yu Cai, Yixin Luo, Erich F. Haratsch*, Ken Mai, Onur Mutlu Carnegie Mellon University, *LSI Corporation

50

slide-51
SLIDE 51

Backup Slides

51

slide-52
SLIDE 52

RFR Motivation

Data loss can happen in many ways 1. High P/E cycle 2. High temperature  accelerates retention loss 3. High retention age (lost power for a long time)

52

slide-53
SLIDE 53

What if there are other errors?

Key: RFR does not have to correct all errors Example:

  • ECC can correct 40 errors in a page
  • Corrupted page has 20 retention errors, 25
  • ther errors (45 total errors)
  • After RFR: 10 retention errors, 30 other errors

(40 total errors  ECC correctable)

53

slide-54
SLIDE 54

Threshold Voltage (Vth) Mean

54

Threshold voltage mean

P1 P2 P3 Finding: Vthshifts faster in higher voltage states

Quickly decrease Slowly decrease Relatively constant

slide-55
SLIDE 55

Raw Bit Error Rate (RBER)

55

Actual OPT

Reading data with 7-day retention age.

Finding: The actual OPT achieves the lowest RBER

RBER gradually decreases as read reference voltage approaches the actual OPT

slide-56
SLIDE 56

Online Pre-Optimization Algorithm

56

Normalized Vth PDF Normalized Vth PDF OPT OPT V0 V0 Case: V0 < OPT Case: V0 > OPT

slide-57
SLIDE 57

Online Pre-Optimization Algorithm

  • Peri

riodically learn and record OPT for page 255 as per-block starting read reference voltage (V0)

‐ Page 255 has the shortest retention age ‐ Other pages within the block have longer retention age and retention age will increase over time

  • Step 1: Read with Vref = old V0, record RBER
  • Step 2: Decrease Vref=Vref – ΔV* compare RBER
  • Step 3: Increase Vref = Vref + ΔV compare RBER
  • Step 4: Record new V0 = Vref | minimal RBER

57 *ΔV is the smallest step size for changing read reference voltage.

slide-58
SLIDE 58

Naive Read-Retry Latency Diagnosis II

58

Observation: Average ECC latency ∝ RBER

Attempt 1

time Read page A (Stage-0):

Flash Read Latency ECC Latency

= Constant ∝ Raw bit error*

*We provide detailed analysis of ECC latency in the paper.

slide-59
SLIDE 59

Arrhenius Law

59

1 year 32 hours Room temperature (20°C) High temperature (70°C)

High temperature accelerates retention loss

slide-60
SLIDE 60

Fast- and Slow-Leaking Cells

60

Slow-leaking cells Fast-leaking cells

(-1σ,μ) (1σ,2σ) (2σ,3σ) (3σ,+∞) (-∞,-3σ) (-3σ,-2σ) (-2σ,-1σ) (μ,1σ) Retention age (days)

*Similar trends are found in P2 state, as shown in the paper.

Average Vth shift

Ends up in higher Vth Ends up in higher Vth

slide-61
SLIDE 61

Fast- and Slow-Leaking Cells

61

Normalized Vth PDF μ 1σ 2σ 3σ

  • 3σ -2σ -1σ

Threshold voltage marks after 28 days:

slide-62
SLIDE 62

Fast- and Slow-Leaking Cells

62

Slow-leaking cells Fast-leaking cells

(-1σ,μ) (1σ,2σ) (2σ,3σ) (3σ,+∞) (-∞,-3σ) (-3σ,-2σ) (-2σ,-1σ) (μ,1σ) Retention age (days)

*Similar trends are found in P2 state, as shown in the paper.

Average Vth shift

Ends up in higher Vth Ends up in lower Vth

slide-63
SLIDE 63

63

Substrate Floating gate (FG) Control gate (CG) Drain Source Inter-poly oxide Tunnel oxide

Substrate FG CG D S