SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya - - PowerPoint PPT Presentation

ssd fail ilures in in datacenters
SMART_READER_LITE
LIVE PREVIEW

SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya - - PowerPoint PPT Presentation

SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, Kushagra Vaid The 9 th ACM Systems And Storage


slide-1
SLIDE 1

SSD Fail ilures in in Datacenters: What? When? And Why?

Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, Kushagra Vaid

1

The 9th ACM Systems And Storage Conference (SYSTOR 2016)

slide-2
SLIDE 2

SSDs’ popularity

Why SSD Reliability ?

2

*Source: IDC, Dec 2015

46.5% annual growth*

Limited field data Datacenter decision support Data reliability

01001100 01001101 11010010 01000000 10011100 10111111 10101111 11000101

slide-3
SLIDE 3

SSDs’ popularity

Why SSD Reliability ?

3

*Source: IDC, Dec 2015

46.5% annual growth*

Limited field data Datacenter decision support Data reliability

01001100 01001101 11010010 01000000 10011100 10111111 10101111 11000101

Large scale Field data

slide-4
SLIDE 4

SSD Failures

4

Flash failures

  • Media wear-out
  • Data Retention
  • Program disturb
  • Erase disturb

FTL Mechanisms

  • Wear levelling
  • Error detection
  • Error correction
  • Flash correct and

refresh, etc.

slide-5
SLIDE 5

SSD Failures

5

Flash failures

  • Media wear-out
  • Data Retention
  • Program disturb
  • Erase disturb

FTL Mechanisms

  • Wear levelling
  • Error detection
  • Error correction
  • Flash correct and

refresh, etc.

slide-6
SLIDE 6

SSD Failures

6

Flash failures

  • Media wear-out
  • Data Retention
  • Program disturb
  • Erase disturb

FTL Mechanisms

  • Wear levelling
  • Error detection
  • Error correction
  • Flash correct and

refresh, etc.

slide-7
SLIDE 7

SSD Failures

7

Flash failures

  • Media wear-out
  • Data Retention
  • Program disturb
  • Erase disturb

FTL Mechanisms

  • Wear levelling
  • Error detection
  • Error correction
  • Flash correct and

refresh, etc.

slide-8
SLIDE 8

SSD Failures

8

Flash failures

  • Media wear-out
  • Data Retention
  • Program disturb
  • Erase disturb

FTL Mechanisms

  • Wear levelling
  • Error detection
  • Error correction
  • Flash correct and

refresh, etc.

slide-9
SLIDE 9

SSD Failures

9

Flash failures

  • Media wear-out
  • Data Retention
  • Program disturb
  • Erase disturb

FTL Mechanisms

  • Wear levelling
  • Error detection
  • Error correction
  • Flash correct and

refresh, etc.

Fail-stop failures

slide-10
SLIDE 10

SSD Reliability

10

0.2 0.4 0.6 0.8 1 1.2 1-A 1-B 1-C 1-D 2-A Annualized Failure Rate % SSD Model AFR=0.61 AFR=0.73

Consumer Enterprise

slide-11
SLIDE 11

SSD Reliability

11

0.2 0.4 0.6 0.8 1 1.2 1-A 1-B 1-C 1-D 2-A Annualized Failure Rate % SSD Model AFR=0.61 AFR=0.73

slide-12
SLIDE 12

SSD Reliability

12

0.2 0.4 0.6 0.8 1 1.2 1-A 1-B 1-C 1-D 2-A Annualized Failure Rate % SSD Model AFR=0.61 AFR=0.73

5 large datacenters

slide-13
SLIDE 13

SSD Reliability

13

0.2 0.4 0.6 0.8 1 1.2 1-A 1-B 1-C 1-D 2-A Annualized Failure Rate % SSD Model AFR=0.61 AFR=0.73

4 major workloads

slide-14
SLIDE 14

SSD Reliability

14

0.2 0.4 0.6 0.8 1 1.2 1-A 1-B 1-C 1-D 2-A Annualized Failure Rate % SSD Model AFR=0.61 AFR=0.73

6 different rack SKUs

slide-15
SLIDE 15

SSD Reliability

15

0.2 0.4 0.6 0.8 1 1.2 1-A 1-B 1-C 1-D 2-A Annualized Failure Rate % SSD Model AFR=0.61 AFR=0.73 Various factors in production environment could affect SSD failure trends very differently from lab test conditions Can we understand SSD failures in the presence of various factors ?

slide-16
SLIDE 16

Understanding SSD Failures – An analogy

16

SSD Reactive Proactive

slide-17
SLIDE 17

What are the symptoms?

17

Fever Unexpected weight loss Low blood pressure Data errors 011001?00101? Reallocated sectors SATA downshift

SSD

Program and erase failure

slide-18
SLIDE 18

SSD Failure Symptoms

18

Reallocated Sector Count Program and Erase Fail Count CRC and Uncorrectable Error Count SATA Downshift Count 0.5 1 1.5 2 2.5 3 3.5

Reallocated Sector Count Program and Erase Failure Count CRC and Uncorrectable Error Count SATA Downshift Count

AFR % w Symptom w/o Symptom 3.95X 2.76X 18X 3.91X

slide-19
SLIDE 19

Insufficiency of symptom only diagnosis

19

10 20 30 40 50 60 70

Reallocations Program and Erase Fail Data Errors SATA Downshift Any

% of devices Failed Healthy Symptoms seen

  • nly in 62% of

failed devices

slide-20
SLIDE 20

What are the factors?

20

Lifestyle Genetics Environmental agents Production environment Workload Design decisions

SSD

slide-21
SLIDE 21

Device level correlating factors

21

Average write rate of a device Average read rate of a device Total read and/or write usage Write Amplification Read Write Ratio 0.5 1 1.5 2 2.5 10 15 20 25 30 35 40 45 50 >50 AFR %

  • Avg. host writes per day

More results in the paper Increasing failure trend at higher write rates

slide-22
SLIDE 22

Server level correlating factors

22

SSD space utilization Disk space utilization Memory utilization Processor utilization

0.2 0.4 0.6 0.8 1 1.2 10 20 30 40 50 60 70 AFR%

  • Avg. Disk Space Utilization

More results in the paper Decreasing failure trend at high disk space usage

slide-23
SLIDE 23

Datacenter factors

Rack SKU Datacenter Facility

23

0.1 0.2 0.3 0.4 0.5 0.6 1-D 2-A 1-D 2-A S1-3a S1-3b AFR % SKU and SSD model

More results in the paper

Same model different behavior

slide-24
SLIDE 24

Understanding SSD Failures – An analogy

24

SSD

Symptoms Factors Symptoms Factors

MULTI FEATURE ANALYSIS

slide-25
SLIDE 25

Understanding SSD Failures – An analogy

25

SSD

Symptoms Factors Symptoms Factors Random forest based binary classification Permutation feature ranking

slide-26
SLIDE 26

What

Understanding What ?

26

are the important factors ? is their order of importance ? are the important combinations?

slide-27
SLIDE 27

27

0.2 0.4 0.6 0.8 1 DataErrors ReallocSectors TotalNANDWrites HostWrites TotalReads+Writes AvgMemory AvgSSDSpace UsagePerDay TotalReads ReadsPerDay

Feature Importance

SYMPTOMS

Understanding What ?

slide-28
SLIDE 28

28

0.2 0.4 0.6 0.8 1 DataErrors ReallocSectors TotalNANDWrites HostWrites TotalReads+Writes AvgMemory AvgSSDSpace UsagePerDay TotalReads ReadsPerDay

Feature Importance

DEVICE WORKLOAD

Understanding What ?

slide-29
SLIDE 29

29

0.2 0.4 0.6 0.8 1 DataErrors ReallocSectors TotalNANDWrites HostWrites TotalReads+Writes AvgMemory AvgSSDSpace UsagePerDay TotalReads ReadsPerDay

Feature Importance

SERVER WORKLOAD

Understanding What ?

slide-30
SLIDE 30

30

Condition Class Data Errors <=1 & Reallocated Sectors<=5 H Data Errors<=1& WAF<=1 H Media Wear-out=100 & WAF<=1 H

  • Avg. SSD space >=10

F

Combinations of top 8 important features Frequent Combinations

SYMPTOMS

Understanding What ?

slide-31
SLIDE 31

31

Condition Class Data Errors <=1 & Reallocated Sectors<=5 H Data Errors<=1& WAF<=1 H Media Wear-out=100 & WAF<=1 H

  • Avg. SSD space >=10

F

Combinations of top 8 important features Frequent Combinations

SYMPTOMS + WORKLOAD

Understanding What ?

slide-32
SLIDE 32

32

Condition Class Data Errors <=1 & Reallocated Sectors<=5 H Data Errors<=1& WAF<=1 H Media Wear-out=100 & WAF<=1 H

  • Avg. SSD space >=10

F

Combinations of top 8 important features Frequent Combinations

WORKLOAD

Understanding What ?

slide-33
SLIDE 33

What

Understanding When ?

33

is the duration between detection and failure? signatures characterize SSD survivability?

slide-34
SLIDE 34

Understanding When ?

34

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 CDF(x) Time To Fail (months)

50% of failures > 4 months Sufficient time to intervene

slide-35
SLIDE 35

Understanding When ?

35

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 CDF(x) Time To Fail (months)

50% of failures > 4 months Early failures (< 1 month): Rules include symptoms and their thresholds Late failures: Rules contains only workload factors

slide-36
SLIDE 36

Understanding SSD Failures – An analogy

36

SSD

Symptoms Factors Symptoms Factors Observation based causal estimate Probabilistic causal models and Pearl’s do-calculus

slide-37
SLIDE 37

What

37

factors impact SSD reliability? is their magnitude of impact?

Understanding Why ?

slide-38
SLIDE 38

Understanding Why ?

38

SSD model and symptoms have direct impact Workload impacts failures through media wearout

slide-39
SLIDE 39

Concluding Remarks

  • SSD Failures in the field
  • Factors -> Symptoms -> Failures
  • Important Symptoms: Data Errors and Reallocated Sectors
  • High intensity and rapid progression fails early
  • Important factors: NAND Writes, Total Reads and Writes, etc.
  • Direct impact: SSD Model and Symptoms
  • Indirect impact: Workload through wear-out
  • Future direction: prediction and control

39