A Realistic Evaluation of Memory Hardware Errors and Software System - - PowerPoint PPT Presentation

a realistic evaluation of memory hardware errors and
SMART_READER_LITE
LIVE PREVIEW

A Realistic Evaluation of Memory Hardware Errors and Software System - - PowerPoint PPT Presentation

A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility Xin Li 1 Michael Huang 1 Kai Shen 2 Lingkun Chu 3 1 Department of Electrical and Computer Engineering University of Rochester 2 Department of Computer Science


slide-1
SLIDE 1

A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility

Xin Li1 Michael Huang1 Kai Shen2 Lingkun Chu3

1Department of Electrical and Computer Engineering

University of Rochester

2Department of Computer Science

University of Rochester

3Ask.com

2010 USENIX Annual Technical Conference

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 1 / 29

slide-2
SLIDE 2

Motivation

Memory Hardware Errors: Transient vs Non-transient

Transient:

Completely due to environmental factors Don’t cause permanent hardware damage

Non-transient:

Hardware fault plays a role May recur over time

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 2 / 29

slide-3
SLIDE 3

Motivation

Asymetrical Understanding of Memory Errors

Transient analysis:

Baumann 2004 Normand 1996 Ziegler et al. 1996 O’Gorman et al. 1996 Li et al. 2007

Non-transient error studies:

Schroeder et al. 2009 Constantinescu 2003 No specifics regarding error locations

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 3 / 29

slide-4
SLIDE 4

Motivation

Importance of Understanding Non-transient Memory Errors

Non-transient errors

Intermittent errors may not be obviously easy to detect System maintenance is not perfect May combine with transient errors to make impact

The lack of a comprehensive understanding of memory errors

High-level studies assume transient errors or resort to synthetic non-transient errors Non-transient errors do happen in practice

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 4 / 29

slide-5
SLIDE 5

Motivation

A Realistic Evaluation from All Angles

Collect non-accelerated errors on production computers

Detailed per-error address and syndrome

Simulate how they would manifest with different hardware correction mechanisms Observe the end results of software running with these errors

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 5 / 29

slide-6
SLIDE 6

Motivation

Outline

1

Data Collection Results

2

Error Manifestation Analysis Overview Methodology Base Results Statistical Rate Bounds

3

Software Susceptibility Overview Methodology Results

4

Conclusions

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 6 / 29

slide-7
SLIDE 7

Realistic Raw Error Data

Outline

1

Data Collection Results

2

Error Manifestation Analysis Overview Methodology Base Results Statistical Rate Bounds

3

Software Susceptibility Overview Methodology Results

4

Conclusions

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 7 / 29

slide-8
SLIDE 8

Realistic Raw Error Data

Methodology

Data primarily from 212 production servers with ECC

Monitored for about 9 months Total of 800 GB memory Read error info from ECC registers Enabled hardware scrubbing to help expose errors

Two other environments are examined

70 PlanetLab geographically distributed testbeds 20 U of Rochester desktops Results reported for transient errors only in USENIX’07

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 8 / 29

slide-9
SLIDE 9

Realistic Raw Error Data Results

Results – Time-line

11 machines with errors in the first 2 months A new faulty machine after 6 months

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 9 / 29

slide-10
SLIDE 10

Realistic Raw Error Data Results

Results – Selected Patterns

Error Pattern Column Address Row Address 512 1024 1536 2048 2048 4096 6144 8192 10240 12288 14336 16384 Error Pattern Column Address Row Address 512 1024 1536 2048 2048 4096 6144 8192 10240 12288 14336 16384 Error Pattern Column Address Row Address 512 1024 1536 2048 2048 4096 6144 8192 10240 12288 14336 16384 Error Pattern Column Address Row Address 512 1024 1536 2048 2048 4096 6144 8192 10240 12288 14336 16384 Error Pattern Column Address Row Address 512 1024 1536 2048 2048 4096 6144 8192 10240 12288 14336 16384 Error Pattern Column Address Row Address 512 1024 1536 2048 2048 4096 6144 8192 10240 12288 14336 16384

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 10 / 29

slide-11
SLIDE 11

Realistic Raw Error Data Results

Results – Patterns

Summary:

5 cells 3 rows 1 column 1 row-column 2 chip

Raw data available on our project website http://www.cs.rochester.edu/research/os/memerror

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 11 / 29

slide-12
SLIDE 12

Manifestation

Outline

1

Data Collection Results

2

Error Manifestation Analysis Overview Methodology Base Results Statistical Rate Bounds

3

Software Susceptibility Overview Methodology Results

4

Conclusions

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 12 / 29

slide-13
SLIDE 13

Manifestation Overview

Manifestation Overview

Countermeasures confine errors inside the memory system

ECC correction Preventive maintenance

Countermeasures at a cost

ECC demands extra bits and extra logic Chipkill ECC even requires lock-stepping between channels

Efficacy is in question

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 13 / 29

slide-14
SLIDE 14

Manifestation Manifestation

Methodology

Event-driven Monte Carlo simulation Calculate manifestation rates given:

Error model (patterns and rates) Countermeasures

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 14 / 29

slide-15
SLIDE 15

Manifestation Manifestation

Assumptions

Transient errors

Single bit patterns Constant error rates Exponential distribution

Non-transient errors

Patterns based on templates Common belief: bathtub curve Wear-out neglected Weibull distribution (shape parameter < 1) Parameters derived from the raw data

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 15 / 29

slide-16
SLIDE 16

Manifestation Manifestation

Assumptions Cont’

ECC

SECDED: single bit correction, double bit detection (in a word) Chipkill: correct a whole chip

Preventive maintenance

Not effective in our model Excluded from the results

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 16 / 29

slide-17
SLIDE 17

Manifestation Base Results

Base Results

No ECC

Transient and non-transient both significant Transient 2000 FIT FIT – Failure In Time (114 FIT – 1000 years MTTF) Non-transient 5000 - 2000 FIT

1 2 3 1000 2000 3000 4000 5000 6000 7000 8000 Operational Duration (years) Cumulative Failure Rate (in FIT) (A) No ECC Transient Cell Row Column Row−column Chip

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 17 / 29

slide-18
SLIDE 18

Manifestation Base Results

Base Results (cont’)

SECDED

Single-bit errors corrected Eliminated transient / majority

  • f non-transient

Chipkill

No uncorrectable error

  • bserved

1 2 3 200 400 600 800 1000 1200 1400 1600 1800 Operational Duration (years) Cumulative Failure Rate (in FIT) (B) SECDED ECC Row Row−column Chip

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 18 / 29

slide-19
SLIDE 19

Manifestation Statistical Bounds

Bound Estimation and Results

Estimate rate bounds using statistical methods No-ECC and SECDED

Non-transient: about 2X difference

Chipkill

Small number of uncorrected errors showing up All caused by transient errors hitting chip error

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 19 / 29

slide-20
SLIDE 20

Susceptibility

Outline

1

Data Collection Results

2

Error Manifestation Analysis Overview Methodology Base Results Statistical Rate Bounds

3

Software Susceptibility Overview Methodology Results

4

Conclusions

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 20 / 29

slide-21
SLIDE 21

Susceptibility Overview

Overview

Software may not be affected by the exposed memory errors An investigation of software susceptibility to memory errors Root in the realism in the data Validate/question conclusions of prior studies

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 21 / 29

slide-22
SLIDE 22

Susceptibility Methodology

Infrastructure of Injection

Virtual machine based injection Goals

Read from faulty locations supplied with erroneous values Write to faulty locations don’t overwrite erroneous bits Bookkeeping accesses to faulty locations

Key challenge: tracking memory accesses

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 22 / 29

slide-23
SLIDE 23

Susceptibility Methodology

Conventional Tracking Methods

Hardware watchpoint Code instrumentation Page access control

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 23 / 29

slide-24
SLIDE 24

Susceptibility Methodology

Novel Tracking Method

Observations

Error bits spread into different pages Spurious page faults

Hotspot Watchpoint

On access to an error, unprotect the page Set up hardware watchpoint on the error Successive accesses to the error tracked by hardware watchpoints Protect this page again when errors on other pages are accessed

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 24 / 29

slide-25
SLIDE 25

Susceptibility Results

Hotspot Watchpoint Speedup

Error Pattern Column Address Row Address 512 1024 1536 2048 2048 4096 6144 8192 10240 12288 14336 16384

Web server MCF Kernel build 1 2 3 4 5 Normalized execution time Overhead of memory access tracking Original (no tracking) Hotspot watchpoints Page access control Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 25 / 29

slide-26
SLIDE 26

Susceptibility Results

Evaluation – Non-transient Error Susceptibility

Application Web server MCF Kernel build No ECC M1 (row-col error) WO AC AC M2 (row error) OK M3 (bit error) OK M4 (chip error) KC WO AC M5 (row error) WO WO M6 (row error) OK M7 (bit error) OK M8 (bit error) M9 (col error) WO SECDED ECC M1 (row-col error) WO WO AC M5 (row error) WO WO

Table: KC—kernel crash; AC—application crash; WO—wrong output; OK—program runs correctly; blank—not accessed.

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 26 / 29

slide-27
SLIDE 27

Susceptibility Results

Non-transient made transient

Application Web server MCF Kernel build No ECC M1 (row-col error) WO AC OK M2 (row error) OK M3 (bit error) OK M4 (chip error) KC OK OK M5 (row error) WO OK M6 (row error) OK M7 (bit error) OK M8 (bit error) M9 (col error) WO SECDED ECC M1 (row-col error) WO OK OK M5 (row error) WO OK

Table: KC—kernel crash; AC—application crash; WO—wrong output; OK—program runs correctly; blank—not accessed.

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 27 / 29

slide-28
SLIDE 28

Susceptibility Results

Additional Discussions

Miscellaneous validations of prior research in the paper

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 28 / 29

slide-29
SLIDE 29

Conclusions

Contributions

Memory error data from production systems

212 servers, 800 GB memory, 9 months Detailed information on error addresses and syndromes Substantial non-transient errors (row/column mostly)

Monte Carlo simulation on error manifestation

Simulation on realistic data Significant non-transient errors among manifested Chipkill ECC very effective

Software susceptibility study

A non-transient error injection tool A novel memory tracking approach – Hotspot Watchpoint Software much more susceptible against non-transient

http://www.cs.rochester.edu/research/os/memerror

Li, Huang, Shen, Chu (Rochester, Ask.com) Realistic Eval of Mem Errors USENIX ATC’10 29 / 29