and how to protect against them Bianca Schroeder, Sotirios Damouras, - PowerPoint PPT Presentation

Understanding latent sector errors and how to protect against them Bianca Schroeder, Sotirios Damouras, Phillipa Gill

Motivation  What is a latent sector error (LSE)?  Individual sectors on a drive become inaccessable (media error)  Prevalence?  3.5% of drives experience LSE(s) [Bairavasundaram2007]  7-9% for some disk models!  Consequence of an LSE?  In a system without redundancy: data loss  In RAID-5, if discovered during reconstruction: data loss  One of the main motivations for RAID-6  Growing concern with growing disk capacities

How to protect against them?  Periodic scrubbing  Proactively detect LSEs and correct them.  Intra-disk redundancy  Replicate selected metadata [e.g. FFS]  Add parity block per file [e.g. Iron file systems]  Add parity block per group of sectors [Dholak.08] XOR

Our goal  Understand potential of different protection schemes  Understand characteristics of LSEs  From point of view of protection  How?  Using real data from production machines  Subset of data in Bairavasundaram et al. (Sigmetrics’07)  Thanks for sharing!

The data NetApp storage systems in The systems The data the field  1.5 million drives  Covers 32 months  SATA & SCSI  Focus on  LSEs detected by - 4 SATA models - application access - 4 SCSI models - scrubber (bi-weekly)  For each LSE: - Time of detection - LBN

How effective are protection schemes?  Scrubbing  Intra-disk redundancy

Scrubbing  Why?  Detect and correct errors early  Reduces probability to encounter LSE during RAID reconstruction

Scrubbing  Standard sequential scrubbing

Scrubbing  Standard sequential scrubbing  Localized scrubbing

Scrubbing  Standard sequential scrubbing  Localized scrubbing  Accelerated scrubbing

Scrubbing  Standard sequential scrubbing  Localized scrubbing  Accelerated scrubbing  Staggered scrubbing [Oprea et al. ‘10]

How do those approaches Scrubbing perform in practice, i.e. on real-world data?  Standard sequential scrubbing  Localized scrubbing  Accelerated scrubbing  Staggered scrubbing [Oprea et al. ‘10]  Accelerated staggered scrubbing

Scrubbing: Evaluation on NetApp data Staggered accel. scrub Staggered scrub Accelerated scrub Local scrub  No significant improvement from local & accelerated scrubs  They don’t reduce the time to detect whether there are any errors  Errors are close in space, so even standard scrub finds them soon

Scrubbing: Evaluation on NetApp data Staggered accel. scrub Staggered scrub Accelerated scrub Local scrub  10-35% improvement with staggered scrubs!  Even better than the original paper claims!  Without introducing any overheads or additional reads  Relatively insensitive to choice of parameters

Intra-disk redundancy  Why?  Recover LSEs in systems without redundancy  Recover LSEs during reconstruction in RAID-5  Goal:  Evaluate potential protection  What fraction of errors could be recovered  Qualitative discussion of overheads

Intra-disk redundancy  Simplest scheme: Single Parity Check (SPC)  Can recover up to one LSE per parity group Data Data Data Data Parity k data sectors 1 parity sector  Results from evaluation on Netapp data:  25-50% of drives have errors that SPC cannot recover  Consider stronger schemes?

Stronger schemes?  Additional parity => additional overhead in updating parity  When would that be interesting?  In environments  … like archival systems, that don’t have updates and don’t like scrubs since they require powering up the system  … with read-mostly workloads, i.e. parity updates are rare  … for selected critical data on a drive, such as meta-data

Inter-leaved Parity Check (IPC) [Dholakia08] Data Data Data Data Data Data Parity Parity Parity k data sectors m redundant sectors  Requires only 1 parity update per data update  Can tolerate up to m consecutive errors

Inter-leaved Parity Check (IPC) [Dholakia08] Data Data Data Data Data Data Parity Parity Parity k data sectors m redundant sectors  Results differ from [Dholakia08]  Claim: Achieves protection as good as MDS codes [Dholakia08]  Importance of real-world data. • MDS=Maximum distance separable, e.g. Reed-Solomon  Paper provides models & parameters • Expensive, but can tolerate loss of any m sectors  Results: (from evaluation on NetApp data)  Far weaker than MDS!  Not significantly better than SPC  Implications  Need different ideas for improving on SPC  Maybe reuse ideas from RAID-6? (see paper for details & results)

Questions unanswered …  What level of protection to use when?  E.g. what is the right scrub frequency?  Depends on error probability at a given time

Do previous errors predict future? Number of future errors Probability of future errors  Adapt protection based on previous  Many previous errors errors => higher chance of future errors  Know your patient .. => higher number of future errors  Big differences between models

Does first error interval predict future? #errors in first scrub with errors #errors in first scrub with errors  Number of errors in first error interval: - Do increase expected number of future errors - Don’t significantly increase probability of future occurrence

For how long are probabilities increased? 10 20 30 40 #Weeks since first error  Taper off added protection over time,  Exponential drop-off, but still significant after tens of weeks e.g. reduce scrub rate  Independent of number of errors in first interval

Questions unanswered …  What level of protection to use when?  What is the error probability at a given time?  What level of protection to use where?  Are all areas of the drive equally likely to develop errors?

Where on the drive are errors located?  Stronger protection for those areas  Don’t use for  Up to 50% of errors concentrated in top/bottom 10% of drive important data  Also increased probability in some other parts of the drive

Questions unanswered …  What level of protection to use when?  What is the error probability at a given time?  Same protection scheme across entire drive?  Are all parts equally likely to develop errors?  Scrubbing potentially harmful?  Do additional read operations increase error rate?

Does utilization affect LSEs?  Collected data in Google data center (>10,000 drives) on  Number of LSEs  Number of reads & number of writes  Results:  Maybe need not worry  No correlation between #reads and #LSEs about scrubs introducing  No correlation between #writes and #LSEs new errors?  Needs further investigation (future work).

Questions unanswered …  What level of protection to use when?  What is the error probability at a given time?  Same protection scheme across entire drive?  Are all parts equally likely to develop errors?  Scrubbing potentially harmful?  Do additional read operations increase error rate?  What is the common distance between errors …  Important for example for replica placement

How far are errors spaced apart?  Avoid placing replicas at certain distances  Explains why single  20-60% of errors have a neighbor within < 10 sectors parity scheme not always helpful  Probability concentration (bumps) at certain distances

Questions unanswered …  What level of protection to use when?  What is the error probability at a given time?  Different protection for different parts of the drive?  Are all parts equally likely to develop errors?  Scrubbing potentially harmful?  Do additional read operations increase error rate?  What is the common distance between errors …  Important for replica placement  Are errors that are close in space also close in time?  Yes!

Questions unanswered …  What level of protection to use when?  What is the error probability at a given time?  Different protection for different parts of the drive?  Are all parts equally likely to develop errors?  Scrubbing potentially harmful?  Do additional read operations increase error rate?  What is the common distance between errors …  Important for replica placement  Are errors that are close in space also close in time?  Yes!  And many other questions – see paper!

Conclusion  Evaluated potential of different protection schemes  Scrubbing  Simple new scheme (staggered scrubbing) performs very well!  Intra-disk redundancy  Single parity can recover LSEs in 50-75% of the drives  Need to look at more complex schemes for coverage beyond that  Looked at characteristics of LSEs  And how to exploit them for reliability  Many characteristics not captured well by simple models  Provided parameters for models

 Thanks!  To NetApp for sharing the data  To you for listening  Questions?

and how to protect against them Bianca Schroeder, Sotirios Damouras, - PowerPoint PPT Presentation

Understanding latent sector errors and how to protect against them Bianca Schroeder, Sotirios Damouras, Phillipa Gill Motivation What is a latent sector error (LSE)? Individual sectors on a drive become inaccessable (media error)

THE FUTURE OF FOIA: FIND, REDACT, DELIVER PROTECT LIFE. PROTECT LIFE. PROTECT TRUTH. PROTECT

blood, but against the rulers, against the authorities, against the powers of this dark world and

FCIC-112798 1. Protect the United States from terrorist attack 2. Protect the United States

Lay Them Down Chorus: Lay them down, Lay them down, Lay your branches down for Him Spread them

Amalekites and the people of the East would come up against them. 4 They would encamp against them

Ephesians 6:12 (NIV) Our struggle is not against flesh and blood, but against the rulers,

Competition Law Strategies: - Protect Your Brand - - Protect Your Retail Partners - - Protect

Backup & Disaster Recovery Backup & Disaster Recovery Protect your data. Protect your

FLORIDAS PLAN AGAINST COVID -19 1. Protect the Vulnerable 2. Increase Testing 3. Promote

PROTECT WP6 Extension /Validation of Benefit-Risk Methods, Tools and Processes Evaluated

Ilham Kadri Solvay CEO Meeting with Family Shareholders March 24, 2020 COVID-19 protect our

ITS ALL ABOUT YOUR SURVIVAL HEALTH INSURANCE Doctors protect your life, you protect your

One mask to group them all, One code to find them, One file to store them all, And in a

They Don t Want Them Or You t Want Them Or You They Don Don t Have Them: t Have

FLORIDAS PLAN AGAINST COVID -19 1. Protect the Vulnerable 2. Increase Testing 3. Promote

FLORIDAS PLAN AGAINST COVID -19 1. Protect the Vulnerable 2. Increase Testing 3. Promote

lavaan : an R package for structural equation modeling and more Yves Rosseel Department of Data

POIR 613: Computational Social Science Pablo Barber a School of International Relations

The semnova Package for Latent Repeated Measures ANOVA Benedikt Langenberg, RWTH Aachen

Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural

Latent Variables and Real-Time Forecasting in DSGE Models with Occasionally Binding

User Recommendation in Content Curation Platforms Jianling Wang, Ziwei Zhu and James Caverlee

and how to protect against them Bianca Schroeder, Sotirios Damouras, - PowerPoint PPT Presentation

Understanding latent sector errors and how to protect against them Bianca Schroeder, Sotirios Damouras, Phillipa Gill Motivation What is a latent sector error (LSE)? Individual sectors on a drive become inaccessable (media error)

THE FUTURE OF FOIA: FIND, REDACT, DELIVER PROTECT LIFE. PROTECT LIFE. PROTECT TRUTH. PROTECT

blood, but against the rulers, against the authorities, against the powers of this dark world and

FCIC-112798 1. Protect the United States from terrorist attack 2. Protect the United States

Lay Them Down Chorus: Lay them down, Lay them down, Lay your branches down for Him Spread them

Amalekites and the people of the East would come up against them. 4 They would encamp against them

Ephesians 6:12 (NIV) Our struggle is not against flesh and blood, but against the rulers,

Competition Law Strategies: - Protect Your Brand - - Protect Your Retail Partners - - Protect

Backup &amp; Disaster Recovery Backup &amp; Disaster Recovery Protect your data. Protect your

FLORIDAS PLAN AGAINST COVID -19 1. Protect the Vulnerable 2. Increase Testing 3. Promote

PROTECT WP6 Extension /Validation of Benefit-Risk Methods, Tools and Processes Evaluated

Ilham Kadri Solvay CEO Meeting with Family Shareholders March 24, 2020 COVID-19 protect our

ITS ALL ABOUT YOUR SURVIVAL HEALTH INSURANCE Doctors protect your life, you protect your

One mask to group them all, One code to find them, One file to store them all, And in a

They Don t Want Them Or You t Want Them Or You They Don Don t Have Them: t Have

FLORIDAS PLAN AGAINST COVID -19 1. Protect the Vulnerable 2. Increase Testing 3. Promote

FLORIDAS PLAN AGAINST COVID -19 1. Protect the Vulnerable 2. Increase Testing 3. Promote

lavaan : an R package for structural equation modeling and more Yves Rosseel Department of Data

POIR 613: Computational Social Science Pablo Barber a School of International Relations

The semnova Package for Latent Repeated Measures ANOVA Benedikt Langenberg, RWTH Aachen

Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural

Latent Variables and Real-Time Forecasting in DSGE Models with Occasionally Binding

User Recommendation in Content Curation Platforms Jianling Wang, Ziwei Zhu and James Caverlee

Backup & Disaster Recovery Backup & Disaster Recovery Protect your data. Protect your