exascale failure modeling with cofactor
play

Exascale Failure Modeling with CoFaCTOR Co rrelated Fa ilure C - PowerPoint PPT Presentation

Exascale Failure Modeling with CoFaCTOR Co rrelated Fa ilure C onsultation T ool for O perational R eliability PIs: Dave Bonnie, Dominic Manno, Wendy Poole, Brad Settlemyer May 21, 2019 Managed by Triad National Security, LLC for the U.S.


  1. Exascale Failure Modeling with CoFaCTOR Co rrelated Fa ilure C onsultation T ool for O perational R eliability PIs: Dave Bonnie, Dominic Manno, Wendy Poole, Brad Settlemyer May 21, 2019 Managed by Triad National Security, LLC for the U.S. Department of Energy’s NNSA LA-UR-19-24681

  2. BLUF (Bottom line up front) • We need a tool for Decision Support – Why? • We need to be able to do predictive analytics for system failures as well as potential catastrophic disk or filesystem failures. • Within the Exascale environment, the filesystems and network will be so complex determining failure causes without the assistance of a tool will be nearly impossible • We don’t believe the failures within the storage environment have been well-identified – Traditional assumptions on drive mix and failures no longer hold – multi- factor positional considerations (shelf position, rack, row) (vibration…) • LANL’s filesystem environment are very different from the cloud env. – A loss of 1MB stripe may invalidate the entire 1PB file • We save memory state, not cat pictures and memes Los Alamos National Laboratory 5/21/2019 | 2

  3. Problem: Correlated disk failures Los Alamos National Laboratory 5/21/2019 | 3

  4. Problem: Why is LANL hitting this now? • Large differences between hyperscalers and LANL – 98.3% of Youtube videos are less than 25MB* • LANL has very large files, tens of TB is not uncommon – Data loss event from failure is orders of magnitude different * “ Statistics and Social Network of Youtube Videos ”, Cheng, Dale, Liu Los Alamos National Laboratory 5/21/2019 | 4

  5. Components Available Today • Data already collected – LANL File distributions – LANL Failure events – Industry failure events also published (e.g. BackBLAZE) • Industry standard data protection schemes – Parity-based data protection (GridRAID, ISA-L, RAIDZ3) • Existing Statistical Techniques appropriate – Monte Carlo simulation – Failure modes well studied for disks • Existing simulation toolkits are sufficient – PySIM , LANL’s Simian, OmNet++ Los Alamos National Laboratory 5/21/2019 | 5

  6. Components Available Today • Data already collected – LANL File distributions – LANL Failure events These techniques are well understood in – Industry failure events also published (e.g. BackBLAZE) • Industry standard data protection schemes OR, statistics, reliability community. – Parity-based data protection (GridRAID, ISA-L, RAIDZ3) • Existing Statistical Techniques appropriate We simply need to apply them correctly to – Monte Carlo simulation – Failure modes well studied for disks our data! • Existing simulation toolkits are sufficient – PySIM , LANL’s Simian, OmNet++ Los Alamos National Laboratory 5/21/2019 | 6

  7. CoFaCTOR Overview • Model inputs: LANL’s empirical file distribution o o Storage system characteristics o Protection strategies o Do we need more error protection and at what levels/complexity? • Evaluate failures via Monte Carlo o Generate realistic failure traces o Identify probabilities of loss o Analyze data loss scenarios o Evaluate and analyze distributions of effected files Los Alamos National Laboratory 5/21/2019 | 7

  8. CoFaCTOR Usage • Using existing methods will enable: – Understanding catastrophic data loss scenarios • LANL HPC field’s approximately 11 different file systems – each with different data protection schemes and data retention times – Evaluating new technologies (with respect to data protection capabilities) – Improving future storage procurements • Additional opportunities: – Provide further collaboration for environments with similar predictive requirements (anyone here want to share?) :) – Provide further collaboration in this space across other federal agencies - DOD Los Alamos National Laboratory 5/21/2019 | 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend