intermittent hardware errors recovery modeling and
play

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y - PowerPoint PPT Presentation

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y A L I R A S H I D , K A R T H I K PAT TA B I R A M A N A N D S AT H I S H G O PA L A K R I S H N A N INTERMITTENT FAULTS-DEFINITION Hardware errors that appear


  1. INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y A L I R A S H I D , K A R T H I K PAT TA B I R A M A N A N D S AT H I S H G O PA L A K R I S H N A N

  2. INTERMITTENT FAULTS-DEFINITION • Hardware errors that appear non-deterministically at the same microarchitectural location. • 40% of the real-world failures in processors are caused by intermittent faults [1] . Error start time Transient fault Permanent fault Intermittent fault 2

  3. CONTRIBUTIONS • Build a model of chip multiprocessor running a parallel application using Stochastic Activity Networks. • Propose intermittent fault models that abstract real intermittent faults at the system level. • Evaluate the performance of a processor after applying different recovery options. 3

  4. RECOVERY-MOTIVATION Program Execution Program Execution CHKPT CHKPT 4

  5. RECOVERY-MOTIVATION Hardware Error Problem! Program Execution CHKPT

  6. RECOVERY-MOTIVATION Transient Hardware Error Program Execution Problem! CHKPT Recovery Restore to Checkpoint

  7. RECOVERY-MOTIVATION Permanent Hardware Error Program Execution Problem! CHKPT Recovery Restore to Checkpoint Core Reconf.

  8. RECOVERY-MOTIVATION Intermittent Hardware Error Program Execution Problem! CHKPT ? Recovery Restore to Checkpoint Core Reconf.

  9. MODEL OVERVIEW System Model Processor Model Fault Model • Rollback-Only • Base • Permanent Reconfiguration • Exponential • Temporary Reconfiguration • Weibull 9

  10. KEY FINDINGS • Error rate and the relative importance of the error location are the main factors in finding the best recovery for high intermittent failure rates. • Permanent shutdown of the defective unit results in a slight improvement of the performance compared to the temporary shutdown . 10

  11. PROCESSOR MODEL Error Program Error Execution Detection Rollback Perf. Degradation to Checkpoint Unit Shutdown- No No Permanent Reconfigure? Fine-Grained Yes Diagnosis 11

  12. PROCESSOR MODEL Error Program Error Execution Detection Full Throughput Rollback to No Enable Checkpoint Unit? No Program Execution Reconfigure? Perf. Degradation Yes Unit Shutdown- Fine-Grained Temporary Diagnosis 12

  13. FAULT MODEL-BASE FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ p 1-p Pulse Error 13

  14. FAULT MODEL-EXPONENTIAL FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ 2 λ 1 p 1-p Error Inactive Active Pulse d 14

  15. FAULT MODEL- WEIBULL FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ 2 λ 1 , σ p 1-p Error Inactive Active Pulse d 15

  16. EXPERIMENT SETUP • Used Mobius [2] to simulate the system for 48 hours with a confidence interval of 95%. • Used useful work [3] measure to model processor throughput in a certain a mount of time. • Analyzed a model of multiprocessor running coordinated checkpoint. 16

  17. SYSTEM PARAMETERS Checkpoint 30sec/5-60min 70% Accuracy Error Program Error Execution Detection Rollback to Perf. Degradation Checkpoint 0-35% 0-60sec Unit No No Shutdown- Permanent Reconfigure? 2sec Fine-Grained Yes Diagnosis 17

  18. RESEARCH QUESTIONS • When should we recover from an intermittent fault by shutting down the defective component? • For errors that are tolerated by shutting down the defective component, should the shutdown be permanent or temporary? 18

  19. RESULTS-DIFFERENT FAULT MODELS • Permanent/temporary reconfiguration leads to 27% more useful work than rollback-only for exponential and Weibull fault models. 19

  20. RESEARCH QUESTION What is the granularity of the disabled component that maximizes the processor’s performance? 20

  21. COMPONENT RANK • The maximum percentage of useful work that is lost when the component is disabled. • 4-core processor, each core has two LSUs and is running a LSU LSU program that is using all the 8 LSU LSU LSUs for 60% of the time. LSU LSU LSU LSU LSU • Using Amdahl’s law, LSU rank is 19% or 1/(0.4 + (0.6/0.125)) 21

  22. RESULT-EFFECT OF COMPONENT RANK • For this experiment, components with a rank of 35% or more should be disabled if diagnosed with intermittent errors. 22

  23. Sensitivity to Fault Rate 23

  24. RESULTS- SENSITIVITY TO FAULT RATE • If lost useful work outweighs the rank of the defective component, then the defective component should be disabled. 24

  25. KEY FINDINGS • Error rate and the relative importance of the error location are the main factors in finding the best recovery for high intermittent failure rates. • Permanent shutdown of the defective unit results in a slight improvement of the performance compared to the temporary shutdown . [1] Eurosys, 2011 [2] Tools of MME, 2003. 25 [3] DSN, 2005.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend