INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y - PowerPoint PPT Presentation

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y A L I R A S H I D , K A R T H I K PAT TA B I R A M A N A N D S AT H I S H G O PA L A K R I S H N A N

INTERMITTENT FAULTS-DEFINITION • Hardware errors that appear non-deterministically at the same microarchitectural location. • 40% of the real-world failures in processors are caused by intermittent faults [1] . Error start time Transient fault Permanent fault Intermittent fault 2

CONTRIBUTIONS • Build a model of chip multiprocessor running a parallel application using Stochastic Activity Networks. • Propose intermittent fault models that abstract real intermittent faults at the system level. • Evaluate the performance of a processor after applying different recovery options. 3

RECOVERY-MOTIVATION Program Execution Program Execution CHKPT CHKPT 4

RECOVERY-MOTIVATION Hardware Error Problem! Program Execution CHKPT

RECOVERY-MOTIVATION Transient Hardware Error Program Execution Problem! CHKPT Recovery Restore to Checkpoint

RECOVERY-MOTIVATION Permanent Hardware Error Program Execution Problem! CHKPT Recovery Restore to Checkpoint Core Reconf.

RECOVERY-MOTIVATION Intermittent Hardware Error Program Execution Problem! CHKPT ? Recovery Restore to Checkpoint Core Reconf.

MODEL OVERVIEW System Model Processor Model Fault Model • Rollback-Only • Base • Permanent Reconfiguration • Exponential • Temporary Reconfiguration • Weibull 9

KEY FINDINGS • Error rate and the relative importance of the error location are the main factors in finding the best recovery for high intermittent failure rates. • Permanent shutdown of the defective unit results in a slight improvement of the performance compared to the temporary shutdown . 10

PROCESSOR MODEL Error Program Error Execution Detection Rollback Perf. Degradation to Checkpoint Unit Shutdown- No No Permanent Reconfigure? Fine-Grained Yes Diagnosis 11

PROCESSOR MODEL Error Program Error Execution Detection Full Throughput Rollback to No Enable Checkpoint Unit? No Program Execution Reconfigure? Perf. Degradation Yes Unit Shutdown- Fine-Grained Temporary Diagnosis 12

FAULT MODEL-BASE FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ p 1-p Pulse Error 13

FAULT MODEL-EXPONENTIAL FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ 2 λ 1 p 1-p Error Inactive Active Pulse d 14

FAULT MODEL- WEIBULL FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ 2 λ 1 , σ p 1-p Error Inactive Active Pulse d 15

EXPERIMENT SETUP • Used Mobius [2] to simulate the system for 48 hours with a confidence interval of 95%. • Used useful work [3] measure to model processor throughput in a certain a mount of time. • Analyzed a model of multiprocessor running coordinated checkpoint. 16

SYSTEM PARAMETERS Checkpoint 30sec/5-60min 70% Accuracy Error Program Error Execution Detection Rollback to Perf. Degradation Checkpoint 0-35% 0-60sec Unit No No Shutdown- Permanent Reconfigure? 2sec Fine-Grained Yes Diagnosis 17

RESEARCH QUESTIONS • When should we recover from an intermittent fault by shutting down the defective component? • For errors that are tolerated by shutting down the defective component, should the shutdown be permanent or temporary? 18

RESULTS-DIFFERENT FAULT MODELS • Permanent/temporary reconfiguration leads to 27% more useful work than rollback-only for exponential and Weibull fault models. 19

RESEARCH QUESTION What is the granularity of the disabled component that maximizes the processor’s performance? 20

COMPONENT RANK • The maximum percentage of useful work that is lost when the component is disabled. • 4-core processor, each core has two LSUs and is running a LSU LSU program that is using all the 8 LSU LSU LSUs for 60% of the time. LSU LSU LSU LSU LSU • Using Amdahl’s law, LSU rank is 19% or 1/(0.4 + (0.6/0.125)) 21

RESULT-EFFECT OF COMPONENT RANK • For this experiment, components with a rank of 35% or more should be disabled if diagnosed with intermittent errors. 22

Sensitivity to Fault Rate 23

RESULTS- SENSITIVITY TO FAULT RATE • If lost useful work outweighs the rank of the defective component, then the defective component should be disabled. 24

KEY FINDINGS • Error rate and the relative importance of the error location are the main factors in finding the best recovery for high intermittent failure rates. • Permanent shutdown of the defective unit results in a slight improvement of the performance compared to the temporary shutdown . [1] Eurosys, 2011 [2] Tools of MME, 2003. 25 [3] DSN, 2005.

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y - PowerPoint PPT Presentation

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y A L I R A S H I D , K A R T H I K PAT TA B I R A M A N A N D S AT H I S H G O PA L A K R I S H N A N INTERMITTENT FAULTS-DEFINITION Hardware errors that appear

Basic Errors Compiling in Unix Syntax errors Common Errors, and Debugging Run-Time errors

One for all: forecasting intermittent and non-intermittent demand using one model Ivan Svetunkov

Hardware Observability Framework Hardware Observability Framework Hardware Observability

INTERMITTENT INTERMITTENT HYPOXIA HYPOXIA HYPOXIA HYPOXIA TRAINING (IHT) TRAINING (IHT)

Unified error reporting -- A worthy goal? Andi Kleen, Intel Corporation Sep 2009

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

GENIE Systematic Errors GENIE Systematic Errors GENIE Systematic Errors Hugh Gallagher, Tufts

Treasurers Institute Sun, Nov. 17, 2019 Property Tax Errors Property Tax Errors Property Tax

ELO TRANSLATION PROJECT SARAH **** SOME VOCAB Errors Logic Errors Runtime Errors

NMVTIS INFORMATION FOR TACA MARCH 2019 NMVTIS ERRORS Odometer Reading Discrepancies

Unforced Errors Unforced Errors My mother taught me that in polite society, we do not talk

Exceptions Introduction to Computing Using Python Types of errors We saw different types of

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Intermittent Generator Forum Friday 22 nd February 2019 Welcome and Introduction Neale Scott 2

Proposed 2019-2022 CAPITAL BUDGET THE CITY OF EDMONTON CITY COUNCIL October 23, 2018 1 OUR

The The Hadoop Di adoop Dist stri ributed buted Fi File le System System Konstantin

Mission Objective: Compromise Nuclear Facility Using Virtual Reality to Improve Cyber Security and

VELOC: Very Low Overhead Checkpointing System Bogdan Nicolae, Rinku Gupta, Franck Cappello (ANL)

Activating the immune system to fight cancer Company Presentation June 2020 | Disclaimer NOT

Greg Welk, Ph.D. & Joey Lee, Ph.D. Iowa State University SWITCH Research Team Former

City of Kansas City AI RPORT COMMI TTEE BRI EFI NG New Terminal Evaluation for Kansas City I

Syosset Central School District K-12 World Language Program New Initiatives & Curriculum

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y - PowerPoint PPT Presentation

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y A L I R A S H I D , K A R T H I K PAT TA B I R A M A N A N D S AT H I S H G O PA L A K R I S H N A N INTERMITTENT FAULTS-DEFINITION Hardware errors that appear

Basic Errors Compiling in Unix Syntax errors Common Errors, and Debugging Run-Time errors

One for all: forecasting intermittent and non-intermittent demand using one model Ivan Svetunkov

Hardware Observability Framework Hardware Observability Framework Hardware Observability

INTERMITTENT INTERMITTENT HYPOXIA HYPOXIA HYPOXIA HYPOXIA TRAINING (IHT) TRAINING (IHT)

Unified error reporting -- A worthy goal? Andi Kleen, Intel Corporation Sep 2009

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

GENIE Systematic Errors GENIE Systematic Errors GENIE Systematic Errors Hugh Gallagher, Tufts

Treasurers Institute Sun, Nov. 17, 2019 Property Tax Errors Property Tax Errors Property Tax

ELO TRANSLATION PROJECT SARAH **** SOME VOCAB Errors Logic Errors Runtime Errors

NMVTIS INFORMATION FOR TACA MARCH 2019 NMVTIS ERRORS Odometer Reading Discrepancies

Unforced Errors Unforced Errors My mother taught me that in polite society, we do not talk

Exceptions Introduction to Computing Using Python Types of errors We saw different types of

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Intermittent Generator Forum Friday 22 nd February 2019 Welcome and Introduction Neale Scott 2

Proposed 2019-2022 CAPITAL BUDGET THE CITY OF EDMONTON CITY COUNCIL October 23, 2018 1 OUR

The The Hadoop Di adoop Dist stri ributed buted Fi File le System System Konstantin

Mission Objective: Compromise Nuclear Facility Using Virtual Reality to Improve Cyber Security and

VELOC: Very Low Overhead Checkpointing System Bogdan Nicolae, Rinku Gupta, Franck Cappello (ANL)

Activating the immune system to fight cancer Company Presentation June 2020 | Disclaimer NOT

Greg Welk, Ph.D. &amp; Joey Lee, Ph.D. Iowa State University SWITCH Research Team Former

City of Kansas City AI RPORT COMMI TTEE BRI EFI NG New Terminal Evaluation for Kansas City I

Syosset Central School District K-12 World Language Program New Initiatives &amp; Curriculum

Greg Welk, Ph.D. & Joey Lee, Ph.D. Iowa State University SWITCH Research Team Former

Syosset Central School District K-12 World Language Program New Initiatives & Curriculum