INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y - - PowerPoint PPT Presentation

intermittent hardware errors recovery modeling and
SMART_READER_LITE
LIVE PREVIEW

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y - - PowerPoint PPT Presentation

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y A L I R A S H I D , K A R T H I K PAT TA B I R A M A N A N D S AT H I S H G O PA L A K R I S H N A N INTERMITTENT FAULTS-DEFINITION Hardware errors that appear


slide-1
SLIDE 1

L A Y A L I R A S H I D , K A R T H I K PAT TA B I R A M A N A N D S AT H I S H G O PA L A K R I S H N A N

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION

slide-2
SLIDE 2

Transient fault Permanent fault Intermittent fault

INTERMITTENT FAULTS-DEFINITION

  • Hardware errors that appear non-deterministically at the same

microarchitectural location.

  • 40% of the real-world failures in processors are caused by

intermittent faults [1].

2

Error start time

slide-3
SLIDE 3

CONTRIBUTIONS

  • Build a model of chip multiprocessor running a

parallel application using Stochastic Activity Networks.

  • Propose intermittent fault models that abstract real

intermittent faults at the system level.

  • Evaluate the performance of a processor after

applying different recovery options.

3

slide-4
SLIDE 4

RECOVERY-MOTIVATION

4

Program Execution CHKPT Program Execution CHKPT

slide-5
SLIDE 5

Program Execution

CHKPT

Problem!

Hardware Error

RECOVERY-MOTIVATION

slide-6
SLIDE 6

Program Execution CHKPT

Problem!

Transient Hardware Error Restore to Checkpoint Recovery

RECOVERY-MOTIVATION

slide-7
SLIDE 7

Program Execution CHKPT

Problem!

Permanent Hardware Error Restore to Checkpoint Core Reconf.

Recovery

RECOVERY-MOTIVATION

slide-8
SLIDE 8

Program Execution CHKPT

Problem!

Intermittent Hardware Error Restore to Checkpoint Core Reconf.

Recovery

?

RECOVERY-MOTIVATION

slide-9
SLIDE 9

MODEL OVERVIEW

  • Rollback-Only
  • Permanent Reconfiguration
  • Temporary Reconfiguration

9

Processor Model Fault Model

  • Base
  • Exponential
  • Weibull

System Model

slide-10
SLIDE 10

KEY FINDINGS

  • Error rate and the relative importance of the error

location are the main factors in finding the best recovery for high intermittent failure rates.

  • Permanent shutdown of the defective unit results in

a slight improvement of the performance compared to the temporary shutdown.

10

slide-11
SLIDE 11

PROCESSOR MODEL

Program Execution Error Detection Rollback to Checkpoint Error Reconfigure? Fine-Grained Diagnosis Unit Shutdown- Permanent

  • Perf. Degradation

No Yes

11

No

slide-12
SLIDE 12

PROCESSOR MODEL

Program Execution Error Detection Rollback to Checkpoint Error Reconfigure? Fine-Grained Diagnosis

  • Perf. Degradation

No Yes

12

Unit Shutdown- Temporary Enable Unit? Program Execution Full Throughput No

slide-13
SLIDE 13

FAULT MODEL-BASE FAULT MODEL

  • Abstract physical fault models.
  • Prune down the space of system configurations.

13

Pulse Error

λ

p 1-p

slide-14
SLIDE 14

FAULT MODEL-EXPONENTIAL FAULT MODEL

  • Abstract physical fault models.
  • Prune down the space of system configurations.

14

Inactive Active Pulse Error

λ1 d λ2

p 1-p

slide-15
SLIDE 15

FAULT MODEL- WEIBULL FAULT MODEL

  • Abstract physical fault models.
  • Prune down the space of system configurations.

15

Inactive Active Pulse Error

λ1, σ d λ2

p 1-p

slide-16
SLIDE 16

EXPERIMENT SETUP

  • Used Mobius[2] to simulate the system for 48 hours

with a confidence interval of 95%.

  • Used useful work[3] measure to model processor

throughput in a certain a mount of time.

  • Analyzed a model of multiprocessor running

coordinated checkpoint.

16

slide-17
SLIDE 17

SYSTEM PARAMETERS

Program Execution Error Detection Rollback to Checkpoint Error Reconfigure? Fine-Grained Diagnosis Unit Shutdown- Permanent

  • Perf. Degradation

No Yes

17

No Checkpoint

30sec/5-60min 70% Accuracy 0-60sec 2sec 0-35%

slide-18
SLIDE 18

RESEARCH QUESTIONS

18

  • When should we recover from an intermittent fault

by shutting down the defective component?

  • For errors that are tolerated by shutting down

the defective component, should the shutdown be permanent or temporary?

slide-19
SLIDE 19

RESULTS-DIFFERENT FAULT MODELS

  • Permanent/temporary reconfiguration leads to 27%

more useful work than rollback-only for exponential and Weibull fault models.

19

slide-20
SLIDE 20

RESEARCH QUESTION

20

What is the granularity of the disabled component that maximizes the processor’s performance?

slide-21
SLIDE 21

COMPONENT RANK

  • The maximum percentage of useful work that is lost

when the component is disabled.

21

  • 4-core processor, each core

has two LSUs and is running a program that is using all the 8 LSUs for 60% of the time.

  • Using Amdahl’s law, LSU rank

is 19% or 1/(0.4 + (0.6/0.125))

LSU LSU LSU LSU LSU LSU LSU LSU LSU

slide-22
SLIDE 22

RESULT-EFFECT OF COMPONENT RANK

22

  • For this experiment, components with a rank of 35%
  • r more should be disabled if diagnosed with

intermittent errors.

slide-23
SLIDE 23

Sensitivity to Fault Rate

23

slide-24
SLIDE 24

RESULTS- SENSITIVITY TO FAULT RATE

24

  • If lost useful work outweighs the rank of the defective

component, then the defective component should be disabled.

slide-25
SLIDE 25

KEY FINDINGS

  • Error rate and the relative importance of the error

location are the main factors in finding the best recovery for high intermittent failure rates.

  • Permanent shutdown of the defective unit results in

a slight improvement of the performance compared to the temporary shutdown.

25

[1] Eurosys, 2011 [2] Tools of MME, 2003. [3] DSN, 2005.