Overview Introduction and basic concept ECE 753: FAULT-TOLERANT - - PDF document

overview
SMART_READER_LITE
LIVE PREVIEW

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT - - PDF document

4/1/2014 Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault coverage COMPUTING Checkpointing and backward error recovery (rollback) Kewal K.Saluja General principles General principles


slide-1
SLIDE 1

4/1/2014 1

ECE 753: FAULT-TOLERANT COMPUTING

Kewal K.Saluja

D t t f El t i l d C t E i i Department of Electrical and Computer Engineering

HIGH Level Fault-Tolerance: Checkpointing and recovery Introductory material

Overview

  • Introduction and basic concept
  • Fault model and fault coverage
  • Checkpointing and backward error

recovery (rollback)

General principles

ECE 753 Fault Tolerant Computing 2

– General principles – Uniprocessor systems

  • Summary
  • Cost, Overhead, Latency issues
  • Distributed Systems

Introduction

  • References

– Text Chapter 6 – [Prad:96] Chapter 3 – sections on rollback and reconfiguration

ECE 753 Fault Tolerant Computing 3

Introduction (contd.)

  • Some what higher level than ECC and

watchdog, uses re-execution as basic recovery strategy

  • It is a hardware assisted software

ECE 753 Fault Tolerant Computing 4

method in practice

  • Basic concept: save fault-free state of

the system and if and when an error is detected, reload the fault-free state and re-execute

Introduction - Basic Concept (contd.)

  • Three phases of recovery

– Error detection – Damage assessment – Recovery – error elimination and arrival at the point where error was detected

  • ften entails re-starting fresh on a system presumably

ECE 753 Fault Tolerant Computing 5

  • ften entails re starting fresh on a system presumably

fault free

  • Backward error recovery

– Current process is rolled back to some error-free point and re-executes – Trivial solution – start afresh from the beginning of the program

Fault model and fault coverage

  • Possible scenarios

– Hardware is faulty, software is fault-free – Fault detection mechanism exists – in hardware or in software form – Hardware fault-free, software is faulty

ECE 753 Fault Tolerant Computing 6

– Both hardware software faulty

  • Assumptions for backward error recovery

– Reliable error detection mechanism exists – Error can be removed by re-execution – Process state can be restored to a previous error- free state

slide-2
SLIDE 2

4/1/2014 2

Fault model and fault coverage (contd.)

  • Based on the assumptions stated:

– The method is normally applicable when: error detection mechanism exists, transient hardware faults, and no-software faults

M th d t dd th f lt i

ECE 753 Fault Tolerant Computing 7

  • Methods to address other fault scenario

are

– Re-configuration – Software fault-tolerance: e.g. recovery block and n-version programming

Checkpointing and Rollback

  • General principles

– Time redundancy is permissible – Transient hardware errors – If software errors (design or otherwise) alternative modules exist or there are timing errors that may b l d d i ti

ECE 753 Fault Tolerant Computing 8

be solved during re-execution – Reliable error detection mechanism – It is feasible to determine checkpoints (system states that need to be saved) in an application – Method can apply to redundant as well as nonredundant systems

Checkpointing and Rollback (contd.)

  • General issues: checkpointing &

rollback

– Save system state at regular interval

  • How often to save - checkpoint interval
  • How much to save - can be as little as PC and

ECE 753 Fault Tolerant Computing 9

How much to save can be as little as PC and status flags, just one instruction or as mush as log of all messages, the complete program and associated data values at a given time

  • How long between fault occurrence and its

detection (error latency) is tolerable – often large error latency may make this method less than an ideal method

Checkpointing and Rollback (contd.)

  • General issues: checkpointing &

rollback

– Rollback recovery

Where do we go back to: damage assessment

ECE 753 Fault Tolerant Computing 10

  • Where do we go back to: damage assessment
  • Rollback: load the state vector (state of the

processor, the data that may have been altered

  • r corrupted)
  • Restart the computation

Checkpointing and Rollback (contd.)

  • What do we need

– Error detection mechanism

  • Various self-checking mechanisms, e.g. error

detection, timers, watchdog, acceptance tests.

ECE 753 Fault Tolerant Computing 11

– Storage for state/data saving

  • Large enough storage – PC, stack, data

segments (static and dynamic), information about user and system files that may be open

  • Access time – issue during storing and retrieval
  • Volatility and stability of the storage

Checkpointing and Rollback (contd.)

  • What do we need (contd.)

– Events

  • Messages and transactions that should be

logged and replayed

ECE 753 Fault Tolerant Computing 12

logged and replayed

– Procedures to handle errors and restart computation – What if errors continue to exist? – mechanism to handle this

slide-3
SLIDE 3

4/1/2014 3

Checkpointing: Uniprocessor

systems

  • Uniprocess and uniprocessor systems

equivalence

  • Simplest scheme

– Instruction re-execution

ECE 753 Fault Tolerant Computing 13

Instruction re execution

  • Hardware (parity, self-checking, duplication) reports error
  • Instruction is re-executed using previous data and state

– Issues

  • Register file update (commit)
  • Latency, especially in pipeline systems

– Key is to determine the state to be saved

Checkpointing: Uniprocessor systems (contd.)

  • Process control systems

– Program that monitors a process behaves in a predetermined manner – known control flow and typically periodic – Define checkpoints statically

ECE 753 Fault Tolerant Computing 14

Checkpointing: Uniprocessor systems (contd.)

  • Process control systems (contd.)

– Typical objectives

  • Recovery possible in a given time
  • Minimize the total number of checkpoints
  • Methods of this nature studied in 60’s

ECE 753 Fault Tolerant Computing 15

Methods of this nature studied in 60 s

Checkpointing: Uniprocessor systems (contd.)

  • General purpose systems

– How much information to save

  • System state consisting of register file, PC, stack, etc.
  • Data?

– All of it? Can be prohibitive (space and time)

ECE 753 Fault Tolerant Computing 16

p ( p ) – So? – Only that data which is modified after the last checkpoint – How do we do this efficiently? – Caches provide a nice boundary to achieve this

Summary

  • Discussed checkpointing classical

studies

ECE 753 Fault Tolerant Computing 17