overview
play

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT - PDF document

4/1/2014 Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault coverage COMPUTING Checkpointing and backward error recovery (rollback) Kewal K.Saluja General principles General principles


  1. 4/1/2014 Overview • Introduction and basic concept ECE 753: FAULT-TOLERANT • Fault model and fault coverage COMPUTING • Checkpointing and backward error recovery (rollback) Kewal K.Saluja – General principles General principles Department of Electrical and Computer Engineering D t t f El t i l d C t E i i – Uniprocessor systems • Summary HIGH Level Fault-Tolerance: Checkpointing • Cost, Overhead, Latency issues and recovery • Distributed Systems Introductory material ECE 753 Fault Tolerant Computing 2 Introduction Introduction (contd.) • References • Some what higher level than ECC and – Text Chapter 6 watchdog, uses re-execution as basic – [Prad:96] Chapter 3 – sections on rollback recovery strategy and reconfiguration • It is a hardware assisted software method in practice • Basic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-execute ECE 753 Fault Tolerant Computing 3 ECE 753 Fault Tolerant Computing 4 Fault model and fault coverage Introduction - Basic Concept (contd.) • Three phases of recovery • Possible scenarios – Error detection – Hardware is faulty, software is fault-free – Damage assessment – Fault detection mechanism exists – in hardware or – Recovery – error elimination and arrival at the in software form point where error was detected – Hardware fault-free, software is faulty • often entails re-starting fresh on a system presumably often entails re starting fresh on a system presumably – Both hardware software faulty fault free • Assumptions for backward error recovery • Backward error recovery – Reliable error detection mechanism exists – Current process is rolled back to some error-free point and re-executes – Error can be removed by re-execution – Trivial solution – start afresh from the beginning of – Process state can be restored to a previous error- the program free state ECE 753 Fault Tolerant Computing 5 ECE 753 Fault Tolerant Computing 6 1

  2. 4/1/2014 Fault model and fault coverage (contd.) Checkpointing and Rollback • Based on the assumptions stated: • General principles – Time redundancy is permissible – The method is normally applicable when: – Transient hardware errors error detection mechanism exists, transient – If software errors (design or otherwise) alternative hardware faults, and no-software faults modules exist or there are timing errors that may • Methods to address other fault scenario M th d t dd th f lt i be solved during re-execution b l d d i ti – Reliable error detection mechanism are – It is feasible to determine checkpoints (system – Re-configuration states that need to be saved) in an application – Method can apply to redundant as well as – Software fault-tolerance: e.g. recovery nonredundant systems block and n-version programming ECE 753 Fault Tolerant Computing 7 ECE 753 Fault Tolerant Computing 8 Checkpointing and Rollback (contd.) Checkpointing and Rollback (contd.) • General issues: checkpointing & rollback • General issues: checkpointing & rollback – Save system state at regular interval • How often to save - checkpoint interval – Rollback recovery • How much to save - can be as little as PC and How much to save can be as little as PC and • Where do we go back to: damage assessment Where do we go back to: damage assessment status flags, just one instruction or as mush as • Rollback: load the state vector (state of the log of all messages, the complete program and processor, the data that may have been altered associated data values at a given time or corrupted) • How long between fault occurrence and its • Restart the computation detection (error latency) is tolerable – often large error latency may make this method less than an ideal method ECE 753 Fault Tolerant Computing 9 ECE 753 Fault Tolerant Computing 10 Checkpointing and Rollback (contd.) Checkpointing and Rollback (contd.) • What do we need • What do we need (contd.) – Error detection mechanism – Events • Various self-checking mechanisms, e.g. error • Messages and transactions that should be detection, timers, watchdog, acceptance tests. logged and replayed logged and replayed – Storage for state/data saving – Procedures to handle errors and restart • Large enough storage – PC, stack, data computation segments (static and dynamic), information – What if errors continue to exist? – about user and system files that may be open mechanism to handle this • Access time – issue during storing and retrieval • Volatility and stability of the storage ECE 753 Fault Tolerant Computing 11 ECE 753 Fault Tolerant Computing 12 2

  3. 4/1/2014 Checkpointing: Uniprocessor Checkpointing: Uniprocessor systems (contd.) systems • Process control systems • Uniprocess and uniprocessor systems – Program that monitors a process behaves in a equivalence predetermined manner – known control flow and typically periodic • Simplest scheme – Define checkpoints statically – Instruction re-execution Instruction re execution • Hardware (parity, self-checking, duplication) reports error • Instruction is re-executed using previous data and state – Issues • Register file update (commit) • Latency, especially in pipeline systems – Key is to determine the state to be saved ECE 753 Fault Tolerant Computing 13 ECE 753 Fault Tolerant Computing 14 Checkpointing: Uniprocessor systems Checkpointing: Uniprocessor systems (contd.) (contd.) • Process control systems (contd.) • General purpose systems – Typical objectives – How much information to save • Recovery possible in a given time • System state consisting of register file, PC, stack, etc. • Minimize the total number of checkpoints • Data? • Methods of this nature studied in 60’s Methods of this nature studied in 60 s – All of it? Can be prohibitive (space and time) p ( p ) – So? – Only that data which is modified after the last checkpoint – How do we do this efficiently? – Caches provide a nice boundary to achieve this ECE 753 Fault Tolerant Computing 15 ECE 753 Fault Tolerant Computing 16 Summary • Discussed checkpointing classical studies ECE 753 Fault Tolerant Computing 17 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend