Overview Introduction and basic concept ECE 753: FAULT-TOLERANT - PDF document

4/1/2014 Overview • Introduction and basic concept ECE 753: FAULT-TOLERANT • Fault model and fault coverage COMPUTING • Checkpointing and backward error recovery (rollback) Kewal K.Saluja – General principles General principles Department of Electrical and Computer Engineering D t t f El t i l d C t E i i – Uniprocessor systems • Summary HIGH Level Fault-Tolerance: Checkpointing • Cost, Overhead, Latency issues and recovery • Distributed Systems Introductory material ECE 753 Fault Tolerant Computing 2 Introduction Introduction (contd.) • References • Some what higher level than ECC and – Text Chapter 6 watchdog, uses re-execution as basic – [Prad:96] Chapter 3 – sections on rollback recovery strategy and reconfiguration • It is a hardware assisted software method in practice • Basic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-execute ECE 753 Fault Tolerant Computing 3 ECE 753 Fault Tolerant Computing 4 Fault model and fault coverage Introduction - Basic Concept (contd.) • Three phases of recovery • Possible scenarios – Error detection – Hardware is faulty, software is fault-free – Damage assessment – Fault detection mechanism exists – in hardware or – Recovery – error elimination and arrival at the in software form point where error was detected – Hardware fault-free, software is faulty • often entails re-starting fresh on a system presumably often entails re starting fresh on a system presumably – Both hardware software faulty fault free • Assumptions for backward error recovery • Backward error recovery – Reliable error detection mechanism exists – Current process is rolled back to some error-free point and re-executes – Error can be removed by re-execution – Trivial solution – start afresh from the beginning of – Process state can be restored to a previous error- the program free state ECE 753 Fault Tolerant Computing 5 ECE 753 Fault Tolerant Computing 6 1

4/1/2014 Fault model and fault coverage (contd.) Checkpointing and Rollback • Based on the assumptions stated: • General principles – Time redundancy is permissible – The method is normally applicable when: – Transient hardware errors error detection mechanism exists, transient – If software errors (design or otherwise) alternative hardware faults, and no-software faults modules exist or there are timing errors that may • Methods to address other fault scenario M th d t dd th f lt i be solved during re-execution b l d d i ti – Reliable error detection mechanism are – It is feasible to determine checkpoints (system – Re-configuration states that need to be saved) in an application – Method can apply to redundant as well as – Software fault-tolerance: e.g. recovery nonredundant systems block and n-version programming ECE 753 Fault Tolerant Computing 7 ECE 753 Fault Tolerant Computing 8 Checkpointing and Rollback (contd.) Checkpointing and Rollback (contd.) • General issues: checkpointing & rollback • General issues: checkpointing & rollback – Save system state at regular interval • How often to save - checkpoint interval – Rollback recovery • How much to save - can be as little as PC and How much to save can be as little as PC and • Where do we go back to: damage assessment Where do we go back to: damage assessment status flags, just one instruction or as mush as • Rollback: load the state vector (state of the log of all messages, the complete program and processor, the data that may have been altered associated data values at a given time or corrupted) • How long between fault occurrence and its • Restart the computation detection (error latency) is tolerable – often large error latency may make this method less than an ideal method ECE 753 Fault Tolerant Computing 9 ECE 753 Fault Tolerant Computing 10 Checkpointing and Rollback (contd.) Checkpointing and Rollback (contd.) • What do we need • What do we need (contd.) – Error detection mechanism – Events • Various self-checking mechanisms, e.g. error • Messages and transactions that should be detection, timers, watchdog, acceptance tests. logged and replayed logged and replayed – Storage for state/data saving – Procedures to handle errors and restart • Large enough storage – PC, stack, data computation segments (static and dynamic), information – What if errors continue to exist? – about user and system files that may be open mechanism to handle this • Access time – issue during storing and retrieval • Volatility and stability of the storage ECE 753 Fault Tolerant Computing 11 ECE 753 Fault Tolerant Computing 12 2

4/1/2014 Checkpointing: Uniprocessor Checkpointing: Uniprocessor systems (contd.) systems • Process control systems • Uniprocess and uniprocessor systems – Program that monitors a process behaves in a equivalence predetermined manner – known control flow and typically periodic • Simplest scheme – Define checkpoints statically – Instruction re-execution Instruction re execution • Hardware (parity, self-checking, duplication) reports error • Instruction is re-executed using previous data and state – Issues • Register file update (commit) • Latency, especially in pipeline systems – Key is to determine the state to be saved ECE 753 Fault Tolerant Computing 13 ECE 753 Fault Tolerant Computing 14 Checkpointing: Uniprocessor systems Checkpointing: Uniprocessor systems (contd.) (contd.) • Process control systems (contd.) • General purpose systems – Typical objectives – How much information to save • Recovery possible in a given time • System state consisting of register file, PC, stack, etc. • Minimize the total number of checkpoints • Data? • Methods of this nature studied in 60’s Methods of this nature studied in 60 s – All of it? Can be prohibitive (space and time) p ( p ) – So? – Only that data which is modified after the last checkpoint – How do we do this efficiently? – Caches provide a nice boundary to achieve this ECE 753 Fault Tolerant Computing 15 ECE 753 Fault Tolerant Computing 16 Summary • Discussed checkpointing classical studies ECE 753 Fault Tolerant Computing 17 3

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT - PDF document

4/1/2014 Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault coverage COMPUTING Checkpointing and backward error recovery (rollback) Kewal K.Saluja General principles General principles

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 SF park overview OVERVIEW PRESENTATION / 2

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 Acknowledgements OVERVIEW PRESENTATION / 2 SF

INVESTOR PRESENTATION FEBRUARY 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

INVESTOR PRESENTATION MAY 2019 Index Executive Summary Company Overview Business Overview

INVESTOR PRESENTATION MARCH 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

1 Overview Overview Regional demographic overview Regional demographic overview Workforce

Covid-19 and Business Interruption: Maximizing Insurance Coverage and Federal Grants Counsel

OVERVIEW OVERVIEW OVERVIEW OVERVIEW The qualifications are aimed at primary school

An overview to Maltese An overview to Maltese An overview to Maltese An overview to Maltese

GSM System Overview GSM System Overview GSM System Overview GSM System Overview Phone Lin

Butterball Employees Butterball Employees Butterball Employees Benefits Overview Ruan Benefits

Program-for-Results Financing Overview Overview Overview of World Bank Instruments

INVESTOR PRESENTATION Index Executive Summary Company Overview Business Overview Industry

Key Maths 3 UK Assessm ent overview Claire Parsons Overview 1. Key Maths 3 UK (overview) 2.

Federal Fiscal Year 2017-18 CHASE Fee Program June 21, 2018 Overview CHASE Overview Fee

Meeting Minutes Class Advisory Senate Jul 11, 2016 1. Garry Dudley, 68, President of the

Filtering Based Techniques DDOS Attacks: Target CPU / Bandwidth for DDOS Mitigation

Portfolio Manager & GRITS: Announcing a new initiative to better track energy efficiency

Sorry Works! Fall 2016 Presented by: Doug Wojcieszak, Sorry Works! Founder Scenario To

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

1 Content and Context: Archiving Social Media for Future Use Sylvie-Rollason-Cass (Web Archivist,

ECE700.07: Game Theory with Engineering Applications Le Lecture 6: Re Repeated Games Seyed

Designing Database Applications Walid G. Aref Roadmap for Designing Database Applications 1.

Sambuz

Useful Links

Newsletter

Mail Us

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT - PDF document

4/1/2014 Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault coverage COMPUTING Checkpointing and backward error recovery (rollback) Kewal K.Saluja General principles General principles

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 SF park overview OVERVIEW PRESENTATION / 2

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 Acknowledgements OVERVIEW PRESENTATION / 2 SF

INVESTOR PRESENTATION FEBRUARY 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

INVESTOR PRESENTATION MAY 2019 Index Executive Summary Company Overview Business Overview

INVESTOR PRESENTATION MARCH 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

1 Overview Overview Regional demographic overview Regional demographic overview Workforce

Covid-19 and Business Interruption: Maximizing Insurance Coverage and Federal Grants Counsel

OVERVIEW OVERVIEW OVERVIEW OVERVIEW The qualifications are aimed at primary school

An overview to Maltese An overview to Maltese An overview to Maltese An overview to Maltese

GSM System Overview GSM System Overview GSM System Overview GSM System Overview Phone Lin

Butterball Employees Butterball Employees Butterball Employees Benefits Overview Ruan Benefits

Program-for-Results Financing Overview Overview Overview of World Bank Instruments

INVESTOR PRESENTATION Index Executive Summary Company Overview Business Overview Industry

Key Maths 3 UK Assessm ent overview Claire Parsons Overview 1. Key Maths 3 UK (overview) 2.

Federal Fiscal Year 2017-18 CHASE Fee Program June 21, 2018 Overview CHASE Overview Fee

Meeting Minutes Class Advisory Senate Jul 11, 2016 1. Garry Dudley, 68, President of the

Filtering Based Techniques DDOS Attacks: Target CPU / Bandwidth for DDOS Mitigation

Portfolio Manager &amp; GRITS: Announcing a new initiative to better track energy efficiency

Sorry Works! Fall 2016 Presented by: Doug Wojcieszak, Sorry Works! Founder Scenario To

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

1 Content and Context: Archiving Social Media for Future Use Sylvie-Rollason-Cass (Web Archivist,

ECE700.07: Game Theory with Engineering Applications Le Lecture 6: Re Repeated Games Seyed

Designing Database Applications Walid G. Aref Roadmap for Designing Database Applications 1.

Sambuz

Useful Links

Newsletter

Mail Us

Portfolio Manager & GRITS: Announcing a new initiative to better track energy efficiency