fault tolerant techniques fault tolerant techniques
play

Fault-tolerant techniques Fault-tolerant techniques What causes - PowerPoint PPT Presentation

EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the effects if the


  1. EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the effects if the hardware or software is not • Specification or design faults: fault-free in a real-time system? – Incomplete or erroneous models – Lack of techniques for formal checking • Component defects: – Manufacturing effects (in hardware or software) – Wear and tear due to component use • Environmental effects: – High stress (temperature, G-forces, vibrations) – Electromagnetic or elementary-particle radiation Fault-tolerant techniques Fault-tolerant techniques What types of (hardware) faults are there? What types of (software) faults are there? • Permanent faults: • Permanent faults: – Total failure of a component – Total failure of a component – Caused by, for example, short-circuits or melt-down – Caused by, for example, corrupted data structures – Remains until component is repaired or replaced – Remains until component is repaired or replaced • Transient faults: • Transient faults: – Temporary malfunctions of a component – Temporary malfunctions of a component – Caused by magnetic or ionizing radiation, or power fluctuation – Caused by data-dependent bugs in the program code • Intermittent faults: • Intermittent faults: – Repeated occurrences of transient faults – Repeated occurrences of transient faults – Caused by, for example, loose wires – Caused by, for example, dangling-pointer problems 1

  2. EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant techniques Fault-tolerant techniques How are faults handled at run-time? How are errors detected? • Error detection: • Watchdog mechanism: – Erroneous data or program behavior is detected – A monitor looks for signs that hardware or software is faulty – Watchdog mechanism, comparisons, diagnostic tests – For example: time-outs, signature checking, or checksums • Error correction: • Comparisons: – The originally-intended data/behavior is restored – The output of redundant components are compared – Intelligent codes used for restoring corrupt data – A ”golden run” of intended behavior can be available – Check-pointing used for restoring corrupt program flow • Diagnostic tests: • Fault masking: – Tests on hardware or software are (transparently) executed – Effects of erroneous data or program behavior are ”hidden” as part of the schedule – Voting mechanism Fault-tolerant techniques Fault-tolerant techniques How is fault-tolerance obtained? Hardware redundancy: • Voting mechanism: • Hardware redundancy: – Majority voter (largest group must have majority of values) – Additional hardware components are used – k-plurality voter (largest group must have at least k values) • Software redundancy: – Median voter – Different application software versions are used • N-modular redundancy (NMR): • Time redundancy: – 2 m +1 units are needed to mask the effects of m faults – Schedule contains ample slack so tasks can be re-executed – One or more voters can be used in parallel • Information redundancy: This technique is very expensive, which means that it is only – Data is coded so that errors can be detected and/or corrected justified in the most critical applications. 2

  3. EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant techniques Fault-tolerant techniques Software redundancy: Time redundancy (backward error recovery): • N-version programming: • Retry: – Different versions of the program are run in parallel – The failed instruction is repeated – Voting is used for fault masking • Rollback: – Software development is diversified using different languages – Execution is re-started from the beginning of the program and even different software development teams • Recovery-block approach: – Execution is re-started from a checkpoint where sufficient program state has been saved – Different versions of the program are used, but only one version is run at a time This technique does not require additional hardware, which – Acceptance test is used for determining validity of results significantly reduces the weight, size, power-consumption and cost of the system. This technique is also very expensive, because of the development of independent program versions. Fault-tolerant techniques Fault-tolerant scheduling Information redundancy (forward error recovery): To extend real-time computing towards fault-tolerance, • Duplication: the following issues must be considered: – Errors are detected by duplicating each data word 1. What is the fault model used? • Parity encoding: – What type of fault is assumed? – How and when are faults detected? – Errors are detected/corrected by keeping the number of ones in the data word odd or even 2. How should fault-tolerance be implemented? • Checksum codes: – Using temporal redundancy (re-execution)? – Errors are detected by adding the data words into sums – Using spatial redundancy (replicated tasks/processors)? • Cyclic codes: 3. What scheduling policy should be used? – Errors are detected/corrected by interpreting the data bits as – Extend existing policies (for example, RM or EDF)? coefficients in a polynomial and deriving redundant bits – Suggest new policies? through division of a generator polynomial 3

  4. EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant scheduling Fault-tolerant scheduling What fault model is used? How is fault-tolerance implemented? Type of fault: Temporal redundancy: – Transient, intermittent and/or permanent faults – Tasks are re-executed to provide replicas for voting decisions – For transient/intermittent faults: is there a minimum interarrival – Tasks are re-executed to recover from a fault time between two subsequent faults? – Re-execution may be from beginning or from check-point – Re-executed task may be original or simplified version Error detection: – Voting (after task execution) Spatial redundancy: – Checksums or signature checking (during task execution) – Replicas of tasks are distributed on multiple processors – Watchdogs or diagnostic testing (during task execution) – Identical or different implementations of tasks – Voting decisions are made to detect errors or mask faults Note: the fault model assumed is a key part of the method used for validating the system. If the true system behavior differs from the Note: the choice of fault-tolerance mechanism should be made in assumed, any guarantees we have made may not be correct! conjunction with the choice of scheduling policy. Fault-tolerant scheduling Fault-tolerant scheduling What do existing scheduling policies offer? How do we extend existing techniques to FT? Static scheduling: Uniprocessor scheduling: – Simple to implement (unfortunately, supported by very few – Use RM, DM or EDF and use any surplus capacity (slack) to commercial real-time operating systems) re-execute tasks that experience errors during their execution. – High observability (facilitates monitoring, testing & debugging) – The slack is reserved a priori and can be accounted for in a schedulability test. This allows for performance guarantees – Natural points in time for self-check & synchronization (under the assumed fault model) (facilitates implementation of task redundancy) – Or: re-executions can be modeled as aperiodic tasks. The Dynamic scheduling: slack is then extracted dynamically at run-time by dedicated – RM simple to implement (supported by most commercial aperiodic servers. This allows for statistical guarantees. real-time operating systems) – RM and EDF are optimal scheduling policies – RM and EDF comes with a solid analysis framework 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend