10 Dependable Architectures The material of this course has been - PowerPoint PPT Presentation

EPFL, Spring 2017 10 Dependable Architectures The material of this course has been initially created by Prof. Dr. H. Kirrmann and adapted by Dr. Y-A. Pignolet & J-C. Tournier

Fault – Error - Failure Fault: Defect in system (bug) Error: Difference between intended and actual behavior Failure: Not satisfying specification Internal External may may failure error fault = system doesn’t perform cause cause required function Fault examples Error examples SW bug Missing values Stuck bit Measured value ≠ real value Loose connector … … Industrial Automation | 2017 2

Fault Tolerance Mechanisms Identify and record the cause(s) of error(s), Error 1 location/type, concurrent or pre-emptive detection Fault isolation Error 2 Reconfiguration (online repair) Passivation Transform from state with errors into state Error 3 Recovery without errors (forward, backward recovery) Fault Masking Error 4 Compensation Error Corrections Deliver the required service in the presence of faults Industrial Automation | 2017 3 Sli de

Main dependable computer architectures input inputs diagnostics D D processor processor D processor on-line workby fail-over logic off-switch inputs outputs output a) Integer b) Persistent " rather nothing than wrong " " rather wrong than nothing " processor processor processor (fail-silent, fail-stop, "fail-safe") "fail-operate “ 1oo1d (1oo2d) 2/3 Exercise: 2/3 voter Compute the reliability and availability of all architectures, without and with outputs repairs. c) Integer & persistent error masking, massive redundancy (2oo3v) Industrial Automation | 2017 4

10.1 Error Detection and Fail-Silent 10.1 Error detection and fail-silent computers - check redundancy - duplication and comparison 10.2 Fault-Tolerant Structures 10.3 Issues in Workby operation - Input Processing - Synchronization - Output Processing 10.4 Issues in Standby Implementation - Standby Redundancy Structures - Checkpointing - Recovery 10.5 Examples of Dependable Architectures - ABB dual controller - Boeing 777 Primary Flight Control - Space Shuttle PASS Computer Industrial Automation | 2017 5

Error Detection: Classification  Error detection is the base of “ safe ” computing ( “ fail-silent ” ) -> disable outputs if error detected  Error detection is the base of fault-tolerant computing ( “ fail-operate ” ) -> switchover if error detected, passivate faulty unit. Key factors:  “ hamming distance ” : how many simultaneous errors can be detected  coverage ( recouvrement , Deckungsgrad) probability that an error is discovered within useful time (definition of "useful time": before any damages occur, before automatic shutdown,…)  latency ( latence , Latenz) time between occurrence and detection of an error Industrial Automation | 2017 6

Error Detection: Classification Errors can be detected, (in order of increasing latency): – on-line (while the specified function is performed)  by continuous monitoring/supervision – off-line (in a time period when the unit is not used for its specified function)  by periodic testing – during periodic maintenance (when the unit is tested and calibrated)  by thorough testing, uncovering lurking errors Industrial Automation | 2017 7

Error detection The correctness of a result can be checked by: relative tests (comparison tests): by comparing several results of redundant units or computations (not necessary identical) pessimistic, i.e. differences due to (allowed) indeterminism count as errors high coverage, high cost absolute tests (acceptance tests): by checking the result against an a priori consistency condition (plausibility check) optimistic, i.e. even if result is consistent it may not be correct (but can catch some design errors) Industrial Automation | 2017 8

Error Detection: Possibilities relative test absolute test duplication and comparison watchdog (time-out) (either hardware duplication or control flow checking on-line time redundancy) error-detecting code (CRC, etc.) triplication and voting illegal address checking comparison with check of program version precomputed test result (fixed check of watchdog function off-line inputs) check code for program code e.g. memory test Industrial Automation | 2017 9

Detection of Errors Caused by Physical Faults Depends on type of component, its error rate and its complexity. Component Error characteristics Typical error detection medium to high error rate, Data transmission lines parity, memoryless CRC, watchdog Regular memory elements medium error rate, parity, large storage Hamming codes, EDC CRC on disk. Processors and controllers low error rate, duplication and comparison, high complexity coded logic Auxiliary elements high error rate, mechanical integrity, (hard disk, ventilation) high diversity voltage supervision, watchdogs,... Industrial Automation | 2017 10

Watchdog Processor (absolute test) watchdog processor supply application processor voltage time cyclic > k ms application reset (every k ms) trusted switch inhibit The application processor periodically resets the watchdog timer. If it fails to do so, the watchdog processor will shut down and restart the processor. Industrial Automation | 2017 11

Duplication and Comparison (relative test) safe input Advantage: high coverage, short latency spreader Problem non-determinism: digital computers are made of analogue elements clock with variable delays, thresholds, asynchronous worker checker clocks... sync The safety-relevant parts (comparator and  switch) are useless if not regularly checked. comparator switch fail-silent output worker and checker are identical and deterministic. Conditions: inputs are (made) identical and synchronized (interrupts !) output must be synchronized to allow comparison. Variant: the checker only checks the plausibility of the results (requires definition of what is forbidden) Industrial Automation | 2017 12

Error detection method by coding (absolute test) This method is used in network and storage, where error patterns are simple. It consists in adding a code (parity, checksum, cyclic redundancy check,…) to the useful data that guarantees its integrity. r check bits k data bits n-bit code word Coding is more efficient than duplication and comparison. Coding has also been applied to processing elements, but complexity can be large. For each operation, a corresponding operation on the check bits has to be done. A A’ B B’ C C’ value code Industrial Automation | 2017 13

Error detection by predicates (absolute check) Results of computation are checked against predicates that must be fulfilled, e.g. the sum of two positive integers is a positive integer Plausibility checks require knowledge of the specification: • e.g. not all traffic lights may be green at the same time Plausibility may involve different information sources: • e.g. compare wheel speed with GPS speed Danger is - detection of wrong errors (legal situations not foreseen by application, e.g. flight altitude below sea level) - not detecting real errors (the result is wrong, but plausible) Error coverage is not 100% ! Industrial Automation | 2017 14

Integer processors Integer processors are capable of detecting all single errors and switch their outputs to a safe state in case of error (“fail - silent” processors) (often called “fail - safe” processors, but they are only safe when used in plants where a safe state can be reached by passive means). This requires a high coverage, that is usually achieved by duplication and comparison. For operation, both computers must be operational, this is a 2oo2 structure (2 out of 2). Industrial Automation | 2017 15

Integer Computers: Self-Testing System self-testing parallel processors E E E backplane bus P P P (e.g. duplication D D D (self-test by & comparison) parity) Computers include stable storage E E increasingly means to I/O D MEM (with error detection D detect their own errors. and correction) changeover logic serial bus to safe state (CRC) Vs safe value What happens if the safe switch fails ? Industrial Automation | 2017 16

Integer outputs: selection by the plant The dual channel should be extended as far as possible into the plant E worker checker worker checker controller D M act if both agree act if any does act if error detection agrees (workby) (workby) (error detector controls power) Industrial Automation | 2017 17

10.2 Fault-tolerant structures 10.1 Error detection and fail-silent computers - check redundancy - duplication and comparison 10.2 Fault-Tolerant Structures 10.3 Issues in Workby operation - Input Processing - Synchronization - Output Processing 10.4 Issues in Standby operation - Standby Redundancy Structures - Checkpointing - Recovery 10.5 Examples of Dependable Architectures - ABB dual controller - Boeing 777 Primary Flight Control - Space Shuttle PASS Computer Industrial Automation | 2017 18

10 Dependable Architectures The material of this course has been - PowerPoint PPT Presentation

EPFL, Spring 2017 10 Dependable Architectures The material of this course has been initially created by Prof. Dr. H. Kirrmann and adapted by Dr. Y-A. Pignolet & J-C. Tournier Fault Error - Failure Fault: Defect in system (bug) Error:

Architectures Architectural styles Software architectures Architectures versus middleware

Anti-Honeypot Technology Thorsten Holz Laboratory for Dependable Distributed Systems

ARCHI TECTURAL CHOI CES FOR DEPENDABLE SYSTEMS Nicole Levy Laboratoire PRISM Universit de

Software Architectures of Dependable Systems: From Closed to Open Systems V. Issarny et al.

Human Interface/ Human Error 18-849b Dependable Embedded Systems Charles P. Shelton February

South Dakota Department of Revenue Professional, Dependable, Accauntable . .. in parmerslzip with

WADS 2009 On the Design of Adaptive-and-dependable Systems Lessons learned and experiences at

Behavioral Contracts and Service Substitutability: A Contribution to Dependable SOA Haldor

Towards a high performance parallel platform for dependable embedded systems Mitsuhisa Sato

Fourth Workshop on Dependable and Secure Nanocomputing Organizers: Jean Arlat, LAAS- CNRS

RAIC: Architecting Dependable Systems Through Redundancy and Just-In-Time Testing For The ICSE

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Architectures, Architectures, Microkernels, IPC, Microkernels, IPC, Capabilities Capabilities

Overview Agent Architectures Definition of agent architecture Classical Architectures for

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

HPC Architectures Types of resource currently in use Outline Shared memory architectures

I M P A UNIVERSITT GIESSEN www.uni-giessen.de/cms/iamp Stefan Schippers, FLAIR workshop,

AI Methodology Theoretical aspects Mathematical formalizations, properties, algorithms

Scope of the Physical Layer Concerns how signals are used to transfer message bits over a link

Environmental Impact of Air Traffic Flow Management Delays A EUROCONTROL Global Aviation

Todays Outline Arrays Files Functions 1 Arrays <!DOCTYPE html PUBLIC

Todays Outline Arrays Files Functions 1 Arrays <?php

Scalable Socket I/O PG Consultants Peter Gordon peter@pg-consultants.com Objective

Files Information used by a computer system may be Introduction to Computer Systems stored on

10 Dependable Architectures The material of this course has been - PowerPoint PPT Presentation

EPFL, Spring 2017 10 Dependable Architectures The material of this course has been initially created by Prof. Dr. H. Kirrmann and adapted by Dr. Y-A. Pignolet & J-C. Tournier Fault Error - Failure Fault: Defect in system (bug) Error:

Architectures Architectural styles Software architectures Architectures versus middleware

Anti-Honeypot Technology Thorsten Holz Laboratory for Dependable Distributed Systems

ARCHI TECTURAL CHOI CES FOR DEPENDABLE SYSTEMS Nicole Levy Laboratoire PRISM Universit de

Software Architectures of Dependable Systems: From Closed to Open Systems V. Issarny et al.

Human Interface/ Human Error 18-849b Dependable Embedded Systems Charles P. Shelton February

South Dakota Department of Revenue Professional, Dependable, Accauntable . .. in parmerslzip with

WADS 2009 On the Design of Adaptive-and-dependable Systems Lessons learned and experiences at

Behavioral Contracts and Service Substitutability: A Contribution to Dependable SOA Haldor

Towards a high performance parallel platform for dependable embedded systems Mitsuhisa Sato

Fourth Workshop on Dependable and Secure Nanocomputing Organizers: Jean Arlat, LAAS- CNRS

RAIC: Architecting Dependable Systems Through Redundancy and Just-In-Time Testing For The ICSE

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Architectures, Architectures, Microkernels, IPC, Microkernels, IPC, Capabilities Capabilities

Overview Agent Architectures Definition of agent architecture Classical Architectures for

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

HPC Architectures Types of resource currently in use Outline Shared memory architectures

I M P A UNIVERSITT GIESSEN www.uni-giessen.de/cms/iamp Stefan Schippers, FLAIR workshop,

AI Methodology Theoretical aspects Mathematical formalizations, properties, algorithms

Scope of the Physical Layer Concerns how signals are used to transfer message bits over a link

Environmental Impact of Air Traffic Flow Management Delays A EUROCONTROL Global Aviation

Todays Outline Arrays Files Functions 1 Arrays &lt;!DOCTYPE html PUBLIC

Todays Outline Arrays Files Functions 1 Arrays &lt;?php

Scalable Socket I/O PG Consultants Peter Gordon peter@pg-consultants.com Objective

Files Information used by a computer system may be Introduction to Computer Systems stored on

Todays Outline Arrays Files Functions 1 Arrays <!DOCTYPE html PUBLIC

Todays Outline Arrays Files Functions 1 Arrays <?php