Overview Introduction ECE 753: FAULT-TOLERANT Watchdog - - PDF document

overview
SMART_READER_LITE
LIVE PREVIEW

Overview Introduction ECE 753: FAULT-TOLERANT Watchdog - - PDF document

3/25/2014 Overview Introduction ECE 753: FAULT-TOLERANT Watchdog techniques COMPUTING Timers, watchdog processors, error model, control flow checking, memory access and assertion Kewal K.Saluja checking Re-execution for


slide-1
SLIDE 1

3/25/2014 1

ECE 753: FAULT-TOLERANT COMPUTING

Kewal K.Saluja

Department of Electrical and Computer Engineering

Low Level Fault-Tolerance: Watchdog and Re-execution

Overview

  • Introduction
  • Watchdog techniques

– Timers, watchdog processors, error model, control flow checking, memory access and assertion checking

R ti f f lt t l

ECE 753 Fault Tolerant Computing 2

  • Re-execution for fault-tolerance

– Basic techniques: RESO concept, program re- execution, instruction re-execution – Case studies: Fine grain parallel architecture (CRAY), SMT architecture, multiscalar

  • architecture. Chip Multiprocessor
  • Summary

Introduction

  • References
  • Watchdog - [mahm:88]
  • Re-execution - [rotenberg:99], [rashid:00]

[subra:10] [kala:13]

ECE 753 Fault Tolerant Computing 3

[subra:10], [kala:13]

  • Sohi, Franklin, and Saluja, “A study of time-

redundant fault-tolerant techniques for high- performance pipelined computers,” Proceedings FTCS-19, June 1989, pp. 436- 443.

Introduction (contd.)

  • Somewhat higher level than ECC and

masking at circuit level

  • Bordering between hardware and

software (hardware often assisted by

ECE 753 Fault Tolerant Computing 4

software)

  • These are some of the very first fault-

tolerance methods

Watchdog techniques

  • Key concept

– A process or processor is checked by another hardware (normally) unit of its

  • actions. Actions checked include if the

process is still active alive not executing

ECE 753 Fault Tolerant Computing 5

process is still active, alive, not executing incorrect paths during execution, etc. Processor

watchdog

Watchdog: Timers

  • Check for aliveness

– Processor resets the timer at certain intervals or on certain conditions Timer raises error flag if not reset before it

ECE 753 Fault Tolerant Computing 6

– Timer raises error flag if not reset before it

  • verruns

Processor

timer Error

slide-2
SLIDE 2

3/25/2014 2

Watchdog: Timers (contd.)

  • Check for timeout

– Processor sends a message and starts a timer, the second processor must reply within this time (hardware/software

ECE 753 Fault Tolerant Computing 7

within this time (hardware/software implementation)

Timer

Processor B Processor A

Watchdog: Timers (contd.)

  • Applications

– Processor control systems (chemical, mechanical and other control systems) – Switching systems – messages sent or

ECE 753 Fault Tolerant Computing 8

received often await certain length of time before they are repeated – Networks – email messages often have timeouts associated with them

Watchdog: Processors

  • Architecture – can be complex but let us

consider the following simple architecture

ECE 753 Fault Tolerant Computing 9

Memory Processor

data address control

BUS Watchdog (observer)

Watchdog: Processors (contd.)

  • What can it achieve?

– Observe the address bus

  • Can observe the data
  • Can observe instructions

ECE 753 Fault Tolerant Computing 10

  • Can observe instructions
  • Can check the flow of program control

– Need to know what kind of errors can

  • ccur to determine the capability of this

method

Watchdog: Error models

  • Experimental setup to develop error

models applicable at this level

– Processor-memory architecture – Inject faults (random errors) - in I/O

ECE 753 Fault Tolerant Computing 11

processor, within processor (register file, states), within memory – Simulate – Also hardware was designed to inject such faults and study the impact/behavior

Watchdog: Error models (contd.)

  • Conclusions of the studies

– Program flow could change (branch to no branch,

  • r vise a versa)

– Instruction fetched from data space – Access to non existence memory space D t f t h d f i t ti

ECE 753 Fault Tolerant Computing 12

– Data fetched from instruction space – Illegal instruction – Writing in protected area (ROM)

  • 60% of all faults could be detected by

monitoring control flow – Thus we need to develop methods that are good in monitoring control flow

slide-3
SLIDE 3

3/25/2014 3

Watchdog: Control flow checking

  • Basic principle

– Analyze the program and extract control information

  • Branch free intervals

ECE 753 Fault Tolerant Computing 13

  • Branch free intervals
  • Subroutine calls

– Assign signatures to branch free intervals and provide these signatures to the watchdog processor to check these values

Watchdog: Control flow checking (contd.)

  • A simple example

Program watchdog start ------------ receive start b h b b

ECE 753 Fault Tolerant Computing 14

branch observe bus free cont. to form code signature

check sig X --- Check X against collected sig

Watchdog: Control flow checking (contd.)

  • Details and variations

– Structural integrity checking

  • Analyze the program control flow – create a program

control flow graph

  • Assign unique identifier to the nodes of the graph

ECE 753 Fault Tolerant Computing 15

  • Provide control flow graph to the watchdog along with the

identifiers

  • In case of branches, watchdog expects one of the many

possible identifiers

  • Limitations

– Performance impact – insertion of special instructions – Inability to detect data processing variations – add to sub

Watchdog: Control flow checking (contd.)

  • Details and variations (contd.)

– Derived signature checking

  • Compiler identifies branch free intervals and generates

signatures (such as check sum) for these intervals

  • At run time these signatures are provided to the

watchdog using tag bits to differentiate between regular instructions and watchdog messages

ECE 753 Fault Tolerant Computing 16

instructions and watchdog messages

  • Watchdog monitors the bus and generates the signatures

and compare these signatures with the signatures captured from the bus (compiled signature)

  • Example: associate two tag bits with every memory word

to differentiate between instructions and compiled signatures – when a tag for signature appears on the bus watchdog captures the tag and forces a NOP on the bus for the regular processor

Watchdog: Control flow checking (contd.)

  • Details and variations (contd.)

– Derived signature checking (contd.)

  • Coverage

– Can detect random errors in instructions in branch free intervals (but aliasing can occur)

  • Overheads

ECE 753 Fault Tolerant Computing 17

– Memory width increase due to tag bits – Memory increase due to signatures insertions – Performance impact due to NOPs

  • Solutions

– Using path signature method – reduces the number of signatures needed – Branch address hashing – merge signature and branch address

Watchdog: Mem access and assertion checks

  • What to do about memory/data errors

– Use ECC – Few other methods using watchdog

  • Check for non existent memory addresses

ECE 753 Fault Tolerant Computing 18

  • Check for out of range addresses
  • Capability based checking for objects is also

possible

  • Assertion based checking and sanity checks

using watchdog (independent hardware) is also possible

slide-4
SLIDE 4

3/25/2014 4

Re-execution for fault-tolerance

  • Key concept

– Execute a program/instruction twice (or more times) and then compare the results. – A time redundancy technique, but if multiple hardware platforms are available,

ECE 753 Fault Tolerant Computing 19

p p , it is a hardware redundancy technique – Can detect transient faults. But it can also be employed to detect some permanent faults (see RESO next) even if the same hardware is used.

Re-execution: Basic Techniques

  • RESO concept

– Re-execution of an instruction with shifted

  • perands
  • Already discussed early in the course

ECE 753 Fault Tolerant Computing 20

  • Already discussed early in the course
  • Can detect transient faults
  • Can also detect many permanent faults

Re-execution: Basic Techniques (contd.)

  • Program Re-execution

– Make two copies the program

  • Execute them serially

– Can use RESO if the hardware platform is same for both executions

  • Execute them in parallel if sufficient hardware

d d i il bl

ECE 753 Fault Tolerant Computing 21

redundancy is available

– May take twice as long or twice the hardware – When/how to compare: impacts the system complexity – Performance impact

  • Serial computation: High latency
  • Parallel computation: Complex implementation, and

hence possible loss of performance

Re-execution: Basic Techniques (contd.)

  • Instruction Re-execution – fine grain

parallelism

– Re-execute every instruction on same or different hardware, depending upon the redundancy available

ECE 753 Fault Tolerant Computing 22

redundancy available

  • May use RESO if same hardware is used for

instruction re-execution

– If sufficient resources are available, this method may have little impact on the performance

Re-execution: Case studies

  • Introduction to case studies

– CRAY

  • Instruction re-execution

– SMT architecture

  • Two copies the program are interleaved as two threads

ECE 753 Fault Tolerant Computing 23

for simultaneous execution

– Multiscalar architecture

  • Two copies of the program are executed on many

processing elements simultaneously

– Chip multiprocessor

  • With critical value forwarding (DSN-2010)

Re-execution: Case studies (contd.)

  • CRAY
  • Instruction re-execution
  • Duplication of instruction in hardware
  • Sufficient resources and pipelining available for

re-execution without doubling the execution time

ECE 753 Fault Tolerant Computing 24

time

  • Consider a generic fine grain parallel

architecture (OH)

  • Consider executing a code segment (OH)
  • Now look at ways of duplicating instructions

and executing original and duplicated instructions (OH)

  • Some experimental results
slide-5
SLIDE 5

3/25/2014 5

Re-execution: Case studies (contd.)

  • AR-SMT

– High level view of the technique (OH)

  • Concept of execution (Active) streams
  • Re-execution of the instruction stream –

Redundant stream

ECE 753 Fault Tolerant Computing 25

Redundant stream

– Issue of delay buffer length and latency – Implementation issues and coverage – Performance impact

Re-execution: Case studies (contd.)

  • Multiscalar

– Concept of control flow graph (OH) – Basic architecture (OH) St ti di i i f PU d f

ECE 753 Fault Tolerant Computing 26

– Static division of PUs and performance impact (OH) – Dynamic division of PUs and performance impact (OH)

Re-execution: Case studies (contd.)

  • Chip Multiprocessor (See slide set)

– Intro – Design Overview and concept – Evaulation

ECE 753 Fault Tolerant Computing 27

– Conclusion

Watchdog and Re-execution: Comments

  • Concepts discussed here can be used

to design high performance processors

– Performance improvement via speculation

  • Have a very high performance speculative processor
  • Verify the control flow using watchdog or use a second

ECE 753 Fault Tolerant Computing 28

Verify the control flow using watchdog or use a second processor to fully verify the executed stream by the speculative processor.

  • This will lead to a processor with high performance

(throughput) albeit high latency

Summary

  • Watchdog

– Timer – Processor C t l fl h ki

ECE 753 Fault Tolerant Computing 29

– Control flow checking

  • Re-execution

– Basic techniques – Case studies: CRAY, AR-SMT, Multiscalar