RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN - - PowerPoint PPT Presentation

reliability reliability and and reliable design reliable
SMART_READER_LITE
LIVE PREVIEW

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN - - PowerPoint PPT Presentation

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli Giovanni De Centre Systmes Systmes Intgrs Intgrs Centre Outline Introduction to reliable design Design for reliability


slide-1
SLIDE 1

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN

Giovanni De Giovanni De Micheli Micheli Centre Centre Systèmes Systèmes Intégrés Intégrés

slide-2
SLIDE 2

De Micheli 2

Outline

  • Introduction to reliable design
  • Design for reliability

– Component redundancy – Communication redundancy – Data encoding and error correction – Dealing with variability

  • Summary and conclusions
slide-3
SLIDE 3

De Micheli 3

Reliable design:

where do we need it ?

  • Traditional applications

– Long-life applications (space missions) – Life-critical, short-term applications (aircraft engine control, fly-by-wire) – Defense applications (aircraft, guidance & control) – Nuclear industry – Telecommunications

  • New computation-critical applications

– Health industry – Automotive industry – Industrial control systems and production lines – Banking, reservations, commerce

slide-4
SLIDE 4

De Micheli 4

The economic perspective

  • Availability is a critical business metric for commercial systems and services

– Nearly 100% availability (“five nines+”) is almost mandatory

  • Service outages are frequent

– 65% website managers report outages over a 6-month period – 25% report three or more outages [Internet week 2000 ]

  • High cost of downtime of systems providing vital services

– Lost opportunities and revenues, non-compliance penalties, potential loss of lives – Cost per an hour of downtime varies from $89K for cellular services to $6.5M for stock brokerage [Gartner Group 1998]

  • Revenue for high availability products in the data/telecom/computer server

market is over $100B (≈ $15B for servers alone) [IMEX Research 2003]

slide-5
SLIDE 5

De Micheli 5

Reliability is a system issue

Hardware

System network Processing elements Memory Storage system

Operating system Reliable communication

Sw Implemented Fault Tolerance

Application program interface (API) Middleware

Applications

Error correcting codes, M-out-of-N and standby redundancy , voting, watchdog timers, reliable storage (RAID, mirrored disks) CRC on messages , acknowledgment, watchdogs, heartbeats, consistency protocols Memory management and exception handling, detection of process failures, checkpoint and rollback Checkpointing and rollback, application replication, software, voting (fault masking), process pairs, robust data structures, recovery blocks, N-version programming,

[ Iyer ]

slide-6
SLIDE 6

De Micheli 6

Malfunctions

  • Manufacturing imperfections

– More likely to happen as lithography scales down

  • Approximations during design

– Uncertainty about details of design

  • Aging

– Oxide breakdown, electromigration

  • Environment-induced

– Soft-errors, electro-magnetic interference

  • Operating-mode induced

– Extremely-low voltage supply

slide-7
SLIDE 7

De Micheli 7

Process variability

  • Effects of downscaling

– Smaller mean values – Larger variances

  • Worst-case design paradigm fails
slide-8
SLIDE 8

De Micheli 8

Sources of process variations

  • Chemical deposition (CD) variation

– Systematic and random

  • Inter and intra-die
  • Width variation

– Impact on narrow transistors

  • Threshold voltage fluctuation

– Largest impact on short and narrow devices

  • Interconnect

– Dishing and erosion

slide-9
SLIDE 9

De Micheli 9

Circuit-level mitigation techniques

  • For sizing:

– Guardbanding, layout design rules – Device matching design rules – Regular fabric

  • For threshold variation:

– Graded wells – Upsizing devices

  • For voltage variations:

– Dynamic voltage control – Thermal management

slide-10
SLIDE 10

De Micheli 10

Malfunctions and faults

  • Malfunctions can be:

– Permanent, transient, intermittent

  • Malfunctions are captured by:

– Faults

  • Abstractions of the malfunctions

– Failure modes

  • Way in which the malfunction manifests

– Failure rates

  • Related to failure probability
slide-11
SLIDE 11

De Micheli 11

Aging of materials (Permanent malfunctions)

  • Failure mechanisms

– Electromigration – Oxide breakdown – Thermo-mechanical stress

  • Temperature dependence

– Arrhenius law

slide-12
SLIDE 12

De Micheli 12

Sources of transient malfunctions

  • Soft errors

– Data corruption due external radiation exposure

  • Crosstalk

– Data corruption due to internal field exposure

  • Both malfunctions manifest

themselves as timing errors

– Error containment

slide-13
SLIDE 13

De Micheli 14

Defining the problems…

  • Failure rate:

– Assuming a unit works correctly in [0,t], the conditional probability λ(t) that a unit fails in [t, t +Δt]

  • Typically the failure λ rate depends on
  • Temperature
  • Time (burn-in and aging)
  • Environmental exposure
  • Soft errors, EMI
  • Often the component failure rate is assumed to be

constant for simplicity

slide-14
SLIDE 14

De Micheli 15

Failure rate the bathtub curve

time Failure rate

slide-15
SLIDE 15

De Micheli 16

Reliability

  • The probability function R(t) that a system

works correctly in [0, t] without repairs

  • Reliability is a function of time

– If the system consist of a single component with constant failure rate λ, then

  • R(t) = exp (– λt)

– The mean time to failure is MTTF = 1/ λ

  • In general, the MTTF is E[t] = ∫ R(t)dt
slide-16
SLIDE 16

De Micheli 17

Dependability Concepts

MTTF MTTR MTBF REPAIR TIME Previous repair Fault occurs Error - fault becomes active (e.g. memory has write 0) Error detection (read memory, parity error) Repair memory Next fault occurs ERROR Latency FAULT Latency

Reliability:

a measure of the continuous delivery of service; R(t) is the probability that the system survives (does not fail) throughout [0, t]; expected value: MTTF(Mean Time To Failure)

Availability:

a measure of the service delivery with respect to the alternation of the delivery and interruptions A(t) is the probability that the system delivers a proper (conforming to specification)service at a given time t. expected value: EA = MTTF / (MTTF + MTTR)

Maintainability:

a measure of the service interruption M(t) is the probability that the system will be repaired within a time less than t; expected value: MTTR (Mean Time To Repair)

Safety:

a measure of the time to catastrophic failure S(t) is the probability that no catastrophic failures

  • ccur during [0, t];

expected value: MTTCF(Mean Time To Catastrophic Failure) MTTF

slide-17
SLIDE 17

De Micheli 18

Reliability of complex systems

  • A system is a connection of components
  • System reliability depends on the topology

– Series/parallel configurations – N out of K configurations – General topologies

  • Common mode failures

– Failure mode that affects all components – Examples:

  • Failure of voltage regulator for SoC
  • Failure of scheduler to process exception routines
slide-18
SLIDE 18

De Micheli 19

Very simple example

  • For reliability analysis, a system consists of three components:

– Processor, memory, bus

  • All components have to be up at the same time to accomplish

the mission

  • The three components form a series configuration
  • The system reliability is the product of the component

reliabilities (if the failure rates are independent)

  • Assume failure rates constant:

– The system failure rate is the sum of the failure rates – The MTTF is its inverse

slide-19
SLIDE 19

De Micheli 20

Example (2)

  • For reliability analysis, a system consists of two processors:

– A working processor suffices to accomplish the mission

  • The two components form a parallel configuration
  • The system unreliability is the product of the component

unreliabilities (if the failure rates are independent)

– R(t) = 1 – [1-R1(t)] [1-R2(t)] – Assume failure rates constant – The MTTF is 1/λ1 + 1/λ2 +1/ (λ1 +λ2)

  • Other relevant configurations:

– Standby – Triple modular redundancy

slide-20
SLIDE 20

De Micheli 21

TMR vs simplex reliability

slide-21
SLIDE 21

De Micheli 22

Outline

  • Introduction to reliable design
  • Design for reliability

– Component redundancy – Communication redundancy – Data encoding and error correction – Dealing with variability

  • Summary and conclusions
slide-22
SLIDE 22

De Micheli 23

Design for reliability

  • Hard failures

– Exploit redundancy:

  • Components
  • Interconnect
  • Soft failures

– Encoding – Containment and rollback

  • Variability

– Timing-error tolerant circuits – Self-calibrating circuits

slide-23
SLIDE 23

De Micheli 24

Providing component redundancy

  • Component redundancy for enhanced reliability

– Energy consumption penalty may be severe

  • Power-managed standby components

– Provide for temporary/permanent back-up – Provide for load and stress sharing

  • Power management and reliability are intertwined:

– PM allows reasonable use of redundancy on chip – Failure rates depend on effect of PM on components

  • A programmable and flexible interconnection

means is required

slide-24
SLIDE 24

De Micheli 25

Example

Standby Standby Faulty Standby memory When core operates failure rate is higher as compared to standby unit When core fails, it is replaced by standby core System management may alternate cores at high frequency, voltage and failure rate, to

  • ptimize long term reliability
slide-25
SLIDE 25

De Micheli 26

Issues

  • Analyze system-level reliability

– as a function of a power management policy

  • Determine a system management policy

– to maximize reliability (over a time interval) and minimize energy consumption

  • Determine a system management policy

and system topology

– to maximize reliability (over a time interval) and minimize energy consumption

slide-26
SLIDE 26

De Micheli 27

Outline

  • Introduction to dependable design
  • Design for reliability

– Component redundancy – Communication redundancy – Data encoding and error correction – Dealing with variability

  • Summary and conclusions
slide-27
SLIDE 27

De Micheli 28

Why on-chip networking ?

  • Provide a structured methodology for realizing
  • n-chip communication schemes

– Modularity – Flexibility

  • Cope with inherent limitations of busses

– Performance and power of busses do not scale up

  • Support reliable operation

– Layered approach to error detection and correction

slide-28
SLIDE 28

De Micheli 29

Interconnect design in a multi-processing environment

  • Most SoCs are multi-processors

– Homogeneous

  • High performance

computation

– Heterogeneous

  • Application specific

solutions

  • Classic and ad hoc topologies
  • Different QoS requirements

– Best-effort services – Guaranteed performance

Network Interface Packets Routes PE

slide-29
SLIDE 29

De Micheli 30

Providing communication reliability

  • Some network topologies support multiple

source/destination paths

– Tolerate transient congestion, transient and permanent link malfunctions

  • Error detection and correction

– Physical links

  • Timing-errors detection by shadow latches

– Switches and routers

  • Flit-level error detection and correction with CRCs

– Network interface

  • Packet integrity check

– Processor cores

  • Software data correctness check
slide-30
SLIDE 30

De Micheli 31

Outline

  • Introduction to dependable design
  • Design for reliability

– Component redundancy – Communication redundancy – Data encoding and error correction – Dealing with variability

  • Summary and conclusions
slide-31
SLIDE 31

De Micheli 32

Encoding

  • At logic level, codes provide means of masking and detecting errors
  • Formally, a code is a subset S of universe U of possible vectors
  • A noncode word is a vector in set U-S

X1 is a codeword <10010011> Due to multiple bit error, becomes X3 = <10011100> not detectable X2 is a codeword, becomes X4 noncode detectable S = even parity X1 X3 X2 X4 U = 28 vectors

slide-32
SLIDE 32

De Micheli 33

Basic Concepts

  • Consider 2k messages (i.e. k bits)
  • Encode messages with 2k codewords using n-bit

vectors

– (n, k) code – Fraction k/n is called rate of code

  • Hamming distance properties:

– Hamming distance between two vectors x and y, d(x,y) is number of bits in which they differ. – Distance of a code is a minimum of Hamming distances between all pairs of code words. Example: x = (1011), y = (0110) w(x) = 3, w(y) = 2, d(x, y) = 3

slide-33
SLIDE 33

De Micheli 34

Distance Properties

  • To detect all error patterns of Hamming distance ≤ d,

code distance must be ≥ d+1

– e.g., code with distance 2 can detect single-bit errors

  • To correct all error patterns of Hamming distance ≤ c,

code distance must be ≥ 2c + 1

– e.g., code with distance 3 can correct single-bit errors

  • To detect all patterns of Hamming distance d, and

correct all patterns of Hamming distance c, code distance must be ≥ 2c + d + 1

– e.g., code with distance 5 can correct double errors and detect quadruple errors

slide-34
SLIDE 34

De Micheli 35

Codes for Storage and Communication

Cyclic Codes

  • Cyclic codes are parity check codes with additional property that

cyclic shift of codeword is also a codeword

– if (Cn-1, Cn-1 ... C1, C0) is a codeword, (Cn-2, Cn-3, ... C0, Cn-1) is also a codeword

  • Cyclic codes are used in

– sequential storage devices, e.g. tapes, disks, and data links – communication applications

  • An (n,k) cyclic code can detect single bit errors, multiple adjacent

bit errors affecting fewer than (n-k) bits, and burst transient errors

  • Cyclic codes require less hardware

– Use linear feedback shift registers (LFSR) – Parity check codes require complex encoding, decoding circuit using arrays of EX-OR gates, AND gates, etc.

slide-35
SLIDE 35

De Micheli 36

ICACHE MEM.CTRL.

AMBA BUS INTERFACE FROM EXT. MEMORY HRDATA AMBA BUS

  • Compare original AMBA bus to

extended bus with error detection and correction or retransmission – SEC coding – SEC-DED coding – ED coding

  • Explore energy efficiency [Bertozzi]

Error-resilient coding

H DECODER H ENCODER

MTTF

slide-36
SLIDE 36

De Micheli 37

ICACHE MEM.CTRL.

AMBA BUS INTERFACE FROM EXT. MEMORY HRDATA AMBA BUS

  • Compare original AMBA bus to

extended bus with error detection and correction or retransmission – SEC, SEC-DEC, ED coding – CRC4 and CRC8 coding

  • On shorter links, CRC become

competitive when ENC/DEC power is accounted for [Bertozzi]

Error-resilient coding

H DECODER H ENCODER

MTTF

slide-37
SLIDE 37

De Micheli 38

Outline

  • Introduction to reliable design
  • Design for reliability

– Component redundancy – Communication redundancy – Data encoding and error correction – Dealing with variability

  • Summary and conclusions
slide-38
SLIDE 38

De Micheli 39

Dealing with variability

  • Most variability problems induce timing errors

– Power supply variation – Wire length estimation – Crosstalk – Soft errors

  • Timing errors can be contained while using

an aggressive operating frequency

– Timing errors are rare – Micro rollback – Delayed clocks

slide-39
SLIDE 39

De Micheli 40

Propagation of soft error

slide-40
SLIDE 40

De Micheli 41

Radiation-hardened registers

  • Protection against soft errors

– Timing errors

  • Each latch is duplicated

– Shadow latch has delayed clock

  • Comparison between original

and shadow latch detects error

– Error correction is possible

[IROC Technologies]

slide-41
SLIDE 41

De Micheli 42

The razor approach

  • Applicable to processor design
  • Try to shave off power consumption

– Reduce voltage margins with in situ error detection and correction for delay faults

  • Compare two samples of data

[Austin 03]

slide-42
SLIDE 42

De Micheli 43

The t-error approach

  • Applicable to NoC communication
  • Use aggressive clocking frequency

– Address data-dependent wire propagation delay – Compare two samples of data – Correct data and propagate with one cycle delay penalty

[Murali 04]

slide-43
SLIDE 43

De Micheli 44

dd

v

1 2 Adaptive low-power transmission scheme

FIFO

ch

F

Controller

FIFO

n

dd

v

Encoder Decoder Ack

ch

v

errors

ch

v

[Ienne02]

slide-44
SLIDE 44

De Micheli 45

Outline

  • Introduction to reliable design
  • Design for reliability

– Component redundancy – Communication redundancy – Data encoding and error correction – Dealing with variability

  • Summary and conclusions
slide-45
SLIDE 45

De Micheli 46

Achieving reliable SoCs Summary

  • Exploit redundancy

– Component-level redundancy

  • Supported by modularity of micro-networks
  • Requires energy management

– Communication link redundancy

  • Supported by path diversity of micro-networks
  • Error detection and correction

– Encoding, CRCs, self-checking circuits

  • Dealing with variability

– Detect and correct timing errors

slide-46
SLIDE 46

De Micheli 47

Conclusions

  • Reliable design is important in many application

domains

  • Reliable MPSOC design can be achieved with

system-level techniques to obviate the limitations of the materials and environment

  • Structured design methodologies and structured

interconnect design support reliable design

slide-47
SLIDE 47

De Micheli 48