[PPT] - EMT 368 Reliability and Testability in Integrated Circuit Design PowerPoint Presentation

SLIDE 1

EMT 368 Reliability and Testability in Integrated Circuit Design

School of Microelectronic Engineering UniMAP

1

A. Harun

SLIDE 2

Course content

Reliability and availability concept
Robust design principle
Time and failure dependent reliability
Estimation methods of the parameters of failure time

distribution

Parametric reliability model
Overview of testing
Ad-hoc techniques
Scan-path design
Boundary scan testing
Built-in self test (BIST)

2

A. Harun

SLIDE 3

CHAPTER 2 Robust Design Principle

3

A. Harun

SLIDE 4

Chapter 2 – Robust Design Principle

Unit of design
Failure recovery groups
Redundancy
Robust design principles
Robust protocols
Robust concurrency controls
Overload control
Process, resource and throughput monitoring
Data auditing
Fault correlation
Failed error detection, isolation or recovery
Geographic redundancy
Security, availability and system robustness
Error detection

4

A. Harun

SLIDE 5

Robust design principle

2.1 Unit of design

– HW and SW are organized into small comp or modules. – System architecture or design define the comp come together  system – Thought of as logical container

Accept logical input  success correct output fail/inconsistent provide

error/exception

If major fault  hang/unresponsive
Organized into hierarchical design.

5

A. Harun

SLIDE 6

Robust design principle

A. Harun

6

SLIDE 7

Robust design principle

2.1 Unit of design

– Logical container from biggest to smallest (network application)

Application
User session
Message request

– Protocol message or request – Any error found in this is contained in this container

Transactions
Robust exception handling
Subroutines

– Natural fault container

7

A. Harun

SLIDE 8

Robust design principle

2.1 Unit of design

– Logical container from biggest to smallest (HW and Platf SW)

System

– Necessary to restart entire system

FRU

– Modular, e.g. blade server

Processor
Process
Thread

8

A. Harun

SLIDE 9

Robust design principle

2.2 Failure recovery group

– Unit report failure to containing unit or containing unit implicitly detect the failure from errant behavior – What to do ?

One can restart the errant application
Restart the entire operating system
Highly available SW support smaller recovery group e.g. session

termination, process, etc.

– Failure recovery groups are suites of logical entities that are designed and tested to be recoverable while the remainder of sys remain operational. – Most common failure recovery group is SW process, browser, word processor, etc

9

A. Harun

SLIDE 10

Robust design principle

2.2 Redundancy

– Sys deploy redundancy to increase throughput or capacity. E.g. mult processor core, mem module to proc board – Increase service availability. E.g multi engine on airplane – Redundancy in computer based sys implemented at three levels

Process – prepare mult process in advance
FRU – e.g. compute blade FRUs in blade server
Network element- more DNS servers

10

A. Harun

SLIDE 11

Robust design principle

2.2 Redundancy

– Redundant units are typically organized into one of two common arrangements

Active standby – one serving and one on standby.

– Hot, ward and cold are term to characterize readiness » Cold standby – application SW or OS need to be restarted » Ward standby – Apps SW running, volatile data is periodically

sync. Time needed to rebuild the sys state

» Hot standby – apps running, volatile data is current

Load shared

– All operational units actively serving users – N = number of units required, K = # of redundancy unit configured – N + K load sharing – E.g. commercial airplane N + 1

11

A. Harun

SLIDE 12

Robust design principle

2.2 high availability Middleware

– Recovering service into a redundant unit – failure recovery fast, no impact to user – High availability mechanism/middleware – Practical sys may use some of these:

IP networking mechanism – balance netw load across cluster of

servers

Clustering – two or more computers arrange into a pool
High-availability middleware – infra to support sync, data sharing,

monitoring, management of apps

Application checkpoint mechanism – system restore
Virtual machine
Redundant array of inexpensive disks (RAID) – arrange mult HDD
r called mirrors
Database redundancy and replication
File sys replication

12

A. Harun

SLIDE 13

Robust design principle

2.3 Robust design principle

– Robust design principle to consider:

Redundant, fault-tolerance design
No single point of failure
No single point of repair
Hot swappable FRUs

– Sys with no down time for planned activities should consider the following principles

No service impact for SW patch, update and upgrade
No service impact for HW growth or degrowth
Minimal impact for sys reconfiguration

13

A. Harun

SLIDE 14

Robust design principle

2.3 Robust protocol

– Application protocols can be made robust by:

Use reliable protocols
Use confirmations or acknowledgements
Support atomic requests or transactions
Support timeouts and message retries
Use heartbeat or keep-alive mechanisms
Use stateless or minimal-shared state protocols
Support automatic reconnection

14

A. Harun

SLIDE 15

Robust design principle

2.4 Robust concurrency controls

– Concurrency controls enable applications to efficiently share resources across many users simultaneously. – Sys may share procs time, buffer, etc – Access to critical sections controlling shared resources have to be serialized – Cannot have two applications accessing same portion of shared memory or resource pool. – Need platform mechanism like semaphore, mutual exclusion lock for control. – Application also should make sure process that failed can be restarted without restarting entire sys – Concurrency control held by failed process need to be scanned to avoid dead locks standing.

15

A. Harun

SLIDE 16

Robust design principle

2.5 Overload control

– Sys implemented has physical hw constraint

E.g. processing power, storage, IO bandwidth

– Translated into capacity limits under acceptable QoS – When demand for service increased, sys unable to deliver required requests – Need overload control to gracefully manage traffic exceeding engineered capacity

16

A. Harun

SLIDE 17

Robust design principle

2.5 Overload control

– Sys overload causes

Unexpected popularity
Under engineered system
Incorrectly configured system
External events

– Promotion, NY eve, etc

Power outage and restoration

– Spike in reconnection if automated

Network equipment failure and restoration
System failure

– Service distributed to multiple sys, one fail causing workload shifted to others

Denial-of-service attack

– Cyber vandalism, ransom

17

A. Harun

SLIDE 18

Robust design principle

2.5 Overload control

– Two features of overload control

Control mechanism

– Shed load or traffic

Control triggers

– Activate control mechanism when congestion occur – Deactivate after congestion ended

18

A. Harun

SLIDE 19

Robust design principle

2.5 Overload control

– Congestion detection techniques

Slower sys response times
Longer work queues
Higher CPU utilization

– High sys stress, may not overloading

– Congestion control mechanism

Rejecting new sessions – ‘too busy error’
Rejecting new message requests – reject all traffic frm

certain users, certain type of message, etc

Disconnecting alive session – lower priority users
Disabling servers or services – close some or all IP ports

having overload.

19

A. Harun

SLIDE 20

Robust design principle

2.5 Overload control

– Architectural considerations

Sys should have three broad classes

– Low priority » Not directly impact user, maintenance and bg task (e.g. bkup, audit, etc) – Medium priority » Tasks directly/indirectly interact with end users – High priority » Management visibility and control tasks e.g. overload control.

As sys saturate, low priority will be deferred to higher

priority.

20

A. Harun

SLIDE 21

Robust design principle

2.6 Process, resource and throughput monitoring

– Some errors may not immediately seen in normal sys

peration. Thus need to detect before become critical

failure. – Mechanism to proactively monitor sys health

Heartbeat checks of critical processes – ensure sane enough

to respond within reasonable time

Resource usage checks – process size, free space, cpu usage
Data audits
Monitor sys throughput, performance and alarm behavior
Health checks of critical supporting sys – hello, keep-alive,

status queries

– These normally run at low priority process, but master control need to run at higher priority process.

21

A. Harun

SLIDE 22

Robust design principle

2.7 Fault correlation

– Failures trace back to a single fault that was activated. – Initial fault may have cascaded to cause detectable errors in several parts of the system. – Ordinary consumer and commercial sys, human must go through the failure and debug, then perform corrective actions – In highly available sys, debugging and recovery must be automatic and fast to minimize downtime. – Robust sys must do the following:

Assess fault persistence

– Transient faults – IP packet dropped, spike in sys workload, etc. E.g. retry may fix the errors – Persistent fault - False negative decision creates silent or sleeping failure. Secondary fault may show the fault finally or human maintenance engineer.

22

A. Harun

SLIDE 23

Robust design principle

2.7 Fault correlation

– Robust sys must do the following: (continue)

Isolate primary fault to appropriate recoverable unit

– Fault management sys collect and correlate enough indirectly detected failures to isolate the true failure if direct failure detection does not occur quickly

Activate correct recovery mechanism

– Automatically trigger appropriate recovery action after failure isolated

Assure successful recovery

23

A. Harun

SLIDE 24

Robust design principle

2.8 Failed error detection, isolation or recovery

– Secondary failures

Detection (or silent) failure

– Sys primary failure detectors do not notice, robustness mechanism not activated

Isolation failure

– sys indicts wrong recoverable module, thus initiated automatic recovery action for module that has not failed.

Recovery failure

– Recovery action not successful in recovering services

– Robust sys must include layers of failure detection to ensure secondary or tertiary will notice escaping primary failures.

24

A. Harun

SLIDE 25

Robust design principle

2.10 Security, availability and system robustness

– Modern sys must also withstand deliberate attack by criminals and malevolent parties. – Robust design, and sys comprehensive security architecture assures sys can withstand security attacks. – 2.10.1 X.805 security architecture

International telecommunication union recommendation X.805
8 security dimensions

– Access control – Authentication – Non-repudiation – confirm ops has been performed – Data confidentiality – Communications security – Data integrity – Availability – service not compromised by malicious acts – Privacy

25

A. Harun

SLIDE 26

Robust design principle

2.11 Procedural consideration

– Procedure related errors arise due to one or more

f the following:
Documented or undocumented procedure wrong,

ambiguous or misleading

User interface was ambiguous, misleading or wrong
Human erroneously entered wrong input
Human executed wrong action, neglected to execute

correct action, or execute action out of sequence

Sys failed to check input or sys state prior to executing

requested operation

26

A. Harun

SLIDE 27

Robust design principle

2.11 Procedural consideration

– Best practice to for designing highly reliable procedures is to focus on three broad principles:

Minimize human interaction
Help humans to do the right thing
Minimize the impact of human error

27

A. Harun

SLIDE 28

Robust design principle

2.11 Procedural consideration

– 2.11.1 Minimize human interaction

Sys experience maintenance procedure in two ways:

– 1. Via command interface » Command issued by maintenance engr highly priviledged authorization – 2. Via direct manipulation of sys HW » E.g. insertion or removal of FRU or network connection

Manual emergency recoveries have higher risk of human error than

nonemergency procedures because:

– Uncertainty about state of the system – what action most appropriate – Time pressure to rapidly restore service – lead to poor human judgment – Limited preparation or practice time – nonemergency can be carefully planned.

Minimize human maintenance interaction by automation. Benefit:

– Automated mechanism that eliminate manual procedures also eliminate the risk of human failure – Automated procedures have fewer steps to execute, and hence fewer

pportunities for human mistakes.

28

A. Harun

SLIDE 29

Robust design principle

2.11 Procedural consideration

– 2.11.2 Help human do the right thing

Well designed procedures

– Simple – no more than 7 steps else sub procedures – Clear – clear language, step by step instruction – Intuitive – SW, keyed connectors, labeling and marking, color – Similar to each other – common power up/ down, terminology – Require confirmation before service impacting operations – Clearly report sys operational state – alarm should be collected and displayed together. Alarm must be clear, simple and consistent – Include “safe stop point” – identify points within the procedure where it is safe to interrupt or stop the procedure.

29

A. Harun

SLIDE 30

Robust design principle

2.11 Procedural consideration

– 2.11.2 Help human do the right thing

Well written and tested documentation

– System planning and installation guide – Capacity planning guide » Dimensioning data – how sys to be configured to support traffic load » Performance and throughput metrics – can easily determine actual traffic level and service mix on the sys » Sys growth and reconfiguration procedure – procedure to grow or migrate traffic to other sys

30

A. Harun

SLIDE 31

Robust design principle

2.11 Procedural consideration

– 2.11.2 Help human do the right thing

Well written and tested documentation

– Trouble shooting tools and technical support » Appropriate alarm » Performance monitors » Visual status indicators on FRUs » Online diagnostics » Documentation of required SW processes » Documented emergency recovery procedures » Technical support

Training

31

A. Harun

SLIDE 32

Robust design principle

2.11 Procedural consideration

– 2.11.3 Minimize the impact of human error

Check input parameters
Confirm service impacting and profound requests
Provide documented back out, undo or rollback

mechanisms

32

A. Harun