EMT 368 Reliability and Testability in Integrated Circuit Design - - PowerPoint PPT Presentation

emt 368
SMART_READER_LITE
LIVE PREVIEW

EMT 368 Reliability and Testability in Integrated Circuit Design - - PowerPoint PPT Presentation

EMT 368 Reliability and Testability in Integrated Circuit Design School of Microelectronic Engineering UniMAP A. Harun 1 Course content Reliability and availability concept Robust design principle Time and failure dependent


slide-1
SLIDE 1

EMT 368 Reliability and Testability in Integrated Circuit Design

School of Microelectronic Engineering UniMAP

1

  • A. Harun
slide-2
SLIDE 2

Course content

  • Reliability and availability concept
  • Robust design principle
  • Time and failure dependent reliability
  • Estimation methods of the parameters of failure time

distribution

  • Parametric reliability model
  • Overview of testing
  • Ad-hoc techniques
  • Scan-path design
  • Boundary scan testing
  • Built-in self test (BIST)

2

  • A. Harun
slide-3
SLIDE 3

CHAPTER 2 Robust Design Principle

3

  • A. Harun
slide-4
SLIDE 4

Chapter 2 – Robust Design Principle

  • Unit of design
  • Failure recovery groups
  • Redundancy
  • Robust design principles
  • Robust protocols
  • Robust concurrency controls
  • Overload control
  • Process, resource and throughput monitoring
  • Data auditing
  • Fault correlation
  • Failed error detection, isolation or recovery
  • Geographic redundancy
  • Security, availability and system robustness
  • Error detection

4

  • A. Harun
slide-5
SLIDE 5

Robust design principle

  • 2.1 Unit of design

– HW and SW are organized into small comp or modules. – System architecture or design define the comp come together  system – Thought of as logical container

  • Accept logical input  success correct output fail/inconsistent provide

error/exception

  • If major fault  hang/unresponsive
  • Organized into hierarchical design.

5

  • A. Harun
slide-6
SLIDE 6

Robust design principle

  • A. Harun

6

slide-7
SLIDE 7

Robust design principle

  • 2.1 Unit of design

– Logical container from biggest to smallest (network application)

  • Application
  • User session
  • Message request

– Protocol message or request – Any error found in this is contained in this container

  • Transactions
  • Robust exception handling
  • Subroutines

– Natural fault container

7

  • A. Harun
slide-8
SLIDE 8

Robust design principle

  • 2.1 Unit of design

– Logical container from biggest to smallest (HW and Platf SW)

  • System

– Necessary to restart entire system

  • FRU

– Modular, e.g. blade server

  • Processor
  • Process
  • Thread

8

  • A. Harun
slide-9
SLIDE 9

Robust design principle

  • 2.2 Failure recovery group

– Unit report failure to containing unit or containing unit implicitly detect the failure from errant behavior – What to do ?

  • One can restart the errant application
  • Restart the entire operating system
  • Highly available SW support smaller recovery group e.g. session

termination, process, etc.

– Failure recovery groups are suites of logical entities that are designed and tested to be recoverable while the remainder of sys remain operational. – Most common failure recovery group is SW process, browser, word processor, etc

9

  • A. Harun
slide-10
SLIDE 10

Robust design principle

  • 2.2 Redundancy

– Sys deploy redundancy to increase throughput or capacity. E.g. mult processor core, mem module to proc board – Increase service availability. E.g multi engine on airplane – Redundancy in computer based sys implemented at three levels

  • Process – prepare mult process in advance
  • FRU – e.g. compute blade FRUs in blade server
  • Network element- more DNS servers

10

  • A. Harun
slide-11
SLIDE 11

Robust design principle

  • 2.2 Redundancy

– Redundant units are typically organized into one of two common arrangements

  • Active standby – one serving and one on standby.

– Hot, ward and cold are term to characterize readiness » Cold standby – application SW or OS need to be restarted » Ward standby – Apps SW running, volatile data is periodically

  • sync. Time needed to rebuild the sys state

» Hot standby – apps running, volatile data is current

  • Load shared

– All operational units actively serving users – N = number of units required, K = # of redundancy unit configured – N + K load sharing – E.g. commercial airplane N + 1

11

  • A. Harun
slide-12
SLIDE 12

Robust design principle

  • 2.2 high availability Middleware

– Recovering service into a redundant unit – failure recovery fast, no impact to user – High availability mechanism/middleware – Practical sys may use some of these:

  • IP networking mechanism – balance netw load across cluster of

servers

  • Clustering – two or more computers arrange into a pool
  • High-availability middleware – infra to support sync, data sharing,

monitoring, management of apps

  • Application checkpoint mechanism – system restore
  • Virtual machine
  • Redundant array of inexpensive disks (RAID) – arrange mult HDD
  • r called mirrors
  • Database redundancy and replication
  • File sys replication

12

  • A. Harun
slide-13
SLIDE 13

Robust design principle

  • 2.3 Robust design principle

– Robust design principle to consider:

  • Redundant, fault-tolerance design
  • No single point of failure
  • No single point of repair
  • Hot swappable FRUs

– Sys with no down time for planned activities should consider the following principles

  • No service impact for SW patch, update and upgrade
  • No service impact for HW growth or degrowth
  • Minimal impact for sys reconfiguration

13

  • A. Harun
slide-14
SLIDE 14

Robust design principle

  • 2.3 Robust protocol

– Application protocols can be made robust by:

  • Use reliable protocols
  • Use confirmations or acknowledgements
  • Support atomic requests or transactions
  • Support timeouts and message retries
  • Use heartbeat or keep-alive mechanisms
  • Use stateless or minimal-shared state protocols
  • Support automatic reconnection

14

  • A. Harun
slide-15
SLIDE 15

Robust design principle

  • 2.4 Robust concurrency controls

– Concurrency controls enable applications to efficiently share resources across many users simultaneously. – Sys may share procs time, buffer, etc – Access to critical sections controlling shared resources have to be serialized – Cannot have two applications accessing same portion of shared memory or resource pool. – Need platform mechanism like semaphore, mutual exclusion lock for control. – Application also should make sure process that failed can be restarted without restarting entire sys – Concurrency control held by failed process need to be scanned to avoid dead locks standing.

15

  • A. Harun
slide-16
SLIDE 16

Robust design principle

  • 2.5 Overload control

– Sys implemented has physical hw constraint

  • E.g. processing power, storage, IO bandwidth

– Translated into capacity limits under acceptable QoS – When demand for service increased, sys unable to deliver required requests – Need overload control to gracefully manage traffic exceeding engineered capacity

16

  • A. Harun
slide-17
SLIDE 17

Robust design principle

  • 2.5 Overload control

– Sys overload causes

  • Unexpected popularity
  • Under engineered system
  • Incorrectly configured system
  • External events

– Promotion, NY eve, etc

  • Power outage and restoration

– Spike in reconnection if automated

  • Network equipment failure and restoration
  • System failure

– Service distributed to multiple sys, one fail causing workload shifted to others

  • Denial-of-service attack

– Cyber vandalism, ransom

17

  • A. Harun
slide-18
SLIDE 18

Robust design principle

  • 2.5 Overload control

– Two features of overload control

  • Control mechanism

– Shed load or traffic

  • Control triggers

– Activate control mechanism when congestion occur – Deactivate after congestion ended

18

  • A. Harun
slide-19
SLIDE 19

Robust design principle

  • 2.5 Overload control

– Congestion detection techniques

  • Slower sys response times
  • Longer work queues
  • Higher CPU utilization

– High sys stress, may not overloading

– Congestion control mechanism

  • Rejecting new sessions – ‘too busy error’
  • Rejecting new message requests – reject all traffic frm

certain users, certain type of message, etc

  • Disconnecting alive session – lower priority users
  • Disabling servers or services – close some or all IP ports

having overload.

19

  • A. Harun
slide-20
SLIDE 20

Robust design principle

  • 2.5 Overload control

– Architectural considerations

  • Sys should have three broad classes

– Low priority » Not directly impact user, maintenance and bg task (e.g. bkup, audit, etc) – Medium priority » Tasks directly/indirectly interact with end users – High priority » Management visibility and control tasks e.g. overload control.

  • As sys saturate, low priority will be deferred to higher

priority.

20

  • A. Harun
slide-21
SLIDE 21

Robust design principle

  • 2.6 Process, resource and throughput monitoring

– Some errors may not immediately seen in normal sys

  • peration. Thus need to detect before become critical

failure. – Mechanism to proactively monitor sys health

  • Heartbeat checks of critical processes – ensure sane enough

to respond within reasonable time

  • Resource usage checks – process size, free space, cpu usage
  • Data audits
  • Monitor sys throughput, performance and alarm behavior
  • Health checks of critical supporting sys – hello, keep-alive,

status queries

– These normally run at low priority process, but master control need to run at higher priority process.

21

  • A. Harun
slide-22
SLIDE 22

Robust design principle

  • 2.7 Fault correlation

– Failures trace back to a single fault that was activated. – Initial fault may have cascaded to cause detectable errors in several parts of the system. – Ordinary consumer and commercial sys, human must go through the failure and debug, then perform corrective actions – In highly available sys, debugging and recovery must be automatic and fast to minimize downtime. – Robust sys must do the following:

  • Assess fault persistence

– Transient faults – IP packet dropped, spike in sys workload, etc. E.g. retry may fix the errors – Persistent fault - False negative decision creates silent or sleeping failure. Secondary fault may show the fault finally or human maintenance engineer.

22

  • A. Harun
slide-23
SLIDE 23

Robust design principle

  • 2.7 Fault correlation

– Robust sys must do the following: (continue)

  • Isolate primary fault to appropriate recoverable unit

– Fault management sys collect and correlate enough indirectly detected failures to isolate the true failure if direct failure detection does not occur quickly

  • Activate correct recovery mechanism

– Automatically trigger appropriate recovery action after failure isolated

  • Assure successful recovery

23

  • A. Harun
slide-24
SLIDE 24

Robust design principle

  • 2.8 Failed error detection, isolation or recovery

– Secondary failures

  • Detection (or silent) failure

– Sys primary failure detectors do not notice, robustness mechanism not activated

  • Isolation failure

– sys indicts wrong recoverable module, thus initiated automatic recovery action for module that has not failed.

  • Recovery failure

– Recovery action not successful in recovering services

– Robust sys must include layers of failure detection to ensure secondary or tertiary will notice escaping primary failures.

24

  • A. Harun
slide-25
SLIDE 25

Robust design principle

  • 2.10 Security, availability and system robustness

– Modern sys must also withstand deliberate attack by criminals and malevolent parties. – Robust design, and sys comprehensive security architecture assures sys can withstand security attacks. – 2.10.1 X.805 security architecture

  • International telecommunication union recommendation X.805
  • 8 security dimensions

– Access control – Authentication – Non-repudiation – confirm ops has been performed – Data confidentiality – Communications security – Data integrity – Availability – service not compromised by malicious acts – Privacy

25

  • A. Harun
slide-26
SLIDE 26

Robust design principle

  • 2.11 Procedural consideration

– Procedure related errors arise due to one or more

  • f the following:
  • Documented or undocumented procedure wrong,

ambiguous or misleading

  • User interface was ambiguous, misleading or wrong
  • Human erroneously entered wrong input
  • Human executed wrong action, neglected to execute

correct action, or execute action out of sequence

  • Sys failed to check input or sys state prior to executing

requested operation

26

  • A. Harun
slide-27
SLIDE 27

Robust design principle

  • 2.11 Procedural consideration

– Best practice to for designing highly reliable procedures is to focus on three broad principles:

  • Minimize human interaction
  • Help humans to do the right thing
  • Minimize the impact of human error

27

  • A. Harun
slide-28
SLIDE 28

Robust design principle

  • 2.11 Procedural consideration

– 2.11.1 Minimize human interaction

  • Sys experience maintenance procedure in two ways:

– 1. Via command interface » Command issued by maintenance engr highly priviledged authorization – 2. Via direct manipulation of sys HW » E.g. insertion or removal of FRU or network connection

  • Manual emergency recoveries have higher risk of human error than

nonemergency procedures because:

– Uncertainty about state of the system – what action most appropriate – Time pressure to rapidly restore service – lead to poor human judgment – Limited preparation or practice time – nonemergency can be carefully planned.

  • Minimize human maintenance interaction by automation. Benefit:

– Automated mechanism that eliminate manual procedures also eliminate the risk of human failure – Automated procedures have fewer steps to execute, and hence fewer

  • pportunities for human mistakes.

28

  • A. Harun
slide-29
SLIDE 29

Robust design principle

  • 2.11 Procedural consideration

– 2.11.2 Help human do the right thing

  • Well designed procedures

– Simple – no more than 7 steps else sub procedures – Clear – clear language, step by step instruction – Intuitive – SW, keyed connectors, labeling and marking, color – Similar to each other – common power up/ down, terminology – Require confirmation before service impacting operations – Clearly report sys operational state – alarm should be collected and displayed together. Alarm must be clear, simple and consistent – Include “safe stop point” – identify points within the procedure where it is safe to interrupt or stop the procedure.

29

  • A. Harun
slide-30
SLIDE 30

Robust design principle

  • 2.11 Procedural consideration

– 2.11.2 Help human do the right thing

  • Well written and tested documentation

– System planning and installation guide – Capacity planning guide » Dimensioning data – how sys to be configured to support traffic load » Performance and throughput metrics – can easily determine actual traffic level and service mix on the sys » Sys growth and reconfiguration procedure – procedure to grow or migrate traffic to other sys

30

  • A. Harun
slide-31
SLIDE 31

Robust design principle

  • 2.11 Procedural consideration

– 2.11.2 Help human do the right thing

  • Well written and tested documentation

– Trouble shooting tools and technical support » Appropriate alarm » Performance monitors » Visual status indicators on FRUs » Online diagnostics » Documentation of required SW processes » Documented emergency recovery procedures » Technical support

  • Training

31

  • A. Harun
slide-32
SLIDE 32

Robust design principle

  • 2.11 Procedural consideration

– 2.11.3 Minimize the impact of human error

  • Check input parameters
  • Confirm service impacting and profound requests
  • Provide documented back out, undo or rollback

mechanisms

32

  • A. Harun