EMT 368 Reliability and Testability in Integrated Circuit Design
School of Microelectronic Engineering UniMAP
1
- A. Harun
EMT 368 Reliability and Testability in Integrated Circuit Design - - PowerPoint PPT Presentation
EMT 368 Reliability and Testability in Integrated Circuit Design School of Microelectronic Engineering UniMAP A. Harun 1 Course content Reliability and availability concept Robust design principle Time and failure dependent
School of Microelectronic Engineering UniMAP
1
2
3
4
– HW and SW are organized into small comp or modules. – System architecture or design define the comp come together system – Thought of as logical container
error/exception
5
6
– Protocol message or request – Any error found in this is contained in this container
– Natural fault container
7
– Necessary to restart entire system
– Modular, e.g. blade server
8
– Unit report failure to containing unit or containing unit implicitly detect the failure from errant behavior – What to do ?
termination, process, etc.
– Failure recovery groups are suites of logical entities that are designed and tested to be recoverable while the remainder of sys remain operational. – Most common failure recovery group is SW process, browser, word processor, etc
9
10
– Hot, ward and cold are term to characterize readiness » Cold standby – application SW or OS need to be restarted » Ward standby – Apps SW running, volatile data is periodically
» Hot standby – apps running, volatile data is current
– All operational units actively serving users – N = number of units required, K = # of redundancy unit configured – N + K load sharing – E.g. commercial airplane N + 1
11
servers
monitoring, management of apps
12
13
14
15
16
– Promotion, NY eve, etc
– Spike in reconnection if automated
– Service distributed to multiple sys, one fail causing workload shifted to others
– Cyber vandalism, ransom
17
– Shed load or traffic
– Activate control mechanism when congestion occur – Deactivate after congestion ended
18
– High sys stress, may not overloading
certain users, certain type of message, etc
having overload.
19
– Low priority » Not directly impact user, maintenance and bg task (e.g. bkup, audit, etc) – Medium priority » Tasks directly/indirectly interact with end users – High priority » Management visibility and control tasks e.g. overload control.
20
to respond within reasonable time
status queries
21
– Transient faults – IP packet dropped, spike in sys workload, etc. E.g. retry may fix the errors – Persistent fault - False negative decision creates silent or sleeping failure. Secondary fault may show the fault finally or human maintenance engineer.
22
– Fault management sys collect and correlate enough indirectly detected failures to isolate the true failure if direct failure detection does not occur quickly
– Automatically trigger appropriate recovery action after failure isolated
23
– Sys primary failure detectors do not notice, robustness mechanism not activated
– sys indicts wrong recoverable module, thus initiated automatic recovery action for module that has not failed.
– Recovery action not successful in recovering services
24
– Access control – Authentication – Non-repudiation – confirm ops has been performed – Data confidentiality – Communications security – Data integrity – Availability – service not compromised by malicious acts – Privacy
25
26
27
– 2.11.1 Minimize human interaction
– 1. Via command interface » Command issued by maintenance engr highly priviledged authorization – 2. Via direct manipulation of sys HW » E.g. insertion or removal of FRU or network connection
nonemergency procedures because:
– Uncertainty about state of the system – what action most appropriate – Time pressure to rapidly restore service – lead to poor human judgment – Limited preparation or practice time – nonemergency can be carefully planned.
– Automated mechanism that eliminate manual procedures also eliminate the risk of human failure – Automated procedures have fewer steps to execute, and hence fewer
28
– Simple – no more than 7 steps else sub procedures – Clear – clear language, step by step instruction – Intuitive – SW, keyed connectors, labeling and marking, color – Similar to each other – common power up/ down, terminology – Require confirmation before service impacting operations – Clearly report sys operational state – alarm should be collected and displayed together. Alarm must be clear, simple and consistent – Include “safe stop point” – identify points within the procedure where it is safe to interrupt or stop the procedure.
29
– System planning and installation guide – Capacity planning guide » Dimensioning data – how sys to be configured to support traffic load » Performance and throughput metrics – can easily determine actual traffic level and service mix on the sys » Sys growth and reconfiguration procedure – procedure to grow or migrate traffic to other sys
30
– Trouble shooting tools and technical support » Appropriate alarm » Performance monitors » Visual status indicators on FRUs » Online diagnostics » Documentation of required SW processes » Documented emergency recovery procedures » Technical support
31
32