emt 368
play

EMT 368 Reliability and Testability in Integrated Circuit Design - PowerPoint PPT Presentation

EMT 368 Reliability and Testability in Integrated Circuit Design School of Microelectronic Engineering UniMAP A. Harun 1 Course content Reliability and availability concept Robust design principle Time and failure dependent


  1. EMT 368 Reliability and Testability in Integrated Circuit Design School of Microelectronic Engineering UniMAP A. Harun 1

  2. Course content • Reliability and availability concept • Robust design principle • Time and failure dependent reliability • Estimation methods of the parameters of failure time distribution • Parametric reliability model • Overview of testing • Ad-hoc techniques • Scan-path design • Boundary scan testing • Built-in self test (BIST) A. Harun 2

  3. CHAPTER 2 Robust Design Principle A. Harun 3

  4. Chapter 2 – Robust Design Principle • Unit of design • Failure recovery groups • Redundancy • Robust design principles • Robust protocols • Robust concurrency controls • Overload control • Process, resource and throughput monitoring • Data auditing • Fault correlation • Failed error detection, isolation or recovery • Geographic redundancy • Security, availability and system robustness • Error detection A. Harun 4

  5. Robust design principle • 2.1 Unit of design – HW and SW are organized into small comp or modules. – System architecture or design define the comp come together  system – Thought of as logical container • Accept logical input  success correct output  fail/inconsistent provide error/exception • If major fault  hang/unresponsive • Organized into hierarchical design. A. Harun 5

  6. Robust design principle A. Harun 6

  7. Robust design principle • 2.1 Unit of design – Logical container from biggest to smallest (network application) • Application • User session • Message request – Protocol message or request – Any error found in this is contained in this container • Transactions • Robust exception handling • Subroutines – Natural fault container A. Harun 7

  8. Robust design principle • 2.1 Unit of design – Logical container from biggest to smallest (HW and Platf SW) • System – Necessary to restart entire system • FRU – Modular, e.g. blade server • Processor • Process • Thread A. Harun 8

  9. Robust design principle • 2.2 Failure recovery group – Unit report failure to containing unit or containing unit implicitly detect the failure from errant behavior – What to do ? • One can restart the errant application • Restart the entire operating system • Highly available SW support smaller recovery group e.g. session termination, process, etc. – Failure recovery groups are suites of logical entities that are designed and tested to be recoverable while the remainder of sys remain operational. – Most common failure recovery group is SW process, browser, word processor, etc A. Harun 9

  10. Robust design principle • 2.2 Redundancy – Sys deploy redundancy to increase throughput or capacity. E.g. mult processor core, mem module to proc board – Increase service availability. E.g multi engine on airplane – Redundancy in computer based sys implemented at three levels • Process – prepare mult process in advance • FRU – e.g. compute blade FRUs in blade server • Network element- more DNS servers A. Harun 10

  11. Robust design principle • 2.2 Redundancy – Redundant units are typically organized into one of two common arrangements • Active standby – one serving and one on standby. – Hot, ward and cold are term to characterize readiness » Cold standby – application SW or OS need to be restarted » Ward standby – Apps SW running, volatile data is periodically sync. Time needed to rebuild the sys state » Hot standby – apps running, volatile data is current • Load shared – All operational units actively serving users – N = number of units required, K = # of redundancy unit configured – N + K load sharing – E.g. commercial airplane N + 1 A. Harun 11

  12. Robust design principle • 2.2 high availability Middleware – Recovering service into a redundant unit – failure recovery fast, no impact to user – High availability mechanism/middleware – Practical sys may use some of these: • IP networking mechanism – balance netw load across cluster of servers • Clustering – two or more computers arrange into a pool • High-availability middleware – infra to support sync, data sharing, monitoring, management of apps • Application checkpoint mechanism – system restore • Virtual machine • Redundant array of inexpensive disks (RAID) – arrange mult HDD or called mirrors • Database redundancy and replication • File sys replication A. Harun 12

  13. Robust design principle • 2.3 Robust design principle – Robust design principle to consider: • Redundant, fault-tolerance design • No single point of failure • No single point of repair • Hot swappable FRUs – Sys with no down time for planned activities should consider the following principles • No service impact for SW patch, update and upgrade • No service impact for HW growth or degrowth • Minimal impact for sys reconfiguration A. Harun 13

  14. Robust design principle • 2.3 Robust protocol – Application protocols can be made robust by: • Use reliable protocols • Use confirmations or acknowledgements • Support atomic requests or transactions • Support timeouts and message retries • Use heartbeat or keep-alive mechanisms • Use stateless or minimal-shared state protocols • Support automatic reconnection A. Harun 14

  15. Robust design principle • 2.4 Robust concurrency controls – Concurrency controls enable applications to efficiently share resources across many users simultaneously. – Sys may share procs time, buffer, etc – Access to critical sections controlling shared resources have to be serialized – Cannot have two applications accessing same portion of shared memory or resource pool. – Need platform mechanism like semaphore, mutual exclusion lock for control. – Application also should make sure process that failed can be restarted without restarting entire sys – Concurrency control held by failed process need to be scanned to avoid dead locks standing. A. Harun 15

  16. Robust design principle • 2.5 Overload control – Sys implemented has physical hw constraint • E.g. processing power, storage, IO bandwidth – Translated into capacity limits under acceptable QoS – When demand for service increased, sys unable to deliver required requests – Need overload control to gracefully manage traffic exceeding engineered capacity A. Harun 16

  17. Robust design principle • 2.5 Overload control – Sys overload causes • Unexpected popularity • Under engineered system • Incorrectly configured system • External events – Promotion, NY eve, etc • Power outage and restoration – Spike in reconnection if automated • Network equipment failure and restoration • System failure – Service distributed to multiple sys, one fail causing workload shifted to others • Denial-of-service attack – Cyber vandalism, ransom A. Harun 17

  18. Robust design principle • 2.5 Overload control – Two features of overload control • Control mechanism – Shed load or traffic • Control triggers – Activate control mechanism when congestion occur – Deactivate after congestion ended A. Harun 18

  19. Robust design principle • 2.5 Overload control – Congestion detection techniques • Slower sys response times • Longer work queues • Higher CPU utilization – High sys stress, may not overloading – Congestion control mechanism • Rejecting new sessions – ‘too busy error’ • Rejecting new message requests – reject all traffic frm certain users, certain type of message, etc • Disconnecting alive session – lower priority users • Disabling servers or services – close some or all IP ports having overload. A. Harun 19

  20. Robust design principle • 2.5 Overload control – Architectural considerations • Sys should have three broad classes – Low priority » Not directly impact user, maintenance and bg task (e.g. bkup, audit, etc) – Medium priority » Tasks directly/indirectly interact with end users – High priority » Management visibility and control tasks e.g. overload control. • As sys saturate, low priority will be deferred to higher priority. A. Harun 20

  21. Robust design principle • 2.6 Process, resource and throughput monitoring – Some errors may not immediately seen in normal sys operation. Thus need to detect before become critical failure. – Mechanism to proactively monitor sys health • Heartbeat checks of critical processes – ensure sane enough to respond within reasonable time • Resource usage checks – process size, free space, cpu usage • Data audits • Monitor sys throughput, performance and alarm behavior • Health checks of critical supporting sys – hello, keep-alive, status queries – These normally run at low priority process, but master control need to run at higher priority process. A. Harun 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend