Reliability Advanced Topics in Computer Architecture Timothy Jones

Historic reliability

Silicon trends

Why we care now • Microprocessors are increasingly used in situations where we want to be sure of their correctness • Self-driving cars, nuclear power stations, medical devices, etc • Many industrial sectors mandate the use of error-detection strategies • For example, ASIL standards in automotive • With increased susceptibility to faults, even non-safety-critical computing starts to require fault tolerance https://perspectives.mvdirona.com/2009/10/you-really-do-need-ecc-memory/

Hard errors • Permanent errors that affect operation • Caused by device wearout in-the-field • Also can occur from manufacturing variabilities

Soft errors • Transient errors that can affect operation • They are transient because their effects don’t last • They are not repeatable • Caused by • Alpha particle strikes • Cosmic rays!

Error manifestation Bit read? Yes No Bit has error Benign fault; protection? no error No Yes, detection only Yes, detection Affects and correction program Detected output? unrecoverable error (DUE) No error Silent data Benign fault; corruption no error (SDC)

Identifying vulnerabilities • We can perform an analysis of processor structures to identify vulnerable state • We identify the bits that are required for architecturally correct execution (ACE) • These bits could result in incorrect output if they were flipped • The architectural vulnerability factor (AVF) is a useful metric ∑ 𝑜𝑑𝑧𝑑𝑚𝑓𝑡 𝐵𝐷𝐹𝑐𝑗𝑢𝑡 𝑗 =0 𝐵𝑊𝐺 = 𝑜𝑑𝑧𝑑𝑚𝑓𝑡 ∗ 𝑜𝑐𝑗𝑢𝑡

Identifying vulnerabilities • Bits can be ACE in some cycles, not ACE in others • Registers, for example . r0 . r1 . . . . 0x00000000feedcafe 0x1234567890123456 0x???????????????? . . .

Identifying vulnerabilities • Bits can be ACE in some cycles, not ACE in others • Registers, for example Most significant bits unACE if used . r0 . as a 32-bit r1 . number . . . 0x00000000feedcafe 0x1234567890123456 0x???????????????? . . .

Identifying vulnerabilities • Bits can be ACE in some cycles, not ACE in others • Registers, for example Most significant bits unACE if used . r0 . as a 32-bit r1 . number . . . 0x00000000feedcafe 0x1234567890123456 All ACE if read 0x???????????????? again, or all unACE if last . . read has . occurred

Identifying vulnerabilities • Bits can be ACE in some cycles, not ACE in others • Registers, for example Most significant bits unACE if used . r0 . as a 32-bit r1 . number . . . 0x00000000feedcafe 0x1234567890123456 All ACE if read 0x???????????????? again, or all All unACE until next unACE if last . . cycle where this read has . will be written to occurred and represent r2

Metrics • Two related metrics are often used to define reliability • The FIT rate (failures in time) • Defined as the total number of errors per billion device hours • MTTF (mean time to failure) • Represents the time between two errors 𝑁𝑈𝑈𝐺 ~ 1 𝐺𝐽𝑈

Dual-core lockstep • In a system with dual-core lockstep, a program is run twice on different cores • Results compared at each cycle • Introduces temporal and spatial redundancy into the system Core 0 Checker Application Correct? and data Core 1

Redundant multithreading • Run two versions of code and compare results • Can be a software scheme, perhaps with some hardware support • Or a purely hardware approach • Can run on different cores with one passing the other data • Or the same core, within a different SMT context

Taking advantage of faulty hardware • Some systems use the faulty core to provide hints to others • For example, Necromancer: Enhancing System Throughput by Animating Dead Cores Ansari, Feng, Gupta and Mahlke ISCA 2010

Approximate computing • In certain situations we can embrace errors

Summary • Reliability is a problem that has come back to haunt us • Required for safety-critical systems • Increasing needed / desired in others too • A variety of techniques developed to • Identify which parts of the core are vulnerable • Reduce vulnerability to errors by re-executing parts of the code • Embrace the unreliability for performance

Reliability Advanced Topics in Computer Architecture Timothy Jones - PowerPoint PPT Presentation

Reliability Advanced Topics in Computer Architecture Timothy Jones Historic reliability Silicon trends Why we care now Microprocessors are increasingly used in situations where we want to be sure of their correctness Self-driving cars,

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Why the 2018 Water Reliability Study WACO Presentation 2018 OC Reliability Study October 5,

System Reliability Regulation: System Reliability Regulation: A Jurisdictional Survey A

An Inside Look at Electric Reliability 2018 Electric Reliability Report Stockton, California

Quest for Reliability Ankush Malhotra VP & GM of Fluke Reliability Speaker Bio Ankush

Safety and Reliability Safety and Reliability Analysis Analysis Team KANG Team KANG Group 1

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

- Reliability - Reliability What It Is, Why, and How Jason Nicholas, Ph.D. November 13,

Reliability Engineering Overview Reliability engineering measures and improves resistance to

Slide 1 SPHSC 569 Single Subject Design Reliability Slide 2 Reliability-Quantitative and

NUC-001-1 Reliability Standard Update April 8, 2008 Keith ONeal Office of Electric

VAMWA/VMA Study EPA Method 1668 Reliability and Data Variability EPA Method 1668 Reliability and

for your Casualty Program W W W . C H I C A G O L A N D R I S K F O R U M . O R G Developing

Applying the Automated EM Pipeline: One quarter of a million particles of GroEL per day Or what

Attenuation Coefficient Estimation Farah Deeba, Ricky Hu, Jefferson Terry, Denise Pugash, Jennifer

Committee for Accessible AIDS Treatment Formed in 1999 to reduce barriers faced by people

Bidirectional mapping between OWL DL and Attempto Controlled English Kaarel Kaljurand, Norbert

HLT Evaluation: Role of Data Centers Christopher Cieri ccieri@ldc.upenn.edu University of

Classes 1 / 36 Classes Classes 2 / 36 Anatomy of a Class By the end of next lecture, youll

Access Control Lists in Linux & Windows Vasudevan Nagendra & Yaohui Chen Fall 2014:: CSE

Sambuz

Useful Links

Newsletter

Mail Us