Reliability Advanced Topics in Computer Architecture Timothy Jones - - PowerPoint PPT Presentation

β–Ά
reliability
SMART_READER_LITE
LIVE PREVIEW

Reliability Advanced Topics in Computer Architecture Timothy Jones - - PowerPoint PPT Presentation

Reliability Advanced Topics in Computer Architecture Timothy Jones Historic reliability Silicon trends Why we care now Microprocessors are increasingly used in situations where we want to be sure of their correctness Self-driving cars,


slide-1
SLIDE 1

Reliability

Advanced Topics in Computer Architecture Timothy Jones

slide-2
SLIDE 2

Historic reliability

slide-3
SLIDE 3

Silicon trends

slide-4
SLIDE 4

Why we care now

  • Microprocessors are increasingly used in situations where we want to

be sure of their correctness

  • Self-driving cars, nuclear power stations, medical devices, etc
  • Many industrial sectors mandate the use of error-detection strategies
  • For example, ASIL standards in automotive
  • With increased susceptibility to faults, even non-safety-critical

computing starts to require fault tolerance

https://perspectives.mvdirona.com/2009/10/you-really-do-need-ecc-memory/

slide-5
SLIDE 5

Hard errors

  • Permanent errors that affect operation
  • Caused by device wearout in-the-field
  • Also can occur from manufacturing variabilities
slide-6
SLIDE 6

Soft errors

  • Transient errors that can affect operation
  • They are transient because their effects don’t last
  • They are not repeatable
  • Caused by
  • Alpha particle strikes
  • Cosmic rays!
slide-7
SLIDE 7

Error manifestation

Bit read? Bit has error protection? Affects program

  • utput?

Detected unrecoverable error (DUE) No error Silent data corruption (SDC) Benign fault; no error No Yes, detection and correction No Yes Yes, detection

  • nly

Benign fault; no error

slide-8
SLIDE 8

Identifying vulnerabilities

  • We can perform an analysis of processor structures to identify vulnerable

state

  • We identify the bits that are required for architecturally correct execution

(ACE)

  • These bits could result in incorrect output if they were flipped
  • The architectural vulnerability factor (AVF) is a useful metric

π΅π‘ŠπΊ = βˆ‘π‘œπ‘‘π‘§π‘‘π‘šπ‘“π‘‘

𝑗=0

𝐡𝐷𝐹𝑐𝑗𝑒𝑑 π‘œπ‘‘π‘§π‘‘π‘šπ‘“π‘‘ βˆ— π‘œπ‘π‘—π‘’π‘‘

slide-9
SLIDE 9

Identifying vulnerabilities

  • Bits can be ACE in some cycles, not ACE in others
  • Registers, for example

0x00000000feedcafe 0x1234567890123456 0x????????????????

. . . . . .

r0 r1

. . .

slide-10
SLIDE 10

Identifying vulnerabilities

  • Bits can be ACE in some cycles, not ACE in others
  • Registers, for example

0x00000000feedcafe 0x1234567890123456 0x????????????????

. . . . . .

r0 r1

. . .

Most significant bits unACE if used as a 32-bit number

slide-11
SLIDE 11

Identifying vulnerabilities

  • Bits can be ACE in some cycles, not ACE in others
  • Registers, for example

0x00000000feedcafe 0x1234567890123456 0x????????????????

. . . . . .

r0 r1

. . .

Most significant bits unACE if used as a 32-bit number All ACE if read again, or all unACE if last read has

  • ccurred
slide-12
SLIDE 12

Identifying vulnerabilities

  • Bits can be ACE in some cycles, not ACE in others
  • Registers, for example

0x00000000feedcafe 0x1234567890123456 0x????????????????

. . . . . .

r0 r1

. . .

Most significant bits unACE if used as a 32-bit number All ACE if read again, or all unACE if last read has

  • ccurred

All unACE until next cycle where this will be written to and represent r2

slide-13
SLIDE 13

Metrics

  • Two related metrics are often used to define reliability
  • The FIT rate (failures in time)
  • Defined as the total number of errors per billion device hours
  • MTTF (mean time to failure)
  • Represents the time between two errors

π‘π‘ˆπ‘ˆπΊ~ 1 πΊπ½π‘ˆ

slide-14
SLIDE 14

Dual-core lockstep

  • In a system with dual-core lockstep, a program is run twice on

different cores

  • Results compared at each cycle
  • Introduces temporal and spatial redundancy into the system

Core 0 Core 1 Application and data Checker Correct?

slide-15
SLIDE 15

Redundant multithreading

  • Run two versions of code and compare results
  • Can be a software scheme, perhaps with some hardware support
  • Or a purely hardware approach
  • Can run on different cores with one passing the other data
  • Or the same core, within a different SMT context
slide-16
SLIDE 16

Taking advantage of faulty hardware

  • Some systems use the faulty core

to provide hints to others

  • For example, Necromancer:

Enhancing System Throughput by Animating Dead Cores Ansari, Feng, Gupta and Mahlke ISCA 2010

slide-17
SLIDE 17

Approximate computing

  • In certain situations we can embrace errors
slide-18
SLIDE 18

Approximate computing

  • In certain situations we can embrace errors
slide-19
SLIDE 19

Summary

  • Reliability is a problem that has come back to haunt us
  • Required for safety-critical systems
  • Increasing needed / desired in others too
  • A variety of techniques developed to
  • Identify which parts of the core are vulnerable
  • Reduce vulnerability to errors by re-executing parts of the code
  • Embrace the unreliability for performance