Reliability
Advanced Topics in Computer Architecture Timothy Jones
Reliability Advanced Topics in Computer Architecture Timothy Jones - - PowerPoint PPT Presentation
Reliability Advanced Topics in Computer Architecture Timothy Jones Historic reliability Silicon trends Why we care now Microprocessors are increasingly used in situations where we want to be sure of their correctness Self-driving cars,
Advanced Topics in Computer Architecture Timothy Jones
be sure of their correctness
computing starts to require fault tolerance
https://perspectives.mvdirona.com/2009/10/you-really-do-need-ecc-memory/
Bit read? Bit has error protection? Affects program
Detected unrecoverable error (DUE) No error Silent data corruption (SDC) Benign fault; no error No Yes, detection and correction No Yes Yes, detection
Benign fault; no error
state
(ACE)
π΅ππΊ = βπππ§ππππ‘
π=0
π΅π·πΉπππ’π‘ πππ§ππππ‘ β ππππ’π‘
0x00000000feedcafe 0x1234567890123456 0x????????????????
. . . . . .
r0 r1
. . .
0x00000000feedcafe 0x1234567890123456 0x????????????????
. . . . . .
r0 r1
. . .
Most significant bits unACE if used as a 32-bit number
0x00000000feedcafe 0x1234567890123456 0x????????????????
. . . . . .
r0 r1
. . .
Most significant bits unACE if used as a 32-bit number All ACE if read again, or all unACE if last read has
0x00000000feedcafe 0x1234567890123456 0x????????????????
. . . . . .
r0 r1
. . .
Most significant bits unACE if used as a 32-bit number All ACE if read again, or all unACE if last read has
All unACE until next cycle where this will be written to and represent r2
ππππΊ~ 1 πΊπ½π
different cores
Core 0 Core 1 Application and data Checker Correct?
to provide hints to others
Enhancing System Throughput by Animating Dead Cores Ansari, Feng, Gupta and Mahlke ISCA 2010