Hardware Reliability of Embedded Systems: Are We There Yet?
Bashir M. Al-Hashimi, FREng, FIEEE
March 19th 2014
PAnDA - Programmable Digital and Analogue Array
York, 18-19 March 2014
Hardware Reliability of Embedded Systems: Are We There Yet? Bashir - - PowerPoint PPT Presentation
Hardware Reliability of Embedded Systems: Are We There Yet? Bashir M. Al-Hashimi, FREng, FIEEE March 19 th 2014 PAnDA - Programmable Digital and Analogue Array York, 18-19 March 2014 Overview Where we are? academic and industrial
Bashir M. Al-Hashimi, FREng, FIEEE
March 19th 2014
PAnDA - Programmable Digital and Analogue Array
York, 18-19 March 2014
– academic and industrial research highlights
– personal perspectives
2
– Computers designed with reliability to protect data integrity and stay available for long periods of time without failure
– Logic faults
– Timing faults
3
* Wikipedia
Low power design Technology scaling Process variation
Exacerbated by
4 Critical charge of flip-flops for 45nm node*
Voltage scaling and process variation degrades reliability
Processors Through Monitoring and Recovery,” IEEE TCAD, vol. 30, no. 12, pp. 1773–1785, Dec. 2011.
5
Source: ARM
Data path Register files Cache Processor #1
6 Interconnect
Memory #1 Memory #n Peripherals
…… …… ……
Control logic Data path Register files Processor #n Control logic Cache
7
8
200 400 600 800 1000 1200 1400 1600 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 number of publications Year
Publications from both academia and industry
Reliability conference publications in 2011
DATE DAC ICCAD ASPDAC DSN
9000+ publications over the past 12 years
the atmospheric neutron soft error rate, IEEE Trans. Nuclear Science, 2000 (citations > 330)
DSN’04 (citations > 350)
and microarchitecture, DAC’03 (citations > 1000)
electronics (1978–1994)," IBM Journal of Research and Development , vol.40, no.1, pp.3,18, Jan. 1996 (citations > 400)
DAC’06 (citations > 330)
9
Hardware approach Software approach
10
Redundancy (DMR, TMR, ECC, Parity, etc.)
(scheduling, mapping)
11
Module 3 Module 1 Module 2 Voting MUX
12
BISER
* Ernst et al, “Razor: a low-power pipeline based on circuit-level timing speculation”, 2003. MICRO-36., pp. 7–18. * Mitra et al,“Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43–52, 2005. * Memik et al, “Increasing Register File Immunity to Transient Errors,” in DATE05, pp. 586–591.
Register files
RAZOR
13
Scaled Retention Under PVT Variation,” IEEE TCAS-I: Regular Papers, vol. 1, pp. 1–9, 2013.
Hardware approach emphasizes detection and correction, Software approach emphasizes software failure prevention
14
15
Vulnerable periods of processor register variables analysis Estimation of program reliability Reliability-optimised instruction-scheduling
Compilers
― Improves software program reliability by quantifying vulnerability of instructions ― Instruction scheduling impacts vulnerable periods of instruction’s variables ― Reduce critical instructions occupancy in pipeline and their operands’ vulnerable periods ― Schedule instruction with highest vulnerability first
Source code Reliability-aware binary
input
Complier flow
http://www.cl.cam.ac.uk/~tmj32/
under PV, Tutorial, CODES-ISSS 2013
16
Operating Systems
Cost-Constrained Fault-Tolerant Distributed Embedded Systems,” DATE05, pp. 864–869.
systems”, pp.1462-1467, DAET10
to processors, scheduling and FT policies to meet reliability requirement
Mapping (Duplication) Scheduling (re-execution) Reliability analysis Task reliability profile Reliability Requirement
Reliable? Pass Fail Input
Hardware platform execution Tasks
(every bit matters; users are willing to pay)
17
18
rolled up to the last correct operation
* http://www.arm.com/products/processors/cortex-r/cortex-r4.php
19
* Ando et al, “A 1.3-GHz fifth-generation SPARC64 microprocessor”, JSSC, 38(11), 1896–1905, 2003,
parity*
20
— Harden latches — Spare cores — Re-execution, task migration
— Tag un-correctable errors — Dynamic sparing
— ECC-protected interconnect between cluster nodes — Redundant paths
* Kalla et al. "Power7: IBM's next-generation server processor." Micro, IEEE, 2010.
Personal Perspectives (Automation, Cross-layer)
21
22
IEC 60601 (medical equipment) IEC 60601 (medical equipment) IEC 60601 (medical equipment) IEC 60601 (medical equipment) IEC 61508 (meta - standard) IEC 61508 (meta - standard) IEC 61508 (meta - standard) IEC 61508 (meta - standard) IEC 61511 (process industry) IEC 61511 (process industry) IEC 61511 (process industry) IEC 61511 (process industry) IEC 62061 (machinery) IEC 62061 (machinery) IEC 62061 (machinery) IEC 62061 (machinery) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) ISO 26262 (automotive) ISO 26262 (automotive) ISO 26262 (automotive) ISO 26262 (automotive) RTCA/DO
(aerospace) RTCA/DO
(aerospace) RTCA/DO
(aerospace) RTCA/DO
(aerospace) (aerospace) (aerospace) (aerospace) DO-178B/DO-254 (aerospace) EN 50128 (railway) EN 50128 (railway) EN 50128 (railway) EN 50128 (railway) Source: YOGITECH
safety of electronic systems in vehicles
– Focuses on risks arising from random hardware faults and systematic faults in HW/SW development
initiative to develop HW reliability modeling language
– EDA tools to analyze reliability models to compute failure rates
23
* Standards for specifying and modeling the reliability of complex electronic systems, 1st RIIF Workshop, DATE2013 * Evans et al, RIIF- Reliability Information interchange format, On-Line Testing Symposium, 2012
24
Synthesis Power ¡description
Placement ¡and ¡ Route Design ¡(RTL)
iso1 pg_ctrl/nclamp DIN Dout
Retention ¡enabled ¡F/F Master ¡F/F Slave ¡Retention ¡ latch Vdd sw_Vdd D clock RETAIN Q Gndintent in power optimization in EDA
Where are we heading to?
25
Specification Performance and reliability
RTL Synthesis
Unified Reliability Format (URF)
Reliability analysis
Reliable Hardware Failure mechanism (RIIF) (eg. SEU, NBTI, HCI,….)
Reliability map
(failure rates..) Razor Duplication Hardening ECC Fault tolerance policy
Where are we heading to?
26
Application threads
Operating System Reliability management Interconnect
Application System Software Hardware …..………
Application and runtime requirements
Monitors Controls
Processor 1 Processor N
Performance counters and temperature DVFS, duplication , CPU affinity
– Delay transistors wear-out and improve faults mitigation – Doing the right thing with existing resources
cores guided by counters and sensors
voltages
lead to more energy being available
27
28
Thermal profile varies with applications
core 0
System
CPU affinity CPU V/F Monitors Temperature performance counter
mpeg2_dec
Application System Software Hardware
Proposed Approach
State
Algorithm
Reward
face_rec
Application requirement: deadline per frame/task Runtime requirement: improve lifetime reliability
Temperature samples Das, A.K., Shafik, R.A., Merrett, G.V., Al-Hashimi, B.M., Kumar, A. and Veeravalli,B. Reinforcement learning-based inter- and intra- application thermal optimization for lifetime improvement of multicore systems, DAC’14
30
Blue: Proposed Red: Linux on-demand
Power-efficient, Reliable, Many-core Embedded systems 5-year project, £5.6M, started March 2013
continue
hardware
design of future many-core embedded systems
32
Entering new stage where innovation in design automation and cross-layer design run-time will provide effective solutions for future reliable systems At the beginning At the end clearly not absolutely not
اركش
谢谢
ধনবাদ
danke
merci
¡gracias grazie
ありがとう
dank u σας ευχαριστώ
धयवाद
감사합니다
tack takk
34
Data Gathering Monitor Control Action Data analysis Adaptation & decision making Logic and timing faults; temperature
Unreliability levels
Operating modes
DVFS, duplication, core affinity
Interconnect
…..………
Processor 1 Processor N
Hardware Platform Application and runtime requirements from application layer System software From Application Layer
Hardware approach Software approach
35
Redundancy (DMR, TMR, ECC, Parity, etc.)
(scheduling, mapping)
Cross-layer