[PPT] - Hardware Reliability of Embedded Systems: Are We There Yet? Bashir PowerPoint Presentation

SLIDE 1

Hardware Reliability of Embedded Systems: Are We There Yet?

Bashir M. Al-Hashimi, FREng, FIEEE

March 19th 2014

PAnDA - Programmable Digital and Analogue Array

York, 18-19 March 2014

SLIDE 2

Overview

Where we are?

– academic and industrial research highlights

Where we are heading to?

– personal perspectives

2

SLIDE 3

Hardware Reliability

Reliability* as described by IBM

– Computers designed with reliability to protect data integrity and stay available for long periods of time without failure

Unreliability sources

– Logic faults

Radiation

– Timing faults

Transistor wear-out

3

* Wikipedia

Low power design Technology scaling Process variation

Exacerbated by

SLIDE 4

Hardware Reliability Trends

4 Critical charge of flip-flops for 45nm node*

Voltage scaling and process variation degrades reliability

S. Yang, S. Khursheed, B. M. Al-Hashimi, D. Flynn, and S. Idgunji, “Reliable State Retention-Based Embedded

Processors Through Monitoring and Recovery,” IEEE TCAD, vol. 30, no. 12, pp. 1773–1785, Dec. 2011.

SLIDE 5

Where Does Reliability Matter?

5

Source: ARM

SLIDE 6

Data path Register files Cache Processor #1

Embedded Systems Reliability

6 Interconnect

Memory #1 Memory #n Peripherals

…… …… ……

Control logic Data path Register files Processor #n Control logic Cache

SLIDE 7

Where are we in dealing with hardware reliability?

7

SLIDE 8

Reliability Publications

8

200 400 600 800 1000 1200 1400 1600 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 number of publications Year

Publications from both academia and industry

Reliability conference publications in 2011

DATE DAC ICCAD ASPDAC DSN

9000+ publications over the past 12 years

SLIDE 9

Academic & industrial Research Examples

Hazucha and Svensson, Impact of CMOS technology scaling on

the atmospheric neutron soft error rate, IEEE Trans. Nuclear Science, 2000 (citations > 330)

Srinivasan, The impact of technology scaling on lifetime reliability,

DSN’04 (citations > 350)

Intel: Borkar et al., Parameter variations and impact on circuits

and microarchitecture, DAC’03 (citations > 1000)

IBM: Ziegler et al., "IBM experiments in soft fails in computer

electronics (1978–1994)," IBM Journal of Research and Development , vol.40, no.1, pp.3,18, Jan. 1996 (citations > 400)

TI: McPherson, Reliability challenges for 45nm and beyond,

DAC’06 (citations > 330)

9

SLIDE 10

Reliability Research Approaches

Hardware approach Software approach

10

Redundancy (DMR, TMR, ECC, Parity, etc.)

Compilers
Operating System

(scheduling, mapping)

Runtime Management

SLIDE 11

Tried and Tested Method

Triple modular redundancy
High cost rules out this method

11

Module 3 Module 1 Module 2 Voting MUX

SLIDE 12

12

BISER

* Ernst et al, “Razor: a low-power pipeline based on circuit-level timing speculation”, 2003. MICRO-36., pp. 7–18. * Mitra et al,“Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43–52, 2005. * Memik et al, “Increasing Register File Immunity to Transient Errors,” in DATE05, pp. 586–591.

Selective duplication (timing faults)
only insert RAZOR flip-flops in critical paths
Re-use existing circuitry (logic faults)
scan flip-flops in BISER
idle register files for redundancy

Register files

Low-Cost Hardware Methods: Examples

RAZOR

SLIDE 13

13

Hardware detection
Parity through scan-chains
Software correction
Interrupt service routine as firmware
S. Yang, S. Khursheed, B. M. Al-Hashimi, D. Flynn, and G. V. Merrett, “Improved State Integrity of Flip-Flops for Voltage

Scaled Retention Under PVT Variation,” IEEE TCAS-I: Regular Papers, vol. 1, pp. 1–9, 2013.

Low-Cost HW-SW Method: Example

SLIDE 14

Software Approach

Hardware approach emphasizes detection and correction, Software approach emphasizes software failure prevention

14

SLIDE 15

15

Unreliable Hardware: Software Approach

Vulnerable periods of processor register variables analysis Estimation of program reliability Reliability-optimised instruction-scheduling

Compilers

― Improves software program reliability by quantifying vulnerability of instructions ― Instruction scheduling impacts vulnerable periods of instruction’s variables ― Reduce critical instructions occupancy in pipeline and their operands’ vulnerable periods ― Schedule instruction with highest vulnerability first

Source code Reliability-aware binary

input

utput

Complier flow

J. Henkel et al, “RAISE: Reliability-Aware Instruction Scheduling”
T. Jones, Energy-aware compilers, Cambridge University,

http://www.cl.cam.ac.uk/~tmj32/

S. Garg et al, Cross-layer reliability modelling and optimisation for embedded systems

under PV, Tutorial, CODES-ISSS 2013

SLIDE 16

16

Operating Systems

V. Izosimov, P. Pop, and P. Eles, “Design Optimization of Time-and

Cost-Constrained Fault-Tolerant Distributed Embedded Systems,” DATE05, pp. 864–869.

R. Shafik, B.M. Al-Hashimi, K. Chakrabarty, “Soft erroe-aware design
ptimisation of low power and time-constrained embedded

systems”, pp.1462-1467, DAET10

Heuristics decide on mapping of application tasks

to processors, scheduling and FT policies to meet reliability requirement

Many heuristics have been proposed, examples

Unreliable Hardware: Software Approach

Mapping (Duplication) Scheduling (re-execution) Reliability analysis Task reliability profile Reliability Requirement

Reliable? Pass Fail Input

Hardware platform execution Tasks

SLIDE 17

Industry Pragmatic Approach to Reliable Processors

(every bit matters; users are willing to pay)

17

SLIDE 18

ARM Cortex-R Series

18

Dual core lock-step configuration*: Two identical cores running the same set of
perations and their outputs are compared. If a difference is detected, the cores are

rolled up to the last correct operation

Pipelines, caches and memories are protected with ECC

* http://www.arm.com/products/processors/cortex-r/cortex-r4.php

SLIDE 19

Oracle/Fujitsu: SPARC64

19

* Ando et al, “A 1.3-GHz fifth-generation SPARC64 microprocessor”, JSSC, 38(11), 1896–1905, 2003,

Error detection in execution units and interconnect using data and address

parity*

Recovery via instruction re-execution
ECC in L1D and L2 caches

SLIDE 20

IBM Power7

20

Core

— Harden latches — Spare cores — Re-execution, task migration

Memory

— Tag un-correctable errors — Dynamic sparing

Interconnects

— ECC-protected interconnect between cluster nodes — Redundant paths

* Kalla et al. "Power7: IBM's next-generation server processor." Micro, IEEE, 2010.

SLIDE 21

Where are we heading to?

Personal Perspectives (Automation, Cross-layer)

21

SLIDE 22

Reliability/Safety Standards

22

IEC 60601 (medical equipment) IEC 60601 (medical equipment) IEC 60601 (medical equipment) IEC 60601 (medical equipment) IEC 61508 (meta - standard) IEC 61508 (meta - standard) IEC 61508 (meta - standard) IEC 61508 (meta - standard) IEC 61511 (process industry) IEC 61511 (process industry) IEC 61511 (process industry) IEC 61511 (process industry) IEC 62061 (machinery) IEC 62061 (machinery) IEC 62061 (machinery) IEC 62061 (machinery) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) ISO 26262 (automotive) ISO 26262 (automotive) ISO 26262 (automotive) ISO 26262 (automotive) RTCA/DO

178B

(aerospace) RTCA/DO

178B

(aerospace) RTCA/DO

178B

(aerospace) RTCA/DO

178B

(aerospace) (aerospace) (aerospace) (aerospace) DO-178B/DO-254 (aerospace) EN 50128 (railway) EN 50128 (railway) EN 50128 (railway) EN 50128 (railway) Source: YOGITECH

SLIDE 23

ISO 26262 and RIIF

ISO 26262: automotive safety standard for functional

safety of electronic systems in vehicles

– Focuses on risks arising from random hardware faults and systematic faults in HW/SW development

Reliability Information Interchange Format (RIIF): IEEE

initiative to develop HW reliability modeling language

– EDA tools to analyze reliability models to compute failure rates

23

* Standards for specifying and modeling the reliability of complex electronic systems, 1st RIIF Workshop, DATE2013 * Evans et al, RIIF- Reliability Information interchange format, On-Line Testing Symposium, 2012

SLIDE 24

Low-Power EDA: Example

24

Synthesis Power ¡description

Eg. ¡UPF*

Placement ¡and ¡ Route Design ¡(RTL)

1. ¡Create ¡power ¡switches ¡ ¡
3. ¡Create ¡output ¡isola3on ¡
2. ¡Create ¡state ¡reten3on ¡

pg_switch Vdd Sw_Vdd pg_ctrl/ power

iso1 pg_ctrl/nclamp DIN Dout

Retention ¡enabled ¡F/F Master ¡F/F Slave ¡Retention ¡ latch Vdd sw_Vdd D clock RETAIN Q Gnd

Tools and standards made low-power design main-stream
UPF (Unified Power Format): IEEE standard for describing power

intent in power optimization in EDA

Example of automatic insertion of power gating in RTL description

SLIDE 25

Where are we heading to?

Reliable Hardware EDA

25

Specification Performance and reliability

RTL Synthesis

Unified Reliability Format (URF)

Reliability analysis

Reliable Hardware Failure mechanism (RIIF) (eg. SEU, NBTI, HCI,….)

Reliability map

(failure rates..) Razor Duplication Hardening ECC Fault tolerance policy

SLIDE 26

Where are we heading to?

Cross-Layer: Run-Time

26

Application threads

Operating System Reliability management Interconnect

Application System Software Hardware …..………

Application and runtime requirements

Monitors Controls

Processor 1 Processor N

Performance counters and temperature DVFS, duplication , CPU affinity

SLIDE 27

Run Smarter, Live Longer

Cross-layer approach enables

– Delay transistors wear-out and improve faults mitigation – Doing the right thing with existing resources

appropriate mapping of application threads to

cores guided by counters and sensors

appropriate selection of core frequencies and

voltages

Use of existing resources and power management

lead to more energy being available

27

S. Bischoff, H. Andreas, B. Al-Hashimi, “applying quality of experience to system optimisation”,PATMOS2013,

SLIDE 28

Cross-layer Approach: Motivational Example

28

Thermal profile varies with applications

SLIDE 29

Runtime Thermal Optimization*

core 0

sensor
core 1
sensor
core n
sensor
Interconnect
Operating

System

Control

CPU affinity CPU V/F Monitors Temperature performance counter

mpeg2_dec

Application System Software Hardware

Proposed Approach

Determine

State

Q-Learning

Algorithm

Compute

Reward

Select Action
Decision Epochs

face_rec

Application requirement: deadline per frame/task Runtime requirement: improve lifetime reliability

Temperature samples Das, A.K., Shafik, R.A., Merrett, G.V., Al-Hashimi, B.M., Kumar, A. and Veeravalli,B. Reinforcement learning-based inter- and intra- application thermal optimization for lifetime improvement of multicore systems, DAC’14

SLIDE 30

Runtime Thermal Optimization - results

30

Blue: Proposed Red: Linux on-demand

SLIDE 31

PRiME

Power-efficient, Reliable, Many-core Embedded systems 5-year project, £5.6M, started March 2013

SLIDE 32

Summary

Significant academic and industrial research to date; likely to

continue

Standards and tools will speed up the design automation of reliable

hardware

Runtime cross-layer approach will enable reliable, energy-efficient

design of future many-core embedded systems

Are we there yet in hardware reliability?

32

Entering new stage where innovation in design automation and cross-layer design run-time will provide effective solutions for future reliable systems At the beginning At the end clearly not absolutely not

SLIDE 33

Thank you

اركش

谢谢

ধনবাদ

danke

merci

¡gracias grazie

ありがとう

dank u σας ευχαριστώ

धयवाद

감사합니다

tack takk

SLIDE 34

Runtime Reliability Management

34

Data Gathering Monitor Control Action Data analysis Adaptation & decision making Logic and timing faults; temperature

Unreliability levels

Low: unlikely faults
Moderate: possible
Critical: imminent faults

Operating modes

Fault recovery
Graceful degrad.
Low energy

DVFS, duplication, core affinity

Interconnect

…..………

Processor 1 Processor N

Hardware Platform Application and runtime requirements from application layer System software From Application Layer

SLIDE 35

Reliability Research Approaches Together

Hardware approach Software approach

35

Redundancy (DMR, TMR, ECC, Parity, etc.)

Compilers
Operating System

(scheduling, mapping)

Runtime Management

Cross-layer