Hardware Reliability of Embedded Systems: Are We There Yet? Bashir - - PowerPoint PPT Presentation

hardware reliability of embedded systems are we there yet
SMART_READER_LITE
LIVE PREVIEW

Hardware Reliability of Embedded Systems: Are We There Yet? Bashir - - PowerPoint PPT Presentation

Hardware Reliability of Embedded Systems: Are We There Yet? Bashir M. Al-Hashimi, FREng, FIEEE March 19 th 2014 PAnDA - Programmable Digital and Analogue Array York, 18-19 March 2014 Overview Where we are? academic and industrial


slide-1
SLIDE 1

Hardware Reliability of Embedded Systems: Are We There Yet?

Bashir M. Al-Hashimi, FREng, FIEEE

March 19th 2014

PAnDA - Programmable Digital and Analogue Array

York, 18-19 March 2014

slide-2
SLIDE 2

Overview

  • Where we are?

– academic and industrial research highlights

  • Where we are heading to?

– personal perspectives

2

slide-3
SLIDE 3

Hardware Reliability

  • Reliability* as described by IBM

– Computers designed with reliability to protect data integrity and stay available for long periods of time without failure

  • Unreliability sources

– Logic faults

  • Radiation

– Timing faults

  • Transistor wear-out

3

* Wikipedia

Low power design Technology scaling Process variation

Exacerbated by

slide-4
SLIDE 4

Hardware Reliability Trends

4 Critical charge of flip-flops for 45nm node*

Voltage scaling and process variation degrades reliability

  • S. Yang, S. Khursheed, B. M. Al-Hashimi, D. Flynn, and S. Idgunji, “Reliable State Retention-Based Embedded

Processors Through Monitoring and Recovery,” IEEE TCAD, vol. 30, no. 12, pp. 1773–1785, Dec. 2011.

slide-5
SLIDE 5

Where Does Reliability Matter?

5

Source: ARM

slide-6
SLIDE 6

Data path Register files Cache Processor #1

Embedded Systems Reliability

6 Interconnect

Memory #1 Memory #n Peripherals

…… …… ……

Control logic Data path Register files Processor #n Control logic Cache

slide-7
SLIDE 7

Where are we in dealing with hardware reliability?

7

slide-8
SLIDE 8

Reliability Publications

8

200 400 600 800 1000 1200 1400 1600 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 number of publications Year

Publications from both academia and industry

Reliability conference publications in 2011

DATE DAC ICCAD ASPDAC DSN

9000+ publications over the past 12 years

slide-9
SLIDE 9

Academic & industrial Research Examples

  • Hazucha and Svensson, Impact of CMOS technology scaling on

the atmospheric neutron soft error rate, IEEE Trans. Nuclear Science, 2000 (citations > 330)

  • Srinivasan, The impact of technology scaling on lifetime reliability,

DSN’04 (citations > 350)

  • Intel: Borkar et al., Parameter variations and impact on circuits

and microarchitecture, DAC’03 (citations > 1000)

  • IBM: Ziegler et al., "IBM experiments in soft fails in computer

electronics (1978–1994)," IBM Journal of Research and Development , vol.40, no.1, pp.3,18, Jan. 1996 (citations > 400)

  • TI: McPherson, Reliability challenges for 45nm and beyond,

DAC’06 (citations > 330)

9

slide-10
SLIDE 10

Reliability Research Approaches

Hardware approach Software approach

10

Redundancy (DMR, TMR, ECC, Parity, etc.)

  • Compilers
  • Operating System

(scheduling, mapping)

  • Runtime Management
slide-11
SLIDE 11

Tried and Tested Method

  • Triple modular redundancy
  • High cost rules out this method

11

Module 3 Module 1 Module 2 Voting MUX

slide-12
SLIDE 12

12

BISER

* Ernst et al, “Razor: a low-power pipeline based on circuit-level timing speculation”, 2003. MICRO-36., pp. 7–18. * Mitra et al,“Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43–52, 2005. * Memik et al, “Increasing Register File Immunity to Transient Errors,” in DATE05, pp. 586–591.

  • Selective duplication (timing faults)
  • only insert RAZOR flip-flops in critical paths
  • Re-use existing circuitry (logic faults)
  • scan flip-flops in BISER
  • idle register files for redundancy

Register files

Low-Cost Hardware Methods: Examples

RAZOR

slide-13
SLIDE 13

13

  • Hardware detection
  • Parity through scan-chains
  • Software correction
  • Interrupt service routine as firmware
  • S. Yang, S. Khursheed, B. M. Al-Hashimi, D. Flynn, and G. V. Merrett, “Improved State Integrity of Flip-Flops for Voltage

Scaled Retention Under PVT Variation,” IEEE TCAS-I: Regular Papers, vol. 1, pp. 1–9, 2013.

Low-Cost HW-SW Method: Example

slide-14
SLIDE 14

Software Approach

Hardware approach emphasizes detection and correction, Software approach emphasizes software failure prevention

14

slide-15
SLIDE 15

15

Unreliable Hardware: Software Approach

Vulnerable periods of processor register variables analysis Estimation of program reliability Reliability-optimised instruction-scheduling

Compilers

― Improves software program reliability by quantifying vulnerability of instructions ― Instruction scheduling impacts vulnerable periods of instruction’s variables ― Reduce critical instructions occupancy in pipeline and their operands’ vulnerable periods ― Schedule instruction with highest vulnerability first

Source code Reliability-aware binary

input

  • utput

Complier flow

  • J. Henkel et al, “RAISE: Reliability-Aware Instruction Scheduling”
  • T. Jones, Energy-aware compilers, Cambridge University,

http://www.cl.cam.ac.uk/~tmj32/

  • S. Garg et al, Cross-layer reliability modelling and optimisation for embedded systems

under PV, Tutorial, CODES-ISSS 2013

slide-16
SLIDE 16

16

Operating Systems

  • V. Izosimov, P. Pop, and P. Eles, “Design Optimization of Time-and

Cost-Constrained Fault-Tolerant Distributed Embedded Systems,” DATE05, pp. 864–869.

  • R. Shafik, B.M. Al-Hashimi, K. Chakrabarty, “Soft erroe-aware design
  • ptimisation of low power and time-constrained embedded

systems”, pp.1462-1467, DAET10

  • Heuristics decide on mapping of application tasks

to processors, scheduling and FT policies to meet reliability requirement

  • Many heuristics have been proposed, examples

Unreliable Hardware: Software Approach

Mapping (Duplication) Scheduling (re-execution) Reliability analysis Task reliability profile Reliability Requirement

Reliable? Pass Fail Input

Hardware platform execution Tasks

slide-17
SLIDE 17

Industry Pragmatic Approach to Reliable Processors

(every bit matters; users are willing to pay)

17

slide-18
SLIDE 18

ARM Cortex-R Series

18

  • Dual core lock-step configuration*: Two identical cores running the same set of
  • perations and their outputs are compared. If a difference is detected, the cores are

rolled up to the last correct operation

  • Pipelines, caches and memories are protected with ECC

* http://www.arm.com/products/processors/cortex-r/cortex-r4.php

slide-19
SLIDE 19

Oracle/Fujitsu: SPARC64

19

* Ando et al, “A 1.3-GHz fifth-generation SPARC64 microprocessor”, JSSC, 38(11), 1896–1905, 2003,

  • Error detection in execution units and interconnect using data and address

parity*

  • Recovery via instruction re-execution
  • ECC in L1D and L2 caches
slide-20
SLIDE 20

IBM Power7

20

  • Core

— Harden latches — Spare cores — Re-execution, task migration

  • Memory

— Tag un-correctable errors — Dynamic sparing

  • Interconnects

— ECC-protected interconnect between cluster nodes — Redundant paths

* Kalla et al. "Power7: IBM's next-generation server processor." Micro, IEEE, 2010.

slide-21
SLIDE 21

Where are we heading to?

Personal Perspectives (Automation, Cross-layer)

21

slide-22
SLIDE 22

Reliability/Safety Standards

22

IEC 60601 (medical equipment) IEC 60601 (medical equipment) IEC 60601 (medical equipment) IEC 60601 (medical equipment) IEC 61508 (meta - standard) IEC 61508 (meta - standard) IEC 61508 (meta - standard) IEC 61508 (meta - standard) IEC 61511 (process industry) IEC 61511 (process industry) IEC 61511 (process industry) IEC 61511 (process industry) IEC 62061 (machinery) IEC 62061 (machinery) IEC 62061 (machinery) IEC 62061 (machinery) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) IEC 50156 (furnaces) IEC 60880 (nuclear power stations) ISO 26262 (automotive) ISO 26262 (automotive) ISO 26262 (automotive) ISO 26262 (automotive) RTCA/DO

  • 178B

(aerospace) RTCA/DO

  • 178B

(aerospace) RTCA/DO

  • 178B

(aerospace) RTCA/DO

  • 178B

(aerospace) (aerospace) (aerospace) (aerospace) DO-178B/DO-254 (aerospace) EN 50128 (railway) EN 50128 (railway) EN 50128 (railway) EN 50128 (railway) Source: YOGITECH

slide-23
SLIDE 23

ISO 26262 and RIIF

  • ISO 26262: automotive safety standard for functional

safety of electronic systems in vehicles

– Focuses on risks arising from random hardware faults and systematic faults in HW/SW development

  • Reliability Information Interchange Format (RIIF): IEEE

initiative to develop HW reliability modeling language

– EDA tools to analyze reliability models to compute failure rates

23

* Standards for specifying and modeling the reliability of complex electronic systems, 1st RIIF Workshop, DATE2013 * Evans et al, RIIF- Reliability Information interchange format, On-Line Testing Symposium, 2012

slide-24
SLIDE 24

Low-Power EDA: Example

24

Synthesis Power ¡description

  • Eg. ¡UPF*

Placement ¡and ¡ Route Design ¡(RTL)

  • 1. ¡Create ¡power ¡switches ¡ ¡
  • 3. ¡Create ¡output ¡isola3on ¡
  • 2. ¡Create ¡state ¡reten3on ¡
pg_switch Vdd Sw_Vdd pg_ctrl/ power

iso1 pg_ctrl/nclamp DIN Dout

Retention ¡enabled ¡F/F Master ¡F/F Slave ¡Retention ¡ latch Vdd sw_Vdd D clock RETAIN Q Gnd
  • Tools and standards made low-power design main-stream
  • UPF (Unified Power Format): IEEE standard for describing power

intent in power optimization in EDA

  • Example of automatic insertion of power gating in RTL description
slide-25
SLIDE 25

Where are we heading to?

Reliable Hardware EDA

25

Specification Performance and reliability

RTL Synthesis

Unified Reliability Format (URF)

Reliability analysis

Reliable Hardware Failure mechanism (RIIF) (eg. SEU, NBTI, HCI,….)

Reliability map

(failure rates..) Razor Duplication Hardening ECC Fault tolerance policy

slide-26
SLIDE 26

Where are we heading to?

Cross-Layer: Run-Time

26

Application threads

Operating System Reliability management Interconnect

Application System Software Hardware …..………

Application and runtime requirements

Monitors Controls

Processor 1 Processor N

Performance counters and temperature DVFS, duplication , CPU affinity

slide-27
SLIDE 27

Run Smarter, Live Longer

  • Cross-layer approach enables

– Delay transistors wear-out and improve faults mitigation – Doing the right thing with existing resources

  • appropriate mapping of application threads to

cores guided by counters and sensors

  • appropriate selection of core frequencies and

voltages

  • Use of existing resources and power management

lead to more energy being available

27

  • S. Bischoff, H. Andreas, B. Al-Hashimi, “applying quality of experience to system optimisation”,PATMOS2013,
slide-28
SLIDE 28

Cross-layer Approach: Motivational Example

28

Thermal profile varies with applications

slide-29
SLIDE 29

Runtime Thermal Optimization*

core 0

  • sensor
  • core 1
  • sensor
  • core n
  • sensor
  • Interconnect
  • Operating

System

  • Control

CPU affinity CPU V/F Monitors Temperature performance counter

mpeg2_dec

Application System Software Hardware

Proposed Approach

  • Determine

State

  • Q-Learning

Algorithm

  • Compute

Reward

  • Select Action
  • Decision Epochs

face_rec

Application requirement: deadline per frame/task Runtime requirement: improve lifetime reliability

Temperature samples Das, A.K., Shafik, R.A., Merrett, G.V., Al-Hashimi, B.M., Kumar, A. and Veeravalli,B. Reinforcement learning-based inter- and intra- application thermal optimization for lifetime improvement of multicore systems, DAC’14

slide-30
SLIDE 30

Runtime Thermal Optimization - results

30

Blue: Proposed Red: Linux on-demand

slide-31
SLIDE 31

PRiME

Power-efficient, Reliable, Many-core Embedded systems 5-year project, £5.6M, started March 2013

slide-32
SLIDE 32

Summary

  • Significant academic and industrial research to date; likely to

continue

  • Standards and tools will speed up the design automation of reliable

hardware

  • Runtime cross-layer approach will enable reliable, energy-efficient

design of future many-core embedded systems

  • Are we there yet in hardware reliability?

32

Entering new stage where innovation in design automation and cross-layer design run-time will provide effective solutions for future reliable systems At the beginning At the end clearly not absolutely not

slide-33
SLIDE 33

Thank you

اركش

谢谢

ধনবাদ

danke

merci

¡gracias grazie

ありがとう

dank u σας ευχαριστώ

धयवाद

감사합니다

tack takk

slide-34
SLIDE 34

Runtime Reliability Management

34

Data Gathering Monitor Control Action Data analysis Adaptation & decision making Logic and timing faults; temperature

Unreliability levels

  • Low: unlikely faults
  • Moderate: possible
  • Critical: imminent faults

Operating modes

  • Fault recovery
  • Graceful degrad.
  • Low energy

DVFS, duplication, core affinity

Interconnect

…..………

Processor 1 Processor N

Hardware Platform Application and runtime requirements from application layer System software From Application Layer

slide-35
SLIDE 35

Reliability Research Approaches Together

Hardware approach Software approach

35

Redundancy (DMR, TMR, ECC, Parity, etc.)

  • Compilers
  • Operating System

(scheduling, mapping)

  • Runtime Management

Cross-layer