Trading Off Lifetime, Fault-tolerance, and Power Consumption in - - PDF document

trading off lifetime fault tolerance and power
SMART_READER_LITE
LIVE PREVIEW

Trading Off Lifetime, Fault-tolerance, and Power Consumption in - - PDF document

Trading Off Lifetime, Fault-tolerance, and Power Consumption in Real-time MPSoC Jacopo Panerati , Samar Abdi , and Giovanni Beltrame eal, Concordia University Ecole Polytechnique de Montr MPSoC 2015 - Ventura Beach, CA,


slide-1
SLIDE 1

Trading Off Lifetime, Fault-tolerance, and Power Consumption in Real-time MPSoC

Jacopo Panerati∗, Samar Abdi†, and Giovanni Beltrame∗

∗ ´

Ecole Polytechnique de Montr´ eal, †Concordia University

MPSoC 2015 - Ventura Beach, CA, USA

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Table of contents

1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

2/20 – mistlab.ca

slide-2
SLIDE 2

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Outline

1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

3/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Motivation

  • Aerospace: high-frequency of Single Event Upsets
  • Usually critical systems, requiring high availability
  • Classical countermeasures:
  • Modular redundancy
  • Shielding
  • Issues:
  • Cost
  • Extra hardware =

⇒ more power = ⇒ higher temperature = ⇒ shorter lifetime

  • What is a good trade-off?
  • J. Panerati et al. – Liferime, Fault-tolerance, Power

4/20 – mistlab.ca

slide-3
SLIDE 3

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Research Goal

  • Reliability and fault-tolerance are essential for critical,

autonomous systems

  • We propose a methodology to quantify, and maximize,

reliability in the presence of transient errors for MPSoC

  • Fault-tolerance is traded-off with power consumption
  • We target homogeneous multi-processor systems
  • Goal: keep a certain level of reliability/lifetime with

varying fault rates

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

5/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Outline

1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

6/20 – mistlab.ca

slide-4
SLIDE 4

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

System Model

  • Multiprocessor System-on-Chip

(we’re in the right place!)

  • Identical processing elements (PEs)

w/ private caches

  • Voltage scaling: a set of operating

points for each PE

Fault models

  • Transient faults (SEUs) w/ data

scrubbing

  • Permanent Faults
  • Total Ionizing Does (TID) effects

PE1,1 PE1,2 PE2,1 PE2,2 ... ... ... ...

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

7/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Real-Time Application Model

  • A set of tasks τ1, τ2..., τm is

executed

  • Each task has a WCET

associaded with the slowes

  • perating point of a PE
  • The speedup is proportional

to the frequency increase WCETOP(fi,−) = WCETOP(f0,−)·f0 fi

  • Precedences via a Directed

Acyclic Graph (DAG) A

WCETOPk (A)=2

B

WCETOPk (B)=4

C

WCETOPk (C)=7

D

WCETOPk (D)=5

A ≺ B B ≺ D C ≺ D

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

8/20 – mistlab.ca

slide-5
SLIDE 5

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Single Event Upsets

We use probability theory to model the occurrence of faults. SEUs are caused by high-energy particles:

  • Whose impacts are

independent.

  • Which happen at a constant

average rate.

  • The rate is mission

phase-dependent. The number of impacts in a scrubbing period of length T is a Poisson rand variable.

20 40 60 80 100 0.2 0.4 0.6 0.8 1 average SEUs/day PSEU T = 1h T = 30’ T = 10’

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

9/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Permanent Faults

  • We consider the most common wear-out phenomena: hot carriers, negative bias

temperature instabiliti (NBTI), time dependent dielectric breakdown (TDDB), electromigration, and self-heating

  • Hypothesize that Mean Time To Fail (MTTF) has an exponential relationship with PE load

(utilization U) MTTFU ∝ (MTTF100%)U−1 10 20 30 40 50 0.1 0.2 0.3 years pmf 0.2 0.4 0.6 0.8 1 CDF

MTTF = 1yrs MTTF = 5yrs MTTF = 10yrs

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

10/20 – mistlab.ca

slide-6
SLIDE 6

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Power Model

  • Total power = sum of each PE
  • Standard model with capacitance, frequency, activation factor

P = α · C · V 2 · f 600 800 1,000 1,200 1,400 1,600 5 10 15 20 25 30 frequency (Mhz) Dynamic Power (W) Power Voltage 0.8 1 1.2 1.4 1.6 1.8 Voltage (V)

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

11/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Outline

1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

12/20 – mistlab.ca

slide-7
SLIDE 7

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Methodology

Task Mapping

  • Enumerate all possible mappings
  • Prune the design space according to WCET and slowest operating point
  • Compute the utilization for each mapping

Power, Fault-tolerance, and Lifetime Optimization

  • Compute the total energy according to utilization and operating points
  • Utilizations reflect exponentially on the probability of system-wide error
  • Slack provides fault-tolerance
  • We consider the effect of utilization on lifetime and the failure of multiple

resources for lifetime optimization

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

13/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Outline

1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

14/20 – mistlab.ca

slide-8
SLIDE 8

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Case Study (actually a toy example)

  • Dual core, four tasks, each PE has four operating points
  • Implementation on a Virtex 4 board
  • 16.5 faults/day in Low Earth Orbit (LEO)
  • 62 faults/day in Highly Elliptical Orbit (HEO)

Operating Point OP1 OP2 OP3 f1 = 600MHz f2 = 1.2Ghz f3 = 1.6Ghz Task A 8.0 4.0 3.0 B 4.0 2.0 1.5 C 8.0 4.0 3.0 D 12.0 6.0 4.5

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

15/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Results

  • Overall 29 acceptable points, 15 different points shown here
  • Trade-offs for utilization (lifetime), power efficiency, or fault-tolerance

Average Best Power System Errors Utilization Consumption LEO HEO 0.600 30.00W 12 42 0.650 27.70W 13 45 0.675 26.55W 14 47 0.700 25.40W 15 49 0.725 24.25W 15 50 0.800 20.80W 16 56 0.850 27.30W 17 59

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

16/20 – mistlab.ca

slide-9
SLIDE 9

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Results

  • Design space as an n-dimensional space of utilization levels, with reliability and

power consumption design points 0.2 0.4 0.6 0.8 1 0.4 0.6 0.8 1 UPE1 UPE2 best reliability best power eff.

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

17/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Outline

1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

18/20 – mistlab.ca

slide-10
SLIDE 10

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

Conclusions

  • Methodology for scheduling real-tiem tasks in homogeneous MPSoCs
  • Energy, fault-tolerance, and lifetime-aware

Future Work

  • Use a detailed temperature model instead of the utilization proxy
  • Extend to the effects of interconnects
  • More detailed modelling of permanent faults
  • J. Panerati et al. – Liferime, Fault-tolerance, Power

19/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions

The End

Questions?

http://mistlab.ca

  • J. Panerati et al. – Liferime, Fault-tolerance, Power

20/20 – mistlab.ca