trading off lifetime fault tolerance and power
play

Trading Off Lifetime, Fault-tolerance, and Power Consumption in - PDF document

Trading Off Lifetime, Fault-tolerance, and Power Consumption in Real-time MPSoC Jacopo Panerati , Samar Abdi , and Giovanni Beltrame eal, Concordia University Ecole Polytechnique de Montr MPSoC 2015 - Ventura Beach, CA,


  1. Trading Off Lifetime, Fault-tolerance, and Power Consumption in Real-time MPSoC Jacopo Panerati ∗ , Samar Abdi † , and Giovanni Beltrame ∗ ∗ ´ eal, † Concordia University Ecole Polytechnique de Montr´ MPSoC 2015 - Ventura Beach, CA, USA POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Table of contents 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 2/20 – mistlab.ca

  2. POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 3/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Motivation • Aerospace: high-frequency of Single Event Upsets • Usually critical systems, requiring high availability • Classical countermeasures: • Modular redundancy • Shielding • Issues: • Cost • Extra hardware = ⇒ more power = ⇒ higher temperature = ⇒ shorter lifetime • What is a good trade-off? J. Panerati et al. – Liferime, Fault-tolerance, Power 4/20 – mistlab.ca

  3. POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Research Goal • Reliability and fault-tolerance are essential for critical, autonomous systems • We propose a methodology to quantify, and maximize, reliability in the presence of transient errors for MPSoC • Fault-tolerance is traded-off with power consumption • We target homogeneous multi-processor systems • Goal: keep a certain level of reliability/lifetime with varying fault rates J. Panerati et al. – Liferime, Fault-tolerance, Power 5/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 6/20 – mistlab.ca

  4. POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions System Model • Multiprocessor System-on-Chip (we’re in the right place!) • Identical processing elements (PEs) w/ private caches ... PE 1,1 PE 1,2 • Voltage scaling: a set of operating points for each PE Fault models ... PE 2,1 PE 2,2 • Transient faults (SEUs) w/ data ... ... scrubbing • Permanent Faults • Total Ionizing Does (TID) effects J. Panerati et al. – Liferime, Fault-tolerance, Power 7/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Real-Time Application Model • A set of tasks τ 1 , τ 2 ..., τ m is A ≺ B executed B ≺ D A B • Each task has a WCET associaded with the slowes WCETOPk ( A )=2 WCETOPk ( B )=4 operating point of a PE D • The speedup is proportional WCETOPk ( D )=5 C ≺ D C to the frequency increase WCETOPk ( C )=7 WCET OP ( f i , − ) = WCET OP ( f 0 , − ) · f 0 f i • Precedences via a Directed Acyclic Graph (DAG) J. Panerati et al. – Liferime, Fault-tolerance, Power 8/20 – mistlab.ca

  5. POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Single Event Upsets We use probability theory to model the occurrence of faults. SEUs are 1 caused by high-energy particles: 0 . 8 • Whose impacts are 0 . 6 independent. P SEU • Which happen at a constant 0 . 4 average rate. 0 . 2 • The rate is mission 0 phase-dependent. 0 20 40 60 80 100 The number of impacts in a average SEUs/day scrubbing period of length T is T = 1h T = 30’ T = 10’ a Poisson rand variable. J. Panerati et al. – Liferime, Fault-tolerance, Power 9/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Permanent Faults • We consider the most common wear-out phenomena: hot carriers, negative bias temperature instabiliti (NBTI), time dependent dielectric breakdown (TDDB), electromigration, and self-heating • Hypothesize that Mean Time To Fail (MTTF) has an exponential relationship with PE load (utilization U ) MTTF U ∝ ( MTTF 100% ) U − 1 0.3 MTTF = 1yrs 1 MTTF = 5yrs 0.8 MTTF = 10yrs 0.2 0.6 CDF pmf 0.4 0.1 0.2 0 0 0 10 20 30 40 50 years J. Panerati et al. – Liferime, Fault-tolerance, Power 10/20 – mistlab.ca

  6. POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Power Model • Total power = sum of each PE • Standard model with capacitance, frequency, activation factor P = α · C · V 2 · f 30 1.8 Power Dynamic Power (W) 25 Voltage 1.6 Voltage (V) 20 1.4 15 1.2 10 1 5 0.8 600 800 1,000 1,200 1,400 1,600 frequency (Mhz) J. Panerati et al. – Liferime, Fault-tolerance, Power 11/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 12/20 – mistlab.ca

  7. POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Methodology Task Mapping • Enumerate all possible mappings • Prune the design space according to WCET and slowest operating point • Compute the utilization for each mapping Power, Fault-tolerance, and Lifetime Optimization • Compute the total energy according to utilization and operating points • Utilizations reflect exponentially on the probability of system-wide error • Slack provides fault-tolerance • We consider the effect of utilization on lifetime and the failure of multiple resources for lifetime optimization J. Panerati et al. – Liferime, Fault-tolerance, Power 13/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 14/20 – mistlab.ca

  8. POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Case Study (actually a toy example) • Dual core, four tasks, each PE has four operating points • Implementation on a Virtex 4 board • 16.5 faults/day in Low Earth Orbit (LEO) • 62 faults/day in Highly Elliptical Orbit (HEO) Operating Point OP 1 OP 2 OP 3 f 1 = 600MHz f 2 = 1.2Ghz f 3 = 1.6Ghz A 8.0 4.0 3.0 B 4.0 2.0 1.5 Task C 8.0 4.0 3.0 D 12.0 6.0 4.5 J. Panerati et al. – Liferime, Fault-tolerance, Power 15/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Results • Overall 29 acceptable points, 15 different points shown here • Trade-offs for utilization (lifetime), power efficiency, or fault-tolerance Average Best Power System Errors Utilization Consumption LEO HEO 0 . 600 30.00W 12 42 0.650 27.70W 13 45 0.675 26.55W 14 47 0.700 25.40W 15 49 0.725 24.25W 15 50 0.800 20 . 80W 16 56 0.850 27.30W 17 59 J. Panerati et al. – Liferime, Fault-tolerance, Power 16/20 – mistlab.ca

  9. POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Results • Design space as an n -dimensional space of utilization levels, with reliability and power consumption design points 1 0.8 U PE 2 0.6 0.4 best reliability best power eff. 0.2 0.4 0.6 0.8 1 U PE 1 J. Panerati et al. – Liferime, Fault-tolerance, Power 17/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 18/20 – mistlab.ca

  10. POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Conclusions • Methodology for scheduling real-tiem tasks in homogeneous MPSoCs • Energy, fault-tolerance, and lifetime-aware Future Work • Use a detailed temperature model instead of the utilization proxy • Extend to the effects of interconnects • More detailed modelling of permanent faults J. Panerati et al. – Liferime, Fault-tolerance, Power 19/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions The End Questions? http://mistlab.ca J. Panerati et al. – Liferime, Fault-tolerance, Power 20/20 – mistlab.ca

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend