Pair & Swap : An Approach to Graceful Degradation for - PowerPoint PPT Presentation

Pair & Swap : An Approach to Graceful Degradation for Dependable Chip Multiprocessors Masashi Imai (miyabi@hal.rcast.u-tokyo.ac.jp) Tomohide Nagai (nagai@hal.rcast.u-tokyo.ac.jp) Takashi Nanya (takashi.nanya@canon.co.jp) 2010.06.28 1 WDSN10 �

Agenda � Introduction � Related works � Pair & Swap � Concept � Hardware model � Execution steps � Comparison mechanism � Task management mechanism � Evaluation � Conclusion 2010.06.28 2 WDSN10 �

Background & Motivation � VLSI technology scaling � The performance improvement of a single processor is limited due to clock skew, power dissipation, ILP, and complexity � CMP (Chip Multi-Processor) � Integrates multiple processor cores in a single chip � CMP is a promising VLSI architecture, not only for high performance but also for reducing power dissipation � Even if a processor core becomes faulty, the remaining cores can continue to operate � It is not efficient to replace the entire CMP chip immediately when a permanent fault occurs 2010.06.28 3 WDSN10 �

Background & Motivation � We consider CMP systems as non-repairable systems and present an approach to graceful degradation for dependable CMP � Dual module redundancy (DMR) � Can detect faults by comparing the result of tasks � The number of tasks in N -cores CMP : N /2 � Triple module redundancy (TMR) � Can mask faults � Can identify a failure core � The number of tasks in N -cores CMP : N /3 � Pair-based scheme for dependable CMP in order to achieve high-performance 2010.06.28 4 WDSN10 �

Related works � Single-processor SMT devices � RMT (Redundant MultiThreading) [Nirmal98] � AR-SMT (Active-stream/Redundant-stream Simultaneous MultiThreading) [Eric99] � A tme redundancy techniques which compares the results of a leading thread called A-thread with the results of a trailing thread called R-thread � SRT (Simultaneous and Redundantly Threaded) [Reinhardt00] � Executes two identical copies of the same program as independent threads and compares their results � SRTR (SRT with Recovery) [Vijaykumar02] 2010.06.28 5 WDSN10 �

Related works � Dual-processor devices which indicate both a dual-core CMP chip and different dies � Lockstep techniques [Nicholas93, Timothy99, Reorda09] � Assumes that an error in either processor will cause a difference between the states of the two processors � Watchdog processors [Mahmod88] � DIVA (Dynamic Implementation Verification Architecture) [Austin99] � Employs a high-performance processor core as a leading core and a low-performance core as a trailing checker core 2010.06.28 6 WDSN10 �

Related works � CMP devices � CRT (Chip-level Redundant Threading) [Mukherjee02] � Applies SRTʼs detection techniques to CMPs � CRTR (CRT with Recovery) [Mohamed03] � Extends the CRT for transient-fault detection � DCR (Dual Core Redundancy) [Gong08] � Extends the CRT by adding HW implemented context saving and recovery � TCR (Triple Core Redundancy) [Gong08] � Extends three copies of a given program on a leading thread, a middle thread, and a trailing thread � DCC (Dynamic Core Coupling) [Christopher07] � Allows arbitrary CMP cores to verify each otherʼs execution while requiring no dedicated cross-core communication channels or buffers � The basic concept of our method is similar to DCC, while DCC employs a TMR using hot spares in order to isolate a failure core and recovery its task 2010.06.28 7 WDSN10 �

Agenda � Introduction � Related works � Pair & Swap � Concept � Hardware model � Execution steps � Comparison mechanism � Task management mechanism � Evaluation � Conclusion 2010.06.28 8 WDSN10 �

Fault model � Single-core fault � A fault can occur only in a single core at a time � Permanent fault � We must identify the failure core and stop using it � Transient fault � The core in which a transient fault occurs can be recovered by re-executing from the latest checkpoint � We do not have to stop using it immediately � Generally, transient faults tend to occur much more frequently than permanent faults 2010.06.28 9 WDSN10 �

Pair & Swap � Processor-level fault tolerance technique for CMPs which consists of two phases � Pair phase : replication and comparison � Two identical copies of a given task are executed on a pair of two processor cores and the results are compared � If no fault is detected, each core repeats a period of execution and comparison � Swap phase : swap and retry � Partners of the mismatched pair are swapped with another pair and mismatched task is re-executed from the latest checkpoint � It is decided whether the fault is transient or permanent in the end of the swap phase � Permanent fault: the failure core is identified and isolated to reconfigure the entire CMP system for continuous operation in a degraded mode � Transient fault: the swapped pairs continue their tasks without any reconfiguration in the next pair phase 2010.06.28 10 WDSN10 �

Target model 1. More than four cores in order to swap partners 2. A stable storage in order to retry the mismatched task from the latest correct checkpoint � A shared memory is used as the stable storage and the correct checkpoint data is stored in the shared memory 3. A non-faulty decision unit which decides the comparison results of all the pairs in order to generate consistent comparison results � It is needed because a pair of two cores in which a fault may occur cannot generate a consistent comparison result by themselves 2010.06.28 11 WDSN10 �

Pair & Swap: Pair phase � pair phase Compare & CP period � Core1 Task A(i) � Task A(i) � Core2 Core3 Task B(i) � Task B(i) � Core4 Compare � Comparison Task A(i) Task B(i) Checkpoint � 12 2010.06.28 WDSN10 �

Pair & Swap: Pair phase � pair phase Compare & CP period � Core1 Task A(i) � Task A(i+1) � Task A(i) � Task A(i+1) � Core2 Core3 Task B(i) � Task B(i+1) � Task B(i) � Task B(i+1) � Core4 Compare � Comparison Comparison Task A(i) Task A(i+1) Task B(i) Task B(i+1) Checkpoint � 13 2010.06.28 WDSN10 �

Pair & Swap: Pair phase � pair phase Compare & CP period � Core1 Task A(i) � Task A(i+1) � Task A(i+2) � Task A(i+3) � Task A(i) � Task A(i+1) � Task A(i+2) � Task A(i+3) � Core2 Core3 Task B(i) � Task B(i+1) � Task B(i+2) � Task B(i+3) � Task B(i) � Task B(i+1) � Task B(i+2) � Task B(i+3) � Core4 Compare � Comparison Comparison Comparison Comparison Task A(i) Task A(i+1) Task A(i+2) Task A(i+3) Task B(i) Task B(i+1) Task B(i+2) Task B(i+3) Checkpoint � 14 2010.06.28 WDSN10 �

Pair & Swap: Swap phase Detect a fault � � pair phase swap phase Core1 Task A(i) � Task A(i) � Core2 Core3 Task B(i) � Core4 Task B(i) � Rollback (load CP) � Compare � Comparison Task A(i) Task B(i) Checkpoint � 2010.06.28 15 WDSN10 �

Pair & Swap: Swap phase � � pair phase swap phase Task B(i+1) � Core1 Task A(i) � Task migration Task A(i) � Task A(i) � Core2 Task A(i) is re-executed Core3 Task B(i) � Task A(i) � Task B(i+1) � Core4 Task B(i) � Rollback (load CP) � Compare � Comparison Task A(i) Task B(i) Checkpoint � 2010.06.28 16 WDSN10 �

Pair & Swap: Fault location (1) � Transient fault case In the end of the Swap phase, both comparison results match � � pair phase swap phase Task B(i+1) � Core1 Task A(i) � It can be decided that the fault was transient Task A(i) � Task A(i) � Core2 � the two pairs continue executing the same tasks by starting Core3 a new Pair phase Task B(i) � Task A(i) � Task B(i+1) � Core4 Task B(i) � Rollback (load CP) � Compare � Comparison Comparison Task A(i) Task A(i) Task B(i+1) Task B(i) Checkpoint � 2010.06.28 17 WDSN10 �

Pair & Swap: Fault location (1) � Transient fault case � � � pair phase swap phase pair phase Task B(i+1) � Core1 Task B(i+2) � Task B(i+3) � Task A(i) � Task A(i) � Task A(i) � Task A(i+1) � Task A(i+2) � Core2 Core3 Task B(i) � Task A(i) � Task A(i+1) � Task A(i+2) � Task B(i+1) � Core4 Task B(i) � Task B(i+2) � Task B(i+3) � Comparison Comparison Comparison Comparison Task A(i) Task A(i+1) Task A(i+2) Task A(i) Task B(i+1) Task B(i+2) Task B(i+3) Task B(i) 2010.06.28 18 WDSN10 �

Pair & Swap: Fault location (2) � Permanent fault case In the end of the Swap phase, the � � pair phase swap phase comparison result of Task A(i) mismatches Task B(i+1) � Core1 Task A(i) � Task A(i) � Task A(i) � Core2 The failure core is identified as the one that executed the mismatched tasks in both the Core3 Task B(i) � Task A(i) � Pair phase and the Swap phase. � stop using the Core2 Task B(i+1) � Core4 Task B(i) � Rollback (load CP) � Compare � Comparison Comparison Task A(i) Task A(i) Task B(i+1) Task B(i) Checkpoint � 2010.06.28 19 WDSN10 �

Pair & Swap: Fault location (2) � Permanent fault case � � pair phase swap phase Task B(i+1) � Core1 Task A(i) � Task A(i) � Task A(i) � Core2 Core3 Task B(i) � Task A(i) � Task B(i+1) � Core4 Task B(i) � Rollback (load CP) � Compare � Comparison Comparison Task A(i) Task A(i) Task B(i+1) Task B(i) Checkpoint � 2010.06.28 20 WDSN10 �

Pair & Swap : An Approach to Graceful Degradation for - PowerPoint PPT Presentation

Pair & Swap : An Approach to Graceful Degradation for Dependable Chip Multiprocessors Masashi Imai (miyabi@hal.rcast.u-tokyo.ac.jp) Tomohide Nagai (nagai@hal.rcast.u-tokyo.ac.jp) Takashi Nanya (takashi.nanya@canon.co.jp) 2010.06.28 1

Graceful Degradation Fault-tolerance, or graceful degradation, is the property that enables a

y = x; } int a = 2, b = 6; swap(a,b); void swap(int x, int y) { int temp = y; y = x; x =

Graceful degradation over the BEC via non-linear codes Hajir Roozbehani, Yury Polyanskiy LIDS

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Zhang Logo Background n-alkane degradation ? n-alkane degradation Degradation Sensing

Cushman & Wakefield SWAP Presentation SWAP - Safe Work Assurance Platform Overview What We

Market Models for Forward Swap Rates and Credit Default Swap Spreads Marek Rutkowski School of

Interest Rate Swap and Interest Rate Swap and Variable Rate Debt Programs Variable Rate Debt

ROUNDERS (1998) CASINO ROYALE (2006) HAND RANKINGS HIGH CARD HAND RANKINGS PAIR HIGH CARD

Closest Pair of Points Cormen et.al 33.4 Closest Pair of Points Closest pair. Given n points in

Power System Restoration - The Graceful Degradation Phase Mike Adibi, IRD Corporation Bethesda,

Graceful Degradation of QoS in Smart Grid Rohit Gupta under the guidance of Prof. Krithi

Graceful Degradation of Low-Criticality Tasks in Multiprocessor Dual-Criticality Systems Lin

Using Architectural Properties to Model System-Wide Graceful Degradation Charles Shelton Philip

The Economics of Land Degradation (ELD) Initiative Economic arguments to combat land degradation

Network Controllable MP3 Player BRADY THORNTON & JASON BROWN (GROUP 12) Goal A user-friendly

Self-healing systems seminar meeting 6 Tiina Niklander Presentation order Kemppainen:

Continuous Testing in Eclipse David Saff, Michael D. Ernst MIT CSAIL eTX 2004, Barcelona, Spain

CS5460: Operating Systems Lecture 20: File System Reliability CS 5460: Operating Systems File

Learning to Find Bugs (Work in progress) Michael Pradel TU Darmstadt 1 Joint work with Koushik

EPR Pairs, Lo Local Projection and Quantum Tele leportation in Holography in Kento Watanabe

Algorithms and Methods for Distributed Storage Networks 2. Hard Disks Christian Schindelhauer

The Challenges of Dynamic Network Interfaces by Brooks Davis brooks@{aero,FreeBSD}.org The

Pair & Swap : An Approach to Graceful Degradation for - PowerPoint PPT Presentation

Pair & Swap : An Approach to Graceful Degradation for Dependable Chip Multiprocessors Masashi Imai (miyabi@hal.rcast.u-tokyo.ac.jp) Tomohide Nagai (nagai@hal.rcast.u-tokyo.ac.jp) Takashi Nanya (takashi.nanya@canon.co.jp) 2010.06.28 1

Graceful Degradation Fault-tolerance, or graceful degradation, is the property that enables a

y = x; } int a = 2, b = 6; swap(a,b); void swap(int x, int y) { int temp = y; y = x; x =

Graceful degradation over the BEC via non-linear codes Hajir Roozbehani, Yury Polyanskiy LIDS

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Zhang Logo Background n-alkane degradation ? n-alkane degradation Degradation Sensing

Cushman &amp; Wakefield SWAP Presentation SWAP - Safe Work Assurance Platform Overview What We

Market Models for Forward Swap Rates and Credit Default Swap Spreads Marek Rutkowski School of

Interest Rate Swap and Interest Rate Swap and Variable Rate Debt Programs Variable Rate Debt

ROUNDERS (1998) CASINO ROYALE (2006) HAND RANKINGS HIGH CARD HAND RANKINGS PAIR HIGH CARD

Closest Pair of Points Cormen et.al 33.4 Closest Pair of Points Closest pair. Given n points in

Power System Restoration - The Graceful Degradation Phase Mike Adibi, IRD Corporation Bethesda,

Graceful Degradation of QoS in Smart Grid Rohit Gupta under the guidance of Prof. Krithi

Graceful Degradation of Low-Criticality Tasks in Multiprocessor Dual-Criticality Systems Lin

Using Architectural Properties to Model System-Wide Graceful Degradation Charles Shelton Philip

The Economics of Land Degradation (ELD) Initiative Economic arguments to combat land degradation

Network Controllable MP3 Player BRADY THORNTON &amp; JASON BROWN (GROUP 12) Goal A user-friendly

Self-healing systems seminar meeting 6 Tiina Niklander Presentation order Kemppainen:

Continuous Testing in Eclipse David Saff, Michael D. Ernst MIT CSAIL eTX 2004, Barcelona, Spain

CS5460: Operating Systems Lecture 20: File System Reliability CS 5460: Operating Systems File

Learning to Find Bugs (Work in progress) Michael Pradel TU Darmstadt 1 Joint work with Koushik

EPR Pairs, Lo Local Projection and Quantum Tele leportation in Holography in Kento Watanabe

Algorithms and Methods for Distributed Storage Networks 2. Hard Disks Christian Schindelhauer

The Challenges of Dynamic Network Interfaces by Brooks Davis brooks@{aero,FreeBSD}.org The

Cushman & Wakefield SWAP Presentation SWAP - Safe Work Assurance Platform Overview What We

Network Controllable MP3 Player BRADY THORNTON & JASON BROWN (GROUP 12) Goal A user-friendly