pair swap an approach to graceful degradation for
play

Pair & Swap : An Approach to Graceful Degradation for - PowerPoint PPT Presentation

Pair & Swap : An Approach to Graceful Degradation for Dependable Chip Multiprocessors Masashi Imai (miyabi@hal.rcast.u-tokyo.ac.jp) Tomohide Nagai (nagai@hal.rcast.u-tokyo.ac.jp) Takashi Nanya (takashi.nanya@canon.co.jp) 2010.06.28 1


  1. Pair & Swap : An Approach to Graceful Degradation for Dependable Chip Multiprocessors Masashi Imai (miyabi@hal.rcast.u-tokyo.ac.jp) Tomohide Nagai (nagai@hal.rcast.u-tokyo.ac.jp) Takashi Nanya (takashi.nanya@canon.co.jp) 2010.06.28 1 WDSN10 �

  2. Agenda � Introduction � Related works � Pair & Swap � Concept � Hardware model � Execution steps � Comparison mechanism � Task management mechanism � Evaluation � Conclusion 2010.06.28 2 WDSN10 �

  3. Background & Motivation � VLSI technology scaling � The performance improvement of a single processor is limited due to clock skew, power dissipation, ILP, and complexity � CMP (Chip Multi-Processor) � Integrates multiple processor cores in a single chip � CMP is a promising VLSI architecture, not only for high performance but also for reducing power dissipation � Even if a processor core becomes faulty, the remaining cores can continue to operate � It is not efficient to replace the entire CMP chip immediately when a permanent fault occurs 2010.06.28 3 WDSN10 �

  4. Background & Motivation � We consider CMP systems as non-repairable systems and present an approach to graceful degradation for dependable CMP � Dual module redundancy (DMR) � Can detect faults by comparing the result of tasks � The number of tasks in N -cores CMP : N /2 � Triple module redundancy (TMR) � Can mask faults � Can identify a failure core � The number of tasks in N -cores CMP : N /3 � Pair-based scheme for dependable CMP in order to achieve high-performance 2010.06.28 4 WDSN10 �

  5. Related works � Single-processor SMT devices � RMT (Redundant MultiThreading) [Nirmal98] � AR-SMT (Active-stream/Redundant-stream Simultaneous MultiThreading) [Eric99] � A tme redundancy techniques which compares the results of a leading thread called A-thread with the results of a trailing thread called R-thread � SRT (Simultaneous and Redundantly Threaded) [Reinhardt00] � Executes two identical copies of the same program as independent threads and compares their results � SRTR (SRT with Recovery) [Vijaykumar02] 2010.06.28 5 WDSN10 �

  6. Related works � Dual-processor devices which indicate both a dual-core CMP chip and different dies � Lockstep techniques [Nicholas93, Timothy99, Reorda09] � Assumes that an error in either processor will cause a difference between the states of the two processors � Watchdog processors [Mahmod88] � DIVA (Dynamic Implementation Verification Architecture) [Austin99] � Employs a high-performance processor core as a leading core and a low-performance core as a trailing checker core 2010.06.28 6 WDSN10 �

  7. Related works � CMP devices � CRT (Chip-level Redundant Threading) [Mukherjee02] � Applies SRTʼs detection techniques to CMPs � CRTR (CRT with Recovery) [Mohamed03] � Extends the CRT for transient-fault detection � DCR (Dual Core Redundancy) [Gong08] � Extends the CRT by adding HW implemented context saving and recovery � TCR (Triple Core Redundancy) [Gong08] � Extends three copies of a given program on a leading thread, a middle thread, and a trailing thread � DCC (Dynamic Core Coupling) [Christopher07] � Allows arbitrary CMP cores to verify each otherʼs execution while requiring no dedicated cross-core communication channels or buffers � The basic concept of our method is similar to DCC, while DCC employs a TMR using hot spares in order to isolate a failure core and recovery its task 2010.06.28 7 WDSN10 �

  8. Agenda � Introduction � Related works � Pair & Swap � Concept � Hardware model � Execution steps � Comparison mechanism � Task management mechanism � Evaluation � Conclusion 2010.06.28 8 WDSN10 �

  9. Fault model � Single-core fault � A fault can occur only in a single core at a time � Permanent fault � We must identify the failure core and stop using it � Transient fault � The core in which a transient fault occurs can be recovered by re-executing from the latest checkpoint � We do not have to stop using it immediately � Generally, transient faults tend to occur much more frequently than permanent faults 2010.06.28 9 WDSN10 �

  10. Pair & Swap � Processor-level fault tolerance technique for CMPs which consists of two phases � Pair phase : replication and comparison � Two identical copies of a given task are executed on a pair of two processor cores and the results are compared � If no fault is detected, each core repeats a period of execution and comparison � Swap phase : swap and retry � Partners of the mismatched pair are swapped with another pair and mismatched task is re-executed from the latest checkpoint � It is decided whether the fault is transient or permanent in the end of the swap phase � Permanent fault: the failure core is identified and isolated to reconfigure the entire CMP system for continuous operation in a degraded mode � Transient fault: the swapped pairs continue their tasks without any reconfiguration in the next pair phase 2010.06.28 10 WDSN10 �

  11. Target model 1. More than four cores in order to swap partners 2. A stable storage in order to retry the mismatched task from the latest correct checkpoint � A shared memory is used as the stable storage and the correct checkpoint data is stored in the shared memory 3. A non-faulty decision unit which decides the comparison results of all the pairs in order to generate consistent comparison results � It is needed because a pair of two cores in which a fault may occur cannot generate a consistent comparison result by themselves 2010.06.28 11 WDSN10 �

  12. Pair & Swap: Pair phase � pair phase Compare & CP period � Core1 Task A(i) � Task A(i) � Core2 Core3 Task B(i) � Task B(i) � Core4 Compare � Comparison Task A(i) Task B(i) Checkpoint � 12 2010.06.28 WDSN10 �

  13. Pair & Swap: Pair phase � pair phase Compare & CP period � Core1 Task A(i) � Task A(i+1) � Task A(i) � Task A(i+1) � Core2 Core3 Task B(i) � Task B(i+1) � Task B(i) � Task B(i+1) � Core4 Compare � Comparison Comparison Task A(i) Task A(i+1) Task B(i) Task B(i+1) Checkpoint � 13 2010.06.28 WDSN10 �

  14. Pair & Swap: Pair phase � pair phase Compare & CP period � Core1 Task A(i) � Task A(i+1) � Task A(i+2) � Task A(i+3) � Task A(i) � Task A(i+1) � Task A(i+2) � Task A(i+3) � Core2 Core3 Task B(i) � Task B(i+1) � Task B(i+2) � Task B(i+3) � Task B(i) � Task B(i+1) � Task B(i+2) � Task B(i+3) � Core4 Compare � Comparison Comparison Comparison Comparison Task A(i) Task A(i+1) Task A(i+2) Task A(i+3) Task B(i) Task B(i+1) Task B(i+2) Task B(i+3) Checkpoint � 14 2010.06.28 WDSN10 �

  15. Pair & Swap: Swap phase Detect a fault � � pair phase swap phase Core1 Task A(i) � Task A(i) � Core2 Core3 Task B(i) � Core4 Task B(i) � Rollback (load CP) � Compare � Comparison Task A(i) Task B(i) Checkpoint � 2010.06.28 15 WDSN10 �

  16. Pair & Swap: Swap phase � � pair phase swap phase Task B(i+1) � Core1 Task A(i) � Task migration Task A(i) � Task A(i) � Core2 Task A(i) is re-executed Core3 Task B(i) � Task A(i) � Task B(i+1) � Core4 Task B(i) � Rollback (load CP) � Compare � Comparison Task A(i) Task B(i) Checkpoint � 2010.06.28 16 WDSN10 �

  17. Pair & Swap: Fault location (1) � Transient fault case In the end of the Swap phase, both comparison results match � � pair phase swap phase Task B(i+1) � Core1 Task A(i) � It can be decided that the fault was transient Task A(i) � Task A(i) � Core2 � the two pairs continue executing the same tasks by starting Core3 a new Pair phase Task B(i) � Task A(i) � Task B(i+1) � Core4 Task B(i) � Rollback (load CP) � Compare � Comparison Comparison Task A(i) Task A(i) Task B(i+1) Task B(i) Checkpoint � 2010.06.28 17 WDSN10 �

  18. Pair & Swap: Fault location (1) � Transient fault case � � � pair phase swap phase pair phase Task B(i+1) � Core1 Task B(i+2) � Task B(i+3) � Task A(i) � Task A(i) � Task A(i) � Task A(i+1) � Task A(i+2) � Core2 Core3 Task B(i) � Task A(i) � Task A(i+1) � Task A(i+2) � Task B(i+1) � Core4 Task B(i) � Task B(i+2) � Task B(i+3) � Comparison Comparison Comparison Comparison Task A(i) Task A(i+1) Task A(i+2) Task A(i) Task B(i+1) Task B(i+2) Task B(i+3) Task B(i) 2010.06.28 18 WDSN10 �

  19. Pair & Swap: Fault location (2) � Permanent fault case In the end of the Swap phase, the � � pair phase swap phase comparison result of Task A(i) mismatches Task B(i+1) � Core1 Task A(i) � Task A(i) � Task A(i) � Core2 The failure core is identified as the one that executed the mismatched tasks in both the Core3 Task B(i) � Task A(i) � Pair phase and the Swap phase. � stop using the Core2 Task B(i+1) � Core4 Task B(i) � Rollback (load CP) � Compare � Comparison Comparison Task A(i) Task A(i) Task B(i+1) Task B(i) Checkpoint � 2010.06.28 19 WDSN10 �

  20. Pair & Swap: Fault location (2) � Permanent fault case � � pair phase swap phase Task B(i+1) � Core1 Task A(i) � Task A(i) � Task A(i) � Core2 Core3 Task B(i) � Task A(i) � Task B(i+1) � Core4 Task B(i) � Rollback (load CP) � Compare � Comparison Comparison Task A(i) Task A(i) Task B(i+1) Task B(i) Checkpoint � 2010.06.28 20 WDSN10 �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend