Pair & Swap : An Approach to Graceful Degradation for - - PowerPoint PPT Presentation

pair swap an approach to graceful degradation for
SMART_READER_LITE
LIVE PREVIEW

Pair & Swap : An Approach to Graceful Degradation for - - PowerPoint PPT Presentation

Pair & Swap : An Approach to Graceful Degradation for Dependable Chip Multiprocessors Masashi Imai (miyabi@hal.rcast.u-tokyo.ac.jp) Tomohide Nagai (nagai@hal.rcast.u-tokyo.ac.jp) Takashi Nanya (takashi.nanya@canon.co.jp) 2010.06.28 1


slide-1
SLIDE 1

2010.06.28 1

Pair & Swap : An Approach to Graceful Degradation for Dependable Chip Multiprocessors

Masashi Imai (miyabi@hal.rcast.u-tokyo.ac.jp) Tomohide Nagai (nagai@hal.rcast.u-tokyo.ac.jp) Takashi Nanya (takashi.nanya@canon.co.jp)

WDSN10

slide-2
SLIDE 2

Agenda

Introduction

Related works

Pair & Swap

Concept Hardware model Execution steps Comparison mechanism Task management mechanism

Evaluation Conclusion

2010.06.28 WDSN10 2

slide-3
SLIDE 3

VLSI technology scaling

The performance improvement of a single processor is limited due to clock skew, power dissipation, ILP, and complexity

CMP (Chip Multi-Processor)

Integrates multiple processor cores in a single chip CMP is a promising VLSI architecture, not only for high performance but also for reducing power dissipation Even if a processor core becomes faulty, the remaining cores can continue to operate It is not efficient to replace the entire CMP chip immediately when a permanent fault occurs

2010.06.28 WDSN10 3

Background & Motivation

slide-4
SLIDE 4

We consider CMP systems as non-repairable systems and present an approach to graceful degradation for dependable CMP Dual module redundancy (DMR)

Can detect faults by comparing the result of tasks The number of tasks in N-cores CMP : N/2

Triple module redundancy (TMR)

Can mask faults Can identify a failure core The number of tasks in N-cores CMP : N/3

Pair-based scheme for dependable CMP in

  • rder to achieve high-performance

2010.06.28 WDSN10 4

Background & Motivation

slide-5
SLIDE 5

Related works

Single-processor SMT devices

RMT (Redundant MultiThreading) [Nirmal98] AR-SMT (Active-stream/Redundant-stream Simultaneous MultiThreading) [Eric99]

A tme redundancy techniques which compares the results of a leading thread called A-thread with the results of a trailing thread called R-thread

SRT (Simultaneous and Redundantly Threaded) [Reinhardt00]

Executes two identical copies of the same program as independent threads and compares their results

SRTR (SRT with Recovery) [Vijaykumar02]

2010.06.28 WDSN10 5

slide-6
SLIDE 6

Related works

Dual-processor devices which indicate both a dual-core CMP chip and different dies Lockstep techniques [Nicholas93, Timothy99, Reorda09]

Assumes that an error in either processor will cause a difference between the states of the two processors

Watchdog processors [Mahmod88] DIVA (Dynamic Implementation Verification Architecture) [Austin99]

Employs a high-performance processor core as a leading core and a low-performance core as a trailing checker core

2010.06.28 WDSN10 6

slide-7
SLIDE 7

Related works

CMP devices

CRT (Chip-level Redundant Threading) [Mukherjee02]

Applies SRTʼs detection techniques to CMPs

CRTR (CRT with Recovery) [Mohamed03]

Extends the CRT for transient-fault detection

DCR (Dual Core Redundancy) [Gong08]

Extends the CRT by adding HW implemented context saving and recovery

TCR (Triple Core Redundancy) [Gong08]

Extends three copies of a given program on a leading thread, a middle thread, and a trailing thread

DCC (Dynamic Core Coupling) [Christopher07]

Allows arbitrary CMP cores to verify each otherʼs execution while requiring no dedicated cross-core communication channels or buffers The basic concept of our method is similar to DCC, while DCC employs a TMR using hot spares in order to isolate a failure core and recovery its task

2010.06.28 WDSN10 7

slide-8
SLIDE 8

Agenda

Introduction

Related works

Pair & Swap

Concept Hardware model Execution steps Comparison mechanism Task management mechanism

Evaluation Conclusion

2010.06.28 WDSN10 8

slide-9
SLIDE 9

Fault model

Single-core fault

A fault can occur only in a single core at a time

Permanent fault

We must identify the failure core and stop using it

Transient fault

The core in which a transient fault occurs can be recovered by re-executing from the latest checkpoint We do not have to stop using it immediately Generally, transient faults tend to occur much more frequently than permanent faults

2010.06.28 WDSN10 9

slide-10
SLIDE 10

Processor-level fault tolerance technique for CMPs which consists of two phases Pair phase : replication and comparison

Two identical copies of a given task are executed on a pair

  • f two processor cores and the results are compared

If no fault is detected, each core repeats a period of execution and comparison

Swap phase : swap and retry

Partners of the mismatched pair are swapped with another pair and mismatched task is re-executed from the latest checkpoint It is decided whether the fault is transient or permanent in the end of the swap phase

Permanent fault: the failure core is identified and isolated to reconfigure the entire CMP system for continuous operation in a degraded mode Transient fault: the swapped pairs continue their tasks without any reconfiguration in the next pair phase

2010.06.28 WDSN10 10

Pair & Swap

slide-11
SLIDE 11

Target model

1. More than four cores in order to swap partners 2. A stable storage in order to retry the mismatched task from the latest correct checkpoint

  • A shared memory is used as the stable storage and the correct

checkpoint data is stored in the shared memory

3. A non-faulty decision unit which decides the comparison results of all the pairs in order to generate consistent comparison results

  • It is needed because a pair of two cores in which a fault may
  • ccur cannot generate a consistent comparison result by

themselves

2010.06.28 WDSN10 11

slide-12
SLIDE 12

Core1

Task A(i)

Pair & Swap: Pair phase

Core2 Core3 Core4

Task A(i)

Compare & CP period

12

Task B(i) Task B(i)

Comparison Task A(i) Task B(i) 2010.06.28

WDSN10

pair phase

  • Compare

Checkpoint

slide-13
SLIDE 13

Core1

Task A(i)

Pair & Swap: Pair phase

Core2 Core3 Core4

Task A(i)

Compare & CP period

13

Task B(i) Task B(i)

Comparison Task A(i) Task B(i) 2010.06.28

WDSN10

pair phase

  • Compare

Checkpoint

Task A(i+1) Task A(i+1) Task B(i+1) Task B(i+1)

Comparison Task A(i+1) Task B(i+1)

slide-14
SLIDE 14

Core1

Task A(i)

Pair & Swap: Pair phase

Core2 Core3 Core4

Task A(i)

Compare & CP period

14

Task B(i) Task B(i)

Comparison Task A(i) Task B(i) 2010.06.28

WDSN10

pair phase

  • Compare

Checkpoint

Task A(i+1) Task A(i+1) Task B(i+1) Task B(i+1)

Comparison Task A(i+1) Task B(i+1)

Task A(i+2) Task A(i+2) Task B(i+2) Task B(i+2)

Comparison Task A(i+2) Task B(i+2)

Task A(i+3) Task A(i+3) Task B(i+3) Task B(i+3)

Comparison Task A(i+3) Task B(i+3)

slide-15
SLIDE 15

Core1

Task A(i)

Pair & Swap: Swap phase

Core2 Core3 Core4

Task A(i) Task B(i) Task B(i)

Comparison Task A(i) Task B(i) 2010.06.28 15

WDSN10

pair phase

  • swap phase
  • Detect a fault

Compare Checkpoint Rollback (load CP)

slide-16
SLIDE 16

Core1

Task A(i)

Pair & Swap: Swap phase

Core2 Core3 Core4

Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)

Comparison Task A(i) Task B(i)

Task B(i+1)

Task migration Task A(i) is re-executed

2010.06.28 16

WDSN10

pair phase

  • swap phase
  • Compare

Checkpoint Rollback (load CP)

slide-17
SLIDE 17

Transient fault case

Core1

Task A(i)

Pair & Swap: Fault location (1)

Core2 Core3 Core4

Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)

Comparison Task A(i) Task B(i)

Task B(i+1)

2010.06.28 17

WDSN10

pair phase

  • swap phase
  • Compare

Checkpoint Rollback (load CP)

Comparison Task A(i) Task B(i+1)

It can be decided that the fault was transient

the two pairs

continue executing the same tasks by starting a new Pair phase In the end of the Swap phase, both comparison results match

slide-18
SLIDE 18

Transient fault case

Core1

Task A(i)

Pair & Swap: Fault location (1)

Core2 Core3 Core4

Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)

Comparison Task A(i) Task B(i)

Task B(i+1)

2010.06.28 18

WDSN10

pair phase

  • swap phase
  • Comparison

Task A(i) Task B(i+1)

pair phase

  • Task A(i+1)

Task B(i+2) Task A(i+1) Task B(i+2)

Comparison Task A(i+1) Task B(i+2)

Task A(i+2) Task B(i+3) Task A(i+2) Task B(i+3)

Comparison Task A(i+2) Task B(i+3)

slide-19
SLIDE 19

Permanent fault case

Core1

Task A(i)

Pair & Swap: Fault location (2)

Core2 Core3 Core4

Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)

Comparison Task A(i) Task B(i)

Task B(i+1)

2010.06.28 19

WDSN10

pair phase

  • swap phase
  • Compare

Checkpoint Rollback (load CP)

Comparison Task A(i) Task B(i+1)

The failure core is identified as the one that executed the mismatched tasks in both the Pair phase and the Swap phase. stop using the Core2 In the end of the Swap phase, the comparison result of Task A(i) mismatches

slide-20
SLIDE 20

Permanent fault case

Core1

Task A(i)

Pair & Swap: Fault location (2)

Core2 Core3 Core4

Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)

Comparison Task A(i) Task B(i)

Task B(i+1)

2010.06.28 20

WDSN10

pair phase

  • swap phase
  • Compare

Checkpoint Rollback (load CP)

Comparison Task A(i) Task B(i+1)

slide-21
SLIDE 21

Permanent fault case

Core1

Task A(i)

Pair & Swap: Fault location (2)

Core2 Core3 Core4

Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)

Comparison Task A(i) Task B(i)

Task B(i+1)

2010.06.28 21

WDSN10

pair phase

  • swap phase
  • Compare

Checkpoint Rollback (load CP)

Comparison Task A(i) Task B(i+1)

Task A(i+1) Task A(i+1) Task A(i+1)

pair phase

slide-22
SLIDE 22

“Trio” configuration

When a permanent fault occurs, the number of processor cores may become odd (2m+1) It can compose m-1 pairs using 2(m-1) cores and 3 processor cores remain 3 cores execute the same task and compare their results each other: “Trio” configuration

22 2010.06.28 WDSN10 Core1

Task A(i) Task A(i+1) Task A(i+2) Task A(i+4) Task A(i+3)

Core2 Core3

Task A(i) Task A(i+2) Task A(i+3) Task A(i+1) Task A(i+4)

pair phase

  • Task A(i)

Task A(i+2) Task A(i+3) Task A(i+1) Task A(i+4)

slide-23
SLIDE 23

Comparison mechanism

What should be compared for replicated task execution depends on the application

all register file, status registers, and memory updates the output value of the system may only be required

Two processor cores in each pair exchange the compressed data over the system bus

MPI can be used

Each core compares its data with partnerʼs data each other

23

Exec Exec

Compare Compare

Core Core

The result

  • f tasks

2010.06.28 WDSN10

Core e1 Co

  • re

e2 No fault occurs Task execution No error No error Comparison result M MAT TCH MAT TCH A fault occurs during task execution Task execution No error Error Comparison result MISMAT TCH MAT TCH o

  • r

MISMAT TCH A fault occurs during comparison Task execution No error No Error Comparison result M MAT TCH MISMAT TCH

slide-24
SLIDE 24

Comparison mechanism

The decision unit

Gathers the comparison results from all the cores Decides whether the results match or mismatch for all the pairs like the following table

A Comparison result of each core can be represented by one bit.

Broadcasts its results to all the cores

24

Core1 comp pa arison r resul lt Cor re2 comp pa arison r resul lt D Decision unit outp put

Match Match Match Match Mismatch Mismatch Mismatch Match Mismatch Mismatch Mismatch Mismatch

Exec Exec

Compare Compare

Core Core

Decision

The result

  • f tasks

The result of comparison Broadcast to all the cores

2010.06.28 WDSN10

slide-25
SLIDE 25

Task assignment table

Each processor core manages the core paring and the task assignment table

There is no special core which controls the entire system

Tasks have priority and the list is

  • rdered by the priority

25

Priority

High Low

Task k

Ass si ign ne ed d co

  • re

es s

A

0, 1

B

2, 3

C

4, 5

D

6, 7

2010.06.28 WDSN10

When a mismatched task is detected in the comparison results which are broadcasted by the decision unit in the Pair phase, each core updates the table for the following Swap phase The swapping pair is selected as follows;

1. If there is a Trio in the table, the Trio is selected 2. The pair which executes the lowest priority task except it

  • wn pair is selected

Permanent fault The lowest task in the table cannot be executed in the next Pair phase

slide-26
SLIDE 26

Agenda

Introduction

Related works

Pair & Swap

Concept Hardware model Execution steps Comparison mechanism Task management mechanism

Evaluation Conclusion

2010.06.28 WDSN10 26

slide-27
SLIDE 27

Evaluation

Evaluate the expected value of the computation capability to failure called “MCTF (Mean Computation To Failure)” using the Markov chains in order to compare the performance

Comparison targets

  • 1. Proposed Pair & Swap
  • 2. Dynamic TMR
  • 3. Static TMR

Failure rate

Permanent : λ=1.0x10-9 Transient : ε=1.0x10-7

Fault detection, fault location, and reconfiguration are successfully executed with a probability of 1

27 2010.06.28 WDSN10

MCTF = (Performance(i) × P

i(t)dt) ∞

i≠ failure

slide-28
SLIDE 28

Markov chain of Pair & Swap

The performance is defined as the mean number of tasks which can be executed at the state

If a fault is detected in any pairs, the mismatched task must be re-executed The mean number of tasks decreases in the Swap phase If a fault is detected in a Trio, the task can be executed continuously The mean number of tasks does not change

28 2010.06.28 WDSN10

slide-29
SLIDE 29

Markov chain of Pair & Swap

The performance is defined as the mean number of tasks which can be executed at the state

If a fault is detected in any pairs, the mismatched task must be re-executed The mean number of tasks decreases in the Swap phase If a fault is detected in a Trio, the task can be executed continuously The mean number of tasks does not change

29 2010.06.28 WDSN10

slide-30
SLIDE 30

Markov chain of dynamic TMR

Dynamic TMR in which three processor cores are dynamically coupled as the number of active cores decreases

When the number of active cores is 3m+2, the remaining 2 cores compose a pair When the number of active cores is 3m+1, a TMR and the remaining 1 core compose 2 pairs

If the number of cores is the same as 2n=3m at the initial state, the mean number of tasks of the proposed P&S is 1.5 times larger

2010.06.28 WDSN10 30

slide-31
SLIDE 31

Markov chain of static TMR

31

201 10.06.28 WDSN10

When a permanent fault occurs in any TMR The remaining two cores compose a pair and compare their results each other When a permanent fault occurs in any pair Both two processor cores cannot be used

slide-32
SLIDE 32

MCTF

32

20 01 10 0. .0 06 6. .28 8 WDSN10

5 10 15 20 25 30 35 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 Pair&Swap Dynamic TMR Static TMR

Initial number of cores

Permanent : λ=1.0x10-9 Transient : ε=1.0x10-7

slide-33
SLIDE 33

MCTF ratio

33

2010.06.28 WDSN10

1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 Pair&Swap/ Dynamic TMR Pair&Swap/ Static TMR

slide-34
SLIDE 34

Advantage

Achieves about 1.4 times larger Mean Computation To Failure than dynamic TMR as the number of cores at the initial state increases

Overhead

Comparison, check pointing, and task swapping use the system bus

It might become a serious bottle-neck with an increasing number of processors

2010.06.28 WDSN10 34

Evaluation summary

slide-35
SLIDE 35

Conclusion

Pair & Swap

Enables graceful degradation Tolerates both transient faults and permanent faults Requires only one extra task execution for the swap phase to decide whether the fault is transient or permanent and identify the failure core Achieves about 1.4 times larger Mean Computation To Failure than dynamic TMR as the number of cores at the initial state increases

2010.06.28 WDSN10 35

slide-36
SLIDE 36

On going issues

Evaluate the overhead

Task migration

Data size, overhead time Waiting time for synchronization

Checkpoint

Implementation in real hardware

Use V850E* processor core and implement the proposed scheme based on the NoC (Network-on- Chip) architecture

2010.06.28 WDSN10 36

slide-37
SLIDE 37

Killer application

Sensor ‒ controller ‒ actuator system

Each program size is small What should be compared for replicated task execution is

  • nly output value
  • Home electronics, …

The proposed scheme requires to execute tasks twice when a fault is detected it is not suitable for hard-deadline-based application

since the throughput is required to be twice larger than the normal execution

2010.06.28 WDSN10 37

slide-38
SLIDE 38

0,2 0,4 0,6 0,8 1 10 20 30 40

Reliability

TMR*2 pair & swap Time( ×10^9) Reliability

38

4 3 2 1

Reliability of 6-cores CMP

2010.06.28 WDSN10

slide-39
SLIDE 39

Execution steps

1. Executes a given task based on the task assignment table which contains a list of all the tasks to be executed and the corresponding list of cores assigned to each task 2. Exchanges execution results between cores in each pair 3. Compares its execution results with the partnerʼs results 4. Sends the comparison result to the decision unit 5. Receives comparison results of all the pairs which are broadcasted by the decision unit 6. Updates the task assignment table 7. Makes checkpoint data and stores it in the shared memory when its comparison result matches 8. Loads the corresponding checkpoint data from the shared memory when its comparison result mismatches or it belongs to the swapping pair

39

Time

Task A (i) Task A (i+1) Task A (i+2)

2010.06.28 WDSN10

Compare Checkpoint Compare Checkpoint Compare Checkpoint

slide-40
SLIDE 40

Digital relays in the power distribution network

The number of processor unit failure is high

2010.06.28 WDSN10 40

Fault analysis

slide-41
SLIDE 41

Digital relays in the power distribution network

The number of processor unit failure is high

2010.06.28 WDSN10 41

Fault analysis

slide-42
SLIDE 42

Task assignment table

42 2010.06.28 WDSN10

slide-43
SLIDE 43

Trio: Swap phase

Core1 Core2

43

Task A(i) Task A(i)

Core3

Task A(i)

Comparison Task A(i)

2010.06.28 WDSN10

pair phase

  • Task

A(i+1) Task A(i+1) Task A(i+1)

Comparison Task A(i+1)

swap phase

  • pair phase
  • Detect a fault.

The failure core can be identified as the one that is included in both

  • f the mismatched pairs

Continue execution. There is no need to swap pairs in the trio

slide-44
SLIDE 44

Transient fault case

Trio: Fault location (1)

Core1 Core2

44

Task A(i) Task A(i)

Core3

Task A(i)

Comparison Task A(i)

2010.06.28 WDSN10

pair phase

  • Task

A(i+1) Task A(i+1) Task A(i+1)

Comparison Task A(i+1)

swap phase

  • pair phase
  • In the end of the Swap phase,

all comparison results match It can be decided that the fault was transient

“Trio” continues

executing the same tasks by starting a new Pair phase

slide-45
SLIDE 45

Permanent fault case

Trio: Fault location (2)

Core1 Core2

45

Task A(i) Task A(i)

Core3

Task A(i)

Comparison Task A(i)

2010.06.28 WDSN10

pair phase

  • Task

A(i+1) Task A(i+1) Task A(i+1)

Comparison Task A(i+1)

swap phase

  • pair phase
  • In the end of the Swap phase,

two of the three comparison results mismatch similarly to the comparison in the pair phase Task A(i+2) Task A(i+2) It can be decided that the fault was permanent

Stop using the failure core: Core1