2010.06.28 1
Pair & Swap : An Approach to Graceful Degradation for Dependable Chip Multiprocessors
Masashi Imai (miyabi@hal.rcast.u-tokyo.ac.jp) Tomohide Nagai (nagai@hal.rcast.u-tokyo.ac.jp) Takashi Nanya (takashi.nanya@canon.co.jp)
WDSN10
Pair & Swap : An Approach to Graceful Degradation for - - PowerPoint PPT Presentation
Pair & Swap : An Approach to Graceful Degradation for Dependable Chip Multiprocessors Masashi Imai (miyabi@hal.rcast.u-tokyo.ac.jp) Tomohide Nagai (nagai@hal.rcast.u-tokyo.ac.jp) Takashi Nanya (takashi.nanya@canon.co.jp) 2010.06.28 1
2010.06.28 1
WDSN10
2010.06.28 WDSN10 2
2010.06.28 WDSN10 3
2010.06.28 WDSN10 4
2010.06.28 WDSN10 5
2010.06.28 WDSN10 6
CRT (Chip-level Redundant Threading) [Mukherjee02]
Applies SRTʼs detection techniques to CMPs
CRTR (CRT with Recovery) [Mohamed03]
Extends the CRT for transient-fault detection
DCR (Dual Core Redundancy) [Gong08]
Extends the CRT by adding HW implemented context saving and recovery
TCR (Triple Core Redundancy) [Gong08]
Extends three copies of a given program on a leading thread, a middle thread, and a trailing thread
DCC (Dynamic Core Coupling) [Christopher07]
Allows arbitrary CMP cores to verify each otherʼs execution while requiring no dedicated cross-core communication channels or buffers The basic concept of our method is similar to DCC, while DCC employs a TMR using hot spares in order to isolate a failure core and recovery its task
2010.06.28 WDSN10 7
2010.06.28 WDSN10 8
2010.06.28 WDSN10 9
Two identical copies of a given task are executed on a pair
If no fault is detected, each core repeats a period of execution and comparison
Partners of the mismatched pair are swapped with another pair and mismatched task is re-executed from the latest checkpoint It is decided whether the fault is transient or permanent in the end of the swap phase
Permanent fault: the failure core is identified and isolated to reconfigure the entire CMP system for continuous operation in a degraded mode Transient fault: the swapped pairs continue their tasks without any reconfiguration in the next pair phase
2010.06.28 WDSN10 10
checkpoint data is stored in the shared memory
themselves
2010.06.28 WDSN10 11
Core1
Task A(i)
Core2 Core3 Core4
Task A(i)
Compare & CP period
12
Task B(i) Task B(i)
Comparison Task A(i) Task B(i) 2010.06.28
WDSN10
pair phase
Checkpoint
Core1
Task A(i)
Core2 Core3 Core4
Task A(i)
Compare & CP period
13
Task B(i) Task B(i)
Comparison Task A(i) Task B(i) 2010.06.28
WDSN10
pair phase
Checkpoint
Task A(i+1) Task A(i+1) Task B(i+1) Task B(i+1)
Comparison Task A(i+1) Task B(i+1)
Core1
Task A(i)
Core2 Core3 Core4
Task A(i)
Compare & CP period
14
Task B(i) Task B(i)
Comparison Task A(i) Task B(i) 2010.06.28
WDSN10
pair phase
Checkpoint
Task A(i+1) Task A(i+1) Task B(i+1) Task B(i+1)
Comparison Task A(i+1) Task B(i+1)
Task A(i+2) Task A(i+2) Task B(i+2) Task B(i+2)
Comparison Task A(i+2) Task B(i+2)
Task A(i+3) Task A(i+3) Task B(i+3) Task B(i+3)
Comparison Task A(i+3) Task B(i+3)
Core1
Task A(i)
Core2 Core3 Core4
Task A(i) Task B(i) Task B(i)
Comparison Task A(i) Task B(i) 2010.06.28 15
WDSN10
pair phase
Compare Checkpoint Rollback (load CP)
Core1
Task A(i)
Core2 Core3 Core4
Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)
Comparison Task A(i) Task B(i)
Task B(i+1)
Task migration Task A(i) is re-executed
2010.06.28 16
WDSN10
pair phase
Checkpoint Rollback (load CP)
Core1
Task A(i)
Core2 Core3 Core4
Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)
Comparison Task A(i) Task B(i)
Task B(i+1)
2010.06.28 17
WDSN10
pair phase
Checkpoint Rollback (load CP)
Comparison Task A(i) Task B(i+1)
It can be decided that the fault was transient
the two pairs
continue executing the same tasks by starting a new Pair phase In the end of the Swap phase, both comparison results match
Core1
Task A(i)
Core2 Core3 Core4
Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)
Comparison Task A(i) Task B(i)
Task B(i+1)
2010.06.28 18
WDSN10
pair phase
Task A(i) Task B(i+1)
pair phase
Task B(i+2) Task A(i+1) Task B(i+2)
Comparison Task A(i+1) Task B(i+2)
Task A(i+2) Task B(i+3) Task A(i+2) Task B(i+3)
Comparison Task A(i+2) Task B(i+3)
Core1
Task A(i)
Core2 Core3 Core4
Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)
Comparison Task A(i) Task B(i)
Task B(i+1)
2010.06.28 19
WDSN10
pair phase
Checkpoint Rollback (load CP)
Comparison Task A(i) Task B(i+1)
The failure core is identified as the one that executed the mismatched tasks in both the Pair phase and the Swap phase. stop using the Core2 In the end of the Swap phase, the comparison result of Task A(i) mismatches
Core1
Task A(i)
Core2 Core3 Core4
Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)
Comparison Task A(i) Task B(i)
Task B(i+1)
2010.06.28 20
WDSN10
pair phase
Checkpoint Rollback (load CP)
Comparison Task A(i) Task B(i+1)
Core1
Task A(i)
Core2 Core3 Core4
Task A(i) Task B(i) Task B(i) Task A(i) Task B(i+1) Task A(i)
Comparison Task A(i) Task B(i)
Task B(i+1)
2010.06.28 21
WDSN10
pair phase
Checkpoint Rollback (load CP)
Comparison Task A(i) Task B(i+1)
Task A(i+1) Task A(i+1) Task A(i+1)
pair phase
22 2010.06.28 WDSN10 Core1
Task A(i) Task A(i+1) Task A(i+2) Task A(i+4) Task A(i+3)
Core2 Core3
Task A(i) Task A(i+2) Task A(i+3) Task A(i+1) Task A(i+4)
pair phase
Task A(i+2) Task A(i+3) Task A(i+1) Task A(i+4)
all register file, status registers, and memory updates the output value of the system may only be required
MPI can be used
23
Compare Compare
The result
2010.06.28 WDSN10
Core e1 Co
e2 No fault occurs Task execution No error No error Comparison result M MAT TCH MAT TCH A fault occurs during task execution Task execution No error Error Comparison result MISMAT TCH MAT TCH o
MISMAT TCH A fault occurs during comparison Task execution No error No Error Comparison result M MAT TCH MISMAT TCH
A Comparison result of each core can be represented by one bit.
24
Core1 comp pa arison r resul lt Cor re2 comp pa arison r resul lt D Decision unit outp put
Match Match Match Match Mismatch Mismatch Mismatch Match Mismatch Mismatch Mismatch Mismatch
Compare Compare
Decision
The result
The result of comparison Broadcast to all the cores
2010.06.28 WDSN10
There is no special core which controls the entire system
25
High Low
Ass si ign ne ed d co
es s
A
0, 1
B
2, 3
C
4, 5
D
6, 7
2010.06.28 WDSN10
1. If there is a Trio in the table, the Trio is selected 2. The pair which executes the lowest priority task except it
2010.06.28 WDSN10 26
Fault detection, fault location, and reconfiguration are successfully executed with a probability of 1
27 2010.06.28 WDSN10
i(t)dt) ∞
i≠ failure
If a fault is detected in any pairs, the mismatched task must be re-executed The mean number of tasks decreases in the Swap phase If a fault is detected in a Trio, the task can be executed continuously The mean number of tasks does not change
28 2010.06.28 WDSN10
If a fault is detected in any pairs, the mismatched task must be re-executed The mean number of tasks decreases in the Swap phase If a fault is detected in a Trio, the task can be executed continuously The mean number of tasks does not change
29 2010.06.28 WDSN10
When the number of active cores is 3m+2, the remaining 2 cores compose a pair When the number of active cores is 3m+1, a TMR and the remaining 1 core compose 2 pairs
If the number of cores is the same as 2n=3m at the initial state, the mean number of tasks of the proposed P&S is 1.5 times larger
2010.06.28 WDSN10 30
31
201 10.06.28 WDSN10
32
20 01 10 0. .0 06 6. .28 8 WDSN10
5 10 15 20 25 30 35 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 Pair&Swap Dynamic TMR Static TMR
Initial number of cores
Permanent : λ=1.0x10-9 Transient : ε=1.0x10-7
33
2010.06.28 WDSN10
1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 Pair&Swap/ Dynamic TMR Pair&Swap/ Static TMR
It might become a serious bottle-neck with an increasing number of processors
2010.06.28 WDSN10 34
2010.06.28 WDSN10 35
Data size, overhead time Waiting time for synchronization
2010.06.28 WDSN10 36
Each program size is small What should be compared for replicated task execution is
2010.06.28 WDSN10 37
38
2010.06.28 WDSN10
1. Executes a given task based on the task assignment table which contains a list of all the tasks to be executed and the corresponding list of cores assigned to each task 2. Exchanges execution results between cores in each pair 3. Compares its execution results with the partnerʼs results 4. Sends the comparison result to the decision unit 5. Receives comparison results of all the pairs which are broadcasted by the decision unit 6. Updates the task assignment table 7. Makes checkpoint data and stores it in the shared memory when its comparison result matches 8. Loads the corresponding checkpoint data from the shared memory when its comparison result mismatches or it belongs to the swapping pair
39
Task A (i) Task A (i+1) Task A (i+2)
2010.06.28 WDSN10
Compare Checkpoint Compare Checkpoint Compare Checkpoint
2010.06.28 WDSN10 40
2010.06.28 WDSN10 41
42 2010.06.28 WDSN10
Core1 Core2
43
Task A(i) Task A(i)
Core3
Task A(i)
Comparison Task A(i)
2010.06.28 WDSN10
pair phase
A(i+1) Task A(i+1) Task A(i+1)
Comparison Task A(i+1)
swap phase
The failure core can be identified as the one that is included in both
Continue execution. There is no need to swap pairs in the trio
Core1 Core2
44
Task A(i) Task A(i)
Core3
Task A(i)
Comparison Task A(i)
2010.06.28 WDSN10
pair phase
A(i+1) Task A(i+1) Task A(i+1)
Comparison Task A(i+1)
swap phase
all comparison results match It can be decided that the fault was transient
“Trio” continues
executing the same tasks by starting a new Pair phase
Core1 Core2
45
Task A(i) Task A(i)
Core3
Task A(i)
Comparison Task A(i)
2010.06.28 WDSN10
pair phase
A(i+1) Task A(i+1) Task A(i+1)
Comparison Task A(i+1)
swap phase
two of the three comparison results mismatch similarly to the comparison in the pair phase Task A(i+2) Task A(i+2) It can be decided that the fault was permanent
Stop using the failure core: Core1