FlipBack: Automatic Target Protection Against Soft Errors Xiang Ni - - PowerPoint PPT Presentation
FlipBack: Automatic Target Protection Against Soft Errors Xiang Ni - - PowerPoint PPT Presentation
FlipBack: Automatic Target Protection Against Soft Errors Xiang Ni Parallel Programming Lab Soft Errors Common source of soft errors Electrical noise External radiation Manufacturing fault Data corruption: we may or
Soft Errors
- Common source of soft errors
- Electrical noise
- External radiation
- Manufacturing fault
- Data corruption: we may or may not know
2
Shrinking chip size
- More energy efficient
- Higher soft error rate
Soft Errors
- Common source of soft errors
- Electrical noise
- External radiation
- Manufacturing fault
- Data corruption: we may or may not know
2
Shrinking chip size
- More energy efficient
- Higher soft error rate
Motivation Example
3
msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg
Motivation Example
3
expectedMsg 00000111 msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg
Motivation Example
3
expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg 00001111
Motivation Example
3
expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg 00001111 7 —> 15
Motivation Example
3
expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg 00001111
HANG
7 —> 15
Motivation Example
3
expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg
HANG
00000011 7 —> 15
Motivation Example
3
expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg
HANG
00000011 7 —> 3
Motivation Example
3
expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg
HANG
00000011
Stop accepting messages much earlier: incorrect result
7 —> 3
Runtime Guided Replication
4
Runtime Guided Replication
- Control Variables
- msgsRecvd, expectedMsg
- Affecting program flow
4
Runtime Guided Replication
- Control Variables
- msgsRecvd, expectedMsg
- Affecting program flow
- How do we ensure the program control flow is correct?
- Fully duplication is expensive: less than 50% resource utilization or at
least twice the running time
4
Runtime Guided Replication
- Control Variables
- msgsRecvd, expectedMsg
- Affecting program flow
- How do we ensure the program control flow is correct?
- Fully duplication is expensive: less than 50% resource utilization or at
least twice the running time
- What about only duplicating the computation that affects program flow?
- Leverage a compiler slicing pass
- Reduce computation time
- Avoid doubling the memory
4
Compiler Slicing Pass
5
Compiler Slicing Pass
5
void Stencil::beginNextIter() { iterCount++; if(iterCount >= totalIter){ mainProxy.done(); //program exits }else{ for(int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } }
Compiler Slicing Pass
6
void Stencil::beginNextIter() { iterCount++; if(iterCount >= totalIter){ mainProxy.done(); //program exits }else{ for(int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } }
Compiler Slicing Pass
7
void Stencil::beginNextIter() { iterCount++; if(iterCount >= totalIter){ mainProxy.done(); //program exits }else{ for(int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } }
Compiler Slicing Pass
8
void Stencil::beginNextIter() { iterCount++; if(iterCount >= totalIter){ mainProxy.done(); //program exits }else{ for(int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } }
The Role of Runtime System
9
The Role of Runtime System
- Creation of shadow chares
9
The Role of Runtime System
- Creation of shadow chares
- Initialize with the same control variables from the
- riginal chare
9
The Role of Runtime System
- Creation of shadow chares
- Initialize with the same control variables from the
- riginal chare
- Share the same pointers of the non-control variables
9
The Role of Runtime System
- Creation of shadow chares
- Initialize with the same control variables from the
- riginal chare
- Share the same pointers of the non-control variables
- Compare the values of control variables and outgoing
messages at the end of entry method
9
Runtime Guided Replication
10
Another Example
11
void Stencil:invokeCompution() { //computation routine for(int i = 0; i < size; ++i){ temperature[i] = ... } }
- The previous method fails to protect loop index i
- Lifetime ends before the end of the entry method
- However, if bit flip occurs to i: incorrect data to be used or program crashes
Selective Instruction Duplication
12
Protection for Field Data
- The rule holds in nature also be held in scientific programs
13
10 20 30 40 50 60 10 20 30 40 50 60 50 100 150 200 250 300 5 10 15 20 25 30 35 5 10 15 20 25 30 35 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Stencil2d OpenAtom
Protection for Field Data
14
Protection for Field Data
- Spatial similarity
14
Protection for Field Data
- Spatial similarity
14
d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)
Protection for Field Data
- Spatial similarity
- Temporal similarity
14
d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)
Protection for Field Data
- Spatial similarity
- Temporal similarity
- data at time step t-2k, t-k, t
14
d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)
Protection for Field Data
- Spatial similarity
- Temporal similarity
- data at time step t-2k, t-k, t
- Spatial temporal similarity
14
d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)
Protection for Field Data
- Spatial similarity
- Temporal similarity
- data at time step t-2k, t-k, t
- Spatial temporal similarity
- spatial similarity of temporal updates
14
d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)
Protection for Field Data
- Spatial similarity
- Temporal similarity
- data at time step t-2k, t-k, t
- Spatial temporal similarity
- spatial similarity of temporal updates
- temporal similarity of spatial differences
14
d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)
Evaluation
- Miniaero
- Mantevo mini-applications suite
- compressible Navier-Stokes equations using explicit RK4 method
- Particle-in-cell
- Intel PRK benchmark suite
- Charm++ implementation
- Particles are distributed within a fixed grid of charges. At each time step, PIC calculates the
impact of the Coulomb potential of particles with related grid points.
- Stencil3d
- 7-point stencil-based computation on a 3D-structured mesh
- Fault Injection with LLFI
- random time
- random processor
15
Evaluation
16
20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (a) Original: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Hang Crash Masked SOC 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Detected Detected & Masked Detected & Corrected 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (b) Protected: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (c) Original: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (d) Protected: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (e) Original: computation (integer) 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (f) Protected: computation (integer) 40 50 60 70 80 90 100 20 25 30 35 40 45 50 55 60 Failure Type (%) Corrupted Bit (g) Original: computation (floating point) 40 50 60 70 80 90 100 20 25 30 35 40 45 50 55 60 Failure Type (%) Corrupted Bit (h) Protected: computation (floating point)
Miniaero
Evaluation
17
Particle-in-cell
20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (a) Original: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Hang Crash Masked SOC 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Detected Detected & Masked Detected & Corrected 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (b) Protected: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (c) Original: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (d) Protected: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (e) Original: computation (integer) 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (f) Protected: computation (integer) 20 40 60 80 100 10 20 30 40 50 60 Failure Type (%) Corrupted Bit (g) Original: computation (floating point) 20 40 60 80 100 10 20 30 40 50 60 Failure Type (%) Corrupted Bit (h) Protected: computation (floating point)
Evaluation
18
Stencil3d
20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (a) Original: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Hang Crash Masked SOC 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Detected Detected & Masked Detected & Corrected 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (b) Protected: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (c) Original: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (d) Protected: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (e) Original: computation (integer) 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (f) Protected: computation (integer) 20 40 60 80 100 20 25 30 35 40 45 50 55 60 Failure Type (%) Corrupted Bit (g) Original: computation (floating point) 20 40 60 80 100 20 25 30 35 40 45 50 55 60 Failure Type (%) Corrupted Bit (h) Protected: computation (floating point)
Performance
19
16 32 64 128 256 b a s e + r t s
- d
u p e + i n s t
- d
u p e + f i e l d
- p
r
- t
e c t i
- n
Time (s) (a) Miniaero 16 cores 32 cores 64 cores 128 cores 256 cores 2.6% 5.3% 5.7% 3.9% 6.8% 6.9% 4.1% 9.0% 9.7% 2.6% 5.3% 5.6% 3.6% 5.5% 6.0% 1 2 4 8 16 b a s e + r t s
- d
u p e + i n s t
- d
u p e + f i e l d
- p
r
- t
e c t i
- n
(b) PIC 10.7% 16.2% 18.4% 12.0% 14.9% 17.3% 11.0% 13.8% 15.6% 12.5% 15.2% 17.0% 5.5% 7.1% 8.9% 8 16 32 64 128 b a s e + r t s
- d
u p e + i n s t
- d
u p e + f i e l d
- p
r
- t
e c t i
- n
(c) Stencil3d 17.6% 32.6% 41.5% 29.4% 45.5% 52.0% 16.1% 29.7% 34.4% 12.6% 23.8% 29.1% 25.0% 30.1% 32.5%
Discussion
20
Discussion
- Compare with traditional checkpoint/restart strategy
20
Discussion
- Compare with traditional checkpoint/restart strategy
- For bit-flips induced crashes/hangs, rolling back to previous
checkpoint is another solution
20
Discussion
- Compare with traditional checkpoint/restart strategy
- For bit-flips induced crashes/hangs, rolling back to previous
checkpoint is another solution
- At the cost of global restart
20
Discussion
- Compare with traditional checkpoint/restart strategy
- For bit-flips induced crashes/hangs, rolling back to previous
checkpoint is another solution
- At the cost of global restart
- With FlipBack, overhead of local restart is minimal
20
Discussion
- Compare with traditional checkpoint/restart strategy
- For bit-flips induced crashes/hangs, rolling back to previous
checkpoint is another solution
- At the cost of global restart
- With FlipBack, overhead of local restart is minimal
20
Discussion
- Compare with traditional checkpoint/restart strategy
- For bit-flips induced crashes/hangs, rolling back to previous
checkpoint is another solution
- At the cost of global restart
- With FlipBack, overhead of local restart is minimal
20
0.1 1 10 100 1000 10 100 1K 10K 100K Overhead (%) FIT (number of crashes/hangs in 1 billion seconds) 60s 300s 600s 1800s 3600s
Conclusion
- Leverage compiler and runtime techniques for a cheaper way to
protect applications against silent data corruptions
- Almost 100% coverage
- 6-20% overhead
21