FlipBack: Automatic Target Protection Against Soft Errors Xiang Ni - - PowerPoint PPT Presentation

flipback automatic target protection against soft errors
SMART_READER_LITE
LIVE PREVIEW

FlipBack: Automatic Target Protection Against Soft Errors Xiang Ni - - PowerPoint PPT Presentation

FlipBack: Automatic Target Protection Against Soft Errors Xiang Ni Parallel Programming Lab Soft Errors Common source of soft errors Electrical noise External radiation Manufacturing fault Data corruption: we may or


slide-1
SLIDE 1

FlipBack: Automatic Target Protection Against Soft Errors

Xiang Ni Parallel Programming Lab

slide-2
SLIDE 2

Soft Errors

  • Common source of soft errors
  • Electrical noise
  • External radiation
  • Manufacturing fault
  • Data corruption: we may or may not know

2

Shrinking chip size

  • More energy efficient
  • Higher soft error rate
slide-3
SLIDE 3

Soft Errors

  • Common source of soft errors
  • Electrical noise
  • External radiation
  • Manufacturing fault
  • Data corruption: we may or may not know

2

Shrinking chip size

  • More energy efficient
  • Higher soft error rate
slide-4
SLIDE 4

Motivation Example

3

msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg

slide-5
SLIDE 5

Motivation Example

3

expectedMsg 00000111 msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg

slide-6
SLIDE 6

Motivation Example

3

expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg 00001111

slide-7
SLIDE 7

Motivation Example

3

expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg 00001111 7 —> 15

slide-8
SLIDE 8

Motivation Example

3

expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg 00001111

HANG

7 —> 15

slide-9
SLIDE 9

Motivation Example

3

expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg

HANG

00000011 7 —> 15

slide-10
SLIDE 10

Motivation Example

3

expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg

HANG

00000011 7 —> 3

slide-11
SLIDE 11

Motivation Example

3

expectedMsg msgsRecvd== expectedMsg No Yes msgsRecvd++ ghostMsg

HANG

00000011

Stop accepting messages much earlier: incorrect result

7 —> 3

slide-12
SLIDE 12

Runtime Guided Replication

4

slide-13
SLIDE 13

Runtime Guided Replication

  • Control Variables
  • msgsRecvd, expectedMsg
  • Affecting program flow

4

slide-14
SLIDE 14

Runtime Guided Replication

  • Control Variables
  • msgsRecvd, expectedMsg
  • Affecting program flow
  • How do we ensure the program control flow is correct?
  • Fully duplication is expensive: less than 50% resource utilization or at

least twice the running time

4

slide-15
SLIDE 15

Runtime Guided Replication

  • Control Variables
  • msgsRecvd, expectedMsg
  • Affecting program flow
  • How do we ensure the program control flow is correct?
  • Fully duplication is expensive: less than 50% resource utilization or at

least twice the running time

  • What about only duplicating the computation that affects program flow?
  • Leverage a compiler slicing pass
  • Reduce computation time
  • Avoid doubling the memory

4

slide-16
SLIDE 16

Compiler Slicing Pass

5

slide-17
SLIDE 17

Compiler Slicing Pass

5

void Stencil::beginNextIter() { iterCount++; if(iterCount >= totalIter){ mainProxy.done(); //program exits }else{
 for(int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } }


slide-18
SLIDE 18

Compiler Slicing Pass

6

void Stencil::beginNextIter() { iterCount++; if(iterCount >= totalIter){ mainProxy.done(); //program exits }else{
 for(int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } }


slide-19
SLIDE 19

Compiler Slicing Pass

7

void Stencil::beginNextIter() { iterCount++; if(iterCount >= totalIter){ mainProxy.done(); //program exits }else{
 for(int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } }


slide-20
SLIDE 20

Compiler Slicing Pass

8

void Stencil::beginNextIter() { iterCount++; if(iterCount >= totalIter){ mainProxy.done(); //program exits }else{
 for(int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } }


slide-21
SLIDE 21

The Role of Runtime System

9

slide-22
SLIDE 22

The Role of Runtime System

  • Creation of shadow chares

9

slide-23
SLIDE 23

The Role of Runtime System

  • Creation of shadow chares
  • Initialize with the same control variables from the
  • riginal chare

9

slide-24
SLIDE 24

The Role of Runtime System

  • Creation of shadow chares
  • Initialize with the same control variables from the
  • riginal chare
  • Share the same pointers of the non-control variables

9

slide-25
SLIDE 25

The Role of Runtime System

  • Creation of shadow chares
  • Initialize with the same control variables from the
  • riginal chare
  • Share the same pointers of the non-control variables
  • Compare the values of control variables and outgoing

messages at the end of entry method

9

slide-26
SLIDE 26

Runtime Guided Replication

10

slide-27
SLIDE 27

Another Example

11

void Stencil:invokeCompution() { //computation routine for(int i = 0; i < size; ++i){ temperature[i] = ... } }

  • The previous method fails to protect loop index i
  • Lifetime ends before the end of the entry method
  • However, if bit flip occurs to i: incorrect data to be used or program crashes
slide-28
SLIDE 28

Selective Instruction Duplication

12

slide-29
SLIDE 29

Protection for Field Data

  • The rule holds in nature also be held in scientific programs

13

10 20 30 40 50 60 10 20 30 40 50 60 50 100 150 200 250 300 5 10 15 20 25 30 35 5 10 15 20 25 30 35 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Stencil2d OpenAtom

slide-30
SLIDE 30

Protection for Field Data

14

slide-31
SLIDE 31

Protection for Field Data

  • Spatial similarity

14

slide-32
SLIDE 32

Protection for Field Data

  • Spatial similarity

14

d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)

slide-33
SLIDE 33

Protection for Field Data

  • Spatial similarity
  • Temporal similarity

14

d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)

slide-34
SLIDE 34

Protection for Field Data

  • Spatial similarity
  • Temporal similarity
  • data at time step t-2k, t-k, t

14

d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)

slide-35
SLIDE 35

Protection for Field Data

  • Spatial similarity
  • Temporal similarity
  • data at time step t-2k, t-k, t
  • Spatial temporal similarity

14

d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)

slide-36
SLIDE 36

Protection for Field Data

  • Spatial similarity
  • Temporal similarity
  • data at time step t-2k, t-k, t
  • Spatial temporal similarity
  • spatial similarity of temporal updates

14

d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)

slide-37
SLIDE 37

Protection for Field Data

  • Spatial similarity
  • Temporal similarity
  • data at time step t-2k, t-k, t
  • Spatial temporal similarity
  • spatial similarity of temporal updates
  • temporal similarity of spatial differences

14

d(i,j) d(i,j-1) d(i,j+1) d(i-1,j+1) d(i+1,j+1) d(i+1,j-1) d(i-1,j-1) d(i-1,j) d(i+1,j)

slide-38
SLIDE 38

Evaluation

  • Miniaero
  • Mantevo mini-applications suite
  • compressible Navier-Stokes equations using explicit RK4 method
  • Particle-in-cell
  • Intel PRK benchmark suite
  • Charm++ implementation
  • Particles are distributed within a fixed grid of charges. At each time step, PIC calculates the

impact of the Coulomb potential of particles with related grid points.

  • Stencil3d
  • 7-point stencil-based computation on a 3D-structured mesh
  • Fault Injection with LLFI
  • random time
  • random processor

15

slide-39
SLIDE 39

Evaluation

16

20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (a) Original: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Hang Crash Masked SOC 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Detected Detected & Masked Detected & Corrected 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (b) Protected: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (c) Original: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (d) Protected: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (e) Original: computation (integer) 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (f) Protected: computation (integer) 40 50 60 70 80 90 100 20 25 30 35 40 45 50 55 60 Failure Type (%) Corrupted Bit (g) Original: computation (floating point) 40 50 60 70 80 90 100 20 25 30 35 40 45 50 55 60 Failure Type (%) Corrupted Bit (h) Protected: computation (floating point)

Miniaero

slide-40
SLIDE 40

Evaluation

17

Particle-in-cell

20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (a) Original: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Hang Crash Masked SOC 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Detected Detected & Masked Detected & Corrected 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (b) Protected: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (c) Original: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (d) Protected: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (e) Original: computation (integer) 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (f) Protected: computation (integer) 20 40 60 80 100 10 20 30 40 50 60 Failure Type (%) Corrupted Bit (g) Original: computation (floating point) 20 40 60 80 100 10 20 30 40 50 60 Failure Type (%) Corrupted Bit (h) Protected: computation (floating point)

slide-41
SLIDE 41

Evaluation

18

Stencil3d

20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (a) Original: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Hang Crash Masked SOC 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit Detected Detected & Masked Detected & Corrected 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (b) Protected: control 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (c) Original: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (d) Protected: communication 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (e) Original: computation (integer) 20 40 60 80 100 5 10 15 20 25 30 Failure Type (%) Corrupted Bit (f) Protected: computation (integer) 20 40 60 80 100 20 25 30 35 40 45 50 55 60 Failure Type (%) Corrupted Bit (g) Original: computation (floating point) 20 40 60 80 100 20 25 30 35 40 45 50 55 60 Failure Type (%) Corrupted Bit (h) Protected: computation (floating point)

slide-42
SLIDE 42

Performance

19

16 32 64 128 256 b a s e + r t s

  • d

u p e + i n s t

  • d

u p e + f i e l d

  • p

r

  • t

e c t i

  • n

Time (s) (a) Miniaero 16 cores 32 cores 64 cores 128 cores 256 cores 2.6% 5.3% 5.7% 3.9% 6.8% 6.9% 4.1% 9.0% 9.7% 2.6% 5.3% 5.6% 3.6% 5.5% 6.0% 1 2 4 8 16 b a s e + r t s

  • d

u p e + i n s t

  • d

u p e + f i e l d

  • p

r

  • t

e c t i

  • n

(b) PIC 10.7% 16.2% 18.4% 12.0% 14.9% 17.3% 11.0% 13.8% 15.6% 12.5% 15.2% 17.0% 5.5% 7.1% 8.9% 8 16 32 64 128 b a s e + r t s

  • d

u p e + i n s t

  • d

u p e + f i e l d

  • p

r

  • t

e c t i

  • n

(c) Stencil3d 17.6% 32.6% 41.5% 29.4% 45.5% 52.0% 16.1% 29.7% 34.4% 12.6% 23.8% 29.1% 25.0% 30.1% 32.5%

slide-43
SLIDE 43

Discussion

20

slide-44
SLIDE 44

Discussion

  • Compare with traditional checkpoint/restart strategy

20

slide-45
SLIDE 45

Discussion

  • Compare with traditional checkpoint/restart strategy
  • For bit-flips induced crashes/hangs, rolling back to previous

checkpoint is another solution

20

slide-46
SLIDE 46

Discussion

  • Compare with traditional checkpoint/restart strategy
  • For bit-flips induced crashes/hangs, rolling back to previous

checkpoint is another solution

  • At the cost of global restart

20

slide-47
SLIDE 47

Discussion

  • Compare with traditional checkpoint/restart strategy
  • For bit-flips induced crashes/hangs, rolling back to previous

checkpoint is another solution

  • At the cost of global restart
  • With FlipBack, overhead of local restart is minimal

20

slide-48
SLIDE 48

Discussion

  • Compare with traditional checkpoint/restart strategy
  • For bit-flips induced crashes/hangs, rolling back to previous

checkpoint is another solution

  • At the cost of global restart
  • With FlipBack, overhead of local restart is minimal

20

slide-49
SLIDE 49

Discussion

  • Compare with traditional checkpoint/restart strategy
  • For bit-flips induced crashes/hangs, rolling back to previous

checkpoint is another solution

  • At the cost of global restart
  • With FlipBack, overhead of local restart is minimal

20

0.1 1 10 100 1000 10 100 1K 10K 100K Overhead (%) FIT (number of crashes/hangs in 1 billion seconds) 60s 300s 600s 1800s 3600s

slide-50
SLIDE 50

Conclusion

  • Leverage compiler and runtime techniques for a cheaper way to

protect applications against silent data corruptions

  • Almost 100% coverage
  • 6-20% overhead

21