A Self-correcting Graph Connected Component Algorithm Piyush Sao, - - PowerPoint PPT Presentation

a self correcting graph connected component algorithm
SMART_READER_LITE
LIVE PREVIEW

A Self-correcting Graph Connected Component Algorithm Piyush Sao, - - PowerPoint PPT Presentation

http://hpcgarage.org/ftxs16/ A Self-correcting Graph Connected Component Algorithm Piyush Sao, Oded Green, Chirag Jain, Richard Vuduc The 6 th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016 http://hpcgarage.org/ftxs16/ Piyush


slide-1
SLIDE 1

A Self-correcting Graph Connected Component Algorithm

Piyush Sao, Oded Green, Chirag Jain, Richard Vuduc The 6th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016 http://hpcgarage.org/ftxs16/

Piyush Sao Fault tolerant graph computing FTXS16 1 / 26

http://hpcgarage.org/ftxs16/

slide-2
SLIDE 2

Summary

Summary of Contributions

Self-correcting Algorithms We introduce a new fault tolerant algorithm design principle that we call self-correction. A self-correcting algorithm remains in a valid state, despite the faulty execution of an iteration, under the assumption that its previous state was a valid one. Self-Correcting Connected Components Algorithm We apply the ideas of self-correction to Label-propagation algorithm for graph connected component problem. Assumes availability of selective reliability mode. Requires O

  • V
  • additional storage and computations per iteration compared to

O

  • |V | + |E|
  • cost for the baseline algorithm.

10-35% increases in execution time for one error for 64 memory operations.

Piyush Sao Fault tolerant graph computing FTXS16 2 / 26

http://hpcgarage.org/ftxs16/

slide-3
SLIDE 3

Summary

Summary of Contributions

Self-correcting Algorithms We introduce a new fault tolerant algorithm design principle that we call self-correction. A self-correcting algorithm remains in a valid state, despite the faulty execution of an iteration, under the assumption that its previous state was a valid one. Self-Correcting Connected Components Algorithm We apply the ideas of self-correction to Label-propagation algorithm for graph connected component problem. Assumes availability of selective reliability mode. Requires O

  • V
  • additional storage and computations per iteration compared to

O

  • |V | + |E|
  • cost for the baseline algorithm.

10-35% increases in execution time for one error for 64 memory operations.

Piyush Sao Fault tolerant graph computing FTXS16 2 / 26

http://hpcgarage.org/ftxs16/

slide-4
SLIDE 4

Self-correcting Algorithms

Iterative Algorithms

Input Problem (I) No Yes Initialize S0 Update Sk Sk+1=F(Sk) k=k+1 Converged? Report Solution

Iterative Algorithms A typical iterative algorithm has following components:

1

The input problem;

2

Intermediate variable;

3

Update relation;

4

Convergence checking; and

5

Solution.

Piyush Sao Fault tolerant graph computing FTXS16 3 / 26

http://hpcgarage.org/ftxs16/

slide-5
SLIDE 5

Self-correcting Algorithms

Iterative Algorithm as State Machine

S0 S1 S2 Sn Iterative Algorithms as State Machines An iterative algorithm can be viewed as state machine. State of the algorithm: subset of intermediate variables required for continued execution of the algorithm. Starts with an initial state S0 Uses update relation to transition from one state to another Sk+1 ← Sk

Piyush Sao Fault tolerant graph computing FTXS16 4 / 26

http://hpcgarage.org/ftxs16/

slide-6
SLIDE 6

Self-correcting Algorithms

Iterative Algorithm as State Machine

S0 S1 S2 Sn Iterative Algorithms as State Machines An iterative algorithm can be viewed as state machine. State of the algorithm: subset of intermediate variables required for continued execution of the algorithm. Starts with an initial state S0 Uses update relation to transition from one state to another Sk+1 ← Sk

Piyush Sao Fault tolerant graph computing FTXS16 4 / 26

http://hpcgarage.org/ftxs16/

slide-7
SLIDE 7

Self-correcting Algorithms

Iterative Algorithm as State Machine

S0 S1 S2 Sn Iterative Algorithms as State Machines An iterative algorithm can be viewed as state machine. State of the algorithm: subset of intermediate variables required for continued execution of the algorithm. Starts with an initial state S0 Uses update relation to transition from one state to another Sk+1 ← Sk

Piyush Sao Fault tolerant graph computing FTXS16 4 / 26

http://hpcgarage.org/ftxs16/

slide-8
SLIDE 8

Self-correcting Algorithms

Iterative Algorithm as State Machine

S0 S1 S2 Sn Iterative Algorithms as State Machines An iterative algorithm can be viewed as state machine. State of the algorithm: subset of intermediate variables required for continued execution of the algorithm. Starts with an initial state S0 Uses update relation to transition from one state to another Sk+1 ← Sk

Piyush Sao Fault tolerant graph computing FTXS16 4 / 26

http://hpcgarage.org/ftxs16/

slide-9
SLIDE 9

Self-correcting Algorithms

Single Fault in Iterative Algorithm

S0 S1 S2 Sn Sf Fault

Valid and Invalid States Valid state: under fault-free execution

  • f the algorithm from that state, the

algorithm will converge to the correct solution; otherwise invalid. In fault free execution, the algorithm always remains in a valid state. Any hardware fault can cause the algorithm to reach an invalid state. In general determining whether a given state is valid or not, is non-trivial.

Piyush Sao Fault tolerant graph computing FTXS16 5 / 26

http://hpcgarage.org/ftxs16/

slide-10
SLIDE 10

Self-correcting Algorithms

Single Fault in Iterative Algorithm

S0 S1 S2 Sn Sf Fault Valid State

Valid and Invalid States Valid state: under fault-free execution

  • f the algorithm from that state, the

algorithm will converge to the correct solution; otherwise invalid. In fault free execution, the algorithm always remains in a valid state. Any hardware fault can cause the algorithm to reach an invalid state. In general determining whether a given state is valid or not, is non-trivial.

Piyush Sao Fault tolerant graph computing FTXS16 5 / 26

http://hpcgarage.org/ftxs16/

slide-11
SLIDE 11

Self-correcting Algorithms

Single Fault in Iterative Algorithm

S0 S1 S2 Sn Sf Fault Invalid State Sfn

Valid and Invalid States Valid state: under fault-free execution

  • f the algorithm from that state, the

algorithm will converge to the correct solution; otherwise invalid. In fault free execution, the algorithm always remains in a valid state. Any hardware fault can cause the algorithm to reach an invalid state. In general determining whether a given state is valid or not, is non-trivial.

Piyush Sao Fault tolerant graph computing FTXS16 5 / 26

http://hpcgarage.org/ftxs16/

slide-12
SLIDE 12

Self-correcting Algorithms

Self-stabilizing Algorithms

Valid State Solution State Arbitrary State

Self-stabilizing Algorithms Starting from any arbitrary state, valid

  • r invalid, a self-stabilizing algorithm

will reach a valid in finite number of steps. Natural fault-tolerance mechanism. Examples: Stationary iterations, Newton Iteration. Self-stabilization is a strong property. Scala’13 Self-stabilizing Steepest Descent and Conjugate Gradient. Periodic state correction. May not be generalized to all iterative algorithms.

Piyush Sao Fault tolerant graph computing FTXS16 6 / 26

http://hpcgarage.org/ftxs16/

slide-13
SLIDE 13

Self-correcting Algorithms

Self-Correcting Algorithms

S0 S1 S2 Sn Arbitrary State Fault Valid State

Self-correcting Algorithms A self-correcting algorithm is an iterative algorithm that, starting in some valid state, remains in a valid state or comes to a valid state in finite number of steps even if a fault occurs. Requires that algorithm starts from a valid state. Uses information from previously known valid state. Example: Checkpoint-restart, FT-GMRES.

Piyush Sao Fault tolerant graph computing FTXS16 7 / 26

http://hpcgarage.org/ftxs16/

slide-14
SLIDE 14

Self-correcting Algorithms

Checkpoint-restart as a Self-correcting algorithm

S0 S1 S2 Sn Arbitrary State Fault S1

Checkpoint-restart based fault tolerance Bring to valid state by restoring a check-pointed valid state. At high fault rate, algorithm will not make any progress. Broader idea of self-correction is to use S1 to construct an state ˜ S2 ≈ S2

Piyush Sao Fault tolerant graph computing FTXS16 8 / 26

http://hpcgarage.org/ftxs16/

slide-15
SLIDE 15

Self-correcting Algorithms

Checkpoint-restart as a Self-correcting algorithm

S0 S1 S2 Sn Arbitrary State Fault Restart

Checkpoint-restart based fault tolerance Bring to valid state by restoring a check-pointed valid state. At high fault rate, algorithm will not make any progress. Broader idea of self-correction is to use S1 to construct an state ˜ S2 ≈ S2

Piyush Sao Fault tolerant graph computing FTXS16 8 / 26

http://hpcgarage.org/ftxs16/

slide-16
SLIDE 16

Self-correcting Algorithms

Checkpoint-restart as a Self-correcting algorithm

S0 S1 S2 Sn Arbitrary State Fault

Checkpoint-restart based fault tolerance Bring to valid state by restoring a check-pointed valid state. At high fault rate, algorithm will not make any progress. Broader idea of self-correction is to use S1 to construct an state ˜ S2 ≈ S2

Piyush Sao Fault tolerant graph computing FTXS16 8 / 26

http://hpcgarage.org/ftxs16/

slide-17
SLIDE 17

Self-correcting Label Correction Algorithm

Label Propagation Algorithm for Graph Connected Component Algorithm

2 3 4 7 8 1 5 6

Graph Connected-component Problem We seek to find number of connected-components in the graph and which connected component each vertex belongs to. Used for community detection, centrality analytics and streaming graph analytics. Label propagation is highly suited for parallel computing.

Piyush Sao Fault tolerant graph computing FTXS16 9 / 26

http://hpcgarage.org/ftxs16/

slide-18
SLIDE 18

Self-correcting Label Correction Algorithm

Label Propagation Algorithm for Graph Connected Component Algorithm

2 3 4 7 8 1 5 6

CC[2]=2 CC[3]=2 CC[4]=2 CC[5]=0 CC[6]=0 CC[7]=2 CC[8]=2 CC[0]=0 CC[1]=0

Graph Connected-component Problem We seek to find number of connected-components in the graph and which connected component each vertex belongs to. Used for community detection, centrality analytics and streaming graph analytics. Label propagation is highly suited for parallel computing.

Piyush Sao Fault tolerant graph computing FTXS16 9 / 26

http://hpcgarage.org/ftxs16/

slide-19
SLIDE 19

Self-correcting Label Correction Algorithm

Label Propagation Algorithm

3 4 7 8

CCi[3]=3 CCi[4]=4 CCi[7]=7 CCi[8]=8

Propagates the minimum label in the connected component. We maintain a label array CC for each vertex. CC is initialized to vertex id for each vertex. In each iteration, each vertex calculates minimum label among all near-neighbours N(v) = v ∪ adj(v) CC i[v] = min

u∈N (v) CC i−1[u].

Iteration terminates when no change occur in an iteration.

Piyush Sao Fault tolerant graph computing FTXS16 10 / 26

http://hpcgarage.org/ftxs16/

slide-20
SLIDE 20

Self-correcting Label Correction Algorithm

Label Propagation Algorithm

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2 CCi+1[7]=min{7,2,2,3}

Propagates the minimum label in the connected component. We maintain a label array CC for each vertex. CC is initialized to vertex id for each vertex. In each iteration, each vertex calculates minimum label among all near-neighbours N(v) = v ∪ adj(v) CC i[v] = min

u∈N (v) CC i−1[u].

Iteration terminates when no change occur in an iteration.

Piyush Sao Fault tolerant graph computing FTXS16 10 / 26

http://hpcgarage.org/ftxs16/

slide-21
SLIDE 21

Self-correcting Label Correction Algorithm

Label Propagation Algorithm- Example

2 3 4 7 8 1 5 6

CC[2]=2 CC[3]=3 CC[4]=4 CC[5]=5 CC[6]=6 CC[7]=7 CC[8]=8 CC[0]=0 CC[1]=1

Minimum label in the connected component (shown in green and orange ) propagates to all the vertex in the connected components.

Piyush Sao Fault tolerant graph computing FTXS16 11 / 26

http://hpcgarage.org/ftxs16/

slide-22
SLIDE 22

Self-correcting Label Correction Algorithm

Label Propagation Algorithm- Example

2 3 4 7 8 1 5 6

CC[2]=2 CC[3]=2 CC[4]=3 CC[5]=1 CC[6]=0 CC[7]=3 CC[8]=2 CC[0]=0 CC[1]=0

Minimum label in the connected component (shown in green and orange ) propagates to all the vertex in the connected components.

Piyush Sao Fault tolerant graph computing FTXS16 11 / 26

http://hpcgarage.org/ftxs16/

slide-23
SLIDE 23

Self-correcting Label Correction Algorithm

Label Propagation Algorithm- Example

2 3 4 7 8 1 5 6

CC[2]=2 CC[3]=2 CC[4]=3 CC[5]=1 CC[6]=0 CC[7]=3 CC[8]=2 CC[0]=0 CC[1]=0

Minimum label in the connected component (shown in green and orange ) propagates to all the vertex in the connected components.

Piyush Sao Fault tolerant graph computing FTXS16 11 / 26

http://hpcgarage.org/ftxs16/

slide-24
SLIDE 24

Self-correcting Label Correction Algorithm

Label Propagation Algorithm- Example

2 3 4 7 8 1 5 6

CC[2]=2 CC[3]=2 CC[4]=2 CC[5]=0 CC[6]=0 CC[7]=2 CC[8]=2 CC[0]=0 CC[1]=0

Minimum label in the connected component (shown in green and orange ) propagates to all the vertex in the connected components.

Piyush Sao Fault tolerant graph computing FTXS16 11 / 26

http://hpcgarage.org/ftxs16/

slide-25
SLIDE 25

Self-correcting Label Correction Algorithm

State of Label Propagation Algorithm

2 3 4 7 8

CC[2]=2 CC[3]=2 CC[4]=3 CC[7]=5 CC[8]=2 CC vector defines the state of the LP algorithm.

Piyush Sao Fault tolerant graph computing FTXS16 12 / 26

http://hpcgarage.org/ftxs16/

slide-26
SLIDE 26

Self-correcting Label Correction Algorithm

Single Fault In Label Propagation Algorithm- Fault Correction

2 3 4 7 8

CC[2]=2 CC[3]=2 CC[4]=3 CC[7]=5 CC[8]=2 CC value can be corrupted due to hardware faults. Depending on error caused by the fault, some error can be corrected by the algorithm. Example: corrupted CC[v] value is arbitrarily high. Such faults do not cause the algorithm to transition to an invalid state, but they can still cause delay in convergence.

Piyush Sao Fault tolerant graph computing FTXS16 13 / 26

http://hpcgarage.org/ftxs16/

slide-27
SLIDE 27

Self-correcting Label Correction Algorithm

Single Fault In Label Propagation Algorithm- Fault Correction

2 3 4 7 8

CC[2]=2 CC[3]=2 CC[4]=2 CC[7]=2 CC[8]=2 CC value can be corrupted due to hardware faults. Depending on error caused by the fault, some error can be corrected by the algorithm. Example: corrupted CC[v] value is arbitrarily high. Such faults do not cause the algorithm to transition to an invalid state, but they can still cause delay in convergence.

Piyush Sao Fault tolerant graph computing FTXS16 13 / 26

http://hpcgarage.org/ftxs16/

slide-28
SLIDE 28

Self-correcting Label Correction Algorithm

Single Fault In Label Propagation Algorithm- Fault Propagation

2 3 4 7 8

CC[2]=2 CC[3]=2 CC[4]=3 CC[7]=0 CC[8]=2 If the fault causes a corruption such CC[v] is changed to a values lower than the CC ∞[v], the error will propagate to all the other vertex in the connected component. Such faults do not cause the algorithm to transition to an invalid state. It is not computationally easy to determine whether a given snapshot of CC is a valid state.

Piyush Sao Fault tolerant graph computing FTXS16 14 / 26

http://hpcgarage.org/ftxs16/

slide-29
SLIDE 29

Self-correcting Label Correction Algorithm

Single Fault In Label Propagation Algorithm- Fault Propagation

2 3 4 7 8

CC[2]=2 CC[3]=0 CC[4]=2 CC[7]=0 CC[8]=0 If the fault causes a corruption such CC[v] is changed to a values lower than the CC ∞[v], the error will propagate to all the other vertex in the connected component. Such faults do not cause the algorithm to transition to an invalid state. It is not computationally easy to determine whether a given snapshot of CC is a valid state.

Piyush Sao Fault tolerant graph computing FTXS16 14 / 26

http://hpcgarage.org/ftxs16/

slide-30
SLIDE 30

Self-correcting Label Correction Algorithm

Single Fault In Label Propagation Algorithm- Fault Propagation

2 3 4 7 8

CC[2]=0 CC[3]=0 CC[4]=0 CC[7]=0 CC[8]=0 If the fault causes a corruption such CC[v] is changed to a values lower than the CC ∞[v], the error will propagate to all the other vertex in the connected component. Such faults do not cause the algorithm to transition to an invalid state. It is not computationally easy to determine whether a given snapshot of CC is a valid state.

Piyush Sao Fault tolerant graph computing FTXS16 14 / 26

http://hpcgarage.org/ftxs16/

slide-31
SLIDE 31

Self-correcting Label Correction Algorithm

Valid States

Classification of different corruption Consider following three cases:

1

CC[v] > v: Easy to detect and automatically corrected in most cases.

2

CC ∞[v] ≤ CC[v] ≤ v: ??

3

CC[v] < CC ∞[v]: Will definitely cause algorithm to fail. Theorem A connected component array CC is a valid state—i.e., a fault-free execution of algorithm starting from CC will converge to the correct solution—if, for all vertices v, CC ∞[v] ≤ CC[v] ≤ v. The only problem is, CC ∞[v]—being the solution, is not known.

Piyush Sao Fault tolerant graph computing FTXS16 15 / 26

http://hpcgarage.org/ftxs16/

slide-32
SLIDE 32

Self-correcting Label Correction Algorithm

Valid States

Classification of different corruption Consider following three cases:

1

CC[v] > v: Easy to detect and automatically corrected in most cases.

2

CC ∞[v] ≤ CC[v] ≤ v: ??

3

CC[v] < CC ∞[v]: Will definitely cause algorithm to fail. Theorem A connected component array CC is a valid state—i.e., a fault-free execution of algorithm starting from CC will converge to the correct solution—if, for all vertices v, CC ∞[v] ≤ CC[v] ≤ v. The only problem is, CC ∞[v]—being the solution, is not known.

Piyush Sao Fault tolerant graph computing FTXS16 15 / 26

http://hpcgarage.org/ftxs16/

slide-33
SLIDE 33

Self-correcting Label Correction Algorithm

Valid States

Classification of different corruption Consider following three cases:

1

CC[v] > v: Easy to detect and automatically corrected in most cases.

2

CC ∞[v] ≤ CC[v] ≤ v: ??

3

CC[v] < CC ∞[v]: Will definitely cause algorithm to fail. Theorem A connected component array CC is a valid state—i.e., a fault-free execution of algorithm starting from CC will converge to the correct solution—if, for all vertices v, CC ∞[v] ≤ CC[v] ≤ v. The only problem is, CC ∞[v]—being the solution, is not known.

Piyush Sao Fault tolerant graph computing FTXS16 15 / 26

http://hpcgarage.org/ftxs16/

slide-34
SLIDE 34

Self-correcting Label Correction Algorithm

Valid States

Classification of different corruption Consider following three cases:

1

CC[v] > v: Easy to detect and automatically corrected in most cases.

2

CC ∞[v] ≤ CC[v] ≤ v: ??

3

CC[v] < CC ∞[v]: Will definitely cause algorithm to fail. Theorem A connected component array CC is a valid state—i.e., a fault-free execution of algorithm starting from CC will converge to the correct solution—if, for all vertices v, CC ∞[v] ≤ CC[v] ≤ v. The only problem is, CC ∞[v]—being the solution, is not known.

Piyush Sao Fault tolerant graph computing FTXS16 15 / 26

http://hpcgarage.org/ftxs16/

slide-35
SLIDE 35

Self-correcting Label Correction Algorithm

Valid States

Classification of different corruption Consider following three cases:

1

CC[v] > v: Easy to detect and automatically corrected in most cases.

2

CC ∞[v] ≤ CC[v] ≤ v: ??

3

CC[v] < CC ∞[v]: Will definitely cause algorithm to fail. Theorem A connected component array CC is a valid state—i.e., a fault-free execution of algorithm starting from CC will converge to the correct solution—if, for all vertices v, CC ∞[v] ≤ CC[v] ≤ v. The only problem is, CC ∞[v]—being the solution, is not known.

Piyush Sao Fault tolerant graph computing FTXS16 15 / 26

http://hpcgarage.org/ftxs16/

slide-36
SLIDE 36

Self-correcting Label Correction Algorithm

Self-correcting Label Propagation Algorithm- 1

We apply principle of self-correction to resolve the apparent difficulty in verifying state validity. We assume the previous state CC i−1 is a valid one. Checking CC i[v] = min

u∈N (v) CC i−1[u],

where N(v) ≡ v ∪ adj(v) is the immediate neighborhood of v, will require re-computing entire iteration. We show that CC i[v] is still a valid value even if we can relax the minimization criterion to CC i[v] ∈ {CC i−1[u] | u ∈ N(v)},

Piyush Sao Fault tolerant graph computing FTXS16 16 / 26

http://hpcgarage.org/ftxs16/

slide-37
SLIDE 37

Self-correcting Label Correction Algorithm

Self-correcting Label Propagation Algorithm- 1

We apply principle of self-correction to resolve the apparent difficulty in verifying state validity. We assume the previous state CC i−1 is a valid one. Checking CC i[v] = min

u∈N (v) CC i−1[u],

where N(v) ≡ v ∪ adj(v) is the immediate neighborhood of v, will require re-computing entire iteration. We show that CC i[v] is still a valid value even if we can relax the minimization criterion to CC i[v] ∈ {CC i−1[u] | u ∈ N(v)},

Piyush Sao Fault tolerant graph computing FTXS16 16 / 26

http://hpcgarage.org/ftxs16/

slide-38
SLIDE 38

Self-correcting Label Correction Algorithm

Self-correcting Label Propagation Algorithm- 1

We apply principle of self-correction to resolve the apparent difficulty in verifying state validity. We assume the previous state CC i−1 is a valid one. Checking CC i[v] = min

u∈N (v) CC i−1[u],

where N(v) ≡ v ∪ adj(v) is the immediate neighborhood of v, will require re-computing entire iteration. We show that CC i[v] is still a valid value even if we can relax the minimization criterion to CC i[v] ∈ {CC i−1[u] | u ∈ N(v)},

Piyush Sao Fault tolerant graph computing FTXS16 16 / 26

http://hpcgarage.org/ftxs16/

slide-39
SLIDE 39

Self-correcting Label Correction Algorithm

Self-correcting Label Propagation Algorithm- 1

We apply principle of self-correction to resolve the apparent difficulty in verifying state validity. We assume the previous state CC i−1 is a valid one. Checking CC i[v] = min

u∈N (v) CC i−1[u],

where N(v) ≡ v ∪ adj(v) is the immediate neighborhood of v, will require re-computing entire iteration. We show that CC i[v] is still a valid value even if we can relax the minimization criterion to CC i[v] ∈ {CC i−1[u] | u ∈ N(v)},

Piyush Sao Fault tolerant graph computing FTXS16 16 / 26

http://hpcgarage.org/ftxs16/

slide-40
SLIDE 40

Self-correcting Label Correction Algorithm

Self-correcting Label Propagation Algorithm- 2

Theorem Given a valid state for the previous iteration, CC i−1, the current connected component array CC i is a valid state if for all vertices v, CC i satisfies these conditions:

1

CC i[v] ≤ v; and

2

there exists a vertex u such that CC i[v] = CC i−1[u] and u ∈ N(v). Cost of Direct Verification Verifying CC i[v] ≤ v requires O

  • V
  • peration.

Verifying second condition requires traversing adjacency list for each vertex v, that will require O

  • V + E
  • perations, as costly as an LP iteration.

Piyush Sao Fault tolerant graph computing FTXS16 17 / 26

http://hpcgarage.org/ftxs16/

slide-41
SLIDE 41

Self-correcting Label Correction Algorithm

Self-correcting Label Propagation Algorithm- 2

Theorem Given a valid state for the previous iteration, CC i−1, the current connected component array CC i is a valid state if for all vertices v, CC i satisfies these conditions:

1

CC i[v] ≤ v; and

2

there exists a vertex u such that CC i[v] = CC i−1[u] and u ∈ N(v). Cost of Direct Verification Verifying CC i[v] ≤ v requires O

  • V
  • peration.

Verifying second condition requires traversing adjacency list for each vertex v, that will require O

  • V + E
  • perations, as costly as an LP iteration.

Piyush Sao Fault tolerant graph computing FTXS16 17 / 26

http://hpcgarage.org/ftxs16/

slide-42
SLIDE 42

Self-correcting Label Correction Algorithm

Validity Checking: Auxiliary Data Structure- 1

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2

Parent Array Parent array P: We may store store information of the vertex u that caused the last change in CC[v]. If u = P[v] then CC i[v] = CC i−1[P[v]], can be verified in O

  • V
  • perations for all vertex.

Storing P requires an memory of O

  • V
  • .

Corruption of P P also can be corrupt. P is valid if P[v] ∈ N(v) for all vertex v. Checking P is valid requires again O

  • V + E
  • perations.

Piyush Sao Fault tolerant graph computing FTXS16 18 / 26

http://hpcgarage.org/ftxs16/

slide-43
SLIDE 43

Self-correcting Label Correction Algorithm

Validity Checking: Auxiliary Data Structure- 1

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2 CCi+1[7]=min{7,2,3,4} P[7]=3

Parent Array Parent array P: We may store store information of the vertex u that caused the last change in CC[v]. If u = P[v] then CC i[v] = CC i−1[P[v]], can be verified in O

  • V
  • perations for all vertex.

Storing P requires an memory of O

  • V
  • .

Corruption of P P also can be corrupt. P is valid if P[v] ∈ N(v) for all vertex v. Checking P is valid requires again O

  • V + E
  • perations.

Piyush Sao Fault tolerant graph computing FTXS16 18 / 26

http://hpcgarage.org/ftxs16/

slide-44
SLIDE 44

Self-correcting Label Correction Algorithm

Validity Checking: Auxiliary Data Structure- 1

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2 CCi+1[7]=min{7,2,3,4} P[7]=3

Parent Array Parent array P: We may store store information of the vertex u that caused the last change in CC[v]. If u = P[v] then CC i[v] = CC i−1[P[v]], can be verified in O

  • V
  • perations for all vertex.

Storing P requires an memory of O

  • V
  • .

Corruption of P P also can be corrupt. P is valid if P[v] ∈ N(v) for all vertex v. Checking P is valid requires again O

  • V + E
  • perations.

Piyush Sao Fault tolerant graph computing FTXS16 18 / 26

http://hpcgarage.org/ftxs16/

slide-45
SLIDE 45

Self-correcting Label Correction Algorithm

Validity Checking: Auxiliary Data Structure- 2

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2 CCi+1[7]=min{7,2,3,4} P[7]=3 1 2

Index Based Parent Array Instead of storing u, we store index of u in adj(v). E ←adj(v) u ←E[k] P∗[v] ←k When P[v] = v, then P∗[v] = −1 CC i[v] = CC i−1[P[v]] reduces to CC i[v] = CC i−1[adj(v)[P∗[v]]]; O

  • V
  • perations.

Validity of P∗: −1 ≤ P∗[v] < |adj(v)|; O

  • V
  • perations.

Piyush Sao Fault tolerant graph computing FTXS16 19 / 26

http://hpcgarage.org/ftxs16/

slide-46
SLIDE 46

Self-correcting Label Correction Algorithm

Validity Checking: Auxiliary Data Structure- 2

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2 CCi+1[7]=min{7,2,3,4} P[7]=3

adj(7)={3,4,8}

1 2

Index Based Parent Array Instead of storing u, we store index of u in adj(v). E ←adj(v) u ←E[k] P∗[v] ←k When P[v] = v, then P∗[v] = −1 CC i[v] = CC i−1[P[v]] reduces to CC i[v] = CC i−1[adj(v)[P∗[v]]]; O

  • V
  • perations.

Validity of P∗: −1 ≤ P∗[v] < |adj(v)|; O

  • V
  • perations.

Piyush Sao Fault tolerant graph computing FTXS16 19 / 26

http://hpcgarage.org/ftxs16/

slide-47
SLIDE 47

Self-correcting Label Correction Algorithm

Validity Checking: Auxiliary Data Structure- 2

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2 CCi+1[7]=min{7,2,3,4} P[7]=3

adj(7)={3,4,8} adj(7)[0]=3

1 2

Index Based Parent Array Instead of storing u, we store index of u in adj(v). E ←adj(v) u ←E[k] P∗[v] ←k When P[v] = v, then P∗[v] = −1 CC i[v] = CC i−1[P[v]] reduces to CC i[v] = CC i−1[adj(v)[P∗[v]]]; O

  • V
  • perations.

Validity of P∗: −1 ≤ P∗[v] < |adj(v)|; O

  • V
  • perations.

Piyush Sao Fault tolerant graph computing FTXS16 19 / 26

http://hpcgarage.org/ftxs16/

slide-48
SLIDE 48

Self-correcting Label Correction Algorithm

Validity Checking: Auxiliary Data Structure- 2

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2 CCi+1[7]=min{7,2,3,4} P[7]=3

adj(7)={3,4,8} adj(7)[0]=3 P*[7]=0

1 2

Index Based Parent Array Instead of storing u, we store index of u in adj(v). E ←adj(v) u ←E[k] P∗[v] ←k When P[v] = v, then P∗[v] = −1 CC i[v] = CC i−1[P[v]] reduces to CC i[v] = CC i−1[adj(v)[P∗[v]]]; O

  • V
  • perations.

Validity of P∗: −1 ≤ P∗[v] < |adj(v)|; O

  • V
  • perations.

Piyush Sao Fault tolerant graph computing FTXS16 19 / 26

http://hpcgarage.org/ftxs16/

slide-49
SLIDE 49

Self-correcting Label Correction Algorithm

Validity Checking: Auxiliary Data Structure- 2

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2 CCi+1[7]=min{7,2,3,4} P[7]=3

adj(7)={3,4,8} adj(7)[0]=3 P*[7]=0

1 2

Index Based Parent Array Instead of storing u, we store index of u in adj(v). E ←adj(v) u ←E[k] P∗[v] ←k When P[v] = v, then P∗[v] = −1 CC i[v] = CC i−1[P[v]] reduces to CC i[v] = CC i−1[adj(v)[P∗[v]]]; O

  • V
  • perations.

Validity of P∗: −1 ≤ P∗[v] < |adj(v)|; O

  • V
  • perations.

Piyush Sao Fault tolerant graph computing FTXS16 19 / 26

http://hpcgarage.org/ftxs16/

slide-50
SLIDE 50

Self-correcting Label Correction Algorithm

Validity Checking: Auxiliary Data Structure- 2

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2 CCi+1[7]=min{7,2,3,4} P[7]=3

adj(7)={3,4,8} adj(7)[0]=3 P*[7]=0

1 2

Index Based Parent Array Instead of storing u, we store index of u in adj(v). E ←adj(v) u ←E[k] P∗[v] ←k When P[v] = v, then P∗[v] = −1 CC i[v] = CC i−1[P[v]] reduces to CC i[v] = CC i−1[adj(v)[P∗[v]]]; O

  • V
  • perations.

Validity of P∗: −1 ≤ P∗[v] < |adj(v)|; O

  • V
  • perations.

Piyush Sao Fault tolerant graph computing FTXS16 19 / 26

http://hpcgarage.org/ftxs16/

slide-51
SLIDE 51

Self-correcting Label Correction Algorithm

Validity Checking: Auxiliary Data Structure- 2

3 4 7 8

CCi[3]=2 CCi[4]=3 CCi[7]=7 CCi[8]=2 CCi+1[7]=min{7,2,3,4} P[7]=3

adj(7)={3,4,8} adj(7)[0]=3 P*[7]=0

1 2

Index Based Parent Array Instead of storing u, we store index of u in adj(v). E ←adj(v) u ←E[k] P∗[v] ←k When P[v] = v, then P∗[v] = −1 CC i[v] = CC i−1[P[v]] reduces to CC i[v] = CC i−1[adj(v)[P∗[v]]]; O

  • V
  • perations.

Validity of P∗: −1 ≤ P∗[v] < |adj(v)|; O

  • V
  • perations.

Piyush Sao Fault tolerant graph computing FTXS16 19 / 26

http://hpcgarage.org/ftxs16/

slide-52
SLIDE 52

Self-correcting Label Correction Algorithm

Fault Detection and Correction

Invalid State Detection In summary, the set of conditions to check for each vertex are: CC i[v] ≤ v; −1 ≤ P∗[v] < |adj(v)|; and CC i[v] =

  • v

if P∗[v] = −1 CC i−1[adj(v)[P∗[v]]] if P∗[v] = −1 . State Correction For any vertex v, if state validity check fails then, we recompute CC i[v].

Piyush Sao Fault tolerant graph computing FTXS16 20 / 26

http://hpcgarage.org/ftxs16/

slide-53
SLIDE 53

Self-correcting Label Correction Algorithm

Fault Detection and Correction

Invalid State Detection In summary, the set of conditions to check for each vertex are: CC i[v] ≤ v; −1 ≤ P∗[v] < |adj(v)|; and CC i[v] =

  • v

if P∗[v] = −1 CC i−1[adj(v)[P∗[v]]] if P∗[v] = −1 . State Correction For any vertex v, if state validity check fails then, we recompute CC i[v].

Piyush Sao Fault tolerant graph computing FTXS16 20 / 26

http://hpcgarage.org/ftxs16/

slide-54
SLIDE 54

Self-correcting Label Correction Algorithm

Overhead of Self-correcting Label-propagation Algorithm

Overhead Asymptotic Complexity Fault detection O(|V |) Fault correction O(f |E|/|V |) Auxiliary data structure O(|V |) Number of state corrections f , can be significantly less than faults occurred. Fault detection and correction needs to be done in a guaranteed reliable mode.

Piyush Sao Fault tolerant graph computing FTXS16 21 / 26

http://hpcgarage.org/ftxs16/

slide-55
SLIDE 55

Experiments and Results

Experimental Setup

Machine Parameter Prop SNB16c Micro-architecture Sandy-Bridge Sockets×Cores 2×8 Clock Rate 2.4GHz DRAM capacity 128GB DRAM Bandwidth 72GB/s Compiler Intel“C”compiler 15.0.0 Fault Injection Fault injection in reading graph data structure and CC array. Each fault injection read is independent Normalized by number of edges in the network Test Network: 14th DIMACS graph challenge

Piyush Sao Fault tolerant graph computing FTXS16 22 / 26

http://hpcgarage.org/ftxs16/

slide-56
SLIDE 56

Experiments and Results

Fault Free Execution Overhead

10 20 30 40 50 60 70 80 90 100 Fault−Tolerant Algorithm Overhead − Error Free Execution Relative Overhead (%)

a s t r

p h a u d i k w 1 c a i d a R

  • u

t e r L e v e l c i t a t i

  • n

C i t e s e e r c n r − 2 c

  • A

u t h

  • r

s D B L P c

  • n

d − m a t − 2 5 c

  • P

a p e r s D B L P d e l a u n a y _ n 1 8 5 − s c a l e 2 G _ n _ p i n _ p

  • u

t k r

  • n

_ g 5 − s i m p l e − l

  • g

n 1 8 l d

  • r

p r e f e r e n t i a l A t t a c h m e n t r g g _ n _ 2 _ 1 8 _ s

Auxiliary Data Structure Overhead Fault Detection Overhead (in Reliable mode)

On an average 1.3% overhead for maintaining additional data structure and 14% for fault detection.

Piyush Sao Fault tolerant graph computing FTXS16 23 / 26

http://hpcgarage.org/ftxs16/

slide-57
SLIDE 57

Experiments and Results

Overhead of Fault Tolerant Algorithm in the Presence of Faults

5 10 15 20 25 30 35 40 45 50

Overhead Vs. Fault Rate

Relative Overhead (%)

2−18 2−16 2−14 2−12 2−10 2−8 2−6

astro−ph CoPapersDBLP kron_g500−simple cnr−2000 rgg_2_18 Piyush Sao Fault tolerant graph computing FTXS16 24 / 26

http://hpcgarage.org/ftxs16/

slide-58
SLIDE 58

Experiments and Results

Overhead of Fault Tolerant Algorithm in the Presence of Faults

10 20 30 40 50 60 70 80 90 100 Fault Tolerant Algorithm Overhead at f=2

−6|E| bit flips/iteration

Relative Overhead (%)

a s t r

p h a u d i k w 1 c a i d a R

  • u

t e r L e v e l c i t a t i

  • n

C i t e s e e r c n r − 2 c

  • A

u t h

  • r

s D B L P c

  • n

d − m a t − 2 5 c

  • P

a p e r s D B L P d e l a u n a y _ n 1 8 5 − s c a l e 2 G _ n _ p i n _ p

  • u

t k r

  • n

_ g 5 − s i m p l e − l

  • g

n 1 8 l d

  • r

p r e f e r e n t i a l A t t a c h m e n t r g g _ n _ 2 _ 1 8 _ s

Auxiliary Data Structure Overhead Fault Detection Overhead (in Reliable mode) Fault Correction Overhead (in Reliable mode)

Fault correction adds additional 9% overhead at 2−6 bit flips per every memory access.

Piyush Sao Fault tolerant graph computing FTXS16 25 / 26

http://hpcgarage.org/ftxs16/

slide-59
SLIDE 59

Conclusion

Conclusion

Conclusion We introduced the ideas of self-correcting algorithm to build fault tolerant algorithms. We presented a self-correcting label propagation algorithm for graph connected component problem. Key steps involved:

Analyze valid and invalid state; Use self-correction hypothesis to simplify invalid state detection; Use previous valid states to recover from invalid state.

Asymptotically lower overhead for fault detection and correction. 10-35% increases in execution time for one error for 64 memory operations.

Piyush Sao Fault tolerant graph computing FTXS16 26 / 26

http://hpcgarage.org/ftxs16/