Self-stabilizing Iterative Solvers Piyush Sao, Richard Vuduc School - - PowerPoint PPT Presentation

self stabilizing iterative solvers
SMART_READER_LITE
LIVE PREVIEW

Self-stabilizing Iterative Solvers Piyush Sao, Richard Vuduc School - - PowerPoint PPT Presentation

Self-stabilizing Iterative Solvers Piyush Sao, Richard Vuduc School of Computational Science & Engineering Georgia Institute of Technology SIAM PP-14 P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 1 / 21


slide-1
SLIDE 1

Self-stabilizing Iterative Solvers

Piyush Sao, Richard Vuduc

School of Computational Science & Engineering Georgia Institute of Technology

SIAM PP-14

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 1 / 21

slide-2
SLIDE 2

Introduction

Self-stabilization

Informally, self-stabilization (Dijkstra 1974) is a property of a system that guarantees it will enter a valid state no matter what its initial state is. We describe a self-stabilizing version the conjugate gradients method, which is resilient to transient soft faults.

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 2 / 21

slide-3
SLIDE 3

Introduction

Fault tolerant Iterative Algorithms

Can an iterative algorithm still converge if a fault has occured ?

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 3 / 21

slide-4
SLIDE 4

Introduction

Iterative Algorithms

Check for convergence xk+1 yk+1 zk+1 , , ...

< >

Update Return Start Intermediate Vars x0 y0 z0 , , ...

< >

xk yk zk , , ...

< >

xk+1 yk+1 zk+1 , , ...

< >

xs ys zs , , ...

< >

. . .

x0 y0 z0 , , ...

< >

x1 y1 z1 , , ...

< >

y1 z1 , , ...

< >

x2 y2 z2 , , ...

< >

y z , , ...

< >

. . .

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 4 / 21

slide-5
SLIDE 5

Introduction

Self-stabilizing Algorithms

An algorithm is self-stabilizing, if starting from any state (valid or invalid), it comes back to a valid state within finite number of“steps”,

  • therwise not.

<X1, Y1, Z1, ...> <Xk, Yk, Zk, ...> <Xs, Ys, Zs, ...> . . . . . . <Xf, Yf, Zf, ...>

faulty Execution

<Xi, Yi, Zi, ...> <Xs', Ys', Zs', ...>

Invalid States

Solution States Valid States Invalid States

Start

. . . . . . Self-stabilizing Algorithms

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 5 / 21

slide-6
SLIDE 6

Introduction

Making an Algorithm Self-stabilizing

Naturally self-stabilizing (e.g., Newton, SOR, Jacobi) Restart from a checkpoint Restart (such as restarted-GMRES) Our strategy: Correction step

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 6 / 21

slide-7
SLIDE 7

Introduction

Periodic correction step

Restore sufficient conditions for convergence Mathematically“equivalent”to original in a fault-free execution Eliminates need for detecting faults Executing correction step periodically ensures resuming correct behavior in finite number of steps

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 7 / 21

slide-8
SLIDE 8

Introduction Self-stabilizing Conjugate Gradient

Conjugate Gradient Algorithm

Solve Ax = b for x for SPD A; Quadratic optimization problem F(x) = 1 2xTAx − xTb F(x) represents N-dimensional paraboloid CG finds the optimum by taking appropriately constructed steps

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 8 / 21

slide-9
SLIDE 9

Introduction Self-stabilizing Conjugate Gradient

Conjugate Gradient (CG) Algorithm

State variables xk = present estimate pk = search direction rk = b − Axk = direction of steepest descent Transition function 1. qk ← Apk 2. αk ←

rT r pT q

3. xk+1 ← xk + αkrk 4. rk+1 ← rk − αkqk 5. βk ←

rk+12 rk2

6. pk+1 ← rk+1 + βkpk

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 9 / 21

slide-10
SLIDE 10

Introduction Self-stabilizing Conjugate Gradient

Self-stabilizing Conjugate Gradient

It is a Krylov subspace method, {pk}, {rk} spans Krylov subspace K (A, r0, m) K (A, r0, m) = span{r0, Ar0, . . . Am−1r0} Global orthogonality properties pT

i Apj

= if i = j; r T

i rj

= if i = j; and r T

i pj

= if i > j. Finite termination in exact arithmetic

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 10 / 21

slide-11
SLIDE 11

Introduction Self-stabilizing Conjugate Gradient

Effects of faults on Conjugate Gradient

In general, most of Krylov subspace properties are lost Multiple potential outcomes due to faults

1

Error in rk ⇒ Converge to incorrect value

2

Error in pk ⇒ Diverge, stagnation, slow convergence

Difficult to detect validity of state

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 11 / 21

slide-12
SLIDE 12

Introduction Self-stabilizing Conjugate Gradient

Self-stabilizing Conjugate Gradient

We identify the following relations that are sufficient to guarentee convergence (a corollary to Zoutendijk condition) Residual condition : rk = b − Axk Optimal step length : αk =

rT

k pk

pT

k Apk

Correct search direction : (pT

k rk)

pkrk > c1

Local orthogonality relation : pk+1TApk = 0

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 12 / 21

slide-13
SLIDE 13

Experiments

Experiments

Assume: selective reliability mode, i.e., correction step can be done reliably Inject faults in sparse matrix-vector (SpMV) product by flipping bits in matrix entry at a specified rate

1

Bit flips in mantissa and sign bits - 40 bit flips in every unreliable SpMV

2

Bit flips can occur any where (including exponent) - 4 bit flips in every unreliable SpMV

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 13 / 21

slide-14
SLIDE 14

Experiments

Problems

We test self-stabilizing CG (CG-SS) on three problems with different convergence profiles and conditioning

Name N NNZ κ(A) Convergence profile K3D 27000 183600 646 Quadratic DIAG 10000 10000 990100 Linear THERMAL1 82654 574458 496250 Sub-linear Table : Different problems used for experimentation

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 14 / 21

slide-15
SLIDE 15

Experiments

Solvers

We compare performance of CG-SS against following solvers

1

Reliable-CG : Where all the computations are done reliably

2

CG-SS : Self-stabilizing CG with correction done every 10th iteration

3

CG-RES : Restarted CG: restart every 10th iteration

4

FT-GMRES : Inner outer iteration based fault tolerant GMRES, where outer iteration is done reliably

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 15 / 21

slide-16
SLIDE 16

Experiments

K3D : Quadratic Convergence

In presence of faults, only linear convergence is observed

50

100

150

200

250

300 10

− 14

10

− 12 10

− 10

10

− 8 10

− 6

10

− 4 10

− 2

10 Convergence History Number of Iterations Error Free CG CG− SS CG− RES FT− GMRES

(a) Bounded errors (mantissa and sign bit flips only)

50

100

150

200

250

300 10

− 15

10

− 10

10

− 5

10 Convergence History Number of Iterations Error Free CG CG− SS CG− RES FT− GMRES

(b) Unbounded errors (including exponent also)

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 16 / 21

slide-17
SLIDE 17

Experiments

THERMAL1 : Sub-linear Convergence

Convergence rate for CG-SS and CG-RES does not change by much; FT-GMRES shows better convergence due to pre-conditioning

50

100

150

200

250

300 10

− 4

10

− 3

10

− 2

10

− 1

10 Convergence History Number of Iterations Error Free CG CG− SS CG− RES FT− GMRES

(a) Bounded errors (mantissa and sign bit flips only)

50

100

150 200 250 300 350 400 450 500

10

− 4

10

− 3

10

− 2

10

− 1

10 Convergence History Number of Iterations CG− RES Error Free CG CG− SS FT− GMRES

(b) Unbounded errors (including exponent also)

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 17 / 21

slide-18
SLIDE 18

Experiments

DIAG : Linear Convergence

Linear convergence is maintained. However, slight slow-down in convergence is observed

50

100

150

200

250

300 10

− 15

10

− 10

10

− 5

10 Convergence History Number of Iterations Error Free CG CG− SS CG− RES FT− GMRES

(a) Bounded errors (mantissa and sign bit flips only)

50

100

150

200

250

300 10

− 14

10

− 12 10

− 10

10

− 8 10

− 6

10

− 4 10

− 2

10 Convergence History Number of Iterations Error Free CG CG− SS CG− RES FT− GMRES

(b) Unbounded errors (including exponent also)

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 18 / 21

slide-19
SLIDE 19

Experiments

Amount of Reliable Computation Required

Compared to Reliable-CG, CG-SS requires <30 % reliable SpMV to reach same error tolerance

10

− 15

10

− 10

10

− 5

10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 Error Tolerance versus # Reliable SpMV E rror tolerance Fraction of Reliable SpMV Error Free CG CG− SS CG− RES FT− GMRES

(a) Bounded errors (mantissa and sign bit flips only)

10

− 15

10

− 10

10

− 5

10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Error Tolerance versus # Reliable SpMV E rror tolerance Fraction of Reliable SpMV

Error Free CG CG− SS CG− RES FT− GMRES

(b) Unbounded errors (including exponent also)

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 19 / 21

slide-20
SLIDE 20

Analysis

Analysis

Can show that if κ(A) is the condition number and η is rate of bit flips per SpMV, then

n

{

Observed Convergence in Presence of faults Ideal Convergence in nite precision

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 20 / 21

slide-21
SLIDE 21

Conclusion

Conclusion & Future Work

Conclusion Two mathematically equivalent algorithms can have different

  • resilience. Concept of self-stabilization may give hints

Self-stabilized CG tolerates high rates of faults

1

Trade off between reliable and unreliable computation

2

Reliable computation varies logarithmically with fault rates = ⇒ Scalable

Future Work The balance between reliable and unreliable computation : to

  • ptimize for energy / power / time

Hybridized fault-tolerance techniques

  • P. Sao, R. Vuduc (Georgia Tech)

Self-stabilizing Iterative Solvers SIAM PP-14 21 / 21