Checkpointing for the RESTART Problem in Markov Networks Lester - - PowerPoint PPT Presentation

checkpointing for the restart problem in markov networks
SMART_READER_LITE
LIVE PREVIEW

Checkpointing for the RESTART Problem in Markov Networks Lester - - PowerPoint PPT Presentation

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31 Checkpointing for the RESTART Problem in Markov Networks Lester Lipsky Derek


slide-1
SLIDE 1

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Checkpointing for the RESTART Problem in Markov Networks

Lester Lipsky Derek Doran Swapna Gokhale (With lots of help from Steve Thompson)

Department of Computer Science & Engineering University of Connecticut

New Frontiers in Applied Probability at Sandbjerg Estate, So /nderborg, 1-5 August 2011 Conference in Honour of So /ren Asmussen

  • n the occasion of his 65th Birthday

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-2
SLIDE 2

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Overview 1

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-3
SLIDE 3

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Matrix Exponential (ME) Distributions - I 2

Subsystem with M nodes (phases)

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-4
SLIDE 4

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Matrix Exponential (ME) Distributions - II 3

◮ Let P be a transition M-Matrix such that I − P has an inverse; ◮ Let ε′

ε′ ε′ be an M dimensional column-vector of all 1’s;

◮ Let p be an M row-vector where (p)i is the probability that

the process will start at node i, and pε′ ε′ ε′ = 1;

◮ Let each of the M nodes have exponential service time

distributions, with rate µi = (M)ii > 0 (M is a diagonal matrix);

◮ Let T be the time from entry to departure;

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-5
SLIDE 5

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Matrix Exponential (ME) Distributions - III 4

◮ Define

B = M(I − P) and V = B−1;

◮ Then the Probability Distribution (PDF), Reliability, and

probability density (pdf) functions for T are F(t) := P P Pr[T ≤ t] = 1 − p exp(−tB)ε′ ε′ ε′. ¯ F(t) = 1 − F(t), and f (t) = dF dt = p exp(−tB)Bε′ ε′ ε′.

◮ Also

E E E[T ℓ] = ℓ! pVℓ ε′ ε′ ε′.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-6
SLIDE 6

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

ME Representation of the Uniform Distribution 5

0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 t Density Function, Uniform

U2(t) U3(t) U4(t) U5(t) U6(t) U7(t) U8(t) U10(t) U20(t) U40(t) U80(t) U120(t) U200(t)

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-7
SLIDE 7

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Truncated Power-tail (TPT) Distributions 6

10 10

2

10

4

10

6

10

8

10

−18

10

−16

10

−14

10

−12

10

−10

10

−8

10

−6

10

−4

10

−2

10

T=1 T=10 T=20 T=30 T=40 X RT(x) = Pr(BurstLength > x)

R∞(x) → c x−α Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-8
SLIDE 8

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Recovery Scenarios 7

There have been three general scenarios about recovering after a system crashes during execution.

◮ preemptive Resume (prs) - RESUME ◮ preemptive repeat different (prd) - REPLACE ◮ preemptive repeat identical (pri) - RESTART

RESUME and REPLACE can be analyzed by Markov models. RESTART, however, is more difficult to treat.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-9
SLIDE 9

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The Performance of Systems Under RESTART - I 8

◮ Let T be the time for a job to complete without failures, . ◮ Let F(t), f (t) and ¯

F(t) = 1 − F(t) be the PDF, pdf, and reliability functions for T.

◮ Assume that the failure distribution is exponential with failure

rate β. Then for T = t, let X(t, β) be the completion time with failures, under RESTART, with PDF H(x|t). Then its Laplace transform was shown to be H∗(s|t) = (s + β)e−(s+β)t s + βe−(s+β)t .

◮ Since this is the moment generating function of H(x|t), we

have in general E E E[X(t, β)ℓ] = (−1)ℓ dℓ H∗(s|t) dsℓ

  • s=0

.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-10
SLIDE 10

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The Performance of Systems Under RESTART - II 9

◮ Since T = t throughout a RESTART process, it follows that

E E E[X(β)ℓ] = ∞ E E E[X(t, β)ℓ] f (t) dt.

◮ In particular, for ℓ = 1 we have

E E E[X(t, β)] = eβt − 1 β and E E E[X(β)] = ∞ eβt − 1 β f (t) dt .

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-11
SLIDE 11

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The Performance of Systems Under RESTART - III 10

Define: λs := sup

  • λ |

  • exp(λt) f (t) dt < ∞
  • .

Also define α := sup

  • ℓ |

  • xℓ h(x) dx < ∞
  • where h(x) is the pdf for X(β) (total completion time under

RESTART ). Then X(β) is power-tailed (PT) with index α if

0 < α < ∞.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-12
SLIDE 12

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The Performance of Systems Under RESTART - IV 11

From these definitions we have the following.

◮ if T has infinite support, X(β) is sub-exponential. ◮ f (t) has an exponential tail with parameter λs if 0 < λs < ∞.

If λs = 0 then f (t) is sub-exponential.

◮ if T has an exponential tail with parameter λs, then X(β) will

be PT with index α = λs/β. Thus as β becomes bigger, α becomes smaller, and the system behavior becomes more unstable.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-13
SLIDE 13

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Markov Models of Software (MMS model) 12

◮ Software systems (among others) are highly modular, where

the system control is passed among independent components.

◮ The passing of control between the M components (nodes)

maps to an M dimensional Markov matrix, P.

◮ Assume that:

◮ the service time at each node is exponentially distributed with

rate µi := [M]ii > 0;

◮ there is a path to exit the system from each node;

Then, as previously described, the distribution for the total execution time T is ME distributed (actually, PHase).

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-14
SLIDE 14

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The MMS Model Under RESTART 13

For ME distributions, λs := Min[|λi|], where {λi | 1 ≤ i ≤ M} is the set of eigenvalues of B whose eigenvectors are not orthogonal to p or ε′ ε′ ε′.

◮ If the MMS model is subject to exponential failures, and must

RESTART, X(β) will be PT distributed with α = λs/β

◮ The first two moments of X(β) are given by:

E E E[X(β)] = p

  • V(I − βV)−1

ε′ ε′ ε′ (β < λs) E E E[X(β)2] = 2p

  • V2(I − 2βV)−2(I − βV)−1

ε′ ε′ ε′ (β < λs/2) even though X(β > 0) is not ME.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-15
SLIDE 15

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Markov Chains with Two Absorbing States - I 14

◮ Consider an (M+2)-dimensional Markov matrix ¯

P with two absorbing states, a and b. That is, ¯ P¯ ε′ ε′ ε′ = ¯ ε′ ε′ ε′ and (¯ P)aa = (¯ P)bb = 1

◮ Deleting the rows and columns of a and b gives P. ◮ Then,

[Z]ij := [(I − P)−1]ij is the expected number of visits to j before absorption, given that the chain started at i.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-16
SLIDE 16

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Markov Chains with Two Absorbing States - II 15

◮ Now define the M-dimensional column vectors

(q′

a)i := ¯

Pia and (q′

b)i := ¯

Pib, where i = a, b. These are the probability vectors of being absorbed by a and b, respectively.

◮ It follows that the ith components of

ε′

a

ε′

a

ε′

a := Z q′ a

and ε′

b

ε′

b

ε′

b := Z q′ b

are the probabilities that the process will end at a or b, respectively, given that the process started at i. Note that ε′

a

ε′

a

ε′

a + ε′ b

ε′

b

ε′

b = ε′

ε′ ε′.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-17
SLIDE 17

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Markov Chains with Two Absorbing States - III 16

◮ Let po be the entrance vector. Then

pa = poε′

a

ε′

a

ε′

a

and pb = poε′

b

ε′

b

ε′

b,

where pa + pb = 1 are the probabilities that the process will be absorbed by a or b.

◮ It is well known that [po exp(−Bt)]i is the probability that

absorption has not occured by time t, and the system is in state i. This all leads to the following:

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-18
SLIDE 18

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Markov Chains with Two Absorbing States - IV 17

◮ Theorem: Let q′ u, ε′ u

ε′

u

ε′

u, po, B and V, where u ∈ {a, b}, be

defined as above. Then Tu has distribution ¯ Fu(t) := P P Pr[Tu > t] = po exp(−Bt)ε′

u

ε′

u

ε′

u/pu,

u = a, b. The moments of these distributions come from above: E E E[T ℓ

u] = ℓ! po [Vℓ]ε′ u

ε′

u

ε′

u

ε′

u

ε′

u

ε′

u

ε′

u

ε′

u

ε′

u/pu

We then say that ¯ Fu(t) is generated by the triplet

  • po, B, ε′

u

ε′

u

ε′

u

  • .

◮ (Note that E

E E[T ℓ] = paE E E[T ℓ

a] + pbE

E E[T ℓ

b] = ℓ! po [Vℓ]ε′

ε′ ε′.)

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-19
SLIDE 19

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Applying Checkpointing to the MMS 18

◮ Checkpointing can easily be applied to the model to combat

the PT service times under RESTART.

◮ After execution of a selected node m, a system checkpoint

  • peration can be applied, saving the system state.

◮ Ideally, the designer will apply checkpointing for each state,

and select the one that yields the best performance.

◮ To analyze this system we need the conditional distributions

for the time to absorption at each of two absorbing states.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-20
SLIDE 20

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Checkpointing in Markov Systems (MMSC model) 19

◮ For the original MMSC model, select node m as the one that

is followed by a system checkpoint. Then, qm = [q′]m := [(I − P)ε′ ε′ ε′]m is the probability that execution will end after finishing at m.

◮ Add one row and one column to P at index M + 1,

representing the system checkpoint state, to produce the matrix Pc.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-21
SLIDE 21

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The MMSC Model - I 20

◮ Pc has the following properties: for i = m, M + 1 and

j = M + 1, [Pc]ij = Pij, [Pc]i,M+1 = 0, [Pc]mi = 0, [Pc]m,M+1 = 1 − qm, [Pc]M+1,k = 0, ∀ k.

◮ This defines a Markov chain with two absorbing states, e (for

end) and c (for checkpoint).

◮ To use the established theorem we need q′ e and q′ c. ◮ q′ e is the same exit vector as that for the original model, with

additional component [q′

e]M+1 = 0, so

[q′

e]i = [(I − P)ε′

ε′ ε′]i, but [q′

e](M+1) = 0

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-22
SLIDE 22

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The MMSC Model - II 21

◮ q′ c is given as: [q′ c]i = 0, for i ≤ M and [q′ c]M+1 = 1. ◮ We define the (M + 1)-matrix Zc = (I − Pc)−1 to get

ǫ′

e

ǫ′

e

ǫ′

e = Zc q′ e

and ǫ′

c

ǫ′

c

ǫ′

c = Zc q′ c ◮ The probability of finishing the process without checkpointing

is: poe := poǫ′

e

ǫ′

e

ǫ′

e ◮ We can also get the probability of reaching the checkpoint

before finishing: poc := poǫ′

c

ǫ′

c

ǫ′

c

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-23
SLIDE 23

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The MMSC Model - III 22

◮ Now we apply the theorem to get the conditional distributions

for the time to finish given no checkpoint (Toe) and the time to reach and execute the checkpoint (Toc).

◮ Define the diagonal matrix

[Mc]ii = [M]ii and [Mc]M+1,M+1 = µc, where tc = 1/µc is the mean time to process a checkpoint.

◮ The conditional distributions are then:

Bc := Mc(I − Pc) ¯ Fou(t) := P P Pr[Tou > t] = po exp(−tBc)ǫ′

u

ǫ′

u

ǫ′

u / pou

u ∈ {e, c}

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-24
SLIDE 24

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The MMSC Model - IV 23

◮ If the system execution takes the path described by oe, the

process ends. But if the path leads to m, the system checkpoints after it’s execution.

◮ We must define a restart vector pc as an entrance vector into

the system corresponding to where the execution of the system begins again after checkpointing.

◮ pc is composed of the transition probabilities out of state m:

pc := [Pm1, Pm2, ... , PmM, 0 ]/(1 − qm)

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-25
SLIDE 25

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The MMSC Model - V 24

◮ So the probability of the system finishing after checkpointing

without returning to m is pce := pcǫ′

e

ǫ′

e

ǫ′

e ◮ The probability of the system returning to m after already

checkpointing (to save a more recent state of the system) is pcc := pcǫ′

c

ǫ′

c

ǫ′

c ◮ The time distribution for these two events are

(for u ∈ {c, e}): ¯ Fcu(t) := P P Pr[Tcu > t] = pc exp(−tBc)ǫ′

u

ǫ′

u

ǫ′

u / pcu.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-26
SLIDE 26

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The MMSC Model - VI 25

◮ What has been described can be thought of as an embedded

Markov chain with four nodes whose service time distributions are given by each of the ¯ Fab.

◮ The transition matrix for this process is:

ˆ Pc ˆ Pc ˆ Pc :=

  • e
  • c

ce cc

  • e
  • c

pce pcc ce cc pce pcc with ˆ pc ˆ pc ˆ pc := [poe, poc, 0, 0 ]

◮ expected number of visits to C:

E E E[Nc] = poc + poc pcc / poe = poc / poe

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-27
SLIDE 27

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Diagrams of the Markov Chain 26

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-28
SLIDE 28

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Applying RESTART to the MMSC 27

◮ ˆ

Pc ˆ Pc ˆ Pc, together with the ME service time distributions of each node is an ME representation (but only for β = 0).

◮ If there is a failure, the system only has to redo whatever work

had been accomplished within the node that had failed.

◮ Thus we can get E

E E[Xu(β)] and E E E[X 2

u (β)] for

u ∈ {oe, oc, cc, ce }.

◮ With the first two moments of the distribution for each node,

we can get E E E[X ℓ

c (β)] (ℓ = 1, 2) as follows:

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-29
SLIDE 29

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Mean and Variance of Xc(β) 28

◮ Define the 4-matrices

[ˆ Tc ˆ Tc ˆ Tc]uu := E E E[Xu(β)], ˆ Vc ˆ Vc ˆ Vc := [ˆ I − ˆ Pc ˆ I − ˆ Pc ˆ I − ˆ Pc]−1 ˆ Tc ˆ Tc ˆ Tc, and [ˆ Γ ˆ Γ ˆ Γ]uu := C 2

u − 1,

where C 2

u = σ2 u(β)/ (E

E E[Xu(β)])2 is the squared coefficient of variation of Xu(β).

◮ Then

E E E[Xc(β)] = ˆ pc ˆ Vc ˆ ǫ′ ˆ pc ˆ Vc ˆ ǫ′ ˆ pc ˆ Vc ˆ ǫ′ and σ2

c(β) = σ2 exp + ˆ

pc ˆ Vc ˆ Tc ˆ Γ ˆ ǫ′ ˆ pc ˆ Vc ˆ Tc ˆ Γ ˆ ǫ′ ˆ pc ˆ Vc ˆ Tc ˆ Γ ˆ ǫ′ where σ2

exp = 2(ˆ

pc ˆ Vc ˆ pc ˆ Vc ˆ pc ˆ Vc2 ˆ ǫ′ ˆ ǫ′ ˆ ǫ′) − (ˆ pc ˆ Vc ˆ ǫ′ ˆ pc ˆ Vc ˆ ǫ′ ˆ pc ˆ Vc ˆ ǫ′)2 is the variance of the similar exponential network.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-30
SLIDE 30

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Asymptotic Properties of T, X(β) and Xc(β) - I 29

◮ The exponential tail for T is determined by λs = λmin, where

λmin is the smallest eigenvalue of B.

◮ If P is a feed-forward matrix, then the eigenvalues of B are the

service rates, µi, of the nodes (assuming Pii = 0), so λs = Min{µi}.

◮ If there are some feed-back loops, then λs may be smaller. In

any case, λs ≤ Min{µi}.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-31
SLIDE 31

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Asymptotic Properties of T, X(β) and Xc(β) - II 30

◮ The PT index for X(β) is α = λs/β ◮ Let λus (u ∈ {oc, oe, cc, ce}) be the exponential parameter

for Fu(t). Then λcs := Min{λus} determines αc = λcs/β

◮ If P is feed-forward, then the index for Xc(β) is the same as

for X(β) (although E E E[Xc(β)] < E E E[X(β)])

◮ If P has loops, and the checkpoint is inserted within a loop

then αc can be much larger.

◮ Even if ˆ

P P P has feedback (pcc > 0), αc does NOT change.

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-32
SLIDE 32

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

P =             .7 0.30 1.00 0.75 .25 .4 .6 .3 .3 .1 .3 .8 .2 .75 .1             and q′ =             .00 .00 .00 .00 .00 .00 .25 .90             [q′ = (I − P)ε′ ε′ ε′], with entrance vector p = [0.60, 0.20, 0.20, 0, 0, 0, 0, 0 ], and M = Diag[1.2, 2.3, 3.4, 0.8, 2.0, 2.4, 6.5, µc ] .

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-33
SLIDE 33

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Diagrams of the Markov Chain With Node Service rates 32

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-34
SLIDE 34

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Reliability Functions, ¯ F, ¯ Fu (u ∈ {oc, oe, cc, ce}) 33

1 2 3 4 5 6 7 8 9 10 10

−4

10

−3

10

−2

10

−1

10

t

¯ F(t) 1/µc = 0 ¯ F0c ¯ Fcc ¯ F0e ¯ Fce ¯ F

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-35
SLIDE 35

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Asymptotic Tail Parameter 34

5 10 15 20 25 30 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 tc = 1/µc Smallest Eigenvalue

λsc = Min{µ4, µc}

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-36
SLIDE 36

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The Checkpointing Effect - I (E E E[Xc(β)] ) 35

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 8 10 12 14 16 18 20

Failure Rate β E[Xc(β)]

No Checkpointing µc = 0.6 µc = 0.8 µc = µa/2 = 1.25 µc = µa = 2.5 µc = ∞

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-37
SLIDE 37

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

The Checkpointing Effect - II (E E E[Xc(β)] (1 − β/λsc )) 36

0.1 0.2 0.3 0.4 0.5 0.6 10 20 30 40 50 60 70 80

Failure Rate β E[Xc(β)](1−β/λs)

No Checkpointing µc = 0.6 µc = 0.8 µc = µa/2 = 1.25 µc = µa = 2.5 µc = ∞

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-38
SLIDE 38

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Squared Coefficient of Variaton (C 2

v := σ2/E

E E[Xc(β)]2) 37

0.05 0.1 0.15 0.2 0.25 0.3 0.35 2 4 6 8 10 12 14 16 18 20

Failure Rate β C2

v(β) No Checkpointing µc = 0.6 µc = 0.8 µc = µa/2 = 1.25 µc = µa = 2.5 µc = ∞

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-39
SLIDE 39

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Blowup of Squared Coefficient of Variaton (C 2

v := σ2/E

E E[Xc(β)]2) 38

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Failure Rate β C2

v(β) No Checkpointing µc = 0.6 µc = 0.8 µc = µa/2 = 1.25 µc = µa = 2.5 µc = ∞

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-40
SLIDE 40

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Some Unresolved Questions 39

◮ How large must x be before the asymptotic formula is a

”good” approximation to ¯ H(x)?

◮ How robust is the method if the nodes have non-exponential

service times?

◮ What is to be done if the failure distribution is not

exponential?

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-41
SLIDE 41

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Simulation of X(β) and Asymptotic Formulas for Exponential, Hyperexponential, and Erlangian Functions 40

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-42
SLIDE 42

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Relative Difference Between Simulation and Analytic Asymptotic formula [Abs(Sim − Asymp)/Asymp] 41

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks

slide-43
SLIDE 43

Overview of ME distributions 2 Failure Recover Scenarios 7 A Taboo Process - Two Absorbing States 14 RESTART and Checkpoints for Markov Models 18 Example 31

Conclusion 42

◮ We can compute the moments of ¯

H(x);

◮ We can get the asymptotic index, αc; ◮ We can’t get ¯

H(x). ‘

Lipsky, Doran, Gokhale Checkpointing for the RESTART Problem in Markov Networks