Unified Model for Assessing Checkpointing Protocols at Extreme-Scale - - PowerPoint PPT Presentation

unified model for assessing checkpointing protocols at
SMART_READER_LITE
LIVE PREVIEW

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale - - PowerPoint PPT Presentation

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale George Bosilca 1 , Aur elien Bouteiller 1 , Elisabeth Brunet 2 , Franck Cappello 3 , Jack Dongarra 1 , Amina Guermouche 4 , erault 1 , Yves Robert 1 , 4 , Thomas H eric Vivien


slide-1
SLIDE 1

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale

George Bosilca1, Aur´ elien Bouteiller1, Elisabeth Brunet2, Franck Cappello3, Jack Dongarra1, Amina Guermouche4, Thomas H´ erault1, Yves Robert1,4, Fr´ ed´ eric Vivien4, and Dounia Zaidouni4

  • 1. University of Tennessee Knoxville, USA
  • 2. Telecom SudParis, France
  • 3. INRIA & University of Illinois at Urbana Champaign, USA
  • 4. Ecole Normale Sup´

erieure de Lyon & INRIA, France Pittsburgh, June 28, 2012

slide-2
SLIDE 2

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Motivation

  • Very very large number of processing elements (e.g., 220)

= ⇒ Probability of failures dramatically increases

  • Large application to be executed on whole platform

= ⇒ Failure(s) will most likely occur before completion!

  • Resilience provided through checkpointing

1 Coordinated protocols 2 Hierarchical protocols

2 / 35

slide-3
SLIDE 3

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Which checkpointing protocol to use?

Coordinated checkpointing

No risk of cascading rollbacks No need to log messages All processors need to roll back Rumor: May not scale to very large platforms

Hierarchical checkpointing

Need to log inter-groups messages

  • Slowdowns failure-free execution
  • Increases checkpoint size/time

Only processors from failed group need to roll back Faster re-execution with logged messages Rumor: Should scale to very large platforms

3 / 35

slide-4
SLIDE 4

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Outline

1 Protocol Overhead

Coordinated checkpointing Hierarchical checkpointing

2 Accounting for message logging 3 Instanciating the model

Applications Platforms

4 Experimental results

Plotting formulas Simulations

4 / 35

slide-5
SLIDE 5

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Outline

1 Protocol Overhead 2 Accounting for message logging 3 Instanciating the model 4 Experimental results

5 / 35

slide-6
SLIDE 6

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Framework

  • Periodic checkpointing policies (of period T)
  • Independent and identically distributed failures
  • Platform failure inter-arrival time: µ
  • Tightly-coupled application:

progress ⇔ all processors available

  • First-order approximation: at most one failure within a period

Waste: fraction of time not spent for useful computations

6 / 35

slide-7
SLIDE 7

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Checkpointing cost

Checkpointing the first chunk Computing the first chunk Processing the second chunk Processing the first chunk

Time Time spent checkpointing Time spent working 7 / 35

slide-8
SLIDE 8

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Checkpointing cost

Checkpointing the first chunk Computing the first chunk Processing the second chunk Processing the first chunk

Time Time spent checkpointing Time spent working

Blocking model: while a checkpoint is taken, no computation can be performed

7 / 35

slide-9
SLIDE 9

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Checkpointing cost

Checkpointing the first chunk Computing the first chunk Processing the second chunk Processing the first chunk

Time Time spent checkpointing Time spent working

Non-blocking model: while a checkpoint is taken, computations are not impacted (e.g., first copy state to RAM, then copy RAM to disk)

7 / 35

slide-10
SLIDE 10

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Checkpointing cost

Checkpointing the first chunk Computing the first chunk Processing the first chunk

Time Time spent working Time spent checkpointing Time spent working with slowdown

General model: while a checkpoint is taken, computations are slowed-down: during a checkpoint of duration C, the same amount

  • f computation is done as during a time αC without checkpointing

(0 ≤ α ≤ 1).

7 / 35

slide-11
SLIDE 11

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

1 Protocol Overhead

Coordinated checkpointing Hierarchical checkpointing

2 Accounting for message logging 3 Instanciating the model

Applications Platforms

4 Experimental results

Plotting formulas Simulations

8 / 35

slide-12
SLIDE 12

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste in absence of failures

T C T − C P1 P0 P3 P2 Time spent working Time spent checkpointing Time spent working with slowdown Time

Time elapsed since last checkpoint: T Amount of computation saved: (T − C) + αC Wastecoord−nofailure = T − ((T − C) + αC) T = (1 − α)C T

8 / 35

slide-13
SLIDE 13

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures

P2 P3 P0 P1 Time spent working Time spent checkpointing Time spent working with slowdown Time

Failure can happen

1 During computation phase 2 During checkpointing phase

8 / 35

slide-14
SLIDE 14

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures

P2 P3 P0 P1 Time spent working Time spent checkpointing Time spent working with slowdown Time

8 / 35

slide-15
SLIDE 15

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures

P2 P3 P0 P1 Time spent working Time spent checkpointing Time spent working with slowdown Time

8 / 35

slide-16
SLIDE 16

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures

Tlost P1 P3 P0 P2 Time spent working Time spent checkpointing Time spent working with slowdown Time

Coordinated checkpointing protocol: when one processor is victim

  • f a failure, all processors lose their work and must roll back to last

checkpoint

8 / 35

slide-17
SLIDE 17

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures in computation phase

D P0 P2 P1 P3 Time spent working Time spent checkpointing Time spent working with slowdown Downtime Time

8 / 35

slide-18
SLIDE 18

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures in computation phase

R P1 P2 P3 P0 Time spent working Time spent checkpointing Time spent working with slowdown Recovery time Downtime Time

Coordinated checkpointing protocol: All processors must recover from last checkpoint

8 / 35

slide-19
SLIDE 19

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures in computation phase

C αC P3 P2 P1 P0 Time spent working Time spent checkpointing Time spent working with slowdown Re-executing slowed-down work Recovery time Downtime Time

Redo the work destroyed by the failure, that was done in the checkpointing phase before the computation phase But no checkpoint is taken in parallel, hence this re-computation is faster than the original computation

8 / 35

slide-20
SLIDE 20

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures in computation phase

T − C P1 P2 P3 P0 Time spent working Time spent checkpointing Time spent working with slowdown Re-executing slowed-down work Recovery time Downtime Time

Re-execute the computation phase

8 / 35

slide-21
SLIDE 21

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures in computation phase

C P1 P2 P3 P0 Time spent working Time spent checkpointing Time spent working with slowdown Re-executing slowed-down work Recovery time Downtime Time

Finally, the checkpointing phase is executed First-order approximation: we assume that no other failure occurs during the re-execution

8 / 35

slide-22
SLIDE 22

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures in computation phase

∆ αC C T − C R D Tlost P0 P2 P1 P3 Time spent working Time spent checkpointing Time spent working with slowdown Re-executing slowed-down work Recovery time Downtime T Time

Re-Exec: ∆ − T = Tlost + αC Expectation: Tlost = 1 2(T − C) Re-Execcoord−fail−in−work = T − C 2 + αC

8 / 35

slide-23
SLIDE 23

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Waste due to failures

  • Failure in the computation phase (probability: T − C

T ) Re-Execcoord−fail−in−work = T − C 2 + αC

  • Failure in the checkpointing phase (probability: C

T ) Re-Execcoord−fail−in−checkpoint = T − C 2 + αC T − C T T − C 2 + αC

  • + C

T

  • T − C

2 + αC

  • = αC + T

2

9 / 35

slide-24
SLIDE 24

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Overall waste

Wastecoord = Wastecoord−nofailure + 1 µ(D + R + Re-Execcoord) = (1 − α)C T + 1 µ

  • D + R + αC + T

2

  • Minimize Wastecoord subject to:
  • C ≤ T (by construction)
  • T ≤ 0.1µ (⇒ Proba(Poisson(T

µ ) ≥ 2) ≤ 0.005)

10 / 35

slide-25
SLIDE 25

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

1 Protocol Overhead

Coordinated checkpointing Hierarchical checkpointing

2 Accounting for message logging 3 Instanciating the model

Applications Platforms

4 Experimental results

Plotting formulas Simulations

11 / 35

slide-26
SLIDE 26

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Hierarchical checkpointing

  • Processors partitioned into G groups
  • Each group includes q processors
  • Inside each group: coordinated checkpointing in time C(q)
  • Inter-group messages are logged

11 / 35

slide-27
SLIDE 27

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Impact of checkpointing

G.C G2 G4 G3 G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

When a group checkpoints, its own computation speed is slowed-down This holds for all groups because of the tightly-coupled assumption Waste = T − Work T where Work = T − (1 − α)GC(q)

12 / 35

slide-28
SLIDE 28

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Impact of checkpointing

G2 G4 G3 G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time 12 / 35

slide-29
SLIDE 29

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Impact of checkpointing

G2 G4 G3 G5 G1 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time 12 / 35

slide-30
SLIDE 30

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Impact of checkpointing

G2 G4 G3 G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Tightly-coupled model: while one group is in downtime, none can work

12 / 35

slide-31
SLIDE 31

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during computation phase

G2 G4 G3 G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Tightly-coupled model: while one group is in recovery, none can work

12 / 35

slide-32
SLIDE 32

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during computation phase

G2 G4 G3 G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Groups must have completed the same amount of work in between two consecutive checkpoints, independently of the fact that a failure may or may not have happened on the platform in between these checkpoints. Hence, no checkpointing is possible during the rollback.

12 / 35

slide-33
SLIDE 33

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during computation phase

(G − g + 1)C G2 G4 Gg G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Redo work done during previous checkpointing phase and that was destroyed by the failure

12 / 35

slide-34
SLIDE 34

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during computation phase

α(G−g+1)C (G − g + 1)C G2 G4 Gg G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Redo work done during previous checkpointing phase and that was destroyed by the failure But no checkpoint is taken in parallel, hence this re-computation is faster than the original computation

12 / 35

slide-35
SLIDE 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during computation phase

Tlost Tlost G2 G4 Gg G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Redo work done in computation phase and that was destroyed by the failure

12 / 35

slide-36
SLIDE 36

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during computation phase

T −G.C−Tlost Tlost G2 G4 Gg G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Failing group has reached the point where it previously failed, all groups now resume execution in parallel and complete the computation phase

12 / 35

slide-37
SLIDE 37

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during computation phase

G.C G2 G4 Gg G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Finally, perform checkpointing phase

12 / 35

slide-38
SLIDE 38

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during computation phase

T α(G−g+1)C R D G.C T −G.C−Tlost Tlost Tlost G2 G4 Gg G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Re-Exec: Tlost + α(G − g + 1)C Expectation: Tlost = 1 2(T − G.C) Approximated Re-Exec: T − G.C 2 + α(G − g + 1)C

12 / 35

slide-39
SLIDE 39

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during computation phase

T α(G−g+1)C R D G.C T −G.C−Tlost Tlost Tlost G2 G4 Gg G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Approximated Re-Exec: T − G.C 2 + α(G − g + 1)C Average approximated Re-Exec:

1 G

G

  • g=1

T − G.C(q) 2 + α(G − g + 1)C(q)

  • = T − G.C(q)

2 + αG + 1 2 C

12 / 35

slide-40
SLIDE 40

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during checkpointing phase

G2 G4 G3 G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time 12 / 35

slide-41
SLIDE 41

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Failure during checkpointing phase

G2 G4 G3 G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

When does the failing group fail?

1 Before starting its own checkpoint 2 While taking its own checkpoint 3 After completing its own checkpoint

12 / 35

slide-42
SLIDE 42

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Average waste for failures during checkpointing phase

Average Re-Exec when the failing-group g fails Overall average Re-Exec: Re-Execckpt = 1 G((g−1).Re-Execbefore ckpt + 1.Re-Execduring ckpt + (G−g).Re-Execafter ckpt) Average over all groups: avg Re-Execckpt = G + 1 2G T + αC(q)(G + 3) 2 + C(q)(1 − 2α) 2G − C(q)(G + 1) 2

13 / 35

slide-43
SLIDE 43

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Average waste

Wastehierach = T − Work T + 1 µ

  • D(q) + R(q) + Re-Exec
  • =

1 2µT ×     T 2 +GC(q)

  • (1 − α)(2µ − T) + (2α − 1)C(q)
  • +T
  • 2(D(q) + R(q)) + (α + 1)C(q)
  • +(1 − 2α)C(q)2

    Minimize Wastehierarch subject to:

  • GC(q) ≤ T (by construction)
  • T ≤ 0.1µ (⇒ Proba(Poisson(T

µ ) ≥ 2) ≤ 0.005)

14 / 35

slide-44
SLIDE 44

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Outline

1 Protocol Overhead 2 Accounting for message logging 3 Instanciating the model 4 Experimental results

15 / 35

slide-45
SLIDE 45

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Impact on work

  • Logging messages slows down execution:

⇒ Work becomes λWork, where 0 < λ < 1 Typical value: λ ≈ 0.98

  • Re-execution after a failure is faster:

⇒ Re-Exec becomes Re-Exec ρ , where ρ ∈ [1..2] Typical value: ρ ≈ 1.5 Wastehierarch = T − λWork T + 1 µ

  • D(q) + R(q) + Re-Exec

ρ

  • 16 / 35
slide-46
SLIDE 46

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Impact on checkpoint size

  • Inter-groups messages logged continuously
  • Checkpoint size increases with amount of work executed

before a checkpoint

  • C0(q): Checkpoint size of a group without message logging

C(q) = C0(q)(1 + βWork) ⇔ β = C(q) − C0(q) C0(q)Work Work = λ(T − (1 − α)GC(q)) C(q) = C0(q)(1 + βλT) 1 + GC0(q)βλ(1 − α)

  • Constraint GC(q) ≤ T translates into

GC0(q)βλα ≤ 1 and T ≥ GC0(q) 1 − GC0(q)βλα

17 / 35

slide-47
SLIDE 47

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Outline

1 Protocol Overhead 2 Accounting for message logging 3 Instanciating the model 4 Experimental results

18 / 35

slide-48
SLIDE 48

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Three case studies

Coord-IO Coordinated approach: C = CMem = Mem bio where Mem is the memory footprint of the application Hierarch-IO Several (large) groups, I/O-saturated ⇒ groups checkpoint sequentially C0(q) = CMem G = Mem Gbio Hierarch-Port Very large number of smaller groups, port-saturated ⇒ some groups checkpoint in parallel Groups of qmin processors, where qminbport ≥ bio

19 / 35

slide-49
SLIDE 49

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

1 Protocol Overhead

Coordinated checkpointing Hierarchical checkpointing

2 Accounting for message logging 3 Instanciating the model

Applications Platforms

4 Experimental results

Plotting formulas Simulations

20 / 35

slide-50
SLIDE 50

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Three applications

1 2D-stencil 2 3D-Stencil

  • Plane
  • Line

3 Matrix product

20 / 35

slide-51
SLIDE 51

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Computing β for Stencil-2D

C(q) = C0(q) + Logged Msg = C0(q)(1 + βWork) Real n × n matrix and p × p grid Work = 9b2 sp , b = n/p Each process sends a block to its 4 neighbors

21 / 35

slide-52
SLIDE 52

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Computing β for Stencil-2D

C(q) = C0(q) + Logged Msg = C0(q)(1 + βWork) Real n × n matrix and p × p grid Work = 9b2 sp , b = n/p Each process sends a block to its 4 neighbors Hierarch-IO:

  • 1 group = 1 grid row
  • 2 out of the 4 messages are logged
  • β = 2sp

9b3

21 / 35

slide-53
SLIDE 53

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Computing β for Stencil-2D

C(q) = C0(q) + Logged Msg = C0(q)(1 + βWork) Real n × n matrix and p × p grid Work = 9b2 sp , b = n/p Each process sends a block to its 4 neighbors Hierarch-IO:

  • 1 group = 1 grid row
  • 2 out of the 4 messages are logged
  • β = 2sp

9b3 Hierarch-Port:

  • β doubles

21 / 35

slide-54
SLIDE 54

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Three applications: 2) 3D-stencil

  • Real matrix of size n × n × n partitioned across a p × p × p

processor grid

  • Each processor holds a cube of size b = n/p
  • At each iteration:
  • average each matrix element with its 27 closest neighbors
  • exchange the six faces of its cube
  • (Parallel) work for one iteration is Work = 27b3

sp Three hierarchical variants

1 Hierarch-IO-Plane: group = horizontal plane of size p2:

β = 2sp 27b3

2 Hierarch-IO-Line: group = horizontal line of size p:

β = 4sp 27b3

3 Hierarch-Port: groups of size qmin: β = 6sp

27b3

22 / 35

slide-55
SLIDE 55

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

1 Protocol Overhead

Coordinated checkpointing Hierarchical checkpointing

2 Accounting for message logging 3 Instanciating the model

Applications Platforms

4 Experimental results

Plotting formulas Simulations

23 / 35

slide-56
SLIDE 56

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Four platforms: basic characteristics

Name Number of Number of Number of cores Memory I/O Network Bandwidth (bio) cores processors ptotal per processor per processor Read Write Titan 299,008 16,688 16 32GB 300GB/s 300GB/s K-Computer 705,024 88,128 8 16GB 150GB/s 96GB/s Exascale-Slim 1,000,000,000 1,000,000 1,000 64GB 1TB/s 1TB/s Exascale-Fat 1,000,000,000 100,000 10,000 640GB 1TB/s 1TB/s 23 / 35

slide-57
SLIDE 57

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Four platforms: 2D-Stencil and Matrix-Product

Name Scenario G (C(q)) β for β for 2D-Stencil Matrix-Product Coord-IO 1 (2,048s) / / Titan Hierarch-IO 136 (15s) 0.0001098 0.0004280 Hierarch-Port 1,246 (1.6s) 0.0002196 0.0008561 Coord-IO 1 (14,688s) / / K-Computer Hierarch-IO 296 (50s) 0.0002858 0.001113 Hierarch-Port 17,626 (0.83s) 0.0005716 0.002227 Coord-IO 1 (64,000s) / / Exascale-Slim Hierarch-IO 1,000 (64s) 0.0002599 0.001013 Hierarch-Port 200,0000 (0.32s) 0.0005199 0.002026 Coord-IO 1 (64,000s) / / Exascale-Fat Hierarch-IO 316 (217s) 0.00008220 0.0003203 Hierarch-Port 33,3333 (1.92s) 0.00016440 0.0006407

24 / 35

slide-58
SLIDE 58

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Four platforms: 3D-Stencil

Name Scenario G β for 3D-Stencil Coord-IO 1 / Titan Hierarch-IO-Plane 26 0.001476 Hierarch-IO-Line 675 0.002952 Hierarch-Port 1,246 0.004428 Coord-IO 1 / K-Computer Hierarch-IO-Plane 44 0.003422 Hierarch-IO-Line 1,936 0.006844 Hierarch-Port 17,626 0.010266 Coord-IO 1 / Exascale-Slim Hierarch-IO-Plane 100 0.003952 Hierarch-IO-Line 10,000 0.007904 Hierarch-Port 200,000 0.011856 Coord-IO 1 / Exascale-Fat Hierarch-IO-Plane 46 0.001834 Hierarch-IO-Line 2,116 0.003668 Hierarch-Port 33,333 0.005502

25 / 35

slide-59
SLIDE 59

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Outline

1 Protocol Overhead 2 Accounting for message logging 3 Instanciating the model 4 Experimental results

26 / 35

slide-60
SLIDE 60

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

1 Protocol Overhead

Coordinated checkpointing Hierarchical checkpointing

2 Accounting for message logging 3 Instanciating the model

Applications Platforms

4 Experimental results

Plotting formulas Simulations

27 / 35

slide-61
SLIDE 61

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Platform: Titan

Waste as a function of processor MTBF µ

27 / 35

slide-62
SLIDE 62

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Platform: K-Computer

Waste as a function of processor MTBF µ

28 / 35

slide-63
SLIDE 63

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Platform: Exascale

Waste = 1 for all scenarios!!!

29 / 35

slide-64
SLIDE 64

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Platform: Exascale

Waste = 1 for all scenarios!!!

Goodbye Exascale?!

29 / 35

slide-65
SLIDE 65

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Platform: Exascale with C = 1, 000

Exascale-Slim Exascale-Fat Waste as a function of processor MTBF µ, C = 1, 000

30 / 35

slide-66
SLIDE 66

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Platform: Exascale with C = 100

Exascale-Slim Exascale-Fat Waste as a function of processor MTBF µ, C = 100

31 / 35

slide-67
SLIDE 67

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

1 Protocol Overhead

Coordinated checkpointing Hierarchical checkpointing

2 Accounting for message logging 3 Instanciating the model

Applications Platforms

4 Experimental results

Plotting formulas Simulations

32 / 35

slide-68
SLIDE 68

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Platform: Titan

Coordinated Coordinated BestPer Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Line BestPer Hierarchical Port Hierarchical Port BestPer

10 20 30 40 50 60 70 80 90 1 2 3 4 5 7.5 10 15 20 35 50 75 100 Makespan (days) MTBF (years) Coordinated Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Port Hierarchical Port BestPer 10 20 30 40 50 60 70 80 90 1 2 3 4 5 7.5 10 15 20 35 50 75 100 Makespan (days) MTBF (years) Coordinated Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Port Hierarchical Port BestPer

Makespan (in days) as a function of processor MTBF µ

32 / 35

slide-69
SLIDE 69

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Platform: Exascale with C = 1, 000

Coordinated Coordinated BestPer Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Line BestPer Hierarchical Port Hierarchical Port BestPer

Exascale-Slim

50 100 150 200 250 1 2 3 4 5 7.5 10 15 20 35 50 75 100 Makespan (days) MTBF (years) Coordinated Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Port Hierarchical Port BestPer 50 100 150 200 250 300 1 2 3 4 5 7.5 10 15 20 35 50 75 100 Makespan (days) MTBF (years) Coordinated Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Port Hierarchical Port BestPer

Exascale-Fat

50 100 150 200 250 1 2 3 4 5 7.5 10 15 20 35 50 75 100 Makespan (days) MTBF (years) Coordinated Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Port Hierarchical Port BestPer 50 100 150 200 250 1 2 3 4 5 7.5 10 15 20 35 50 75 100 Makespan (days) MTBF (years) Coordinated Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Port Hierarchical Port BestPer

Makespan (in days) as a function of processor MTBF µ, C = 1, 000

33 / 35

slide-70
SLIDE 70

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Platform: Exascale with C = 100

Coordinated Coordinated BestPer Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Line BestPer Hierarchical Port Hierarchical Port BestPer

Exascale-Slim

20 40 60 80 100 120 1 2 3 4 5 7.5 10 15 20 35 50 75 100 Makespan (days) MTBF (years) Coordinated Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Port Hierarchical Port BestPer 20 40 60 80 100 120 1 2 3 4 5 7.5 10 15 20 35 50 75 100 Makespan (days) MTBF (years) Coordinated Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Port Hierarchical Port BestPer

Exascale-Fat

4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 7.5 10 15 20 35 50 75 100 Makespan (days) MTBF (years) Coordinated Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Port Hierarchical Port BestPer 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 7.5 10 15 20 35 50 75 100 Makespan (days) MTBF (years) Coordinated Daly Coordinated BestPer Hierarchical Hierarchical BestPer Hierarchical Port Hierarchical Port BestPer

Makespan (in days) as a function of processor MTBF µ, C = 100

34 / 35

slide-71
SLIDE 71

Protocol Overhead Accounting for message logging Instanciating the model Experimental results

Conclusion

  • First attempt at analytical comparison of coordinated and

hierarchical checkpointing protocols

  • Classical models (Young, Daly) extended
  • Several new parameters (α, λ, ρ)
  • Message logging impact (β)
  • Instantiation
  • Scenarios: Coord-IO, Hierarch-IO, Hierarch-Port
  • Realistic application/platform combinations
  • Future work: (partial) replication, prediction, energy

35 / 35