Scalable Replay with Partial-Order Dependencies for Message-Logging - - PowerPoint PPT Presentation

scalable replay with partial order dependencies for
SMART_READER_LITE
LIVE PREVIEW

Scalable Replay with Partial-Order Dependencies for Message-Logging - - PowerPoint PPT Presentation

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses , Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy , Laxmikant V. Kale* jliffl2@illinois.edu , emeneses@pitt.edu


slide-1
SLIDE 1

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Jonathan Lifflander*, Esteban Meneses†, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy‡, Laxmikant V. Kale*

jliffl2@illinois.edu, emeneses@pitt.edu, {gplkrsh2,mille121}@illinois.edu, sriram@pnnl.gov, kale@illinois.edu

*University of Illinois Urbana-Champaign (UIUC)

†University of Pittsburgh ‡Pacific Northwest National Laboratory (PNNL)

September 23, 2014

slide-2
SLIDE 2

Deterministic Replay & Fault Tolerance

Fault tolerance often crosses over into replay territory! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 2 / 33

Scalab

slide-3
SLIDE 3

Deterministic Replay & Fault Tolerance

Fault tolerance often crosses over into replay territory! Popular uses ◮ Online fault tolerance ◮ Parallel debugging ◮ Reproducing results Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 2 / 33

Scalab

slide-4
SLIDE 4

Deterministic Replay & Fault Tolerance

Fault tolerance often crosses over into replay territory! Popular uses ◮ Online fault tolerance ◮ Parallel debugging ◮ Reproducing results Types of replay ◮ Data-driven replay ⋆ Application/system data is recorded ⋆ Content of messages sent/received, etc. ◮ Control-driven replay ⋆ The ordering of events is recorded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 2 / 33

Scalab

slide-5
SLIDE 5

Deterministic Replay & Fault Tolerance

→ Our Focus

Fault tolerance often crosses over into replay territory! Popular uses ◮ Online fault tolerance ◮ Parallel debugging ◮ Reproducing results Types of replay ◮ Data-driven replay ⋆ Application/system data is recorded ⋆ Content of messages sent/received, etc. ◮ Control-driven replay ⋆ The ordering of events is recorded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 3 / 33

Scalab

slide-6
SLIDE 6

Online Fault Tolerance

→ Hard failures

Researchers have predicted that hard faults will increase Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 4 / 33

Scalab

slide-7
SLIDE 7

Online Fault Tolerance

→ Hard failures

Researchers have predicted that hard faults will increase ◮ Exascale! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 4 / 33

Scalab

slide-8
SLIDE 8

Online Fault Tolerance

→ Hard failures

Researchers have predicted that hard faults will increase ◮ Exascale! ◮ Machines are getting larger Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 4 / 33

Scalab

slide-9
SLIDE 9

Online Fault Tolerance

→ Hard failures

Researchers have predicted that hard faults will increase ◮ Exascale! ◮ Machines are getting larger ◮ Projected to house more than 200,000 sockets Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 4 / 33

Scalab

slide-10
SLIDE 10

Online Fault Tolerance

→ Hard failures

Researchers have predicted that hard faults will increase ◮ Exascale! ◮ Machines are getting larger ◮ Projected to house more than 200,000 sockets ◮ Hard failures may be frequent and only affect a small percentage of

nodes

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 4 / 33

Scalab

slide-11
SLIDE 11

Online Fault Tolerance

→ Approaches

Checkpoint/restart (C/R) ◮ Well-established method ◮ Save snapshot of system state ◮ Roll back to previous snapshot in case of failure Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 5 / 33

Scalab

slide-12
SLIDE 12

Online Fault Tolerance

→ Approaches

Checkpoint/restart (C/R) ◮ Well-established method ◮ Save snapshot of system state ◮ Roll back to previous snapshot in case of failure Motivation beyond C/R ◮ If a single node experiences a hard fault, why must all the nodes roll

back?

◮ Recovering from C/R is expensive at large machine scales ⋆ Complicated because it depends on many factors (e.g checkpointing

frequency)

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 5 / 33

Scalab

slide-13
SLIDE 13

Online Fault Tolerance

→ Approaches

Checkpoint/restart (C/R) ◮ Well-established method ◮ Save snapshot of system state ◮ Roll back to previous snapshot in case of failure Motivation beyond C/R ◮ If a single node experiences a hard fault, why must all the nodes roll

back?

◮ Recovering from C/R is expensive at large machine scales ⋆ Complicated because it depends on many factors (e.g checkpointing

frequency)

Solutions ◮ Application-specific fault tolerance ◮ Other system-level approaches ◮ Message-logging! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 5 / 33

Scalab

slide-14
SLIDE 14

Hard Failure System Model

P processes that communicate via message passing Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 6 / 33

Scalab

slide-15
SLIDE 15

Hard Failure System Model

P processes that communicate via message passing Communication is across non-FIFO channels ◮ Sent asynchronously ◮ Possibly out of order Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 6 / 33

Scalab

slide-16
SLIDE 16

Hard Failure System Model

P processes that communicate via message passing Communication is across non-FIFO channels ◮ Sent asynchronously ◮ Possibly out of order Guaranteed to arrive sometime in the future if the recipient process

has not failed

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 6 / 33

Scalab

slide-17
SLIDE 17

Hard Failure System Model

P processes that communicate via message passing Communication is across non-FIFO channels ◮ Sent asynchronously ◮ Possibly out of order Guaranteed to arrive sometime in the future if the recipient process

has not failed

Fail-stop model for all failures ◮ Failed processes do not recover from failures ◮ They do not behave maliciously (non-Byzantine failures) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 6 / 33

Scalab

slide-18
SLIDE 18

Sender-Based Causal Message Logging (SB-ML)

Combination of data-driven and control-driven replay ◮ Data-driven ⋆ Messages sent are recorded ◮ Control-driven ⋆ Determinants are recorded to store the order of events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 7 / 33

Scalab

slide-19
SLIDE 19

Sender-Based Causal Message Logging (SB-ML)

Combination of data-driven and control-driven replay ◮ Data-driven ⋆ Messages sent are recorded ◮ Control-driven ⋆ Determinants are recorded to store the order of events Incurs costs in the form of time and storage overhead during forward

execution

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 7 / 33

Scalab

slide-20
SLIDE 20

Sender-Based Causal Message Logging (SB-ML)

Combination of data-driven and control-driven replay ◮ Data-driven ⋆ Messages sent are recorded ◮ Control-driven ⋆ Determinants are recorded to store the order of events Incurs costs in the form of time and storage overhead during forward

execution

Periodic checkpoints reduce storage overhead ◮ Recovery effort is limited to work executed after the latest checkpoint ◮ Data stored before the checkpoint can be discarded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 7 / 33

Scalab

slide-21
SLIDE 21

Sender-Based Causal Message Logging (SB-ML)

Combination of data-driven and control-driven replay ◮ Data-driven ⋆ Messages sent are recorded ◮ Control-driven ⋆ Determinants are recorded to store the order of events Incurs costs in the form of time and storage overhead during forward

execution

Periodic checkpoints reduce storage overhead ◮ Recovery effort is limited to work executed after the latest checkpoint ◮ Data stored before the checkpoint can be discarded Scalable implementation in Charm++ Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 7 / 33

Scalab

slide-22
SLIDE 22

Example Execution with SB-ML

Checkpoint Failure

Task A Task B Task D Task E Restart

m1 m2

Time

Task C

m3 m4 m5 m1 m2 m3 m4 m5

Recovery

m6 m7

Forward Path

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 8 / 33

Scalab

slide-23
SLIDE 23

Motivation

→ Overheads with SB-ML

100% Progress Time

Performance Overhead Slowdown Checkpoint Recovery Failure

No FT FT

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 9 / 33

Scalab

slide-24
SLIDE 24

Forward Execution Overhead with SB-ML

Logging the messages ◮ Just requires a pointer to be saved and message is not deallocated! ◮ Increases memory pressure Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 10 / 33

Scalab

slide-25
SLIDE 25

Forward Execution Overhead with SB-ML

Logging the messages ◮ Just requires a pointer to be saved and message is not deallocated! ◮ Increases memory pressure Determinants, 4-tuple of the form: <SPE,SSN,RPE,RSN> ◮ Components: ⋆ Sender processor (SPE) ⋆ Sender sequence number (SSN) ⋆ Receiver processor (RPE) ⋆ Receiver sequence number (RSN) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 10 / 33

Scalab

slide-26
SLIDE 26

Forward Execution Overhead with SB-ML

Logging the messages ◮ Just requires a pointer to be saved and message is not deallocated! ◮ Increases memory pressure Determinants, 4-tuple of the form: <SPE,SSN,RPE,RSN> ◮ Components: ⋆ Sender processor (SPE) ⋆ Sender sequence number (SSN) ⋆ Receiver processor (RPE) ⋆ Receiver sequence number (RSN) ◮ Must be stored stably based on the reliability requirements ⋆ Propagated to n processors ⋆ Unacknowledged determinants are augmented onto new messages (to

avoid frequent synchronizations)

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 10 / 33

Scalab

slide-27
SLIDE 27

Forward Execution Overhead with SB-ML

Logging the messages ◮ Just requires a pointer to be saved and message is not deallocated! ◮ Increases memory pressure Determinants, 4-tuple of the form: <SPE,SSN,RPE,RSN> ◮ Components: ⋆ Sender processor (SPE) ⋆ Sender sequence number (SSN) ⋆ Receiver processor (RPE) ⋆ Receiver sequence number (RSN) ◮ Must be stored stably based on the reliability requirements ⋆ Propagated to n processors ⋆ Unacknowledged determinants are augmented onto new messages (to

avoid frequent synchronizations)

◮ Recovery ⋆ Messages must be replayed in a total order Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 10 / 33

Scalab

slide-28
SLIDE 28

Forward Execution Microbenchmark (SB-ML)

Component Overhead (%) Determinants

84.75%

Bookkeeping

11.65%

Message-envelope size increase

3.10%

Message storage

0.50%

Using the LeanMD (molecular dynamics) benchmark Measured on 256 cores of Ranger Largest source of overhead is determinants ◮ Creating, storing, sending, etc. Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 11 / 33

Scalab

slide-29
SLIDE 29

Benchmarks

→ Runtime System—Charm++

Decompose parallel computation into objects that communicate ◮ More objects than number of processors ◮ Objects communicate by sending messages ◮ Computation is oblivious to the processors Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 12 / 33

Scalab

slide-30
SLIDE 30

Benchmarks

→ Runtime System—Charm++

Decompose parallel computation into objects that communicate ◮ More objects than number of processors ◮ Objects communicate by sending messages ◮ Computation is oblivious to the processors Benefits ◮ Load balancing, message-driven execution, fault tolerance, etc. Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 12 / 33

Scalab

slide-31
SLIDE 31

Benchmarks

→ Configuration & Experimental Setup Benchmark Configuration STENCIL3D matrix: 40963, chunk: 643 LEANMD (mini-app for NAMD) 600K atoms, 2-away XY, 75 atoms/cell LULESH (shock hydrodynamics) matrix: 1024x5122, chunk: 16x82

All experiments on IBM Blue Gene/P (BG/P), ‘Intrepid’ 40960-node system ◮ Each node consists of one quad-core 850MHz PowerPC 450 ◮ 2GB DDR2 memory Compiler: IBM XL C/C++ Advanced Edition for Blue Gene/P

, V9.0

Runtime: Charm++ 6.5.1 Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 13 / 33

Scalab

slide-32
SLIDE 32

Forward Execution Overhead with SB-ML

5 10 15 20 Stencil3D LeanMD LULESH Percent Overhead (%) 8k Cores 16k Cores 32k Cores 64k Cores 132k Cores

The finer-grained benchmarks, LeanMD and LULESH, suffer from

significant overhead

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 14 / 33

Scalab

slide-33
SLIDE 33

Reducing the Overhead of Determinants

Design Criteria ◮ We must maintain full determinism Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 15 / 33

Scalab

slide-34
SLIDE 34

Reducing the Overhead of Determinants

Design Criteria ◮ We must maintain full determinism ◮ We must devolve well for all cases (even very non-deterministic

programs)

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 15 / 33

Scalab

slide-35
SLIDE 35

Reducing the Overhead of Determinants

Design Criteria ◮ We must maintain full determinism ◮ We must devolve well for all cases (even very non-deterministic

programs)

◮ Need to consider tasks or lightweight objects Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 15 / 33

Scalab

slide-36
SLIDE 36

Reducing the Overhead of Determinants

‘Intrinsic’ determinism ◮ Many researchers have noticed that programs have internal

determinism

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 16 / 33

Scalab

slide-37
SLIDE 37

Reducing the Overhead of Determinants

‘Intrinsic’ determinism ◮ Many researchers have noticed that programs have internal

determinism

⋆ Causality tracking (1988: Fidge, Partial orders for parallel debugging) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 16 / 33

Scalab

slide-38
SLIDE 38

Reducing the Overhead of Determinants

‘Intrinsic’ determinism ◮ Many researchers have noticed that programs have internal

determinism

⋆ Causality tracking (1988: Fidge, Partial orders for parallel debugging) ⋆ Racing messages (1992: Netzer, et al., Optimal tracing and replay for

debugging message-passing parallel programs)

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 16 / 33

Scalab

slide-39
SLIDE 39

Reducing the Overhead of Determinants

‘Intrinsic’ determinism ◮ Many researchers have noticed that programs have internal

determinism

⋆ Causality tracking (1988: Fidge, Partial orders for parallel debugging) ⋆ Racing messages (1992: Netzer, et al., Optimal tracing and replay for

debugging message-passing parallel programs)

⋆ Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing

and debugging in message passing parallel programs)

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 16 / 33

Scalab

slide-40
SLIDE 40

Reducing the Overhead of Determinants

‘Intrinsic’ determinism ◮ Many researchers have noticed that programs have internal

determinism

⋆ Causality tracking (1988: Fidge, Partial orders for parallel debugging) ⋆ Racing messages (1992: Netzer, et al., Optimal tracing and replay for

debugging message-passing parallel programs)

⋆ Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing

and debugging in message passing parallel programs)

⋆ Block races (1995: Clemencon, An implementation of race detection and

deterministic replay with MPI

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 16 / 33

Scalab

slide-41
SLIDE 41

Reducing the Overhead of Determinants

‘Intrinsic’ determinism ◮ Many researchers have noticed that programs have internal

determinism

⋆ Causality tracking (1988: Fidge, Partial orders for parallel debugging) ⋆ Racing messages (1992: Netzer, et al., Optimal tracing and replay for

debugging message-passing parallel programs)

⋆ Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing

and debugging in message passing parallel programs)

⋆ Block races (1995: Clemencon, An implementation of race detection and

deterministic replay with MPI

⋆ MPI and Non-determinism (2000: Kranzlmuller, Event graph analysis for

debugging massively parallel programs)

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 16 / 33

Scalab

slide-42
SLIDE 42

Reducing the Overhead of Determinants

‘Intrinsic’ determinism ◮ Many researchers have noticed that programs have internal

determinism

⋆ Causality tracking (1988: Fidge, Partial orders for parallel debugging) ⋆ Racing messages (1992: Netzer, et al., Optimal tracing and replay for

debugging message-passing parallel programs)

⋆ Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing

and debugging in message passing parallel programs)

⋆ Block races (1995: Clemencon, An implementation of race detection and

deterministic replay with MPI

⋆ MPI and Non-determinism (2000: Kranzlmuller, Event graph analysis for

debugging massively parallel programs)

⋆ . . . Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 16 / 33

Scalab

slide-43
SLIDE 43

Reducing the Overhead of Determinants

‘Intrinsic’ determinism ◮ Many researchers have noticed that programs have internal

determinism

⋆ Causality tracking (1988: Fidge, Partial orders for parallel debugging) ⋆ Racing messages (1992: Netzer, et al., Optimal tracing and replay for

debugging message-passing parallel programs)

⋆ Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing

and debugging in message passing parallel programs)

⋆ Block races (1995: Clemencon, An implementation of race detection and

deterministic replay with MPI

⋆ MPI and Non-determinism (2000: Kranzlmuller, Event graph analysis for

debugging massively parallel programs)

⋆ . . . ⋆ Send-determinism (2011: Guermouche, et al., Uncoordinated

checkpointing without domino effect for send-deterministic MPI applications)

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 16 / 33

Scalab

slide-44
SLIDE 44

Our Approach

In many cases, only a partial order must be stored for full determinism ◮ Program = internal determinism + non-determinism + commutative Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 17 / 33

Scalab

slide-45
SLIDE 45

Our Approach

In many cases, only a partial order must be stored for full determinism ◮ Program = internal determinism + non-determinism + commutative ◮ Internal determinism requires no determinants! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 17 / 33

Scalab

slide-46
SLIDE 46

Our Approach

In many cases, only a partial order must be stored for full determinism ◮ Program = internal determinism + non-determinism + commutative ◮ Internal determinism requires no determinants! ◮ Commutative events require no determinants! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 17 / 33

Scalab

slide-47
SLIDE 47

Our Approach

In many cases, only a partial order must be stored for full determinism ◮ Program = internal determinism + non-determinism + commutative ◮ Internal determinism requires no determinants! ◮ Commutative events require no determinants! ◮ Approach: use determinants to store a partial order for the

non-deterministic events that are not commutative

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 17 / 33

Scalab

slide-48
SLIDE 48

Ordering Algebra

→ Ordered Sets, O

O(n, d) ◮ Set of n events and d dependencies Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 18 / 33

Scalab

slide-49
SLIDE 49

Ordering Algebra

→ Ordered Sets, O

O(n, d) ◮ Set of n events and d dependencies ◮ Can be accurately replayed from a given starting point Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 18 / 33

Scalab

slide-50
SLIDE 50

Ordering Algebra

→ Ordered Sets, O

O(n, d) ◮ Set of n events and d dependencies ◮ Can be accurately replayed from a given starting point ◮ Dependencies d can be among the events in the set, or on preceding

events

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 18 / 33

Scalab

slide-51
SLIDE 51

Ordering Algebra

→ Ordered Sets, O

O(n, d) ◮ Set of n events and d dependencies ◮ Can be accurately replayed from a given starting point ◮ Dependencies d can be among the events in the set, or on preceding

events

◮ Intuitively, they are ordered sets of events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 18 / 33

Scalab

slide-52
SLIDE 52

Ordering Algebra

→ Ordered Sets, O

O(n, d) ◮ Set of n events and d dependencies ◮ Can be accurately replayed from a given starting point ◮ Dependencies d can be among the events in the set, or on preceding

events

◮ Intuitively, they are ordered sets of events Define sequencing operation, ⊞:

O(1, d1) ⊞ O(1, d2) = O(2, d1 + d2 + 1)

◮ Intuitively, if we have two atomic events, we need a single dependency

to tell us which one comes first

Generalization: O(n1, d1) ⊞ O(n2, d2) = O(n1 + n2, d1 + d2 + 1) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 18 / 33

Scalab

slide-53
SLIDE 53

Ordering Algebra

→ Unordered Sets, U

U(n, d) ◮ Unordered set of n events and d dependencies Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 19 / 33

Scalab

slide-54
SLIDE 54

Ordering Algebra

→ Unordered Sets, U

U(n, d) ◮ Unordered set of n events and d dependencies ◮ Example is where several messages are sent to a single endpoint Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 19 / 33

Scalab

slide-55
SLIDE 55

Ordering Algebra

→ Unordered Sets, U

U(n, d) ◮ Unordered set of n events and d dependencies ◮ Example is where several messages are sent to a single endpoint ◮ Depending the order of arrival, the eventual state will be different We decompose this into atomic events with an additional dependency

between each successive pair:

U(n, d) = O(1, d1) ⊞ O(1, d2) ⊞ · · · ⊞ O(1, dn) = O(n, d + n − 1)

where d = di

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 19 / 33

Scalab

slide-56
SLIDE 56

Ordering Algebra

→ Unordered Sets, U

U(n, d) ◮ Unordered set of n events and d dependencies ◮ Example is where several messages are sent to a single endpoint ◮ Depending the order of arrival, the eventual state will be different We decompose this into atomic events with an additional dependency

between each successive pair:

U(n, d) = O(1, d1) ⊞ O(1, d2) ⊞ · · · ⊞ O(1, dn) = O(n, d + n − 1)

where d = di

◮ Result: additional n − 1 dependencies required to fully order n events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 19 / 33

Scalab

slide-57
SLIDE 57

Ordering Algebra

→ Interleaving Multiple Independent Sets, ⊠ operator

Lemma

Any possible interleaving of two ordered sets of events A = O(m, d) and

B = O(n, e), where A ∩ B = ∅, is given by: O(m, d) ⊠ O(n, e) = O(m + n, d + e + min(m, n)) Lemma

Any possible ordering of n ordered set of events

O(m1, d1), O(m2, d2), . . . , O(mn, dn), when

i O(mi, di) = ∅, can be

represented as:

n

i=1 O(mi, di) = O(m, d + m − maxi mi) where

m =

n

  • i=1

mi ∧ d =

n

  • i=1

di

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 20 / 33

Scalab

slide-58
SLIDE 58

Internal Determinism

→ D

D(n) = O(n, 0) n deterministically ordered events are structurally equivalent to an

  • rdered set of n events with no associated explicit dependencies!

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 21 / 33

Scalab

slide-59
SLIDE 59

Internal Determinism

→ D

D(n) = O(n, 0) n deterministically ordered events are structurally equivalent to an

  • rdered set of n events with no associated explicit dependencies!

What happens if we interleave internal determinism with something

else?

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 21 / 33

Scalab

slide-60
SLIDE 60

Internal Determinism

→ D

D(n) = O(n, 0) n deterministically ordered events are structurally equivalent to an

  • rdered set of n events with no associated explicit dependencies!

What happens if we interleave internal determinism with something

else?

k interruption points => O(k, k − 1) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 21 / 33

Scalab

slide-61
SLIDE 61

Communtative Events

→ C

Some events in programs are communtative ◮ Regardless of the execution order the state will be identical Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 22 / 33

Scalab

slide-62
SLIDE 62

Communtative Events

→ C

Some events in programs are communtative ◮ Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order

  • n them

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 22 / 33

Scalab

slide-63
SLIDE 63

Communtative Events

→ C

Some events in programs are communtative ◮ Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order

  • n them

However we can reduce a commutative set to: ◮ C(n) = O(2, 1) ◮ A beginning and end event sequenced together ◮ Sequencing other sets of event around the region just puts them before

and after

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 22 / 33

Scalab

slide-64
SLIDE 64

Communtative Events

→ C

Some events in programs are communtative ◮ Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order

  • n them

However we can reduce a commutative set to: ◮ C(n) = O(2, 1) ◮ A beginning and end event sequenced together ◮ Sequencing other sets of event around the region just puts them before

and after

◮ Interleaving other events puts them in three buckets: ⋆ (1) before the begin event ⋆ (2) during the commutative region ⋆ (3) after the end event Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 22 / 33

Scalab

slide-65
SLIDE 65

Communtative Events

→ C

Some events in programs are communtative ◮ Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order

  • n them

However we can reduce a commutative set to: ◮ C(n) = O(2, 1) ◮ A beginning and end event sequenced together ◮ Sequencing other sets of event around the region just puts them before

and after

◮ Interleaving other events puts them in three buckets: ⋆ (1) before the begin event ⋆ (2) during the commutative region ⋆ (3) after the end event ◮ This corresponds exactly to an ordered set of two events! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 22 / 33

Scalab

slide-66
SLIDE 66

Applying the Theory

→ PO-REPLAY: Partial-Order Message Identification Scheme

Properties ◮ It tracks causality with Lamport clocks ◮ It uniquely identifies a sent message, whether or not its order is

transposed

◮ It requires exactly the number of determinants and dependencies

produced by the ordering algebra

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 23 / 33

Scalab

slide-67
SLIDE 67

Applying the Theory

→ PO-REPLAY: Partial-Order Message Identification Scheme

Properties ◮ It tracks causality with Lamport clocks ◮ It uniquely identifies a sent message, whether or not its order is

transposed

◮ It requires exactly the number of determinants and dependencies

produced by the ordering algebra

Determinant Composition (3-tuple): <SRN,SPE,CPI> ◮ SRN: sender region number, incremented for every send outside a

commutative region and incremented once when a commutative region starts

◮ SPE: sender processor endpoint ◮ CPI: commutative path identifier, sequence of bits that represents the

path to the root of the commutative region

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 23 / 33

Scalab

slide-68
SLIDE 68

Experimental Results

→ Forward Execution Overhead: Stencil3D

5 10 15 20 Stencil3D- PartialDetFT Stencil3D- FullDetFT Percent Overhead (%) 8k Cores 16k Cores 32k Cores 64k Cores 132k Cores

Course-grained, shows small improvement over SB-ML Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 24 / 33

Scalab

slide-69
SLIDE 69

Experimental Results

→ Forward Execution Overhead: LeanMD

5 10 15 20 LeanMD- PartialDetFT LeanMD- FullDetFT Percent Overhead (%) 8k Cores 16k Cores 32k Cores 64k Cores 132k Cores

Fine-grained, reduction from 11-19% overhead to <5% Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 25 / 33

Scalab

slide-70
SLIDE 70

Experimental Results

→ Forward Execution Overhead: LULESH

5 10 15 20 LULESH- PartialDetFT LULESH- FullDetFT Percent Overhead (%) 8k Cores 16k Cores 32k Cores 64k Cores 132k Cores

Medium-grained, many messages, 17% overhead to <4% Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 26 / 33

Scalab

slide-71
SLIDE 71

Experimental Results

→ Fault Injection

Measure the recovery time for the different protocols Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 27 / 33

Scalab

slide-72
SLIDE 72

Experimental Results

→ Fault Injection

Measure the recovery time for the different protocols ◮ We inject a simulated fault on a random node Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 27 / 33

Scalab

slide-73
SLIDE 73

Experimental Results

→ Fault Injection

Measure the recovery time for the different protocols ◮ We inject a simulated fault on a random node ◮ During approximately the middle of the period Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 27 / 33

Scalab

slide-74
SLIDE 74

Experimental Results

→ Fault Injection

Measure the recovery time for the different protocols ◮ We inject a simulated fault on a random node ◮ During approximately the middle of the period ◮ We calculate the optimal checkpoint period duration using Daly’s

formula

⋆ Assuming 64K–1M socket count ⋆ Assuming MTBF of 10 years Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 27 / 33

Scalab

slide-75
SLIDE 75

Experimental Results

→ Recovery Time Speedup C/R

0.5 1 1.5 2 2.5 3 3.5 4 LeanMD Stencil3D LULESH Speedup 8192 Cores 16384 Cores 32768 Cores 65536 Cores 131072 Cores

LeanMD has the most speedup due to its fine-grained,

  • verdecomposed nature

We achieve speedup in all cases in recovery time Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 28 / 33

Scalab

slide-76
SLIDE 76

Experimental Results

→ Recovery Time Speedup SB-ML

0.5 1 1.5 2 2.5 LeanMD Stencil3D LULESH Speedup 8192 Cores 16384 Cores 32768 Cores 65536 Cores 131072 Cores

Increased speedup with scale, due to expense of coordinating

determinants and ordering

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 29 / 33

Scalab

slide-77
SLIDE 77

Experimental Results

→ Summary

Our new message logging protocol has about <5% overhead for the

benchmarks tested

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 30 / 33

Scalab

slide-78
SLIDE 78

Experimental Results

→ Summary

Our new message logging protocol has about <5% overhead for the

benchmarks tested

Recover is significantly faster than C/R or causal Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 30 / 33

Scalab

slide-79
SLIDE 79

Experimental Results

→ Summary

Our new message logging protocol has about <5% overhead for the

benchmarks tested

Recover is significantly faster than C/R or causal Depending on the frequency of faults, it may perform better than C/R Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 30 / 33

Scalab

slide-80
SLIDE 80

Future Work

More benchmarks Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 31 / 33

Scalab

slide-81
SLIDE 81

Future Work

More benchmarks Study for broader range of programming models Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 31 / 33

Scalab

slide-82
SLIDE 82

Future Work

More benchmarks Study for broader range of programming models Memory overhead of message logging makes it infeasible for some

applications

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 31 / 33

Scalab

slide-83
SLIDE 83

Future Work

More benchmarks Study for broader range of programming models Memory overhead of message logging makes it infeasible for some

applications

Automated extraction of ordering and interleaving properties Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 31 / 33

Scalab

slide-84
SLIDE 84

Future Work

More benchmarks Study for broader range of programming models Memory overhead of message logging makes it infeasible for some

applications

Automated extraction of ordering and interleaving properties Programming language support? Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 31 / 33

Scalab

slide-85
SLIDE 85

Conclusion

Comprehensive approach for reasoning about execution orderings

and interleavings

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 32 / 33

Scalab

slide-86
SLIDE 86

Conclusion

Comprehensive approach for reasoning about execution orderings

and interleavings

We observe that the information stored can be reduced in proportion

to the knowledge of order flexibility

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 32 / 33

Scalab

slide-87
SLIDE 87

Conclusion

Comprehensive approach for reasoning about execution orderings

and interleavings

We observe that the information stored can be reduced in proportion

to the knowledge of order flexibility

Programming paradigms should make this cost model clearer! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

  • Jonathan Lifflander
  • 32 / 33

Scalab

slide-88
SLIDE 88

Questions?