Scalable Replay with Partial-Order Dependencies for Message-Logging - PowerPoint PPT Presentation

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses † , Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy ‡ , Laxmikant V. Kale* jliffl2@illinois.edu , emeneses@pitt.edu , { gplkrsh2,mille121 } @illinois.edu , sriram@pnnl.gov , kale@illinois.edu *University of Illinois Urbana-Champaign (UIUC) † University of Pittsburgh ‡ Pacific Northwest National Laboratory (PNNL) September 23, 2014

Deterministic Replay & Fault Tolerance � Fault tolerance often crosses over into replay territory! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 2 / 33 Scalab

Deterministic Replay & Fault Tolerance � Fault tolerance often crosses over into replay territory! � Popular uses ◮ Online fault tolerance ◮ Parallel debugging ◮ Reproducing results Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 2 / 33 Scalab

Deterministic Replay & Fault Tolerance � Fault tolerance often crosses over into replay territory! � Popular uses ◮ Online fault tolerance ◮ Parallel debugging ◮ Reproducing results � Types of replay ◮ Data-driven replay ⋆ Application/system data is recorded ⋆ Content of messages sent/received, etc. ◮ Control-driven replay ⋆ The ordering of events is recorded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 2 / 33 Scalab

Deterministic Replay & Fault Tolerance → Our Focus � Fault tolerance often crosses over into replay territory! � Popular uses ◮ Online fault tolerance ◮ Parallel debugging ◮ Reproducing results � Types of replay ◮ Data-driven replay ⋆ Application/system data is recorded ⋆ Content of messages sent/received, etc. ◮ Control-driven replay ⋆ The ordering of events is recorded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 3 / 33 Scalab

Online Fault Tolerance → Hard failures � Researchers have predicted that hard faults will increase Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 4 / 33 Scalab

Online Fault Tolerance → Hard failures � Researchers have predicted that hard faults will increase ◮ Exascale! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 4 / 33 Scalab

Online Fault Tolerance → Hard failures � Researchers have predicted that hard faults will increase ◮ Exascale! ◮ Machines are getting larger Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 4 / 33 Scalab

Online Fault Tolerance → Hard failures � Researchers have predicted that hard faults will increase ◮ Exascale! ◮ Machines are getting larger ◮ Projected to house more than 200,000 sockets Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 4 / 33 Scalab

Online Fault Tolerance → Hard failures � Researchers have predicted that hard faults will increase ◮ Exascale! ◮ Machines are getting larger ◮ Projected to house more than 200,000 sockets ◮ Hard failures may be frequent and only affect a small percentage of nodes Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 4 / 33 Scalab

Online Fault Tolerance → Approaches � Checkpoint/restart (C/R) ◮ Well-established method ◮ Save snapshot of system state ◮ Roll back to previous snapshot in case of failure Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 5 / 33 Scalab

Online Fault Tolerance → Approaches � Checkpoint/restart (C/R) ◮ Well-established method ◮ Save snapshot of system state ◮ Roll back to previous snapshot in case of failure � Motivation beyond C/R ◮ If a single node experiences a hard fault, why must all the nodes roll back? ◮ Recovering from C/R is expensive at large machine scales ⋆ Complicated because it depends on many factors (e.g checkpointing frequency) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 5 / 33 Scalab

Online Fault Tolerance → Approaches � Checkpoint/restart (C/R) ◮ Well-established method ◮ Save snapshot of system state ◮ Roll back to previous snapshot in case of failure � Motivation beyond C/R ◮ If a single node experiences a hard fault, why must all the nodes roll back? ◮ Recovering from C/R is expensive at large machine scales ⋆ Complicated because it depends on many factors (e.g checkpointing frequency) � Solutions ◮ Application-specific fault tolerance ◮ Other system-level approaches ◮ Message-logging! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 5 / 33 Scalab

Hard Failure System Model � P processes that communicate via message passing Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 6 / 33 Scalab

Hard Failure System Model � P processes that communicate via message passing � Communication is across non-FIFO channels ◮ Sent asynchronously ◮ Possibly out of order Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 6 / 33 Scalab

Hard Failure System Model � P processes that communicate via message passing � Communication is across non-FIFO channels ◮ Sent asynchronously ◮ Possibly out of order � Guaranteed to arrive sometime in the future if the recipient process has not failed Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 6 / 33 Scalab

Hard Failure System Model � P processes that communicate via message passing � Communication is across non-FIFO channels ◮ Sent asynchronously ◮ Possibly out of order � Guaranteed to arrive sometime in the future if the recipient process has not failed � Fail-stop model for all failures ◮ Failed processes do not recover from failures ◮ They do not behave maliciously (non-Byzantine failures) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 6 / 33 Scalab

Sender-Based Causal Message Logging (SB-ML) � Combination of data-driven and control-driven replay ◮ Data-driven ⋆ Messages sent are recorded ◮ Control-driven ⋆ Determinants are recorded to store the order of events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 7 / 33 Scalab

Sender-Based Causal Message Logging (SB-ML) � Combination of data-driven and control-driven replay ◮ Data-driven ⋆ Messages sent are recorded ◮ Control-driven ⋆ Determinants are recorded to store the order of events � Incurs costs in the form of time and storage overhead during forward execution Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 7 / 33 Scalab

Sender-Based Causal Message Logging (SB-ML) � Combination of data-driven and control-driven replay ◮ Data-driven ⋆ Messages sent are recorded ◮ Control-driven ⋆ Determinants are recorded to store the order of events � Incurs costs in the form of time and storage overhead during forward execution � Periodic checkpoints reduce storage overhead ◮ Recovery effort is limited to work executed after the latest checkpoint ◮ Data stored before the checkpoint can be discarded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 7 / 33 Scalab

Sender-Based Causal Message Logging (SB-ML) � Combination of data-driven and control-driven replay ◮ Data-driven ⋆ Messages sent are recorded ◮ Control-driven ⋆ Determinants are recorded to store the order of events � Incurs costs in the form of time and storage overhead during forward execution � Periodic checkpoints reduce storage overhead ◮ Recovery effort is limited to work executed after the latest checkpoint ◮ Data stored before the checkpoint can be discarded � Scalable implementation in Charm++ Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 7 / 33 Scalab

Example Execution with SB-ML Checkpoint Failure Time Task A m1 m1 m4 m4 Task B m2 m2 m6 Task C m5 m5 Task D m3 m3 m7 Task E Forward Path Restart Recovery Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 8 / 33 Scalab

Motivation → Overheads with SB-ML Performance Overhead 100% Progress Slowdown Recovery Checkpoint Failure No FT FT Time Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 9 / 33 Scalab

Forward Execution Overhead with SB-ML � Logging the messages ◮ Just requires a pointer to be saved and message is not deallocated! ◮ Increases memory pressure Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance � Jonathan Lifflander � 10 / 33 Scalab

Scalable Replay with Partial-Order Dependencies for Message-Logging - PowerPoint PPT Presentation

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander, Esteban Meneses , Harshitha Menon, Phil Miller, Sriram Krishnamoorthy , Laxmikant V. Kale jliffl2@illinois.edu , emeneses@pitt.edu

2019 NFHS FOOTBALL RULES CHANGES POSTSEASON INSTANT REPLAY RULES 1-3-7 NOTE (NEW), TABLE 1-7

June 5, 2020 Commonwealth Credit Review Replay Information Please note that a replay of the

February 7, 2020 Commonwealth Credit Review Replay Information Please note that a replay of the

November 13, 2020 Commonwealth Credit Review Replay Information Please note that a replay of the

Building stuff with monadic dependencies + unchanging dependencies + polymorphic dependencies +

Overview Partial Constituent Fronting in German The phenomenon: Partial constituent fronting

Task Dependencies: ant Steven J Zeil February 25, 2013 Task Dependencies: ant Outline

Earnings Presentation Year ended December 2019 Replay Replay passcode 6 March 2020 0207 136

Earnings Presentation Half year ended June 2020 Replay Replay passcode 5 August 2020 0207 136

Earnings Presentation Quarter ended March 2020 Replay Replay passcode 7 May 2020 0207 136 9233

Do you have to reproduce the bug on the first replay attempt? PRES: Probabilistic Replay with

NFC Payments: The Art of Relay & Replay Attacks Who are we? Troopers 2018? NFC

Capture-Replay Tests in J2ME Testy capture-replay w rodowisku J2ME Marcin Zduniak Bartosz

Complete partial orders An ( -chain- ) complete partial order , cpo : D = D, ,

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

JUST THE MATHS SLIDES NUMBER 14.1 PARTIAL DIFFERENTIATION 1 (Partial derivatives of the

Improving Access to and Rational Use of Medicines in Malawi Pharmacy Assistant Training Global

The Good, the Bad and the Ugly: 6th annual review of Recent Oregon Public Sector Labor Cases

Topic: Measuring Australias Health Status Lesson: An introduction to measuring Australias

Integrating Primary and Joshua Rubin, Principal Behavioral Healthcare for VBP Readiness A LBANY

One Researcher's Viewpoint on Policy Issues Relating to What have we learned, and whats

Policy Modelling for COVID-19: Better Data for Better Decision- Making in LMICs Anna Vassall,

EQUITABLE ACCESS INITIATIVE Draft Project overview and timeline Equitable Access Initiative (EAI)

Nutrition Webinar for FDOV/SDGP projects 25 & 28 November 2019 The Netherlands Working Group

Scalable Replay with Partial-Order Dependencies for Message-Logging - PowerPoint PPT Presentation

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses , Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy , Laxmikant V. Kale* jliffl2@illinois.edu , emeneses@pitt.edu

2019 NFHS FOOTBALL RULES CHANGES POSTSEASON INSTANT REPLAY RULES 1-3-7 NOTE (NEW), TABLE 1-7

June 5, 2020 Commonwealth Credit Review Replay Information Please note that a replay of the

February 7, 2020 Commonwealth Credit Review Replay Information Please note that a replay of the

November 13, 2020 Commonwealth Credit Review Replay Information Please note that a replay of the

Building stuff with monadic dependencies + unchanging dependencies + polymorphic dependencies +

Overview Partial Constituent Fronting in German The phenomenon: Partial constituent fronting

Task Dependencies: ant Steven J Zeil February 25, 2013 Task Dependencies: ant Outline

Earnings Presentation Year ended December 2019 Replay Replay passcode 6 March 2020 0207 136

Earnings Presentation Half year ended June 2020 Replay Replay passcode 5 August 2020 0207 136

Earnings Presentation Quarter ended March 2020 Replay Replay passcode 7 May 2020 0207 136 9233

Do you have to reproduce the bug on the first replay attempt? PRES: Probabilistic Replay with

NFC Payments: The Art of Relay &amp; Replay Attacks Who are we? Troopers 2018? NFC

Capture-Replay Tests in J2ME Testy capture-replay w rodowisku J2ME Marcin Zduniak Bartosz

Complete partial orders An ( -chain- ) complete partial order , cpo : D = D, ,

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

JUST THE MATHS SLIDES NUMBER 14.1 PARTIAL DIFFERENTIATION 1 (Partial derivatives of the

Improving Access to and Rational Use of Medicines in Malawi Pharmacy Assistant Training Global

The Good, the Bad and the Ugly: 6th annual review of Recent Oregon Public Sector Labor Cases

Topic: Measuring Australias Health Status Lesson: An introduction to measuring Australias

Integrating Primary and Joshua Rubin, Principal Behavioral Healthcare for VBP Readiness A LBANY

One Researcher's Viewpoint on Policy Issues Relating to What have we learned, and whats

Policy Modelling for COVID-19: Better Data for Better Decision- Making in LMICs Anna Vassall,

EQUITABLE ACCESS INITIATIVE Draft Project overview and timeline Equitable Access Initiative (EAI)

Nutrition Webinar for FDOV/SDGP projects 25 &amp; 28 November 2019 The Netherlands Working Group

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander, Esteban Meneses , Harshitha Menon, Phil Miller, Sriram Krishnamoorthy , Laxmikant V. Kale jliffl2@illinois.edu , emeneses@pitt.edu

NFC Payments: The Art of Relay & Replay Attacks Who are we? Troopers 2018? NFC

Nutrition Webinar for FDOV/SDGP projects 25 & 28 November 2019 The Netherlands Working Group