Checkpoint/Recovery 18-849b Dependable Embedded Systems John DeVale - PowerPoint PPT Presentation

Checkpoint/Recovery 18-849b Dependable Embedded Systems John DeVale February 4, 1999 Required Reading: Application-Transparent Checkpointing in Mach 3.0/UX - Russinovich and Segall Best Tutorial: Libckpt: Transparent Checkpointing under Unix Usenix Winter 1995 Technical Conference Authoritative Books: Software Fault Tolerance, Michael R. Lyu (ed)

Overview: Checkpointing - Recovery N Introduction • Method of creating fault tolerant software systems N Key concepts • Periodically saves process/system state • In the event of a fault, state is restored via a rollback • Scales to distributed/parallelized applications N Tools / techniques / metrics • Stratus VOS and Tandem NonStop Kernel • libFt , libckpt N Relationship to other topics • Fault Tolerant Computing technique N Conclusions & future work 2

Checkpoint - Recovery… The basic picture Memory Non-Volatile Checkpoint mechanism copies the state of Copy of Process A Process A process A into non- volatile storage System Failure... Restore mechanism copies the last known checkpointed Memory Non-Volatile state of process A back into memory and continues Restored processing. This Copy of Copy of mechanism is especially Process A Process A useful for application which may run for long periods of time before reaching a solution. 3

Where we are 4

Description of Topic N Checkpoint-Recovery gives an application or system the ability to save its state, and tolerate failures by enabling a failed executive to recover to an earlier safe state. N Key ideas • Saves executive state • Provides recovery mechanism in the presence of a fault • Can allow tolerance of any non-apocalyptic failure • Provides mechanism for process migration in distributed systems for fault tolerance reasons or load balancing 5

Saves Executive State N When a checkpoint is executed, a snapshot of all program state is saved into some non-volatile, machine accessible medium. • Time • Space • Memory Exclusion - new idea – Ref:Memory Exclusion: Optimizing the Performance of Checkpointing Systems James Plank, submitted for publication, SP&E – Allows the checkpointing mechanism to be told, and/or dynamically determine what memory structures are an important part of program state, and only save those structures. – Saves time AND space 6

Provides Recovery Mechanism N Once a fault has occurred, the recovery mechanism restores the program to the last checkpointed state • Current automatic Unix based tools wait for the process to abort, and restore it after abort. – Time constraint may not allow for this length of recovery – In the presence of a software design fault, rollback mechanism needs more complexity to allow rollback to a previous state, yet retain knowledge of faulted path of execution. • Stratus, Tandem seem to handle this, but details are sketchy 7

Failure Tolerance N Faults can be tolerated, even those which may physically destroy the processing site • Geographically distant sites with a synchronized distributed systems can perform coordinated checkpoints and process migration. • Transient faults and glitches tolerated as a matter of course through the normal checkpoint-recovery system 8

Tools / Techniques N libFT - AT&T research labs • Provides checkpoint recovery and watchdog demons • http://www.research.att.com/sw/tools/reuse/packages/ft.html N libCKPT - University of Tenn. Knoxville • Provides incremental checkpoint recovery library, with memory exclusion • http://www.cs.utk.edu/~plank/plank/www/libckpt.html N PMCKPT • The Poor man’s checkpoint • http://warp.dcs.st-andrews.ac.uk/warp/systems/checkpoint/source.html N CONDOR • Process migration for load balancing • http://www.cs.wisc.edu/condor N General Links • http://warp.dcs.st-andrews.ac.uk/warp/systems/checkpoint/ 9

Metrics N Key Metrics in Checkpoint - Recovery • Snapshot Time – How long it takes to identify and copy (to intermediate storage) all required program state • Commit Time – How long it takes to copy snapshot into non-volatile storage • Recovery Time – How long it takes to restore state to a failed process N Dependant on state size and system performance 10

Relationship To Other Topic Areas N Fault Tolerant Computing • Checkpoint - Rollback is a technique which can be used to build fault tolerance into a computing system • In its current form it very capably saves process state and can create a new process and restore old state to it in the case of a process failure N SW fault tolerance • Related to SW fault tolerance by sharing a common goal • Scope of the solutions are on a much different scale – SW fault tolerance focuses more on making software not crash – Traditional checkpointing focuses on recovering from the crash in a graceful manner while preserving computational state and critical data. 11

Conclusions & Future Work N Checkpoint-Recovery provides • Ability to save and restore state for critical applications • Useful for single computer systems and large distributed or parallel systems • Can incur large time penalties during checkpointing N Future Work • Design for Checkpoint-Recovery – Design critical systems to have as small a critical state as possible » Breakdown task into smaller subsystems which can be checkpointed separately » Self recovering state • Task restart may not be possible in small RT/Embedded systems – Support at the OS level to allow micro checkpoints and rollbacks at a task level 12

Application-Transparent Checkpointing in Mach N Paper presents methodology for checkpoint-recovery N Performance varied with memory footprint • Typically <5 sec checkpoint cost (first) less for subsequent • Larger commit delays - 10 to 30 sec of degraded performance • Recovery times 5 to 10 sec for reasonably sized applications N Major Contributions • Provides roadmap on how one might build in transparent checkpointing • Can checkpoint and restore entire system state in X N Limitations • Time costly • Requires custom pager in OS • Does not address memory exclusion (trade-off for transparency) 13

Checkpoint/Recovery 18-849b Dependable Embedded Systems John DeVale - PowerPoint PPT Presentation

Checkpoint/Recovery 18-849b Dependable Embedded Systems John DeVale February 4, 1999 Required Reading: Application-Transparent Checkpointing in Mach 3.0/UX - Russinovich and Segall Best Tutorial: Libckpt: Transparent Checkpointing under Unix

iOmx Therapeutics Announces Discovery of Novel, Druggable Immune-Checkpoint Targets iOTarg

ICD-10 Checkpoint: Update for NJ-HFMA Jim Hennessy June 2015 e4 Services LLC Discussion Topics

Logistics Assignments Crossover and Mutation Checkpoint 1 -- Problem Graded --

Oasys PRIMER Did you know? Back to Contents Top Tips Demo Slide 2 Slide 2 Checkpoint

Paper Summaries Any takers? Procedural Shading Announcement Logistics Checkpoint 2

Logistics Checkpoint 2 Mostly graded. Note on grading -- Regaining points

Logistics The Renderman Shading Language Checkpoint 3 Grading underway Checkpoint 4

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint

Community Recovery Forum Presenter: Cr Mary Brown Overview of Recovery Structure

RECOVERY OPERATIONS Performing recovery and related operations Acronis Training and Certification

Continuity and Recovery Planning Continuity and Recovery Planning Continuity and Recovery

Contents What is Recovery? What is Better Recovery? What is Community

From Recovery Strategy to Recovery Framework Session Outline Why a Recovery Framework 1 2 What

Recovery Lizzie Jacobs GBRT Sport Science Intern 1 What we will cover What is recovery

in lymphoma Catherine Hildyard Haematology Senior Registrar Oxford University Hospitals NHS

Challenges of water utilities in the cities Distribution of water in Chandigarh B.

Trial of tele-medicine to promote the fetal diagnosis of congenital heart disease (CHD) Motoyoshi

Probabilistic Inference in BN2T Models by Weighted Model Counting Jirka Vomlel Institute of

Special Needs Shelter Florida Department of Health Broward County Paula Thaqi, MD, MPH Director

TransparentCheckpointofClosed DistributedSystemsin Emulab

VELOC: Very Low Overhead Checkpointing System Bogdan Nicolae, Rinku Gupta, Franck Cappello (ANL)

Mission Objective: Compromise Nuclear Facility Using Virtual Reality to Improve Cyber Security and

The The Hadoop Di adoop Dist stri ributed buted Fi File le System System Konstantin