Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya - PowerPoint PPT Presentation

Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya Kashyap , Changwoo Min, Byoungyoung Lee, Taesoo Kim, Pavel Emelyanov

OS updates are prevalent

And OS updates are unavoidable ● Prevent known, state-of-the-art attacks – Security patches ● Adopt new features – New I/O scheduler features ● Improve performance – Performance patches

Unfortunately, system updates come at a cost ● Unavoidable downtime ● Potential risk of system failure

Unfortunately, system updates come at a cost ● Unavoidable downtime ● Potential risk of system failure $109k per minute Hidden costs (losing customers)

Example: memcached ● Facebook's memcached servers incur a downtime of 2-3 hours per machine – Warming cache (e.g., 120 GB) over the network

Example: memcached ● Facebook's memcached servers incur a downtime of 2-3 hours per machine – Warming cache (e.g., 120 GB) over the network Our approach updates OS in 3 secs for 32GB of data from v3.18 to v3.19 for Ubuntu / Fedora releases

Existing practices for OS updates ● Dynamic Kernel Patching (e.g., kpatch, ksplice) – Problem: only support minor patches ● Rolling Update (e.g., Google, Facebook, etc) – Problem: inevitable downtime and requires careful planning

Existing practices for OS updates ● Dynamic Kernel Patching (e.g., kpatch, ksplice) Losing application state is inevitable – Problem: only support minor patches → Restoring memcached takes 2-3 hours ● Rolling Update (e.g., Google, Facebook, etc) – Problem: inevitable downtime and requires careful planning

Existing practices for OS updates ● Dynamic Kernel Patching (e.g., kpatch, ksplice) Losing application state is inevitable – Problem: only support minor patches → Restoring memcached takes 2-3 hours ● Rolling Update (e.g., Google, Facebook, etc) Goals of this work: – Problem: inevitable downtime and requires ● Support all types of patches careful planning ● Least downtime to update new OS ● No kernel source modifjcation

Problems of typical OS update Memcached OS OS OS OS Stop service

Problems of typical OS update Memcached OS OS OS OS Stop service Soft reboot New OS

Problems of typical OS update Memcached OS OS OS OS Stop service Soft reboot Start service Memcached New OS New OS

Problems of typical OS update 2-3 hours of downtime Memcached OS OS OS OS Stop service Soft reboot Start service Memcached New OS New OS

Problems of typical OS update 2-3 hours of downtime Memcached OS OS OS OS Stop service 2-10 minutes of downtime Soft reboot Start service Memcached New OS New OS

Problems of typical OS update 2-3 hours of downtime Memcached OS OS OS OS Stop service 2-10 minutes of downtime Soft reboot Start service Memcached Is it possible to keep the New OS New OS application state?

KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) Memcached OS OS OS OS Stop service Soft reboot Start service Memcached New OS New OS

KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) Memcached Memcached OS OS OS OS Stop service Checkpoint Soft reboot Start service Memcached New OS New OS

KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) Memcached Memcached OS OS OS OS Stop service Checkpoint In-kernel Soft reboot switch Start service Memcached Memcahed New OS New OS

KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) Memcached Memcached OS OS OS OS Stop service Checkpoint In-kernel Soft reboot switch Start service Restore Memcached Memcahed New OS New OS

KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) KUP's life cycle Stop service Checkpoint In-kernel switch Start service Restore

KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) KUP's life cycle Stop service Checkpoint In-kernel 1-10 minutes of downtime switch Start service Restore

KUP: Kernel update with application OS updates loose application states checkpoint-and-restore (C/R) KUP's life cycle Stop service Checkpoint In-kernel 1-10 minutes of downtime switch Start service Restore Challenge: how to further decrease New OS New OS the potential downtime?

Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint In-kernel switch Restore

Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint In-kernel switch Restore 2) On-demand restore

Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint 3) FOAM: a snapshot abstraction In-kernel switch Restore 2) On-demand restore

Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint 3) FOAM: a 4) PPP: reuse memory without an explicit dump snapshot abstraction In-kernel switch Restore 2) On-demand restore

Incremental checkpoint ● Reduces downtime (up to 83.5%) ● Problem : Multiple snapshots increase the restore time Naive S 1 S i Snapshot instance → checkpoint downtime Timeline

Incremental checkpoint ● Reduces downtime (up to 83.5%) ● Problem : Multiple snapshots increase the restore time Naive S 1 S i Snapshot instance → checkpoint downtime Timeline Incremental S 1 checkpoint

Incremental checkpoint ● Reduces downtime (up to 83.5%) ● Problem : Multiple snapshots increase the restore time Naive S 1 S i Snapshot instance → checkpoint downtime Timeline Incremental S 2 S 1 checkpoint

Incremental checkpoint ● Reduces downtime (up to 83.5%) ● Problem : Multiple snapshots increase the restore time Naive S 1 S i Snapshot instance → checkpoint downtime Timeline Incremental S 2 S 3 S 1 checkpoint

Incremental checkpoint ● Reduces downtime (up to 83.5%) ● Problem : Multiple snapshots increase the restore time Naive S 1 S i Snapshot instance → checkpoint downtime Timeline Incremental S 4 S 2 S 3 S 1 checkpoint downtime

On-demand restore ● Rebind the memory once the application accesses it – Only map the memory region with snapshot and restart the application ● Decreases the downtime (up to 99.6%) ● Problem : Incompatible with incremental checkpoint

Problem : both techniques together result in ineffjcient application C/R ● During restore, need to map each pages individually – Individual lookups to fjnd the relevant pages – Individual page mapping to enable on-demand restore An application has 4 pages as ● S 1 S 1 its working set size 1 2 3 4 Incremental checkpoint has 2 ● iterations – 1 st iteration all 4 pages (1, 2, 3, 4) are dumped → – 2 nd iteration 2 pages (2, 4) are dirtied → ● Increases the restoration downtime (42.5%)

Problem : both techniques together result in ineffjcient application C/R ● During restore, need to map each pages individually – Individual lookups to fjnd the relevant pages – Individual page mapping to enable on-demand restore An application has 4 pages as ● S 2 S 1 S 1 its working set size 2 4 1 3 Incremental checkpoint has 2 ● iterations – 1 st iteration all 4 pages (1, 2, 3, 4) are dumped → – 2 nd iteration 2 pages (2, 4) are dirtied → ● Increases the restoration downtime (42.5%)

New abstraction : fjle-ofgset based address mapping (FOAM) ● Flat address space representation for the snapshot – One-to-one mapping between the address space and the snapshot – No explicit lookups for the pages across the snapshots – A few map operations to map the entire snapshot with address space ● Use sparse fjle representation – Rely on the concept of holes supported by modern fjle systems ● Simplifjes incremental checkpoint and on-demand restore

Techniques to decrease the downtime 1) Incremental checkpoint Checkpoint 3) FOAM: a 4) PPP: reuse memory without an explicit dump snapshot abstraction In-kernel switch Restore 2) On-demand restore

Redundant data copy ● Application C/R copies data back and forth ● Not a good fjt for applications with huge memory Memcached OS RAM 1 2 3 4 In-kernel Running Running Checkpoint Restore Running switch

Redundant data copy ● Application C/R copies data back and forth ● Not a good fjt for applications with huge memory Memcached OS RAM S 1 Snapshot 1 2 3 4 In-kernel Running Checkpoint Checkpoint Restore Running switch

Redundant data copy ● Application C/R copies data back and forth ● Not a good fjt for applications with huge memory Memcached Memcached New OS OS RAM S 1 Snapshot 1 2 3 4 In-kernel In-kernel Running Checkpoint Restore Running switch switch

Redundant data copy ● Application C/R copies data back and forth ● Not a good fjt for applications with huge memory Memcached Memcached New OS OS RAM 1 2 3 4 S 1 Snapshot 1 2 3 4 In-kernel Running Checkpoint Restore Restore Running switch

Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya - PowerPoint PPT Presentation

Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya Kashyap , Changwoo Min, Byoungyoung Lee, Taesoo Kim, Pavel Emelyanov OS updates are prevalent And OS updates are unavoidable Prevent known, state-of-the-art attacks

The Range iWhite range overview 1. iWhite Instant kit 1 2. iWhite Instant kit 2 3. iWhite

Checkpoint/Restart in Linux Sukadev Bhattiprolu IBM Linux Technology Center 09/2009 Linux is a

TEXAS INSTANT RACING OVERVIEW AND POTENTIAL IMPACT INSTANT RACING Instant Racing is a

Restart to Recover Restart and debottleneck your business operations to adjust to changes in

The Presentation Kit-Book: Instant Scripts for Business The Presentation Kit-Book: Instant Scripts

Restart 20 Restart 20-21 Task Force: 21 Task Force: Report to Board of Education June 29, 2020

Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure

Checkpoint-Restart for a Network of Virtual Machines Rohan Garg, Komal Sodha, Zhengping Jin, Gene

Exploration of Lossy Compression for Application- level Checkpoint/Restart Naoto Sasaki 1 ,

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu,

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Ton

ICD-10 Checkpoint: Update for NJ-HFMA Jim Hennessy June 2015 e4 Services LLC Discussion Topics

Logistics Assignments Crossover and Mutation Checkpoint 1 -- Problem Graded --

iOmx Therapeutics Announces Discovery of Novel, Druggable Immune-Checkpoint Targets iOTarg

Oasys PRIMER Did you know? Back to Contents Top Tips Demo Slide 2 Slide 2 Checkpoint

Paper Summaries Any takers? Procedural Shading Announcement Logistics Checkpoint 2

Vulnerability disclosure Dont forget overall goal: improve software safety Consider

Security Risk Assessment and Risk Treatment for Integrated Modular Communication Hamid Asgari,

Microservices stress-free and without increased heart-attack risk Uwe Friedrichsen (codecentric AG)

Modelling and simulation of a defense strategy to face indirect DDoS flooding attacks A. Furfaro,

Presentations Trustworthy Cyber Infrastructure for Power (TCIP) Protection Detection and

Advanced Man-at-the-end Attacks and Defenses Bjorn De Sutter ISSISP 2018 Canberra 1

CREST Workshop Rick Schantz, Partha Pal, Aaron Paulos, Joe Loyall, Kurt Rohloff Distributed

Data Transmission Analog and Digital Impairments Capacity ITS323: Introduction to Data