Leveraging Lightweight Virtual Machines to Easily and Efficiently - - PowerPoint PPT Presentation
Leveraging Lightweight Virtual Machines to Easily and Efficiently - - PowerPoint PPT Presentation
Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services Jacob R. Lorch Andrew Baumann Lisa Glendenning Dutch T. Meyer Andrew Warfield Our goal: Turn existing binaries into fault-
Jay Lorch, Microsoft Research Tardigrade 2
Our goal: Turn existing binaries into fault- tolerant services.
Jay Lorch, Microsoft Research Tardigrade 3
FDS Cluster FDS Metadata server
Example: FDS Metadata Service
[Nightingale et al., OSDI 2012]
Jay Lorch, Microsoft Research Tardigrade 4
FDS Cluster FDS Metadata server
Example: FDS Metadata Service
Paxos leader election
[Nightingale et al., OSDI 2012]
Jay Lorch, Microsoft Research Tardigrade 5
Use state machine replication library Explicitly persist state to reliable back-end Requires development resources Potential for oversight
- Non-determinism
- Failing to persist state
- Exposing non-persisted data
- Bugs in crash recovery
Techniques for making code fault-tolerant have limitations
Better: Transparently make the binary fault-tolerant
Outline
- Motivation
- Background: Asynchronous VM replication
- Our solution: Lightweight VM replication
- Challenges and solutions
- Evaluation
Jay Lorch, Microsoft Research Tardigrade 6
Outline
- Motivation
- Background: Asynchronous VM replication
- Our solution: Lightweight VM replication
- Challenges and solutions
- Evaluation
Jay Lorch, Microsoft Research Tardigrade 7
Asynchronous virtual machine replication - Remus
Jay Lorch, Microsoft Research Tardigrade 8
Δ Δ
primary backup
Primary can crash at any time; backup is always a bit behind.
[Cully et al., NSDI 2008]
Output buffer
Asynchronous virtual machine replication - Remus
Jay Lorch, Microsoft Research Tardigrade 9
primary backup [Cully et al., NSDI 2008]
Asynchronous virtual machine replication - Remus
Jay Lorch, Microsoft Research Tardigrade 10
Output buffer
Ack(Δ)
primary backup [Cully et al., NSDI 2008]
High VM activity can delay packets
Jay Lorch, Microsoft Research Tardigrade 11
43 71 76 88 66 96 104 151 67 102 160 276 77 722 1716 2460 81.9 4942 7741 9697
1 10 100 1000 10000 50th quantile 95th quantile 99th quantile 99.9th quantile Latency of ping (ms) Baseline Safety Scan Search Indexer Update Deduplication
Processes unrelated to the service can balloon client-perceived latency.
Outline
- Motivation
- Background: Asynchronous VM replication
- Our solution: Lightweight VM replication
- Challenges and solutions
- Evaluation
Jay Lorch, Microsoft Research Tardigrade 12
Lightweight VM system examples Xax [Douceur et al., OSDI 2008] Native Client [Sehr et al., IEEE S&P 2009] Drawbridge [Porter et al., ASPLOS 2011] Embassies [Howell et al., NSDI 2013] Bascule [Baumann et al., Eurosys 2013]
Our solution: Use lightweight VMs instead
Jay Lorch, Microsoft Research Tardigrade 13
Service process Other processes LVM host
Host OS
Narrow API (e.g., ~45 calls in Bascule)
Lightweight VMs can support unmodified binaries via a library OS
Jay Lorch, Microsoft Research Tardigrade 14
Service process LVM host
LVM API
Service process
Lightweight VMs can support unmodified binaries via a library OS
Jay Lorch, Microsoft Research Tardigrade 15
Service binary
LVM API OS API
Library OS LVM host Bascule has a Windows LibOS and a Linux LibOS
A lightweight VM is encapsulated by virtue of having a narrow interface
Jay Lorch, Microsoft Research Tardigrade 16
LVM host
LVM API
Service process Service binary
OS API
Library OS
Service process
Our approach: Checkpoint by interposing on existing LVM API
Jay Lorch, Microsoft Research Tardigrade 17
Service binary
LVM API OS API
Library OS Checkpointer
LVM API
LVM host Checkpoint Interposition using existing API means LVM and LibOS don’t have to change
Jay Lorch, Microsoft Research Tardigrade 18
Lightweight Virtual Machine Replication Lightweight Virtual Machine Replication Asynchronous Virtual Machine Replication Asynchronous Virtual Machine Replication
primary backup primary backup
Service Library OS Checkpointing Host Service Library OS Checkpointing Host
[Cully et al., NSDI 2008]
[Cully et al., NSDI 2008]
Jay Lorch, Microsoft Research Tardigrade 19
Lightweight Virtual Machine Replication Lightweight Virtual Machine Replication Asynchronous Virtual Machine Replication Asynchronous Virtual Machine Replication
primary backup primary backup
Guest (service+OS) Checkpointing Host Checkpointing Host Guest (service+OS)
Our implementation of LVMR is called Tardigrade
Outline
- Motivation
- Background: Asynchronous VM replication
- Our solution: Lightweight VM replication
- Challenges and solutions
- Evaluation
Jay Lorch, Microsoft Research Tardigrade 20
See paper for details
Jay Lorch, Microsoft Research Tardigrade 21
Maintaining consistency across reconfigurations Achieving performance potential Checkpointing via an existing LVM API
Vertical Paxos Incremental checkpointing, checkpoint capping, parallelism, scaling send buffer size Quiescing, pre-checkpointing, enforcing determinism, terminating connections
Challenges Solutions
Practical LVMR poses challenges
Lessons for LVM API designers
Jay Lorch, Microsoft Research Tardigrade 22
Checkpointing uses certain LVM API features
Ability to track changed memory pages Determinism when API calls are replayed Host state either replayable
- r regeneratable
Efficiently compute checkpoint deltas Capture consistent snapshot Prevent divergence on failover Feature Purpose Ability to suspend and inspect other threads Recreate host state on backup
Host state either replayable
- r regeneratable
Ability to suspend and inspect other threads Missing ability to suspend and inspect other threads Determinism when API calls are replayed
Jay Lorch, Microsoft Research Tardigrade 23
Features may not always be in LVM APIs
Non-determinism when API calls are replayed Host state not replayable or regeneratable Use exceptions, pre- checkpointing Hide non-determinism Feature Workaround Expose divergence as error condition Ability to track changed memory pages
Checkpointing layer Host Guest (service + library OS)
To capture a checkpoint, we must quiesce and capture all threads’ state.
Jay Lorch, Microsoft Research Tardigrade 24
primary Memory
What if the API doesn’t let a thread suspend and inspect another thread?
Checkpointing layer Host Guest (service + library OS)
We can use exceptions to quiesce guest threads
Jay Lorch, Microsoft Research Tardigrade 25
primary Checkpoint
Checkpoint Checkpointing layer Host Guest (service + library OS)
Exception handler quiesces and captures each guest thread’s state
Jay Lorch, Microsoft Research Tardigrade 26
primary ExceptionHandler( , ) Memory
Checkpointing layer Host Guest (service + library OS)
Synchronous system calls complicate quiescence
Jay Lorch, Microsoft Research Tardigrade 27
primary
Checkpointing layer Checkpointing layer Host Guest (service + library OS)
The wait system call is easy to deal with
Jay Lorch, Microsoft Research Tardigrade 28
primary select() file descriptor list 0x1AC 0x3BB 0x907 select() file descriptor list 0x1AC 0x3BB 0x907 time-to-checkpoint
Checkpointing layer Checkpointing layer Checkpointing layer Host Guest (service + library OS)
General synchronous system calls require pre-checkpointing
Jay Lorch, Microsoft Research Tardigrade 29
primary
API non-determinism undermines replay
Jay Lorch, Microsoft Research Tardigrade 30
primary backup CreateSemaphore() returns descriptor 0xAAA CreateSemaphore() returns descriptor 0xBBB
An indirection table can hide non- determinism
Jay Lorch, Microsoft Research Tardigrade 31
primary backup Checkpointing layer Host Guest (service + library OS) Checkpointing layer Host Guest (service + library OS)
Guest descriptor Host descriptor 0x001 0xAAA 0x002 0x932 Guest descriptor Host descriptor 0x001 0xBBB 0x002 0x909
State external to guest needs to be replayable or regeneratable
Jay Lorch, Microsoft Research Tardigrade 32
primary backup Checkpointing layer Host Guest (service + library OS)
LVM API LVM API
API provides sockets, not packets TCP session state Checkpointer can’t capture TCP session state!
System-specific modifications may be necessary
Jay Lorch, Microsoft Research Tardigrade 33
primary backup Checkpointing layer Host Guest (service + library OS)
TCP session state
Checkpointing layer Host Guest (service + library OS)
TCP connections get dropped on a failover. Fixing this requires a major API change to make it use packets rather than sockets
Outline
- Motivation
- Background: Asynchronous VM replication
- Our solution: Lightweight VM replication
- Challenges and solutions
- Evaluation
Jay Lorch, Microsoft Research Tardigrade 34
Effect of external processes - Remus
Jay Lorch, Microsoft Research Tardigrade 35
43 71 76 88 66 96 104 151 67 102 160 276 77 722 1716 2460 81.9 4942 7741 9697
1 10 100 1000 10000 50th quantile 95th quantile 99th quantile 99.9th quantile Latency of ping (ms) Baseline Safety Scan Search Indexer Update Deduplication
Effect of external processes - Tardigrade
Jay Lorch, Microsoft Research Tardigrade 36
1 10 100 1000 10000 50% 95% 99% 99.9% Latency (ms) Quantile Baseline Safety Scan Search Indexer Update Deduplication
Effect of external processes - Tardigrade
Jay Lorch, Microsoft Research Tardigrade 37
5 10 15 20 25 50% 95% 99% 99.9% Latency (ms) Quantile Baseline Safety Scan Search Indexer Update Deduplication
Memory dirtying affects checkpoint latency
Jay Lorch, Microsoft Research Tardigrade 38
10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 140 CDF (%) Latency (ms) No dirtying 10% of net b/w 20% of net b/w 30% of net b/w 40% of net b/w 50% of net b/w
FDS metadata service
Jay Lorch, Microsoft Research Tardigrade 39
20 40 60 80 100 10 20 30 40 50 60 70 CDF (%) Checkpoint interval (ms) Metadata server initially idle Cluster starting up Cluster operating normally Checkpoint delta average size: 0.9 MB Checkpoint delta average size: 1.8 MB
ZKLite, a simple non-fault-tolerant Java implementation of the Zookeeper API
Jay Lorch, Microsoft Research Tardigrade 40
10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 140 160 CDF (%) Client request latency (ms)
Conclusions
Jay Lorch, Microsoft Research Tardigrade 41