Leveraging Lightweight Virtual Machines to Easily and Efficiently - - PowerPoint PPT Presentation

leveraging lightweight virtual machines to easily
SMART_READER_LITE
LIVE PREVIEW

Leveraging Lightweight Virtual Machines to Easily and Efficiently - - PowerPoint PPT Presentation

Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services Jacob R. Lorch Andrew Baumann Lisa Glendenning Dutch T. Meyer Andrew Warfield Our goal: Turn existing binaries into fault-


slide-1
SLIDE 1

Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services

Lisa Glendenning Dutch T. Meyer Andrew Warfield Andrew Baumann Jacob R. Lorch

slide-2
SLIDE 2

Jay Lorch, Microsoft Research Tardigrade 2

Our goal: Turn existing binaries into fault- tolerant services.

slide-3
SLIDE 3

Jay Lorch, Microsoft Research Tardigrade 3

FDS Cluster FDS Metadata server

Example: FDS Metadata Service

[Nightingale et al., OSDI 2012]

slide-4
SLIDE 4

Jay Lorch, Microsoft Research Tardigrade 4

FDS Cluster FDS Metadata server

Example: FDS Metadata Service

Paxos leader election

[Nightingale et al., OSDI 2012]

slide-5
SLIDE 5

Jay Lorch, Microsoft Research Tardigrade 5

Use state machine replication library Explicitly persist state to reliable back-end Requires development resources Potential for oversight

  • Non-determinism
  • Failing to persist state
  • Exposing non-persisted data
  • Bugs in crash recovery

Techniques for making code fault-tolerant have limitations

Better: Transparently make the binary fault-tolerant

slide-6
SLIDE 6

Outline

  • Motivation
  • Background: Asynchronous VM replication
  • Our solution: Lightweight VM replication
  • Challenges and solutions
  • Evaluation

Jay Lorch, Microsoft Research Tardigrade 6

slide-7
SLIDE 7

Outline

  • Motivation
  • Background: Asynchronous VM replication
  • Our solution: Lightweight VM replication
  • Challenges and solutions
  • Evaluation

Jay Lorch, Microsoft Research Tardigrade 7

slide-8
SLIDE 8

Asynchronous virtual machine replication - Remus

Jay Lorch, Microsoft Research Tardigrade 8

Δ Δ

primary backup

Primary can crash at any time; backup is always a bit behind.

[Cully et al., NSDI 2008]

slide-9
SLIDE 9

Output buffer

Asynchronous virtual machine replication - Remus

Jay Lorch, Microsoft Research Tardigrade 9

primary backup [Cully et al., NSDI 2008]

slide-10
SLIDE 10

Asynchronous virtual machine replication - Remus

Jay Lorch, Microsoft Research Tardigrade 10

Output buffer

Ack(Δ)

primary backup [Cully et al., NSDI 2008]

slide-11
SLIDE 11

High VM activity can delay packets

Jay Lorch, Microsoft Research Tardigrade 11

43 71 76 88 66 96 104 151 67 102 160 276 77 722 1716 2460 81.9 4942 7741 9697

1 10 100 1000 10000 50th quantile 95th quantile 99th quantile 99.9th quantile Latency of ping (ms) Baseline Safety Scan Search Indexer Update Deduplication

Processes unrelated to the service can balloon client-perceived latency.

slide-12
SLIDE 12

Outline

  • Motivation
  • Background: Asynchronous VM replication
  • Our solution: Lightweight VM replication
  • Challenges and solutions
  • Evaluation

Jay Lorch, Microsoft Research Tardigrade 12

slide-13
SLIDE 13

Lightweight VM system examples Xax [Douceur et al., OSDI 2008] Native Client [Sehr et al., IEEE S&P 2009] Drawbridge [Porter et al., ASPLOS 2011] Embassies [Howell et al., NSDI 2013] Bascule [Baumann et al., Eurosys 2013]

Our solution: Use lightweight VMs instead

Jay Lorch, Microsoft Research Tardigrade 13

Service process Other processes LVM host

Host OS

Narrow API (e.g., ~45 calls in Bascule)

slide-14
SLIDE 14

Lightweight VMs can support unmodified binaries via a library OS

Jay Lorch, Microsoft Research Tardigrade 14

Service process LVM host

LVM API

slide-15
SLIDE 15

Service process

Lightweight VMs can support unmodified binaries via a library OS

Jay Lorch, Microsoft Research Tardigrade 15

Service binary

LVM API OS API

Library OS LVM host Bascule has a Windows LibOS and a Linux LibOS

slide-16
SLIDE 16

A lightweight VM is encapsulated by virtue of having a narrow interface

Jay Lorch, Microsoft Research Tardigrade 16

LVM host

LVM API

Service process Service binary

OS API

Library OS

slide-17
SLIDE 17

Service process

Our approach: Checkpoint by interposing on existing LVM API

Jay Lorch, Microsoft Research Tardigrade 17

Service binary

LVM API OS API

Library OS Checkpointer

LVM API

LVM host Checkpoint Interposition using existing API means LVM and LibOS don’t have to change

slide-18
SLIDE 18

Jay Lorch, Microsoft Research Tardigrade 18

Lightweight Virtual Machine Replication Lightweight Virtual Machine Replication Asynchronous Virtual Machine Replication Asynchronous Virtual Machine Replication

primary backup primary backup

Service Library OS Checkpointing Host Service Library OS Checkpointing Host

[Cully et al., NSDI 2008]

slide-19
SLIDE 19

[Cully et al., NSDI 2008]

Jay Lorch, Microsoft Research Tardigrade 19

Lightweight Virtual Machine Replication Lightweight Virtual Machine Replication Asynchronous Virtual Machine Replication Asynchronous Virtual Machine Replication

primary backup primary backup

Guest (service+OS) Checkpointing Host Checkpointing Host Guest (service+OS)

Our implementation of LVMR is called Tardigrade

slide-20
SLIDE 20

Outline

  • Motivation
  • Background: Asynchronous VM replication
  • Our solution: Lightweight VM replication
  • Challenges and solutions
  • Evaluation

Jay Lorch, Microsoft Research Tardigrade 20

slide-21
SLIDE 21

See paper for details

Jay Lorch, Microsoft Research Tardigrade 21

Maintaining consistency across reconfigurations Achieving performance potential Checkpointing via an existing LVM API

Vertical Paxos Incremental checkpointing, checkpoint capping, parallelism, scaling send buffer size Quiescing, pre-checkpointing, enforcing determinism, terminating connections

Challenges Solutions

Practical LVMR poses challenges

Lessons for LVM API designers

slide-22
SLIDE 22

Jay Lorch, Microsoft Research Tardigrade 22

Checkpointing uses certain LVM API features

Ability to track changed memory pages Determinism when API calls are replayed Host state either replayable

  • r regeneratable

Efficiently compute checkpoint deltas Capture consistent snapshot Prevent divergence on failover Feature Purpose Ability to suspend and inspect other threads Recreate host state on backup

slide-23
SLIDE 23

Host state either replayable

  • r regeneratable

Ability to suspend and inspect other threads Missing ability to suspend and inspect other threads Determinism when API calls are replayed

Jay Lorch, Microsoft Research Tardigrade 23

Features may not always be in LVM APIs

Non-determinism when API calls are replayed Host state not replayable or regeneratable Use exceptions, pre- checkpointing Hide non-determinism Feature Workaround Expose divergence as error condition Ability to track changed memory pages

slide-24
SLIDE 24

Checkpointing layer Host Guest (service + library OS)

To capture a checkpoint, we must quiesce and capture all threads’ state.

Jay Lorch, Microsoft Research Tardigrade 24

primary Memory

What if the API doesn’t let a thread suspend and inspect another thread?

slide-25
SLIDE 25

Checkpointing layer Host Guest (service + library OS)

We can use exceptions to quiesce guest threads

Jay Lorch, Microsoft Research Tardigrade 25

primary Checkpoint

slide-26
SLIDE 26

Checkpoint Checkpointing layer Host Guest (service + library OS)

Exception handler quiesces and captures each guest thread’s state

Jay Lorch, Microsoft Research Tardigrade 26

primary ExceptionHandler( , ) Memory

slide-27
SLIDE 27

Checkpointing layer Host Guest (service + library OS)

Synchronous system calls complicate quiescence

Jay Lorch, Microsoft Research Tardigrade 27

primary

slide-28
SLIDE 28

Checkpointing layer Checkpointing layer Host Guest (service + library OS)

The wait system call is easy to deal with

Jay Lorch, Microsoft Research Tardigrade 28

primary select() file descriptor list 0x1AC 0x3BB 0x907 select() file descriptor list 0x1AC 0x3BB 0x907 time-to-checkpoint

slide-29
SLIDE 29

Checkpointing layer Checkpointing layer Checkpointing layer Host Guest (service + library OS)

General synchronous system calls require pre-checkpointing

Jay Lorch, Microsoft Research Tardigrade 29

primary

slide-30
SLIDE 30

API non-determinism undermines replay

Jay Lorch, Microsoft Research Tardigrade 30

primary backup CreateSemaphore() returns descriptor 0xAAA CreateSemaphore() returns descriptor 0xBBB

slide-31
SLIDE 31

An indirection table can hide non- determinism

Jay Lorch, Microsoft Research Tardigrade 31

primary backup Checkpointing layer Host Guest (service + library OS) Checkpointing layer Host Guest (service + library OS)

Guest descriptor Host descriptor 0x001 0xAAA 0x002 0x932 Guest descriptor Host descriptor 0x001 0xBBB 0x002 0x909

slide-32
SLIDE 32

State external to guest needs to be replayable or regeneratable

Jay Lorch, Microsoft Research Tardigrade 32

primary backup Checkpointing layer Host Guest (service + library OS)

LVM API LVM API

API provides sockets, not packets TCP session state Checkpointer can’t capture TCP session state!

slide-33
SLIDE 33

System-specific modifications may be necessary

Jay Lorch, Microsoft Research Tardigrade 33

primary backup Checkpointing layer Host Guest (service + library OS)

TCP session state

Checkpointing layer Host Guest (service + library OS)

TCP connections get dropped on a failover. Fixing this requires a major API change to make it use packets rather than sockets

slide-34
SLIDE 34

Outline

  • Motivation
  • Background: Asynchronous VM replication
  • Our solution: Lightweight VM replication
  • Challenges and solutions
  • Evaluation

Jay Lorch, Microsoft Research Tardigrade 34

slide-35
SLIDE 35

Effect of external processes - Remus

Jay Lorch, Microsoft Research Tardigrade 35

43 71 76 88 66 96 104 151 67 102 160 276 77 722 1716 2460 81.9 4942 7741 9697

1 10 100 1000 10000 50th quantile 95th quantile 99th quantile 99.9th quantile Latency of ping (ms) Baseline Safety Scan Search Indexer Update Deduplication

slide-36
SLIDE 36

Effect of external processes - Tardigrade

Jay Lorch, Microsoft Research Tardigrade 36

1 10 100 1000 10000 50% 95% 99% 99.9% Latency (ms) Quantile Baseline Safety Scan Search Indexer Update Deduplication

slide-37
SLIDE 37

Effect of external processes - Tardigrade

Jay Lorch, Microsoft Research Tardigrade 37

5 10 15 20 25 50% 95% 99% 99.9% Latency (ms) Quantile Baseline Safety Scan Search Indexer Update Deduplication

slide-38
SLIDE 38

Memory dirtying affects checkpoint latency

Jay Lorch, Microsoft Research Tardigrade 38

10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 140 CDF (%) Latency (ms) No dirtying 10% of net b/w 20% of net b/w 30% of net b/w 40% of net b/w 50% of net b/w

slide-39
SLIDE 39

FDS metadata service

Jay Lorch, Microsoft Research Tardigrade 39

20 40 60 80 100 10 20 30 40 50 60 70 CDF (%) Checkpoint interval (ms) Metadata server initially idle Cluster starting up Cluster operating normally Checkpoint delta average size: 0.9 MB Checkpoint delta average size: 1.8 MB

slide-40
SLIDE 40

ZKLite, a simple non-fault-tolerant Java implementation of the Zookeeper API

Jay Lorch, Microsoft Research Tardigrade 40

10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 140 160 CDF (%) Client request latency (ms)

slide-41
SLIDE 41

Conclusions

Jay Lorch, Microsoft Research Tardigrade 41

No changes to binaries needed, making deployment simple Replicating processes rather than VMs substantially reduces worst-case latency Reasonable performance if memory dirtying rate and load are low Lightweight virtual machine API designers should consider effect on replication Examples of good targets: Metadata services Coordination services Niche web services

Lightweight VM replication is practical for making existing service binaries fault-tolerant