D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N - - PowerPoint PPT Presentation

d etecting f ailures in d istributed s ystems with the
SMART_READER_LITE
LIVE PREVIEW

D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N - - PowerPoint PPT Presentation

D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N ETWORK Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K.Aguilera, Michael Walfish SOSP11 Presented by Khiem Ngo PROBLEM Reliable distributed systems must handle crash


slide-1
SLIDE 1

DETECTING FAILURES IN DISTRIBUTED SYSTEMS WITH

THE FALCON SPY NETWORK

Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K.Aguilera, Michael Walfish SOSP’11 Presented by Khiem Ngo

slide-2
SLIDE 2

PROBLEM

  • Reliable distributed systems must handle crash failures
  • Application crashes, hardware failures, etc.
  • Detecting failures can take longer than recovery
  • Building a fast, reliable and unobtrusive failure detector is

challenging

  • Distributed systems are built upon asynchronous

communication environment

  • Existing failure detection techniques (e.g., end-to-end

timeout) are unreliable or disruptive

slide-3
SLIDE 3

PROBLEM PRIMARY VS. SUPPLEMENTAL

[PRIMARY]

  • Formal theories and definitions of several classes of failure

detectors

  • How consensus and atomic broadcast are made possible

in asynchronous network with failure detectors [SUPPLEMENTAL]

  • How to build a failure detector that is fast, reliable, little

disruptive

slide-4
SLIDE 4

KEY TECHNIQUES FALCON

  • FALCON: a network of spies chained together to monitor

different layers of the system

  • FALCON: Fast And Lethal Component Observation Network
  • Monitored layers: Application, Operating System, Virtual

Machine Monitor, and Network

  • Each spy uses inside information (e.g., process table, internal

timeouts, etc.) à fast

  • Lower-level spies monitor higher-level ones
  • Kill the layer when in doubt to achieve reliability
  • Try to kill smallest possible component à low disruption
  • Use end-to-end timeout as the last resort
slide-5
SLIDE 5

FALCON architecture Spy architecture

slide-6
SLIDE 6

KEY TECHNIQUES PRIMARY VS. SUPPLEMENTAL

[Primary]

  • Formal theories and definitions of several classes of failure

detectors

  • (Theoretically) show that simpler solutions for consensus

and atomic broadcast are possible with reliable failure detectors (RFD) [Supplemental]

  • Build a failure detector that is fast, reliable, little disruptive
  • (Experimentally) shows that some distributed system tasks

can be made simpler with RFD

slide-7
SLIDE 7

KEY FINDINGS

  • FALCON is fast and achieves sub-second detection
  • Its detection time is an order of magnitude faster than

baseline FDs

  • FALCON’s CPU overhead is mall (< 1% per component)
  • FALCON has little disruption in spite of surgical kill
  • FALCON reduces unavailability period after crashes (6x)
  • FALCON helps simplify distributed system programming

Replication approach Lines of code # replicas/ witnesses Paxos 1759 3 Primary-back 1388 2

slide-8
SLIDE 8

Detection time of FALCON and baseline failure detector under various failures

slide-9
SLIDE 9

KEY TAKEAWAYS

  • FALCON: a chained network of spies monitoring different

layers

  • FALCON: uses inside information and local timeouts for

fast detection, surgical killing for accuracy

  • FALCON: has little disruption, help simplify distributed

system programming

  • FALCON does not contradict the FLP impossibility result
  • FALCON cannot handle Byzantine faults
  • FALCON: cannot differentiate between a slow network and

a failed network