D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N - - PowerPoint PPT Presentation

▶

May 29, 2023 314 likes •421 views

D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N ETWORK Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K.Aguilera, Michael Walfish SOSP11 Presented by Khiem Ngo PROBLEM Reliable distributed systems must handle crash

SLIDE 1

DETECTING FAILURES IN DISTRIBUTED SYSTEMS WITH

THE FALCON SPY NETWORK

Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K.Aguilera, Michael Walfish SOSP’11 Presented by Khiem Ngo

SLIDE 2

PROBLEM

Reliable distributed systems must handle crash failures
Application crashes, hardware failures, etc.
Detecting failures can take longer than recovery
Building a fast, reliable and unobtrusive failure detector is

challenging

Distributed systems are built upon asynchronous

communication environment

Existing failure detection techniques (e.g., end-to-end

timeout) are unreliable or disruptive

SLIDE 3

PROBLEM PRIMARY VS. SUPPLEMENTAL

[PRIMARY]

Formal theories and definitions of several classes of failure

detectors

How consensus and atomic broadcast are made possible

in asynchronous network with failure detectors [SUPPLEMENTAL]

How to build a failure detector that is fast, reliable, little

disruptive

SLIDE 4

KEY TECHNIQUES FALCON

FALCON: a network of spies chained together to monitor

different layers of the system

FALCON: Fast And Lethal Component Observation Network
Monitored layers: Application, Operating System, Virtual

Machine Monitor, and Network

Each spy uses inside information (e.g., process table, internal

timeouts, etc.) à fast

Lower-level spies monitor higher-level ones
Kill the layer when in doubt to achieve reliability
Try to kill smallest possible component à low disruption
Use end-to-end timeout as the last resort

SLIDE 5

FALCON architecture Spy architecture

SLIDE 6

KEY TECHNIQUES PRIMARY VS. SUPPLEMENTAL

[Primary]

Formal theories and definitions of several classes of failure

detectors

(Theoretically) show that simpler solutions for consensus

and atomic broadcast are possible with reliable failure detectors (RFD) [Supplemental]

Build a failure detector that is fast, reliable, little disruptive
(Experimentally) shows that some distributed system tasks

can be made simpler with RFD

SLIDE 7

KEY FINDINGS

FALCON is fast and achieves sub-second detection
Its detection time is an order of magnitude faster than

baseline FDs

FALCON’s CPU overhead is mall (< 1% per component)
FALCON has little disruption in spite of surgical kill
FALCON reduces unavailability period after crashes (6x)
FALCON helps simplify distributed system programming

Replication approach Lines of code # replicas/ witnesses Paxos 1759 3 Primary-back 1388 2

SLIDE 8

Detection time of FALCON and baseline failure detector under various failures

SLIDE 9

KEY TAKEAWAYS

FALCON: a chained network of spies monitoring different

layers

FALCON: uses inside information and local timeouts for

fast detection, surgical killing for accuracy

FALCON: has little disruption, help simplify distributed

system programming

FALCON does not contradict the FLP impossibility result
FALCON cannot handle Byzantine faults
FALCON: cannot differentiate between a slow network and

DETECTING FAILURES IN DISTRIBUTED SYSTEMS WITH

THE FALCON SPY NETWORK

Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K.Aguilera, Michael Walfish SOSP’11 Presented by Khiem Ngo

PROBLEM

challenging

communication environment

timeout) are unreliable or disruptive

PROBLEM PRIMARY VS. SUPPLEMENTAL

[PRIMARY]

detectors

in asynchronous network with failure detectors [SUPPLEMENTAL]

disruptive

KEY TECHNIQUES FALCON

KEY TECHNIQUES PRIMARY VS. SUPPLEMENTAL

[Primary]

detectors

and atomic broadcast are possible with reliable failure detectors (RFD) [Supplemental]

can be made simpler with RFD

KEY FINDINGS

baseline FDs

KEY TAKEAWAYS

layers

fast detection, surgical killing for accuracy

system programming

a failed network