Improving Client Web Availability with MONET David G. Andersen, CMU - - PowerPoint PPT Presentation

improving client web availability with monet david g
SMART_READER_LITE
LIVE PREVIEW

Improving Client Web Availability with MONET David G. Andersen, CMU - - PowerPoint PPT Presentation

Improving Client Web Availability with MONET David G. Andersen, CMU Hari Balakrishnan, M. Frans Kaashoek, Rohit Rao, MIT http: //nms.csail.mit.edu/ron/ronweb/ Availability We Want Carrier Airlines (2002 FAA Fact Book) 41 accidents,


slide-1
SLIDE 1

Improving Client Web Availability with MONET David G. Andersen, CMU

Hari Balakrishnan, M. Frans Kaashoek, Rohit Rao, MIT

http: //nms.csail.mit.edu/ron/ronweb/

slide-2
SLIDE 2

Availability We Want

  • Carrier Airlines (2002 FAA Fact Book)

– 41 accidents, 6.7M departures ✔ 99.9993% availability

  • 911 Phone service (1993 NRIC report +)

– 29 minutes per year per line ✔ 99.994% availability

  • Std. Phone service (various sources)

– 53+ minutes per line per year ✔ 99.99+% availability

slide-3
SLIDE 3

The Internet Has Only Two Nines

✘ End-to-End Internet Availability: 95% - 99.6% [Paxson, Dahlin, Labovitz, Andersen] Insufficient substrate for:

  • New / critical apps:

– Medical collaboration – Financial transactions – Telephony, real-time services, ...

  • Users leave if page slower than 4-8 seconds

[Forrester Research, Zona Research]

slide-4
SLIDE 4

MONET: Goals

  • Mask Internet failures

– Total outages – Extended high loss periods

  • Reduce exceptional delays

– Look like failures to user – Save seconds, not milliseconds MONET achieves 99.9 - 99.99% availability (Not enough, but a good step!)

slide-5
SLIDE 5

A fatal exception 0E has occurred at 0028:C00068F8 in PPT.EXE<01> +

  • 000059F8. The current application will be terminated.

* Press any key to terminate the application. * Press CTRL+ALT+DEL to restart your computer. You will lose any unsaved information in all applications. Press any key to continue Windows

slide-6
SLIDE 6

A fatal exception 0E has occurred at 0028:C00068F8 in PPT.EXE<01> +

  • 000059F8. The current application will be terminated.

* Press any key to terminate the application. * Press CTRL+ALT+DEL to restart your computer. You will lose any unsaved information in all applications. Press any key to continue Windows

Not about client failures...

slide-7
SLIDE 7

A fatal exception 0E has occurred at 0028:C00068F8 in PPT.EXE<01> +

  • 000059F8. The current application will be terminated.

* Press any key to terminate the application. * Press CTRL+ALT+DEL to restart your computer. You will lose any unsaved information in all applications. Press any key to continue Windows

Not about client failures... Nor fixing server failures (but understand) There’s another nine hidden in here, but today... “It’s about the network!”

slide-8
SLIDE 8

End-to-End Availability: Challenges

  • Internet services depend on many components:

Access networks, routing, DNS, servers, ...

  • End-to-end failures persist despite availability

mechanisms for each component.

  • Failures unannounced, unpredictable, silent
  • Many different causes of failures:

– Misconfiguration, deliberate attacks, hardware/software failures, persistent congestion, routing convergence

slide-9
SLIDE 9

Our Approach

  • Expose multiple paths to end system

– How to get access to them?

  • End-systems determine if path works

via probing/measurement – How to do this probing?

  • Let host choose a good end-to-end path

Client MONET Web Proxy Server

slide-10
SLIDE 10

Contributions

  • MONET Web Proxy design and

implementation

  • Waypoint Selection algorithm explores paths

with low overhead

  • Evaluation of deployed system with live user

traces; roughly order of magnitude availability improvement

slide-11
SLIDE 11

MONET: Bypassing Web Failures

"Internet" Lab Proxy Cogent Internet2 Genuity MIT Clients

  • A Web-proxy based system to improve

availability

  • Three ways to obtain paths
slide-12
SLIDE 12

MONET: Obtaining Paths

MIT "Internet" Lab Proxy Cogent DSL Internet2 Genuity Clients

  • 10-50% of failures at client access link

➔ Multihome the proxy (no routing needed)

slide-13
SLIDE 13

MONET: Obtaining Paths

Clients "Internet" Lab Proxy Cogent DSL Internet2 Genuity MIT

  • 10-50% of failures at client access link

➔ Multihome the proxy (no routing needed)

  • Many failures at server access link

➔ Contact multiple servers

slide-14
SLIDE 14

MONET: Obtaining Paths

Clients "Internet" Lab Proxy Cogent DSL Internet2 Genuity MIT Peer Proxy

  • 10-50% of failures at client access link

➔ Multihome the proxy (no routing needed)

  • Many failures at server access link

➔ Contact multiple servers

  • 40-60% failures “in network”➔Overlay paths
slide-15
SLIDE 15

Parallel Connections Validate Paths

Near-concurrent TCP, peer proxy, and DNS queries.

Peer Proxy Web Server

Local Proxy

1 Request Starts 2 Local DNS Resolution 3 Peer Query

D N S P e e r P r

  • x

y Q u e r y

slide-16
SLIDE 16

Parallel Connections Validate Paths

Near-concurrent TCP, peer proxy, and DNS queries.

Peer Proxy Web Server

Local Proxy

1 Request Starts 2 Local DNS Resolution 3 Peer Query 4 Local TCP Conns

D N S S Y N s S Y N / A C K P e e r P r

  • x

y Q u e r y D N S

slide-17
SLIDE 17

Parallel Connections Validate Paths

Near-concurrent TCP, peer proxy, and DNS queries.

Peer Proxy Web Server

Local Proxy

1 Request Starts 2 Local DNS Resolution 3 Peer Query 4 Local TCP Conns 5 Fetch via 1st 6 Close others

D N S S Y N s S Y N / A C K P e e r P r

  • x

y Q u e r y D N S S Y N S Y N / A C K P e e r R e s p

  • n

s e

slide-18
SLIDE 18

A More Practical MONET

Evaluated MONET tries all combinations:

  • l local interfaces

p peers s servers ls + lps paths l = 3, p = 3, s = 1 − 8 Paths = 12 – 96

slide-19
SLIDE 19

A More Practical MONET

Evaluated MONET tries all combinations:

  • l local interfaces

p peers s servers ls + lps paths l = 3, p = 3, s = 1 − 8 Paths = 12 – 96

  • Waypoint Selection chooses the right subset

– What order to try interfaces? – How long to wait between tries?

slide-20
SLIDE 20

Waypoint Selection Problem

. . .

S1 P1 Pn P2 C Ss

  • Client C

Paths P1, · · · , PN Servers S 1, ..., S s ➔ Find good order of the s ∗ N Px, S y pairs. ➔Find delay between each pair.

slide-21
SLIDE 21

Waypoint Selection

C S C S Server Selection Waypoint Selection

slide-22
SLIDE 22

Waypoint Selection

C S4 S2 S Server Selection S2 S3 S4 Waypoint Selection C S S3

slide-23
SLIDE 23

Waypoint Selection

Shared learning S4 Server Selection Waypoint Selection C S2 S3 S4 S C S2 S S3

  • History teaches about paths, not just servers

➔ Better initial guess (ephemeral...)

slide-24
SLIDE 24

Using Waypoint Results to Probe

  • DNS: Current best + random interface
  • TCP: Current best path (int or peer)
  • 2nd TCP w/5% chance via random path
  • Pass results back to waypoint algorithm
slide-25
SLIDE 25

Using Waypoint Results to Probe

  • DNS: Current best + random interface
  • TCP: Current best path (int or peer)
  • 2nd TCP w/5% chance via random path
  • Pass results back to waypoint algorithm
  • While no response within thresh

– connect via next best – increase thresh ➔What information affects thresh?

slide-26
SLIDE 26

TCP Response Time Knee

Knee TCP

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 Fraction of requests Response time (seconds) DNS−DSL TCP−Cogent TCP−MIT TCP−DSL 0.1 0.2

slide-27
SLIDE 27

TCP Response Time Knee

MIT: 105ms DSL: ~145ms

Knee TCP

0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 Fraction of requests Response time (seconds) DNS−DSL TCP−Cogent TCP−MIT TCP−DSL 0.3 0.2 0.1 0.4

  • When to probe - right after knee
  • Small extra latency ➔ much less overhead

Two ways to approximate the knee in the paper

slide-28
SLIDE 28

Implementation

MONET Ad−blocking Squid Squid Normal Squid

Cogent DSL MIT Proxy Machine Clients

  • Squid Web proxy + parallel DNS resolver
  • Front-end squids mask back-end failures

(Ad-blocking squid as bribe)

  • Choose outbound link with FreeBSD / Mac

OS X ipfw or Linux policy routing

slide-29
SLIDE 29

6-site MONET Deployment

Lab Proxy

Aros Proxy

Saved Traces

Cogent DSL Internet2 Genuity "Internet" UUNET ELI Aros Wi−ISP MIT NYU Utah Utah Proxy NYU Proxy Mazu Proxy Clients

  • Two years, ∼ 50 users/week
  • Primary traces at MIT, replay at Mazu
  • Three peer proxies: NYU, Utah, Aros
  • Focus on 1 Dec 2003 – 27 Jan 2004
  • Record everything
slide-30
SLIDE 30

Measurement Challenges

  • Invalid DNS responses (packet traces)
  • Invalid IPs (0.0.0.0, 127.0.0.1, ...)
  • Anomalous servers - discard 90% SYNs, etc.
  • Implementation and design flaws

– Network anomalies hit corner cases (Must avoid correlated measurement & network failures!)

  • Identify, automate detection, iterate...

Excluded consistently anomalous services.

slide-31
SLIDE 31

MIT Trace Statistics

Request type Count Client object fetch 2.1M Cache misses 1.3M Data fetch size 28.5 Gb Cache hit size 1 Gb TCP Connections 616,536 DNS lookups 82,957 137,341 Sessions - first req to a server after 60+ idle seconds (avoids bias)

slide-32
SLIDE 32

Characterizing Failures

DNS Server unreach Server RST Client access Wide-area

MIT Cogent DSL MIT Cogent DSL

X

Local Interfaces Peer Proxies Server

  • 2+ peers reachable

no peer or link could reach server (40% unreachable during post-analysis)

slide-33
SLIDE 33

Failure Breakdown

MIT 137,612 sessions Failure Type Srv MIT Cog DSL DNS 1

  • Srv. Unreach

173

  • Srv. RST

50 Client Access 152 14 2016 Wide-area 201 238 1828 Availability 99.6% 99.7% 97% Factor out server failures—until they use MONET!

slide-34
SLIDE 34

Single Link Availability

97% of MIT connections established within 1s .999 .9999 0.1 1 10 .95 dns+connect() time (seconds) 0.972 DSL 0.9974 MIT 0.9977 Cogent 0.9995 Cog+MIT+DSL Fraction successful connects .99

slide-35
SLIDE 35

Single Link Availability

at 2 seconds DNS retransmissions .95 .9999 0.1 1 10 Fraction successful connects dns+connect() time (seconds) 0.972 DSL 0.9974 MIT 0.9977 Cogent 0.9995 Cog+MIT+DSL .99 .999

slide-36
SLIDE 36

Single Link Availability

TCP SYN retransmissions at 2 seconds DNS retransmissions at 3, 6, 9, ... seconds .95 1 10 Fraction successful connects dns+connect() time (seconds) 0.972 DSL 0.9974 MIT 0.9977 Cogent 0.9995 Cog+MIT+DSL .9999 .999 .99 0.1

slide-37
SLIDE 37

Combined Link Availabilitgy

0.9995 Cog+MIT+DSL .99 .999 .9999 0.1 1 10 Fraction successful connects dns+connect() time (seconds) 0.972 DSL 0.9974 MIT 0.9977 Cogent 0.9992 Cog+DSL .95

  • Cheap DSL augments 100Mbit link
slide-38
SLIDE 38

MONET Achieves 4 Nines

0.9977 Cogent .99 .999 .9999 0.1 1 10 Fraction successful connects dns+connect() time (seconds) 0.9999 All 0.972 DSL 0.974 DSL+Peers 0.9974 MIT 0.9992 Cog+DSL 0.9992 MIT+Peers 0.9995 Cog+MIT+DSL 0.9997 Cog+Peers .95

  • Cheap DSL augments 100Mbit link
  • Overlays + reliable link very good
slide-39
SLIDE 39

MONET with Low Overhead

How do the practical MONETs compare?

  • Optimal, Liveness, Random
  • Post-best:

– Analyze trace, determine single “best” interface to always use first – While no response within thresh ∗ connect via random interface or peer ∗ increase thresh (Requires omniscience, but quasi-realistic).

slide-40
SLIDE 40

Achievable Resilience

.9 .99 .999 .9999 0.2 0.5 1 2 3 6 9 15 Fraction successful connects dns+connect() time (seconds) Optimal

cogent

DSL

slide-41
SLIDE 41

Achievable Resilience

.9 .99 .999 .9999 0.2 0.5 1 2 3 6 9 15 Fraction successful connects dns+connect() time (seconds) Optimal

cogent

Random DSL

slide-42
SLIDE 42

Achievable Resilience

.9 .99 .999 .9999 0.2 0.5 1 2 3 6 9 15 Fraction successful connects dns+connect() time (seconds) Optimal Post Best

cogent

Random DSL

slide-43
SLIDE 43

Achievable Resilience

.9 .99 .999 .9999 0.2 0.5 1 2 3 6 9 15 Fraction successful connects dns+connect() time (seconds) Optimal Liveness Post Best

cogent

Random DSL

  • 10% more SYNs (< 1% packets), near optimal
slide-44
SLIDE 44

What we didn’t talk about

  • Discounted server failures: Some servers

really bad.

  • Paper: MONET + Replicated services

– A more reliable subset of servers – Presumably, operators care more... ✔ 8x better availability including server failures.

slide-45
SLIDE 45

Related Work

  • SOSR (OSDI’04) - single-hop NAT-based
  • verlay routing.

Probing-based study

  • Akella et al. multihoming

Akamai-based study ➔ Similar underlying network performance.

  • Commercial products (Stonesoft, Sockeye, ...)

Tactics, performance, formalize problem

  • Content Delivery Networks

MONET improves availability

slide-46
SLIDE 46

Summary

  • Expose multiple paths to end-system

– Choose one that works end-to-end

  • Necessary location for availability engineering
  • Multihoming without routing support
  • Resilience achievable with low overhead
  • Experience w/2 year deployment and 100s of

users: Avoids 90% of failures to reliable sites http://nms.lcs.mit.edu/ron/ronweb/

slide-47
SLIDE 47

Bulk Transfers

  • Use application knowledge

– Static objects only – HTTP parallel transfers (“Paraloaders”)

  • Dykes et al. server selection + our tests

– First-response SYN effective

  • Mid-stream failover

– SCTP, Migrate, Host ID schemes, others.. – Range requests / app-specific tactics

slide-48
SLIDE 48

TCP CONTROL DEFER socket option

  • Switch to new server if SYN lost

Still works if SYN delayed > 3 seconds

  • Avoid 3-way handshake completion

for all but one connection Time source dest Type 54:31 client.3430 > server-A.80 SYN 54:34 client.3430 > server-A.80 SYN · · · 55:05 client.3430 > server-A.80 SYN 55:17 client.3432 > server-B.80 SYN

slide-49
SLIDE 49

Characterizing Failures

DNS Server unreach Server RST Client access Wide-area

MIT Cogent DSL MIT Cogent DSL DNS

X

Local Interfaces Peer Proxies

  • Peers reachable

no peer or interface could resolve DNS.

slide-50
SLIDE 50

Characterizing Failures

DNS Server unreach Server RST Client access Wide-area

MIT Cogent DSL MIT Cogent DSL

X

Local Interfaces Peer Proxies Server

  • 2+ peers reachable

no peer or link could reach server (40% unreachable during post-analysis)

slide-51
SLIDE 51

Characterizing Failures

DNS Server unreach Server RST Client access Wide-area

MIT Cogent DSL MIT Cogent DSL RST Server Local Interfaces Peer Proxies

  • Server refused TCP connections

Network OK end-to-end.

slide-52
SLIDE 52

Characterizing Failures

DNS Server unreach Server RST Client access Wide-area

MIT Cogent DSL MIT Cogent DSL

X X X

Local Interfaces Peer Proxies Server

  • No peers, DNS or server reachable via one link.

Peers and server working via other links.

slide-53
SLIDE 53

Characterizing Failures

DNS Server unreach Server RST Client access Wide-area

MIT Cogent DSL MIT Cogent DSL

X

Local Interfaces Peer Proxies Server

  • Server not reachable via one link. That link can

reach peers. Server reachable via peer or other link.

slide-54
SLIDE 54

Measurement

Packet-level traces at each node:

  • TCP to server, all DNS lookups
  • UDP overlay queries

Application traces:

  • Proxy request parameters, TCP sessions, DNS

queries, overlay queries

  • DNS server query log

Sliding-window join links application logs to local and remote packet logs.

slide-55
SLIDE 55

When to probe: Practical Solution

Conservative estimator from aggregate connection behavior:

  • rttest - expected connect() time

rttest ← q ∗ rttest + (1 − q) ∗ rtt

  • rttdev - average linear deviation (> σ)
  • thresh = rttest + 4 ∗ rttdev

✔ Easily computed, little state, effective