Best-Path vs. Multi-Path Overlay Routing David G. Andersen (MIT) - - PowerPoint PPT Presentation

best path vs multi path overlay routing
SMART_READER_LITE
LIVE PREVIEW

Best-Path vs. Multi-Path Overlay Routing David G. Andersen (MIT) - - PowerPoint PPT Presentation

Best-Path vs. Multi-Path Overlay Routing David G. Andersen (MIT) Alex C. Snoeren (UCSD) Hari Balakrishnan (MIT) October 2003 http://nms.lcs.mit.edu/ron/ Overview Best-path vs. redundant overlay routing What tactics work best to Reduce


slide-1
SLIDE 1

Best-Path vs. Multi-Path Overlay Routing

David G. Andersen (MIT) Alex C. Snoeren (UCSD) Hari Balakrishnan (MIT)

October 2003 http://nms.lcs.mit.edu/ron/

slide-2
SLIDE 2

Overview

Best-path vs. redundant overlay routing

  • What tactics work best to

– Reduce loss? – Reduce latency? – Avoid outages?

  • In what circumstances do they perform best?
  • Implications for new strategies
slide-3
SLIDE 3

Context: Reliability via Path Diversity

  • Backup links provide alternatives

➔ Mechanisms for obtaining diversity (existing diversity) ➔ Mechanisms for using diversity (overlay techniques)

slide-4
SLIDE 4

Obtaining Diversity

Engineered diversity:

  • Exploiting existing diversity:
slide-5
SLIDE 5

Existing AS-level Redundancy

  • Traceroute between 12 hosts,

showing Autonomous Systems (AS’s)

AS5650 AS6521 AS1239 AS226 AS3967 AS210 AS701 UTREP AS3 AS7015 AS1742 AS145 AS1 AS7922 AS7018 AS3561 AMNAP AS209 AS1103 AS3756 AS6453 AS1785 AS3356 AS8297 AS5050 AS9 AS13790 AS702 AS26 AS1200 AS9057 AS8709 AS7280 AS13649 AS6114 AS1790

CCI Utah Aros Sightpath NYU CA−T1 CMU MA−Cable MIT VU−NL Cornell

vBNS NYSERNet Abilene Known private peering

slide-6
SLIDE 6

Exploiting Diversity via overlays

  • Send packets through cooperating peers
  • End-hosts only, no network support
slide-7
SLIDE 7

Exploiting Diversity via Overlays

Reactive Routing

  • Probe paths
  • Route via best
  • RON (SOSP’01)

Detour

Probes and Routing Updates

slide-8
SLIDE 8

Exploiting Diversity via Overlays

Reactive Routing

  • Probe paths
  • Route via best

Probes and Routing Updates

  • Redundant Routing
  • Parallel paths
  • No probing
  • Mesh routing

(SOSP’01)

slide-9
SLIDE 9

Reactive vs. Redundant Routing

0% 100% 100%

Capacity limit

Desired Loss Rate Improvement % Capacity used by data Data Traffic Probe/Redundant Traffic

  • Capacity limits probing and redundancy
slide-10
SLIDE 10

Reactive vs. Redundant Routing

0% 100%

Independence Best Expected

100%

Capacity limit Path Limit Limit

Desired Loss Rate Improvement % Capacity used by data

  • Reactive limit: best path performance
  • Redundant limit: Path independence
slide-11
SLIDE 11

Reactive vs. Redundant Routing

0% 100%

Reactive

Independence Best Expected

100%

Capacity limit Path Limit Limit

Desired Loss Rate Improvement % Capacity used by data

Redundant

  • Reactive limit: best path performance
  • Redundant limit: Path independence
slide-12
SLIDE 12

Reactive vs. Redundant Routing

0% 100%

Reactive Redundant

Independence Best Expected

100%

Capacity limit Path Limit Limit

Desired Loss Rate Improvement % Capacity used by data

  • Reactive limit: best path performance
  • Redundant limit: Path independence
  • Overhead scaling: throughput vs. nodes
slide-13
SLIDE 13

8 Routing Methods

Direct Single packet, direct path Direct Direct 2 packets, direct, no spacing DD 10ms 2 packets, direct, 10ms spacing DD 20ms 2 packets, direct, 20ms spacing

slide-14
SLIDE 14

8 Routing Methods

Direct Single packet, direct path Direct Direct 2 packets, direct, no spacing DD 10ms 2 packets, direct, 10ms spacing DD 20ms 2 packets, direct, 20ms spacing Lat Reactive routing, min latency Loss Reactive routing, min loss

slide-15
SLIDE 15

8 Routing Methods

Direct Single packet, direct path Direct Direct 2 packets, direct, no spacing DD 10ms 2 packets, direct, 10ms spacing DD 20ms 2 packets, direct, 20ms spacing Lat Reactive routing, min latency Loss Reactive routing, min loss Direct Rand 2pkts, Redundant routing, simplest

slide-16
SLIDE 16

8 Routing Methods

Direct Single packet, direct path Direct Direct 2 packets, direct, no spacing DD 10ms 2 packets, direct, 10ms spacing DD 20ms 2 packets, direct, 20ms spacing Lat Reactive routing, min latency Loss Reactive routing, min loss Direct Rand 2pkts, Redundant routing, simplest Lat Loss 2pkts, Reactive + Redundant (Falls back to random)

slide-17
SLIDE 17

Probing on Internet Testbed

Each node repeats:

  • 1. Pick random node j
  • 2. Pick one of the 8 routing types

(direct, loss, lat, etc.) in round-robin order. Send to j.

  • 3. Delay for random interval [0.6s - 1.2s]

Probes are one-way, recorded at sender & receiver.

slide-18
SLIDE 18

Datasets From Internet Deployment

Dataset Nodes Time Measurements RONwide 17 5 days 4.7M RONnarrow 17 3 days 2.8M RON2003 30 14 days 32.6M ✔ Variety of network types and bandwidths 5 int’l, 3 Cable/DSL, 7 universities... ✔ N 2 path scaling ∼ 900 paths

slide-19
SLIDE 19

One-way Loss Rates Are Low

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7

fraction of paths average path−wide loss rate (%) 2003 dataset 2002 dataset 90% of paths under 1% loss rate

  • Overall loss

0.42% in 2003

  • Includes quiescent periods
  • Outages still (painfully) apparent
slide-20
SLIDE 20

Duplication Reduces Overall Loss

Type Loss % direct 0.42 direct direct 0.30 dd 10ms 0.27 dd 20ms 0.27

slide-21
SLIDE 21

Duplication Reduces Overall Loss

Type Loss % direct 0.42 direct direct 0.30 dd 10ms 0.27 dd 20ms 0.27 Lat 0.43 Loss 0.33 Direct Rand 0.26 Lat Loss 0.23

slide-22
SLIDE 22

Loss Probabilities Sanity Check

  • 0.42% loss << [Paxson 94,95] (2.8%, 5%).
  • Unloaded paths vs. loaded by TCP transfer
  • Conditional loss probabilities are similar

Study P(lose P2|lost P1) Paxson TCP ∼ 50% Bolot 8ms spacing 60% RON2003 no spacing 72% RON2003 20ms 65% RON2003 direct rand 62%

slide-23
SLIDE 23

Latency Improvements

0.7 0.75 0.8 0.85 0.9 0.95 1 50 100 150 200 250 300 Fraction of paths Latency (ms)

lat loss lat direct rand direct Mean Latency 48.0 51.7 54.1

5% of connections exhibit large latency improvement

46.8 ms

Unlike loss, most latency from specific bad paths

slide-24
SLIDE 24

# High Loss Periods (1 hr, normalized)

Type > 0% direct 1 (8817) direct direct 0.59 dd 20ms 0.43 Lat 1.2 Loss 0.80 ← Worse than naive duplication Direct Rand 0.44 for low loss situations Lat Loss 0.38

slide-25
SLIDE 25

# High Loss Periods (1 hr, normalized)

Type > 0% > 30% direct 1 (8817) 1 (630) direct direct 0.59 0.93 dd 20ms 0.43 0.91 Lat 1.2 0.96 Loss 0.80 0.91 ← on par Direct Rand 0.44 0.92 Lat Loss 0.38 0.89

slide-26
SLIDE 26

# High Loss Periods (1 hr, normalized)

Type > 0% > 30% > 60% direct 1 (8817) 1 (630) 1 (255) direct direct 0.59 0.93 0.98 dd 20ms 0.43 0.91 0.98 Lat 1.2 0.96 0.91 Loss 0.80 0.91 0.86 ★ Direct Rand 0.44 0.92 0.92 ★ Lat Loss 0.38 0.89 0.84 ★

slide-27
SLIDE 27

Measurement Summary

✔ Redundant beats reactive for low loss – “Meshing” beats controls during outages ✔ Reactive finds specific good paths – Latency improvements – Low loss paths ✘ No overlay technique near independent paths – Hypothesis: Access link failures – More severe outages harder to correct

slide-28
SLIDE 28

Why Not FEC?

Redundant assumption: Fast recovery, low rate 0.42% loss rate → need little redundancy

...100 packets... Recovery 1st packet lost

X

Failure losses bursty (≥ 0.5 conditional loss) ✘ Spread FEC over even more packets ➔ Latency-critical traffic: 2-redundant mesh

slide-29
SLIDE 29

Conclusions

  • Loss rate for low-rate traffic low (0.42%)
  • Conditional loss probability high (0.72)

even for random mesh (0.62)

  • 40-60% of loss avoidable

✔ Reundant: Avoiding low loss rates ✔ Reactive: Avoiding high loss, latency ➔ Low loss suggests selective approach ...

slide-30
SLIDE 30

Future Work

Strategies for avoiding losses and outages:

  • Selective redundancy: Protecting SYNs, etc.

(shameless plug: Currently implementing)

  • Selective probing: Activate on first loss

Measurements:

  • Engineered network redundancy impact?

(testing now, looking for multihomed sites) http://nms.lcs.mit.edu/ron/

slide-31
SLIDE 31

Scaling

  • Reactive: Scales with # nodes
  • Redundant: Scales with traffic volume
slide-32
SLIDE 32

Best Path Scaling

Routing and probing add packets: Responsiveness vs. overhead vs. size

10 nodes 13.3Kbps 30 nodes 2.2Kbps 33Kbps 50 nodes 35000 5 10 15 20 25 30 35 40 45 50 Overhead (bits/second) Number of Nodes Overhead 25000 20000 15000 10000 5000 30000

  • 50 nodes near limit, enough for many apps.
slide-33
SLIDE 33

Best Path Routing

Probes and Routing

  • Frequently measure all inter-node paths
  • Exchange routing information
  • Route along app-specific best path

consistent with routing policy

slide-34
SLIDE 34

Probing and Outage Detection

Record "success" with RTT 6

Node A Node B

I n i t i a l P i n g R e s p

  • n

s e 1 R e s p

  • n

s e 2 ID 5: time 10 ID 5: time 15 ID 5: time 33 ID 5: time 39 Record "success" with RTT 5

  • Probe every random(14) seconds
  • 3 packets, both sides get RTT and reachability
  • If “lost probe,” send next immediately

Timeout based on RTT and RTT variance

  • If N lost probes, notify outage
slide-35
SLIDE 35

Architecture: Probing

  • ➔ Probe between nodes, determine path qualities

– O

  • N 2

probe traffic with active probes – Passive measurements