mPlane: An Architecture for Scalable Fault Localization Ramana Rao - - PowerPoint PPT Presentation

mplane an architecture for scalable fault localization
SMART_READER_LITE
LIVE PREVIEW

mPlane: An Architecture for Scalable Fault Localization Ramana Rao - - PowerPoint PPT Presentation

mPlane: An Architecture for Scalable Fault Localization Ramana Rao Kompella, Alex C. Snoeren, George Varghese Purdue University University of California, San Diego ReARCH 2009 1 Disruptions are costly Disruptions in the network are


slide-1
SLIDE 1

mPlane: An Architecture for Scalable Fault Localization

Ramana Rao Kompella, Purdue University

1

Alex C. Snoeren, George Varghese University of California, San Diego

ReARCH 2009

slide-2
SLIDE 2

ReARCH 2009 2

Disruptions are costly

  • Stringent latency and loss requirements
  • VoIP, IPTV, Gaming (100-200 msec, small loss)
  • High-performance computing (10s of µsecs, very small loss)
  • Very tight SLAs with little room
  • Small amounts of extra delay (1ms) can cause SLA violations

Disruptions in the network are significant in their impact

slide-3
SLIDE 3

Debugging ISP networks

  • ISPs use active probes to detect delay spikes
  • r loss episodes
  • Problem: Active probes do not scale well
  • O(n2) in today’s networks
  • Paths share links, so many probes redundant
  • Solutions:
  • Probe less frequently (one every few mins/secs)
  • Reduce the value of n by aggregating end-points
  • Measure between smaller number of points and

extrapolate (iPlane [Madhyastha06])

  • Cannot detect many performance problems

ReARCH 2009 3

slide-4
SLIDE 4

Localizing the root cause

  • Active probes indicate that a problem
  • ccurred, not where the problem occurred
  • Problems are delay spikes or loss episodes
  • Tomography approaches help answer this to

some extent [Chen04,Duffield03]

  • Problem: Inference under-constrained
  • Hence, manual debugging and troubleshooting

ReARCH 2009 4

Main question: How to perform scalable fault localization ?

slide-5
SLIDE 5

mPlane: Basic idea

Link Router

5 ReARCH 2009

slide-6
SLIDE 6
  • Key idea
  • Break end-to-end paths into “segments” (e.g.,

router forwarding paths, links)

  • High fidelity measurements local to a segment
  • For a network with n routers, m links,
  • Total number of segments = O(nd2+m), d is the

average degree

mPlane: Basic idea

Segment

6 ReARCH 2009

slide-7
SLIDE 7

Advantages of segment approach

  • Advantage 1: Probes can be injected within

a local segment at high frequency

  • Measurements not amplified by path lengths
  • Advantage 2: Direct fault localization of

end-to-end paths

  • No need for indirect approaches such as

tomography

  • End-to-end active probes may still need to

be issued, but with lower frequency

ReARCH 2009 7

slide-8
SLIDE 8

Architecture of mPlane

8

External Component Internal Component External Component Internal Component 3) Network Operations Center Measures properties of all forwarding paths within routers Measures properties of links across routers Measurements are reported periodically to the NOC

ReARCH 2009

slide-9
SLIDE 9

Internal component

  • Use data structures such as Lossy Difference

Aggregator (LDA) [Kompella09]

  • Reports aggregate latency measurements using few

counters at both interfaces

  • Periodically state is transmitted between sender

and receiver (very little overhead)

  • Can measure loss and latency in a scalable fashion
  • Typically required for each measurement equivalent

class if QoS enabled

  • mPlane itself is oblivious to LDA
  • Any data structure that can report router latency

measurements works fine

ReARCH 2009 9

slide-10
SLIDE 10

External component

  • Measure properties of links
  • Link properties typically vary less
  • Optical re-configuration may change the delay

to some extent

  • Routers inject active probes periodically to

the neighbor

  • Can also piggyback on control packets

exchanged between two neighboring routers

  • Example: OSPF Hello packets, Time

synchronization packets, etc.

ReARCH 2009 10

slide-11
SLIDE 11

mPlane deployment

  • Clean-slate deployment fairly straightforward
  • Each router reports measurements of all

forwarding paths and links

  • Any end-to-end path problem can now be correlated

directly with individual segment measurements to isolate the root cause

  • Fork-lift upgrade difficult.
  • How to deploy mPlane incrementally ?
  • How can a subset of routers cover the entire

network ?

ReARCH 2009 11

slide-12
SLIDE 12

Partial deployment example

ReARCH 2009 12

Measurement server (m-server) 2 3 2 1 1 2 2 2 3 A F B C D E Upgraded routers OSPF weights

slide-13
SLIDE 13

Self-sourced OSPF shortest paths

ReARCH 2009 13

A F B C E D D

Cut in the OSPF shortest path tree whenever an upgraded router or a leaf is encountered Measuring the nodes within the sub-tree handled by F M-set for A consists of {F, B}

slide-14
SLIDE 14

Evaluating the benefits

  • Two main metrics
  • Probe Hop Count: Sum of all hops taken by

every active probe in the network

  • Localization Granularity: Average segment size

in number of hops

  • Upgrade strategy involves picking the right

routers for upgrade

  • Naïve strategy: pick routers at random
  • Intelligent strategy: pick routers that

decrease the probe hop count and localization granularity the most

ReARCH 2009 14

slide-15
SLIDE 15

Intelligent upgrade strategy

  • Greedily pick routers that are present on

most number of shortest paths

  • Benefits localization granularity by reducing

the length of most segments

  • Benefits probe hop count since most number of

paths are shortened

  • Not necessarily the most optimal, but

greedy works much better than random as

  • ur evaluation shows
  • LP-formulation should be possible

ReARCH 2009 15

slide-16
SLIDE 16

Benefit in bandwidth reduction

(Sprint Rocketfuel topology, 315 routers)

ReARCH 2009 16

Upgrading just 40 routers reduces probe hop count by 2 orders of magnitude Random falls off relatively slowly

slide-17
SLIDE 17

Localization granularity (segment size in number of hops)

ReARCH 2009 17

1 1.5 2 2.5 3 3.5 4 4.5 5 50 100 150 200 250 300 350

Localization granularity Number of upgraded routers

Intelligent (avg) Random (avg)

Upgrading just 50 routers reduces localization granularity to about 1.5

slide-18
SLIDE 18

Summary

  • Proposed mPlane for direct and scalable fault

localization

  • Key idea is to break end-to-end paths into

segments, and monitor them with high fidelity

  • Partial deployment using OSPF shortest path

trees to determine upgraded routers to probe

  • Benefits of an intelligent upgrade strategy
  • 100x reduction in bandwidth
  • Localization granularity of 1.5
  • With only 15% of routers upgraded

ReARCH 2009 18

Thanks! Questions…

slide-19
SLIDE 19

Thanks! Questions…

ReARCH 2009 19

slide-20
SLIDE 20

Other details

  • Routers advertize the measurement

capability using one reserved bit within

  • ptions field of HELLO messages
  • M-sets may change with OSPF LSAs as

shortest paths change during link failures

  • During periods of churn, need to conduct

measurements to both old and new m-sets

  • ECMP handled by routers by issuing

separate probes through separate paths

ReARCH 2009 20