Monitoring the NZ Internet with AMP WAND Update NZNOG 2014 The - - PowerPoint PPT Presentation

monitoring the nz internet with amp
SMART_READER_LITE
LIVE PREVIEW

Monitoring the NZ Internet with AMP WAND Update NZNOG 2014 The - - PowerPoint PPT Presentation

Monitoring the NZ Internet with AMP WAND Update NZNOG 2014 The Active Measurement Project Monitor machines situated throughout the NZ Internet Participating NZ ISPs Universities Alongside name servers Monitors continually


slide-1
SLIDE 1

Monitoring the NZ Internet with AMP

WAND Update NZNOG 2014

slide-2
SLIDE 2

The Active Measurement Project

  • Monitor machines situated throughout the NZ Internet
  • Participating NZ ISPs
  • Universities
  • Alongside name servers
  • Monitors continually perform a scheduled set of measurements (tests)
  • Target other AMP monitors and sites of interest
  • Test frequency
  • Low impact tests: at least once a minute
  • High impact tests: multiple minutes between tests
  • Results give a view of network performance between sites over time
slide-3
SLIDE 3

The Active Measurement Project

  • We’ve been running AMP for 10 years
  • You may remember it from previous NZNOGs
  • Current MBIE-funded project
  • NZ Internet is critical infrastructure
  • Requires constant monitoring
  • Rewriting AMP software from ground up to better serve this purpose
  • Apply lessons learned from the earlier deployment
  • Combine with our existing work in anomaly detection
  • Find network events, report on them so they can be resolved asap
slide-4
SLIDE 4

Measuring the Internet

  • The modern Internet is very dynamic
  • Some services move around in ways that most people can’t predict
  • Physical or logical location can affect how services treat you
  • All tests make extensive use of DNS
  • Important to test to the addresses that users will actually hit
  • Resolve addresses every time the test is run
  • Often there are many of these, test to them all!
  • We are *starting* to see IPv6 deployment
  • A large number of sites don’t do IPv6, but we are ready for them
  • The IPv6 path is often poor compared to the IPv4 one
slide-5
SLIDE 5

Current AMP Tests

  • ICMP Ping
  • Latency and loss from monitor to target
  • Traceroute
  • Route and path lengths from monitor to target
  • DNS
  • Response time for queries to target DNS server
  • HTTP
  • Performance when fetching all elements of a webpage
  • Pipelining, multiple/parallel connections, caching, etc
  • “ User experience ”
slide-6
SLIDE 6

Upcoming AMP Tests

  • Throughput
  • How much data can we push between monitors
  • TCP Ping
  • Many sites firewall ICMP, so need an alternative
  • High-rate Ping and UDP Packet Streams
  • Network jitter and reordering
  • Loss characterisation
slide-7
SLIDE 7

Data Collection

  • Developed system for persistent storage of network measurement data
  • Backed by a postgresql database
  • Don’t aggregate or discard any measurements (unlike RRD)
  • Always get full detail, even if you go back a year
  • Disk space is cheap, so storage isn’t a major issue
  • Query speed is the big obstacle
  • Flexible design
  • Easily extended to store data collected by other measurement tools
  • Smokeping, Munin, Cacti, etc.
  • Other WAND measurement projects
slide-8
SLIDE 8

Visualisation

  • Revamped AMP graphs
  • Interactive rather than static graphs
  • Easier on the eyes
  • Matrix
  • Condensed design to support large meshes
  • Show IPv4 and IPv6 results at the same time
  • Graph styles
  • Grouping measurements to create Smokeping-style graphs
  • Rainbow graphs to visualise traceroute paths
slide-9
SLIDE 9
slide-10
SLIDE 10

Matrix

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15

Event Detection

  • Automate finding interesting changes in network behaviour
  • Increases in latency, taking the scenic route, traffic bursts / plunges
  • Needs to happen in close to real-time
  • Inform the network operator before the phones start ringing
  • Minimal false positive rate
  • Don’t want to be crying wolf too often
  • Grade events based on their significance and alert accordingly
  • Send text to operator if event is very urgent
  • Send email if less urgent
slide-16
SLIDE 16

Event Detection

  • Network measurements are essentially time series data
  • Plenty of techniques for finding anomalies in time series
  • Not many of these techniques work well in real-time
  • Trade-off between accuracy and timeliness
  • No one technique to rule them all
  • Detecting spikes vs detecting plunges
  • Noisy data vs consistent data
  • Tendency towards false positives
slide-17
SLIDE 17

Event Detection - Fusion

  • Our approach: data fusion
  • Implement any potentially useful detector
  • Combine results and infer likelihood of an event
  • If many detectors fire around the same time, it’s probably an event
  • Less reliable detectors tend to fire first -- early warning
  • More reliable detectors fire later -- confirmation
  • Exploring different data fusion techniques
  • Dempster-Shafer, Bayes, Fuzzy Logic and others
  • Current Masters project
slide-18
SLIDE 18

Event Detection - Techniques

  • Heuristic methods
  • Mode, Loss, Path Change, Plunge, Status Change
  • Threshold methods
  • Plateau, Arima-Shewhart
  • Variance methods
  • Jitter variance, T-Entropy
  • Probabilistic methods
  • Changepoint, Hidden Markov Model
slide-19
SLIDE 19

Event Detection - Ongoing Work

  • Group events based on common properties
  • If a site goes down, don’t report separate events for each monitor
  • Ranking events based on severity
  • Combine magnitude and likelihood of being significant
  • Develop system for alerting operators when major events occur
  • User-configurable
  • Learn from operator feedback
  • Continue to extend and improve library of detection methods
  • Especially for metrics other than latency
slide-20
SLIDE 20
slide-21
SLIDE 21

Interesting Observations

  • Unusual things happen all the time
  • Networking is hard
  • Event detection is really good for finding interesting behaviours
  • Very helpful in producing the following slides :)
slide-22
SLIDE 22

Traceroute graph showing alternative paths

slide-23
SLIDE 23

Latency graphs with “smoke” and loss colouring

slide-24
SLIDE 24

Google moves around a lot! 30ms to >150ms

slide-25
SLIDE 25

Youtube can be far away (especially on v6)

slide-26
SLIDE 26

F-root redeployment at APE went well

slide-27
SLIDE 27

Well, unless you’re these guys...

slide-28
SLIDE 28

Or these guys...

slide-29
SLIDE 29

Or even us at Waikato...

slide-30
SLIDE 30

Trademe likes to swap datacentres pretty regularly

slide-31
SLIDE 31

Traceroute graphs can make pretty patterns

slide-32
SLIDE 32

Hosting an AMP monitor

http://wand.net.nz/amp/request/monitor

slide-33
SLIDE 33

Feedback

  • Public website is at http://amp.wand.net.nz
  • Send us any comments, suggestions, bug reports
  • amp@wand.net.nz
  • Content providers
  • Put yourself forward as a test target
  • http://wand.net.nz/amp/request/target