INFRASTRUCTURE Fault Detection at Scale INFRASTRUCTURE Giacomo - - PowerPoint PPT Presentation

infrastructure fault detection at scale
SMART_READER_LITE
LIVE PREVIEW

INFRASTRUCTURE Fault Detection at Scale INFRASTRUCTURE Giacomo - - PowerPoint PPT Presentation

INFRASTRUCTURE Fault Detection at Scale INFRASTRUCTURE Giacomo Bagnoli Production Engineer, PE Network Monitoring Monitoring the network How and what NetNORAD NFI Agenda TTLd Netsonar Lessons learned


slide-1
SLIDE 1

INFRASTRUCTURE

slide-2
SLIDE 2

Production Engineer, PE Network Monitoring

Fault Detection at Scale

Giacomo Bagnoli

INFRASTRUCTURE

slide-3
SLIDE 3
  • Monitoring the network
  • How and what
  • NetNORAD
  • NFI
  • TTLd
  • Netsonar
  • Lessons learned
  • Future

Agenda

slide-4
SLIDE 4

Network Monitoring

slide-5
SLIDE 5

Fabric Networks

  • Multi stage CLOS

topologies

  • Lots of devices and

links

  • BGP only
  • IPv6 >> IPv4
  • Large ECMP fan out
slide-6
SLIDE 6

Prineville, OR Los Lunas, NM Papillion, NE Fort Worth, TX Forest City, NC Altoona, IA Clonee, Ireland Luleå, Sweden Odense, Denmark

slide-7
SLIDE 7

Why is passive not enough?

Active Network Monitoring

  • SNMP: trusting the network devices
  • Host TCP retransmits: packet loss is everywhere
  • Active network monitoring:
  • Inject synthetic packets in the network
  • Treat the devices as black boxes, see if they can forward traffic
  • Detect which service is impacted, triangulate loss to

device/interface

slide-8
SLIDE 8
slide-9
SLIDE 9

NetNORAD, NFI, NetSONAR, TTLd

slide-10
SLIDE 10

NetNORAD

Rapid detection of faults

TTLd

End-to-end retransmit and loss detection using production traffic

What!?

NFI

Isolate fault to a specific device or link

Netsonar

Up/Down reachability info

slide-11
SLIDE 11
  • A set of agents injects synthetic UDP

traffic

  • Targeting all machines in the fleet
  • targets >> agents
  • Responder is deployed on all machines
  • Collect packet loss and RTT
  • Report and analyze

NetNORAD

Network

slide-12
SLIDE 12

NetNORAD - Target Selection

FRC

FRC1 FRC2

P01 P02 P02 P01

CLN

CLN1 CLN2

P01 P02 P02 P01 DC REGION GLOBAL

slide-13
SLIDE 13

NetNORAD – Data Pipeline

Alarming Timeseries SCUBA

  • Agents reports to a fleet of

aggregators

  • pre-aggregated results data per target
  • Aggregators
  • calculate per-pod percentiles for

loss/RTT

  • augment data with locality info
  • Reporting
  • to SCUBA
  • timeseries data
  • alarming

Locality

slide-14
SLIDE 14

Observability

Using SCUBA

  • By scope – isolates issue to
  • Backbone, region, dc
  • By cluster/pod – isolates issue

to

  • A small number of FSWs
  • By EBB/CBB
  • Is replication traffic affected?
  • By Tupperware Job
  • Is my service affected?
slide-15
SLIDE 15
  • Gray Network failures
  • Detect and triangulate to device/link
  • Auto remediation
  • Also useful dashboards (timeseries and SCUBA)

Network Fault Isolation

NFI

slide-16
SLIDE 16
  • Probe all paths by rotating the SRC

port (ECMP)

  • Run traceroutes (also on reverse

paths)

  • Associate loss with path info
  • Scheduling and data processing

similar to NetNORAD

  • Thrift – based (TCP)

How (shortest version possible)

NFI

slide-17
SLIDE 17
  • Blackbox monitoring tool
  • Sends ICMP probes to network switches
  • Provides reachability information: is it up or down?
  • Scheduling and data pipeline similar to NetNORAD and NFI

NetSONAR

slide-18
SLIDE 18
  • Main goal: surface end to end retransmits throughout the

network

  • Use production packets as probes
  • A mixed approach (not passive, not active)
  • End host mark one bit in the IP header when the packet is a

retransmission

  • Uses MSB of TTL/Hop Limit
  • Marking is done by an eBPF program on end hosts
  • A collection framework collect stats from devices (sampled

data)

Mixed Passive / Active approach

TTLd

slide-19
SLIDE 19

Visualization

  • High density dashboards
  • Using cubism.js
  • Fancy javascript UIs
  • (various iterations)
  • Other experimental views
slide-20
SLIDE 20

Examples

slide-21
SLIDE 21

Example Fabric Layout

RSW FSW SSW

slide-22
SLIDE 22
  • Clear signal in NetNORAD
  • Triangulated successfully by

NFI

  • Also seen in passive

collections

Bad FSW causing 25% loss in POD

Example 1

slide-23
SLIDE 23
  • Low loss seen in NetNORAD
  • NFI drained the device
  • But the drain failed
  • NFI alarms again
  • Clear signal in TTLd too

Bad FSW could not be drained (failed automation)

Example 2

slide-24
SLIDE 24
  • Congestion happens
  • NFI uses outlier detection
  • Not perfect
  • Loss in NetNORAD was

just limited to a single DSCP

Congestion, false alarm

Example 3

slide-25
SLIDE 25

Lessons Learned

slide-26
SLIDE 26
  • Having multiple tools helps
  • Separate failure domains
  • Separation of concerns
  • But also adds a lot of overhead
  • Reliability:
  • Regressions are usually the biggest problem
  • Holes in coverage are the next big problem
  • Dependency / cascading failures

Multiple tools, similar problems

Lesson learned

slide-27
SLIDE 27
  • After validating the proof of concept
  • How to make sure it continues working?
  • … that it can detect failures
  • … maintaining coverage
  • … and keeping up with scale
  • … and doesn’t fail with its dependencies?

i.e. how to know we can catch events reliably.

How to avoid regressions or holes?

slide-28
SLIDE 28
  • It’s a function of time and space
  • e.g. we cover the 90% of the devices 99% of the time
  • Should not regress
  • New devices should be covered once provisioned
  • Monitor and alarm!

Coverage

slide-29
SLIDE 29
  • Find a way to categorize events
  • Possibly automatically
  • Measure and keep history
  • Make sure there’s no regressions

false positive vs true positive

Accuracy

slide-30
SLIDE 30
  • How do we know we can detect events?
  • Before we get an event, possibly!
  • End-to-End (E2E) testing:
  • Introduce fake faults and see if the tool can detect them
  • Usually done via ACL injection to block traffic
  • Middle-to-End (M2E) testing:
  • Introduce fake data in the aggregation pipeline
  • Useful for more complex failures

Not just for performances

Regression detection

slide-31
SLIDE 31
  • Time to detection
  • Time to alarm
  • But also more classic metrics (cpu, mem, errors)
  • Measure and alarm!

Performance

slide-32
SLIDE 32
  • Degrade gracefully during large scale events
  • i.e. what if SCUBA is down?
  • or the timeseries database?
  • “doomsday” tooling:
  • Review dependencies, see if you can drop as many as we can
  • Provide a subset of functionalities
  • Make sure it’s user friendly (both the UI and the help)
  • Make sure it’s continuously tested

Or how to survive when things start to fry

Dependencies failures

slide-33
SLIDE 33

Future

slide-34
SLIDE 34
  • Keep up with scale
  • Support new devices and

networks

  • Continue to provide a stable signal
  • Exploring ML for data analysis
  • Improve coverage

Lots of work to do

Future work