Trumpet: Timely and Precise Triggers in Data Centers The Problem - - PowerPoint PPT Presentation

trumpet timely and precise triggers in data centers
SMART_READER_LITE
LIVE PREVIEW

Trumpet: Timely and Precise Triggers in Data Centers The Problem - - PowerPoint PPT Presentation

Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat Trumpet: Timely and Precise Triggers in Data Centers The Problem Evolve or Die, SIGCOMM 2016 Long failure repair times in large networks Human-in-the-loop failure assessment and


slide-1
SLIDE 1

Trumpet: Timely and Precise Triggers in Data Centers

Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat

slide-2
SLIDE 2

The Problem

2

Human-in-the-loop failure assessment and repair Long failure repair times in large networks

Evolve or Die, SIGCOMM 2016

slide-3
SLIDE 3

Humans in the Loop

3

Detect Locate Inspect Fix

slide-4
SLIDE 4

Programs in the Loop

4

Detect Locate Inspect Fix Programs in the loop

slide-5
SLIDE 5

Our Focus

5

Detect

A framework for programmed detection

  • f events in large datacenters
slide-6
SLIDE 6

Events

6

Link failure DDoS Traffic surge

Packet delay

Lost packet

Packet burst

Switch failure

Incast

Load imbalance

Blackhole

Congestion

Traffic hijack Loop

Middlebox failure ❖Availability ❖Performance ❖Security

Burst Loss

slide-7
SLIDE 7

Our Focus

7

Detect

Aggregated, often sampled measures of network health

slide-8
SLIDE 8

8

Fine Timescale Events

40 ms burst Timeouts lasting several 100 ms Detecting Transient Congestion

slide-9
SLIDE 9

Fine Timescale Events

9

Did this tenant see a sudden increase in traffic over the last few milliseconds? Detecting Attack Onset

slide-10
SLIDE 10

Inspect Every Packet

10

Link failure DDoS Traffic surge

Packet delay

Lost packet

Packet burst

Switch failure

Incast

Load imbalance

Blackhole

Congestion

Traffic hijack Loop

Middlebox failure

Some event definitions may require inspecting every packet

Burst Loss

slide-11
SLIDE 11

Eventing Framework Requirements

Expressivity ▸ Set of possible events not known a priori Fine timescale eventing ▸ Capture transient and onset events Per-packet processing ▸ Precise event determination

11

Because data centers will require high availability and high utilization

slide-12
SLIDE 12

12

A Key Architectural Question

Where do we place eventing functionality?

Switches Hosts NICs

❖ Are programmable ❖ Have processing power for fine-time scale eventing ❖ Already inspect every packet

slide-13
SLIDE 13

13

We explore the design of a host-based eventing framework

slide-14
SLIDE 14

Research Questions

What eventing architecture permits programmability and visibility? How can we achieve precise eventing at fine timescales? What is the performance envelope

  • f such an eventing

framework?

14

slide-15
SLIDE 15

Research Questions

What eventing architecture permits programmability and visibility? How can we achieve precise eventing at fine timescales? What is the performance envelope

  • f such an eventing

framework?

15

Trumpet has a logically centralized event manager that aggregates local events from per-host packet monitors

slide-16
SLIDE 16

For each packet matching group by and report every each group that satisfies Filter Predicate Time-interval Flow-granularity

16

Event Definition

Flow volumes, loss rate, loss pattern (bursts), delay

slide-17
SLIDE 17

17

For each packet matching group by and report every any flow whose

Event Example

Service IP Prefix 5-tuple 10ms sum (is_lost & is_burst) > 10%

Is there any flow sourced by a service that sees a burst of losses in a small interval?

slide-18
SLIDE 18

18

For each packet matching group by and report every any job whose

Event Example

Cluster IP Prefix and Port Job IP Prefix 10ms sum (volume) > 100MB

Is there a job in a cluster that sees abnormal traffic volumes in a small interval?

slide-19
SLIDE 19

19

Server Controller Server VM VM Hypervisor

Trumpet Packet Monitor

Software switch

Trumpet Event Manager Triggers Trigger Reports Event Report

Trumpet Design

slide-20
SLIDE 20

20

Trumpet Event Manager

Trumpet Event Manager

Congestion? Congestion Triggers Contains event attributes, detects local events

slide-21
SLIDE 21

21

Trumpet Event Manager

Trumpet Event Manager

slide-22
SLIDE 22

22

Trumpet Event Manager

Trumpet Event Manager

Large flow? Large Flow Triggers

Trumpet can be used by programs to drill-down to potential root causes

slide-23
SLIDE 23

Research Questions

What eventing architecture permits programmability and visibility? How can we achieve precise eventing at fine timescales? What is the performance envelope

  • f such an eventing

framework?

23

The monitor optimizes packet processing to inspect every packet and evaluate predicates at fine timescales

slide-24
SLIDE 24

The Packet Monitor

24

Server VM VM Hypervisor

Trumpet Packet Monitor

Software switch

slide-25
SLIDE 25

A Key Assumption

25

Server VM VM Hypervisor

Trumpet Packet Monitor

Software switch

Piggyback on CPU core used by software switch ❖ Conserves server CPU resources ❖ Avoids inter-core synchronization

slide-26
SLIDE 26

26

Can a single core monitor thousands

  • f triggers at full packet rate (14.8

Mpps) on a 10G NIC?

slide-27
SLIDE 27

Two Obvious Tricks

Use kernel bypass ▸ Avoid kernel stack

  • verhead

Use polling to have tighter scheduling ▸ Trigger time intervals at 10ms

27

Necessary, but far from sufficient….

slide-28
SLIDE 28

28

Packet Match Update statistics at Check

Source IP = 10.1.1.0/24 Source IP = 20.2.2.0/24 Predicate Time interval Filter Sum(loss) > 10% Sum(size) < 10MB Flow granularity 10ms 100ms Service IP prefix 5-tuple

filters flow granularity

predicate time-interval

Monitor Design

at

With 1000s

  • f triggers
slide-29
SLIDE 29

29

Packet Match Update statistics at Check filters flow granularity

predicate time-interval

Design Challenges

at

Which of these should be performed ❖On-path ❖Off-path

slide-30
SLIDE 30

30

Packet Match Update statistics at Check filters flow granularity

predicate time-interval

Design Challenges

at

Which operations to do on-path?

❖70ns to forward and inspect packet

slide-31
SLIDE 31

31

Packet Match Update statistics at Check filters flow granularity

predicate time-interval

Design Challenges

at

How to schedule off-path operations?

❖Off-path on same core, can delay packets ❖Bound delay to a few µs

slide-32
SLIDE 32

32

Packet Match Update statistics at Check filters flow granularity

predicate time-interval

Strawman Design

at Packet History On-Path Off-Path

Doesn’t scale to large numbers of triggers

slide-33
SLIDE 33

33

Packet Match Update statistics at Check filters flow granularity

predicate time-interval

Strawman Design

at On-Path Off-Path

Still cannot reach goal

❖Memory subsystem becomes a bottleneck

slide-34
SLIDE 34

34

Packet Match Update statistics at Check filters 5-tuple granularity

predicate time-interval

Trumpet Monitor Design

at On-Path Off-Path Gather statistics at flow granularity

slide-35
SLIDE 35

35

Packet Match Update statistics at filters 5-tuple granularity

Optimizations

On-Path ❖ Use tuple-space search for matching ❖ Match on first packet, cache match ❖ Lay out tables to enable cache prefetch ❖ Use TLB huge pages for tables

slide-36
SLIDE 36

36

Check

predicate time-interval

Optimizations

at Off-Path Gather statistics at flow granularity ❖ Lazy cleanup of statistics across intervals ❖ Lay out tables to enable cache prefetch ❖ Bounded-delay cooperative scheduling

slide-37
SLIDE 37

Bounded Delay Cooperative Scheduling

37

Off-Path On-Path Bounded Delay

Bound delay to a few µs

slide-38
SLIDE 38

Research Questions

What eventing architecture permits programmability and visibility? How can we achieve precise eventing at fine timescales? What is the performance envelope

  • f such an eventing

framework?

38

Trumpet can monitor thousands of triggers at full packet rate on a 10G NIC

slide-39
SLIDE 39

39

Trumpet is expressive

❖Transient congestion ❖Burst loss ❖Attack onset

Trumpet scales to thousands of triggers Trumpet is DoS-Resilient

Evaluation

slide-40
SLIDE 40

Detecting Transient Congestion

40

Congestion Large Flow (Reactive)

Trumpet can detect millisecond scale congestion events

40 ms

slide-41
SLIDE 41

Scalability

41

Trumpet can process❉ 14.8 Mpps ❖64 byte packets at 10G ❖650 byte packets at 4x10G … while evaluating 16K triggers at 10ms granularity

❉Xeon ES-2650, 10-core 2.3 Ghz, Intel 82599 10G NIC

slide-42
SLIDE 42

Performance Envelope

42

Triggers matched by each flow How often each predicate is checked Above this rate, Trumpet would miss events

slide-43
SLIDE 43

Performance Envelope

43

At moderate packet rates, can detect events at 1ms Number of <trigger, flow> pairs increases statistics gathering overhead

slide-44
SLIDE 44

Performance Envelope

44

Need to profile and provision Trumpet deployment Above 10ms, CPU can sustain full packet rate

slide-45
SLIDE 45

Conclusion

Future datacenters will need fast and precise eventing ▸ Trumpet is an expressive system for host-based eventing Trumpet can process 16K triggers at full packet rate ▸ … without delaying packets by more than 10 µs Future work: scale to 40G NICs ▸ … perhaps with NIC or switch support

45

https://github.com/USC-NSL/Trumpet

slide-46
SLIDE 46

A Big Discrepancy

46

Outage budget for five 9s availability 24 seconds per month

99.999% uptime

Long failure durations due to time to root- cause failures

slide-47
SLIDE 47

47

Every optimization is necessary❉

❉Details in the paper