NetPoirot: Taking The Blame Game Out of Data Center Operations - - PowerPoint PPT Presentation

netpoirot taking the blame game
SMART_READER_LITE
LIVE PREVIEW

NetPoirot: Taking The Blame Game Out of Data Center Operations - - PowerPoint PPT Presentation

NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, Geoff Outhred Datacenters can fail 2 Failures are disruptive 3 Why is debugging hard? Azure


slide-1
SLIDE 1

NetPoirot: Taking The Blame Game Out of Data Center Operations

Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, Geoff Outhred

slide-2
SLIDE 2

Datacenters can fail …

2

slide-3
SLIDE 3

Failures are disruptive

  • 3
slide-4
SLIDE 4

Why is debugging hard?

4

Penn researcher Azure VM Azure Network Service X

Network

slide-5
SLIDE 5

Network Network

`

Someone accepts responsibility Each blames the other 5

In the case of a failure…

slide-6
SLIDE 6

A real example… Event X

  • 6
slide-7
SLIDE 7

Current tools are insufficient

Sherlock SIGCOMM- 07 NetMedic SIGCOMM- 09 NSDI-11 TRat SIGCOMM-02 Netprofile r P2Psys-05

7

slide-8
SLIDE 8

Can we do better? (Overview)

  • Introducing…

8

NetPoirot Fault injector Learning Agent

slide-9
SLIDE 9

The monitoring agent

  • 9
slide-10
SLIDE 10

What is the TCP event digest?

  • 10
slide-11
SLIDE 11

Why do we think this can work?

  • 11
slide-12
SLIDE 12

To distinguish failures…

  • 12
slide-13
SLIDE 13

Decision trees…

  • 13

His uncertainty is X

slide-14
SLIDE 14

Decision trees…

  • 14

His uncertainty is X- Y

slide-15
SLIDE 15

Decision trees alone are not enough

15

slide-16
SLIDE 16

Decision trees alone are not enough

16

slide-17
SLIDE 17

Decision trees alone are not enough

17

Feature 1 Feature 2

slide-18
SLIDE 18

Decision trees alone are not enough

Easiest to

18

Hardest to classify Feature 2 Feature 1

slide-19
SLIDE 19

What we do to deal with this

19

Feature 2 Feature 1

slide-20
SLIDE 20

Upper portion of an example tree…

20

Mean of max congestion window Min of the last congestion window 50th percentile of number of triple duplicate ACKs 50th percentile of connection duration Max of the number of triple duplicate Acks 95th percentile of the max congestion window

slide-21
SLIDE 21

What we do to deal with this

21

Feature 2 Feature 1

slide-22
SLIDE 22

Upper portion of an example tree…

22

50TH percentile of the max RTT Number of flows 50th percentile of amount of data received 95th percentile of the number of timeouts

slide-23
SLIDE 23

Decision trees alone are not enough

23

Feature 1 Feature 2

slide-24
SLIDE 24

The upper portion of an example tree…

24

Mean time spent in zero window probing 95th percentile of the ratio

  • f number of bytes posted

to received Number of flows Number of flows 95th percentile of connection durations Minimum of the number of bytes received

slide-25
SLIDE 25

25

Is it a network failure? Is it a server problem? Is it a client side problem?

slide-26
SLIDE 26

Other details

  • 26

If throughput < x: Open more connections If throughput <x: Send more data on the same connection

slide-27
SLIDE 27

What did we learn from all this?

  • 27
slide-28
SLIDE 28

Evaluation

  • 28
slide-29
SLIDE 29

How did we get labeled data?

  • 29
slide-30
SLIDE 30

Worse case application

  • 30
slide-31
SLIDE 31

What if we haven’t seen the failure before?

31

slide-32
SLIDE 32

Performance on real applications

32

General label Normal Client Networ k Precisio n 97.78% 99.7% 100% Recall 99.68% 98.25% 99.37

YouTube Event X

slide-33
SLIDE 33

Things we did not talk about

  • 33
slide-34
SLIDE 34

What’s next?

  • 34
slide-35
SLIDE 35

Conclusion

  • 35