netpoirot taking the blame game
play

NetPoirot: Taking The Blame Game Out of Data Center Operations - PowerPoint PPT Presentation

NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, Geoff Outhred Datacenters can fail 2 Failures are disruptive 3 Why is debugging hard? Azure


  1. NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, Geoff Outhred

  2. Datacenters can fail … 2

  3. Failures are disruptive • • • • 3

  4. Why is debugging hard? Azure VM Service X Azure Network Network Penn researcher 4

  5. In the case of a failure… ` Someone accepts responsibility Each blames the other Network Network 5

  6. A real example… Event X • • • • 6

  7. Current tools are insufficient TRat SIGCOMM-02 Netprofile r P2Psys-05 Sherlock NetMedic SIGCOMM- SIGCOMM- 07 09 NSDI-11 7

  8. Can we do better? (Overview) • Introducing… Learning Agent Fault injector NetPoirot 8

  9. The monitoring agent • • • • • • • 9

  10. What is the TCP event digest? • • • 10

  11. Why do we think this can work? • • • • • • 11

  12. To distinguish failures… • • • 12

  13. Decision trees… • His uncertainty is X 13

  14. Decision trees… • • His uncertainty is X- Y 14

  15. Decision trees alone are not enough 15

  16. Decision trees alone are not enough 16

  17. Decision trees alone are not enough Feature 2 Feature 1 17

  18. Decision trees alone are not enough Feature 2 Hardest to classify Easiest to Feature 1 18

  19. What we do to deal with this Feature 2 Feature 1 19

  20. Upper portion of an example tree… Mean of max congestion window 50 th percentile of number of Min of the last congestion window triple duplicate ACKs 50 th percentile of connection duration 95 th percentile of the max Max of the number of triple duplicate Acks congestion window 20

  21. What we do to deal with this Feature 2 Feature 1 21

  22. Upper portion of an example tree… 50 TH percentile of the max RTT Number of flows 50 th percentile of amount of data received 95 th percentile of the number of timeouts 22

  23. Decision trees alone are not enough Feature 2 Feature 1 23

  24. The upper portion of an example tree… Mean time spent in zero window probing 95 th percentile of the ratio Number of flows of number of bytes posted to received 95 th percentile of connection durations Minimum of the number of bytes received Number of flows 24

  25. Is it a network failure? Is it a server problem? Is it a client side problem? 25

  26. Other details • • If throughput <x: If throughput < x: Send more data on the Open more • same connection connections • 26

  27. What did we learn from all this? • • • • • • • • • 27

  28. Evaluation • • • • • • • 28

  29. How did we get labeled data? • • • • • • • 29

  30. Worse case application • 30

  31. What if we haven’t seen the failure before? 31

  32. Performance on real applications General Normal Client Networ label k Precisio 97.78% 99.7% 100% n Recall 99.68% 98.25% 99.37 YouTube Event X 32

  33. Things we did not talk about • • • • • 33

  34. What’s next? • • • • 34

  35. Conclusion • • 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend