Network Failure Mitigation Xin Wu , Daniel Turner, Chao-Chih Chen, - - PowerPoint PPT Presentation

network failure mitigation
SMART_READER_LITE
LIVE PREVIEW

Network Failure Mitigation Xin Wu , Daniel Turner, Chao-Chih Chen, - - PowerPoint PPT Presentation

NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu , Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang Failures are Common and Harmful Network failures are common 10,000+ switches 2 Failures


slide-1
SLIDE 1

NetPilot: Automating Datacenter Network Failure Mitigation

Xin Wu, Daniel Turner, Chao-Chih Chen,

David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

slide-2
SLIDE 2

Failures are Common and Harmful

  • Network failures are common

10,000+ switches

2

slide-3
SLIDE 3

Failures are Common and Harmful

  • Network failures are common
  • Failures cause long down times

3

slide-4
SLIDE 4

Failures are Common and Harmful

  • Network failures are common
  • Failures cause long down times

Time from detection to repair (minutes)

Six-month failure logs of production datacenters

25% of failures take 13+ hours to repair

4

slide-5
SLIDE 5

Failures are Common and Harmful

  • Failures are common due to VERY large

datacenters

  • Failures cause long down times
  • Long failure duration  large revenue loss

5

slide-6
SLIDE 6

Failures are Common and Harmful

  • Failures are common due to VERY large

datacenters

  • Failures cause long down times
  • Long failure duration  large revenue loss

6

slide-7
SLIDE 7

How to Shorten Failure Recovery Time?

slide-8
SLIDE 8

Previous Work

  • Conventional failure recovery takes 3 steps

Detection Diagnosis Repair passive

ping

active

8

slide-9
SLIDE 9

Previous Work

  • Conventional failure recovery takes 3 steps
  • Failure localization/diagnosis

– [M. K. Aguilera, SOSP’03] – [M. Y. Chen, NSDI’04] – [R.R Kompella, NSDI ’05] – [P.Bahl, SIGCOMM’07] – [S. Kandula, SIGCOMM’09]…

Detection Diagnosis Repair

9

slide-10
SLIDE 10

Automating Failure Diagnosis is Challenging

  • Root causes are deep in network stack
  • Diagnosis involves multiple parties

10

slide-11
SLIDE 11

Category Failure types Diagnosis & Repair % Software 21% Link layer loop Find and fix bugs 19% Imbalance  overload 2% Hardware 18% FCS error Replace cable 13% Unstable power Repair power 5% Unknown 23% Switch stops forwarding N/A 9% Imbalance  overload 7% Lost configuration 5% High CPU utilization 2% Configuration 38% Errors on multiple switches Update configuration 32% Errors on one switch 6%

  • Six-month failure logs from several production DCNs
  • 1. Root causes are deep

in the network stack

11

slide-12
SLIDE 12

Category Failure types Diagnosis & Repair % Software 21% Link layer loop Find and fix bugs 19% Imbalance  overload 2% Hardware 18% FCS error Replace cable 13% Unstable power Repair power 5% Unknown 23% Switch stops forwarding N/A 9% Imbalance  overload 7% Lost configuration 5% High CPU utilization 2% Configuration 38% Errors on multiple switches Update configuration 32% Errors on one switch 6%

  • Six-month failure logs from several production DCNs
  • 1. Root causes are deep

in the network stack

  • 2. Diagnosis involves

multiple parties

12

slide-13
SLIDE 13

Category Failure types Diagnosis & Repair % Software 21% Link layer loop Find and fix bugs 19% Imbalance  overload 2% Hardware 18% FCS error Replace cable 13% Unstable power Repair power 5% Unknown 23% Switch stops forwarding N/A 9% Imbalance  overload 7% Lost configuration 5% High CPU utilization 2% Configuration 38% Errors on multiple switches Update configuration 32% Errors on one switch 6%

  • Six-month failure logs from several production DCNs
  • 1. Root causes are deep

in the network stack

  • 2. Diagnosis involves

multiple parties

Failure Diagnosis Requires Human Intervention !

13

slide-14
SLIDE 14

Can we do something other than failure diagnosis?

slide-15
SLIDE 15

NetPilot: Mitigating rather than Diagnosing Failures

  • Mitigate failure symptoms ASAP, at the

cost of reduced capacity

Detection Diagnosis Repair

15

slide-16
SLIDE 16

16

slide-17
SLIDE 17

NetPilot Benefits

  • Short recovery time
  • Small network disruption
  • Low operation cost

17

Automated Mitigation

Detection Diagnosis Repair

slide-18
SLIDE 18

Failure Mitigation is Effective

  • Most failures can be mitigated by simple actions
  • Mitigation is feasible due to redundancy

18

slide-19
SLIDE 19

Category Failure types Mitigation Repair % Software 21% Link layer loop Deactivate port Find and fix bugs 19% Imbalance- triggered overload Restart switch 2% Hardware 18% FCS error Deactivate port Replace cable 13% Unstable power Deactivate switch Repair power 5% Unknown 23% Switch stops forwarding Restart switch N/A 9% Imbalance- triggered overload Restart switch 7% Lost configuration Restart switch 5% High CPU utilization Restart switch 2% Configurati

  • n 38%

Errors on multiple switches n/a Update configuration 32% Errors on single switch Deactivate switch 6%

19

slide-20
SLIDE 20

Category Failure types Mitigation Repair % Software 21% Link layer loop Deactivate port Find and fix bugs 19% Imbalance- triggered overload Restart switch 2% Hardware 18% FCS error Deactivate port Replace cable 13% Unstable power Deactivate switch Repair power 5% Unknown 23% Switch stops forwarding Restart switch N/A 9% Imbalance- triggered overload Restart switch 7% Lost configuration Restart switch 5% High CPU utilization Restart switch 2% Configurati

  • n 38%

Errors on multiple switches n/a Update configuration 32% Errors on single switch Deactivate switch 6%

20

slide-21
SLIDE 21

Category Failure types Mitigation Repair % Software 21% Link layer loop Deactivate port Find and fix bugs 19% Imbalance- triggered overload Restart switch 2% Hardware 18% FCS error Deactivate port Replace cable 13% Unstable power Deactivate switch Repair power 5% Unknown 23% Switch stops forwarding Restart switch N/A 9% Imbalance- triggered overload Restart switch 7% Lost configuration Restart switch 5% High CPU utilization Restart switch 2% Configurati

  • n 38%

Errors on multiple switches n/a Update configuration 32% Errors on single switch Deactivate switch 6%

68% of failures can be mitigated by simple actions

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

23

slide-24
SLIDE 24

Outline

  • Automating failure diagnosis is challenging
  • Failure mitigation is effective
  • How to automate mitigation?
  • NetPilot evaluations
  • Conclusion

24

slide-25
SLIDE 25

A Strawman NetPilot: Trial-and-error

Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization

25

slide-26
SLIDE 26

NetPilot: Challenges & Solutions

  • 1. Blind trial-and-error

takes a long time

Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Localization

26

slide-27
SLIDE 27

NetPilot: Challenges & Solutions

  • 1. Blind trial-and-error

takes a long time

Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Localization

Failure specific localization

27

slide-28
SLIDE 28

NetPilot: Challenges & Solutions

Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Localization

  • 2. Partition/overload network

Impact estimation

28

slide-29
SLIDE 29

29

slide-30
SLIDE 30

30

slide-31
SLIDE 31

NetPilot: Challenges & Solutions

Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Rank actions Localization

  • 3. Different actions have

different side-effects

Rank actions based on impact

31

slide-32
SLIDE 32

Failure Specific Localization

  • Limited # of failure types
  • Domain knowledge improves accuracy

Failure types

  • 1. Link layer loop
  • 2. Imbalance-triggered overload
  • 3. FCS error
  • 4. Unstable power
  • 5. Switch stops forwarding
  • 6. Imbalance-triggered overload
  • 7. Lost configuration
  • 8. High CPU utilization
  • 9. Errors on multiple switches
  • 10. Errors on single switch

32

slide-33
SLIDE 33

Example: Frame Check

Sequence (FCS) Errors

  • 13% of all the failures
  • Cut-through switching

– Forward frames before checksums are verified

  • Increase application latency

33

slide-34
SLIDE 34

Localizing FCS Errors

error frames seen on L frames corrupted by L frames corrupted by other links & traverse L

  • xL: link corruption rate
  • # of variables = # of equations = # of links
  • Corrupted links: xL> 0

34

slide-35
SLIDE 35

NetPilot Overview

Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Rank actions

35

slide-36
SLIDE 36

Impact Metrics

  • Derived from Service Level Agreement (SLA)

– Availability: online_server_ratio – Packet loss: total_lost_pkt – latency: max_link_utilization

  • Small link utilization  small (queuing) delay
  • Total_lost_pkt & max_link_utilization

derived from utilization of individual links

36

slide-37
SLIDE 37

Estimating Link Utilization

  • # of flows >> redundant paths

– Traffic evenly distributed under ECMP

  • Estimate the load contributed by each flow
  • n each link
  • Sum up the loads to compute utilization

Impact Estimator

Action Traffic

Link utilization

Topology

37

slide-38
SLIDE 38

Link Utilization Estimation is Highly Accurate

  • 1-month traffic from a 8000-server network

– Log socket events on each server

  • Ground truth: SNMP counters

38

slide-39
SLIDE 39

NetPilot Overview

Network failure Roll back if necessary No Failure mitigated? End Yes Execute an action Localization Estimate impact Rank actions

Choose the action with the least impact

39

slide-40
SLIDE 40

Outline

  • Automating failure diagnosis is challenging
  • Failure mitigation is effective
  • How to automate mitigation?

– Localization  impact estimation  ranking

  • NetPilot evaluations

– Mitigating load imbalance – Mitigating FCS errors – Mitigating overload

  • Conclusion

40

slide-41
SLIDE 41

Load Imbalance

  • Agga stops receiving traffic
  • Localize to 4 suspects

corea Agga coreb Aggb

41

slide-42
SLIDE 42

42

slide-43
SLIDE 43

43

slide-44
SLIDE 44

44

slide-45
SLIDE 45

45

slide-46
SLIDE 46

46

slide-47
SLIDE 47

47

slide-48
SLIDE 48

Fast FCS Error Mitigation

NetPilot: deactivates 2 links in 1 trial within 15 minutes Human operator: after 11 trials in 3.5 hours, 2

  • ut of 28 ports are deactivated

48

slide-49
SLIDE 49

Fast FCS Error Mitigation

NetPilot: deactivates 2 links in 1 trial within 15 minutes Human operator: after 11 trials in 3.5 hours, 2

  • ut of 28 ports are deactivated

3.5 hours  15 minutes

49

slide-50
SLIDE 50

50

slide-51
SLIDE 51

51

slide-52
SLIDE 52

Mitigating Link Overload

  • Mitigate overload by deactivating healthy links

– Many candidate links in production networks – Choose the link(s) with the least impact

52

core1 1.5 1.5 3 agg core2 core1 1 1.5 3 agg core2 core1 3 3 agg core2

lost 0.5

slide-53
SLIDE 53

Action Ranking Lowers Link Utilization

  • Replay 97 overload incidents due to link failures

53

slide-54
SLIDE 54

Conclusion

  • Mitigation reduces failure recovery time

– Simple actions are effective – Made possible by redundancy

  • NetPilot: automating failure mitigation

– Recovery time: hour  minutes – Several mitigation scenarios deployed in Bing

54

slide-55
SLIDE 55

Thank You!

Detection Diagnosis Repair NetPilot: Automated Mitigation

netpilot@microsoft.com

55

slide-56
SLIDE 56

56

slide-57
SLIDE 57

NetPilot Shortens Recovery Time

  • Time from detection to mitigation

– 6 months, many production datacenters Operators work around

50% failures in 2 HOURS

NetPilot mitigate 3 types of failures all with in 30 minutes

57