Motivation: Root cause analysis SDN controller From: alice@xyz.com - - PowerPoint PPT Presentation

motivation root cause analysis
SMART_READER_LITE
LIVE PREVIEW

Motivation: Root cause analysis SDN controller From: alice@xyz.com - - PowerPoint PPT Presentation

Motivation: Root cause analysis SDN controller From: alice@xyz.com Traffic is arriving To: Admin (bob@xyz.com) at the wrong Title: Help! server !?! My server is receiving suspicious traffic from 4.3.2.0/24--it should have 4.3.2.0/24


slide-1
SLIDE 1

Motivation: Root cause analysis

1

  • Networks can (and frequently do!) have bugs
  • We need a good debugger!

Web server 1 Web server 2 Internet Bob

Traffic is arriving at the wrong server !?!

4.3.2.0/24 4.3.3.0/24 SDN controller

From: alice@xyz.com To: Admin (bob@xyz.com) Title: Help! My server is receiving suspicious traffic from 4.3.2.0/24--it should have been sent to the low-security server. Packets from 4.3.3.0/24 are still being routed correctly. Can you help?

slide-2
SLIDE 2

Debugging networks with provenance

2

  • Existing debuggers tell us what happened
  • Example: NetSight [NSDI’14]
  • Provenance offers a richer explanation
  • Example: Y! [SIGCOMM’14]

A B C C received packet B sent packet B received packet Rule match on B A sent packet Packet P Packet P Rule for 4.3.2.0/24 installed by controller

PacketIn from 4.3.2.1 Controller: next hop should be port2! Rule match on A

slide-3
SLIDE 3

root

Problem: The explanation can be too big!

3 Root cause Symptom Packet arrives at the wrong server Rule 7: Next-hop=port2

C received packet B sent packet Rule match on B

… …

slide-4
SLIDE 4

What can we do?

4

  • Idea: Reason about the differences between the symptom

and the reference

Web server 1 DPI Web server 2

S1 S2 S3 S4 S5 S6

Bob

From: alice@xyz.com To: Admin (bob@xyz.com) Title: Help! My server is receiving suspicious traffic from 4.3.2.0/24--it should have been sent to the low-security server. Packets from 4.3.3.0/24 are still being routed correctly. Can you help?

Outages mailing list Sept.—Dec. 2014: 66% have references! Working reference!

slide-5
SLIDE 5

4.3.3.1 fails 4.3.2.1 works

Differential provenance

5

  • Input: a bad symptom and a good reference
  • Debugger reasons about the differences
  • Output: root cause

Differential Provenance Rule 7’s next hop is wrong!

slide-6
SLIDE 6

Outline

  • Motivation: Root cause analysis
  • Differential provenance
  • Background: Provenance
  • Strawman solution
  • Algorithm
  • Evaluation
  • Prototype implementation
  • Usability
  • Query processing speed
  • Complex network diagnostics
  • Conclusion

6

slide-7
SLIDE 7

Background: Provenance

7

  • Provenance tracks causal connections between network

events and state [ExSPAN-SIGMOD’10]

  • Provenance graph: Vertexes à event/state. Edge à causality
  • Provenance tree: Recursive explanation of an event/state

PktRecv(@C, 4.3.3.0) PktSend(@B, 4.3.3.0, next=C) Flow(@B, 4.3.3.0, next=C) Flow(@A, 4.3.3.0, next=B) PktRecv(@B, 4.3.3.0) PktSend(@B, 4.3.3.0, next=B) PktRecv(@A, 4.3.3.0) Observed symptom Configuration state Event

slide-8
SLIDE 8

Strawman solution

8

  • Strawman solution: Find vertexes that are different in the

two trees

  • Problem: The diff can be larger than the individual trees!
faulty rule root root
  • =

Provenance (Symptom) Provenance (Reference)

?

slide-9
SLIDE 9

9

  • Observation: The diff can be larger than the individual trees
  • Reason: “Butterfly effect”
  • A small initial difference can lead to drastically different events later on

Pkt@A Pkt@B Flow(next=B) Flow(next=E) Pkt@E Flow(next=Srv1) Pkt@Srv1 Pkt@Srv2 Pkt@A Pkt@B Flow(next=B) Flow(next=C) Pkt@C Flow(next=D) Pkt@D Flow(next=Srv2)

Why does the strawman solution not work?

slide-10
SLIDE 10

Outline

  • Motivation: Root cause analysis
  • Key insight
  • Differential provenance
  • Background: Provenance
  • Strawman solution
  • Algorithm
  • Evaluation
  • Prototype implementation
  • Usability
  • Query processing speed
  • Complex network diagnostics
  • Conclusion

10

slide-11
SLIDE 11

Algorithm: Refinement #1

11

  • This approach finds only the (small) initial differences
  • The (potentially large) consequences are ignored

Roll back the execution to a divergence point Change the faulty node to be like the correct node Roll forward the execution to align the trees

slide-12
SLIDE 12

Algorithm: Refinement #1 (Cont’d)

12

  • Approach: Roll back the execution, change the first faulty

node, and roll forward again to align the trees

Provenance (symptom) Provenance (reference)

Pkt@A Pkt@B Flow(next=B) Flow(next=E) Pkt@E Flow(next=Srv1) Pkt@Srv1 Pkt@Srv2 Pkt@A Pkt@B Flow(next=B) Flow(next=C) Pkt@C Flow(next=D) Pkt@D Flow(next=Srv2) Flow(next=E) Pkt@E Flow(next=Srv1) Pkt@Srv1 ` B E A C

slide-13
SLIDE 13

How to preserve crucial differences?

13

  • Problem: There are differences that we need to preserve
  • Example: The packets whose provenance we are looking at

Provenance (symptom)

Pkt(4.3.3.1)@A Pkt(4.3.3.1)@B Flow(next=B) Flow(next=E) Pkt(4.3.3.1)@E Pkt(4.3.3.1)@Srv1 Pkt(4.3.2.1)@Srv2 Pkt(4.3.2.1)@A Pkt(4.3.2.1)@B Flow(next=B) Flow(next=C) Pkt(4.3.2.1)@C Flow(next=D) Pkt(4.3.2.1)@D Flow(next=Srv2)

Provenance (reference)

Flow(next=Srv1) Pkt(4.3.3.1)@A Pkt(4.3.3.1)@B Flow(next=B) Flow(next=E) Pkt(4.3.3.1)@E Pkt(4.3.3.1)@Srv1 Flow(next=Srv1)

slide-14
SLIDE 14

Solution: Establish equivalence

14

  • Establish an equivalence relation between the trees
  • Example: IP addresses 4.3.2.1 and 4.3.3.1
  • Values on the trees can be identical, equivalent, or different
  • Goal: Make the trees equivalent, not necessarily identical!

Provenance (symptom)

Pkt(4.3.3.1)@A Pkt(4.3.3.1)@B Flow(next=B) Flow(next=E) Pkt(4.3.3.1)@E Pkt(4.3.3.1)@Srv1 Pkt(4.3.2.1)@Srv2 Pkt(4.3.2.1)@A Pkt(4.3.2.1)@B Flow(next=B) Flow(next=C) Pkt(4.3.2.1)@C Flow(next=D) Pkt(4.3.2.1)@D Flow(next=Srv2)

Provenance (reference)

Flow(next=Srv1) Pkt(4.3.2.1)@Srv2 Pkt(4.3.2.1)@A Pkt(4.3.2.1)@B Flow(next=B) Flow(next=C) Pkt(4.3.2.1)@C Flow(next=D) Pkt(4.3.2.1)@D Flow(next=Srv2) Pkt(4.3.3.1)@A Pkt(4.3.3.1)@B Flow(next=B) Flow(next=E) Pkt(4.3.3.1)@E Pkt(4.3.3.1)@Srv1 Flow(next=Srv1)

slide-15
SLIDE 15

Algorithm: Refinement #2

15

  • Benefit: Preserves the crucial differences between the

trees

Roll back the execution to Change the faulty node to Roll forward the execution to align the trees be like the correct node be like the correct node its equivalent a divergence point a divergence a non-equivalent point

slide-16
SLIDE 16

Establishing and propagating equivalence

16

  • Start with an initial equivalence relation between the packets
  • Establish a mapping between packet fields that are different
  • Keep track of the mapping while going up the tree
  • Stop at the first non-equivalent(!) node
  • More general approach: taint analysis

Bad provenance

Pkt(4.3.3.1)@A Pkt(4.3.3.1)@B Flow(next=B) Flow(next=E) Pkt(4.3.3.1)@E Pkt(4.3.3.1)@Srv1 Pkt(4.3.2.1)@Srv2 Pkt(4.3.2.1)@A Pkt(4.3.2.1)@B Flow(next=B) Flow(next=C) Pkt(4.3.2.1)@C Flow(next=D) Pkt(4.3.2.1)@D Flow(next=Srv2)

Reference provenance

Flow(next=Srv1) Pkt(4.3.2.1)@A Pkt(4.3.2.1)@B Flow(next=B) Flow(next=C) Pkt(4.3.2.1)@C Pkt(4.3.3.1)@A Pkt(4.3.3.1)@B Flow(next=B) Flow(next=E) Pkt(4.3.3.1)@E

slide-17
SLIDE 17

Propagating equivalence with taints

17

  • Approach:
  • Create taints for equivalent fields
  • Propagate taints up the tree
  • Repeat until we find a non-equivalent node

Symptom Reference

pktstat( pt , 8*sz , c+1) pkt( pt , sz )

Computation

pktcnt(c) pktstat( 80 , 800 , 2 ) pkt( 80 , 100 ) pktstat ( 51 , 808 , 2) pkt( 51 , 101 ) pktcnt(1) pktcnt(1)

? ? =

x8

= =

slide-18
SLIDE 18

Changing the faulty node

18

  • Change the faulty node to its equivalent: Pkt(4.3.2.1)@C à

Pkt(4.3.2.1)@E

  • Have dependent nodes à Create their equivalents recursively
  • Example: Flow(next=C) à Flow(next=E)
  • No dependent nodes à Insert its equivalent
  • Example: Insert Flow(next=E)
  • See paper for how to propagate taints in the reverse direction

Bad provenance

Pkt(4.3.3.1)@Srv1

Reference provenance

Flow(next=Srv1) Pkt(4.3.2.1)@A Pkt(4.3.2.1)@B Flow(next=B) Flow(next=C) Pkt(4.3.2.1)@C Pkt(4.3.3.1)@A Pkt(4.3.3.1)@B Flow(next=B) Flow(next=E) Pkt(4.3.3.1)@E v Wanted: Pkt(4.3.2.1)@E ! Wanted: Flow(next=E)! Flow(next=E) Flow(next=E) C

slide-19
SLIDE 19

Problem: Multiple faults

19

  • Problem: There could be more than one difference between

the two trees

  • Solution: Repeat until the trees are completely aligned

Bad provenance Reference provenance

Change #1 Aligned Aligned Change #2 Aligned Aligned Aligned Aligned Change #3

slide-20
SLIDE 20

Refinement #3: Final algorithm

20

Roll back the execution to Change the faulty node to Roll forward the execution to align the trees be like the correct node its equivalent a divergence a non-equivalent point Completely equivalent? Output changes YES NO

slide-21
SLIDE 21

Rolling forward the execution

21

  • Roll the execution forward to align the trees
  • Output the accumulated change(s): Flow(next=C) àFlow(next=E)!

Bad provenance

Pkt(4.3.3.1)@Srv1

Reference provenance

Flow(next=Srv1) Pkt(4.3.2.1)@A Pkt(4.3.2.1)@B Flow(next=B) Pkt(4.3.3.1)@A Pkt(4.3.3.1)@B Flow(next=B) Pkt(4.3.3.1)@E Flow(next=E) Flow(next=E) Pkt(4.3.2.1)@C Pkt(4.3.2.1)@E Flow(next=Srv1) Pkt(4.3.2.1)@Srv1 Pkt(4.3.3.1)@Srv1 Flow(next=Srv1) Pkt(4.3.3.1)@E

=

Flow(next=C) Flow(next=E)

slide-22
SLIDE 22

Outline

  • Motivation: Root cause analysis
  • Differential provenance
  • Background: Provenance
  • Strawman approach
  • Algorithm
  • Evaluation
  • Prototype implementation
  • Usability
  • Query processing speed
  • Complex network diagnostics
  • Conclusion

22

slide-23
SLIDE 23

Prototype implementation: DiffProv

23

  • Mostly focuses on Network Datalog (NDlog) [CACM ’2009]

programs, where provenance is easy to see

  • NetCore [NSDI ’13] programs are also supported
  • Applicable beyond SDN: Hadoop MapReduce
  • Integrated with Mininet + the Beacon controller; based on

Rapidnet

slide-24
SLIDE 24

Evaluation: Overview

24

Q1: How well does DiffProv find the root cause? Q2: How much overhead does DiffProv incur at runtime? Q3: How quickly does DiffProv answer diagnostic queries? Q4: How well does DiffProv recognize bad reference events? Q5: How well does DiffProv work for complex networks?

slide-25
SLIDE 25

Experimental setup

25

  • We adapted seven diagnostic scenarios:
  • SDN1: Broken flow entry [Empr.Soft.Eng.’09]
  • SDN2: Multi-controller inconsistency [CoNEXT’14]
  • SDN3: Unexpected rule expiration [P2P’13]
  • SDN4: Multiple faulty entries [Empr.Soft.Eng.’09]
  • Complex network diagnostics [CoNEXT’12]
  • MR1: Configuration changes [Industry collaborators]
  • MR2: Code changes [Industry collaborators]
  • Baseline: Y!, a provenance debugger without reference

support [SIGCOMM’14]

slide-26
SLIDE 26

How well does DiffProv find the root cause?

26

Provenance (symptom) Naïve diff Next hop of rule 7 is wrong

faulty rule root root

Provenance (reference) DiffProv 201 nodes 278 nodes 156 nodes 1 node!

slide-27
SLIDE 27

How well does DiffProv find the root cause? (Cont’d)

27

  • DiffProv finds one or two nodes (the faulty rules or MapReduce

configuration entries), which are the actual root cause

Query SDN1 SDN2 SDN3 SDN4

  • Num. of faults

DiffProv Reference Symptom Plain tree diff 156 156 156 201/201 201 156 201 156/145 278 238 74 278/218 1 1 1 2 1 1 1 2 Query MR1-D MR2-D MR1-I MR2-I

  • Num. of faults

DiffProv Reference Symptom Plain tree diff 1 1 1 1 1 1 1 1 1051 1001 588 588 1051 848 588 438 164 306 240 216

slide-28
SLIDE 28

How long does DiffProv take to find the root causes?

28

  • DiffProv answered most of our queries within one minute!

Time (s)

120 100 80 60 40 20

52.6 52.9 53.7 38.9 DiffProv 105.2 39.6

SDN1 SDN2 SDN3 SDN4 MR1 MR2

slide-29
SLIDE 29

How well does DiffProv work in complex networks?

29

  • Setup: larger topology, complex config, background traffic
  • ‘Forwarding error’ scenario [ATPG-CoNEXT’12]
  • Stanford network: 757,000 forwarding entries and 1,500 ACL rules
  • Multiple faults: Injected 20 additional faulty entries
  • Background traffic: 12GB traffic, 69 protocol types
  • Results:
  • DiffProv: the faulty entry for misconfigured subnet – one node
  • Identified the root cause despite heavy interference
  • Why is DiffProv not confused by the interference?
  • Provenance captures causality, not merely correlations!
slide-30
SLIDE 30

Summary

30

  • Debugging networks is hard
  • Need good debuggers to find root causes!
  • Key insight: We can use reference events
  • We often have more information than we are using
  • Idea: Reason about the differences between bad events and

reference events

  • Approach: Differential provenance
  • We have built a prototype debugger for SDNs
  • Applicable to other distributed systems beyond SDNs
  • Result: Very precise diagnostics
  • Differential provenance can often identify a single root cause

Thank you!