Ang Chen, Yang Wu, Andreas Haeberlen, Boon Thau Loo, Wenchao Zhou
Data Provenance at Internet Scale: Architecture, Experiences, and - - PowerPoint PPT Presentation
Data Provenance at Internet Scale: Architecture, Experiences, and - - PowerPoint PPT Presentation
Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead Ang Chen, Yang Wu, Andreas Haeberlen, Boon Thau Loo, Wenchao Zhou Motivation D E A foo.com Alice C B An example scenario: network routing System
Motivation
- An example scenario: network routing
– System administrator observes strange behavior – Example: the route to foo.com has suddenly changed – Anomalies in distributed systems
- Need a way to explain system behavior.
2
Alice foo.com A D E B C
Motivation
- An example scenario: network routing
– System administrator observes strange behavior – Example: the route to foo.com has suddenly changed – Anomalies in distributed systems
- Need a way to explain system behavior.
2
Alice foo.com Route r1 A D E B C
Motivation
- An example scenario: network routing
– System administrator observes strange behavior – Example: the route to foo.com has suddenly changed – Anomalies in distributed systems
- Need a way to explain system behavior.
2
Alice foo.com Route r2 A D E B C
Motivation
- An example scenario: network routing
– System administrator observes strange behavior – Example: the route to foo.com has suddenly changed – Anomalies in distributed systems
- Need a way to explain system behavior.
2
Why did my route to foo.com change?!
Alice foo.com Route r2 A D E B C
Motivation
- An example scenario: network routing
– System administrator observes strange behavior – Example: the route to foo.com has suddenly changed – Anomalies in distributed systems
- Need a way to explain system behavior.
2
Why did my route to foo.com change?!
Alice foo.com Route r2
Innocent Reason?
A D E B C
Motivation
- An example scenario: network routing
– System administrator observes strange behavior – Example: the route to foo.com has suddenly changed – Anomalies in distributed systems
- Need a way to explain system behavior.
2
Why did my route to foo.com change?!
Alice foo.com Route r2
Innocent Reason?
A D E B C
Software Bugs?
Motivation
- An example scenario: network routing
– System administrator observes strange behavior – Example: the route to foo.com has suddenly changed – Anomalies in distributed systems
- Need a way to explain system behavior.
2
Why did my route to foo.com change?!
Alice foo.com Route r2
Innocent Reason? Malicious Attack?
A D E B C
Software Bugs?
Data-centric Perspective on Network Debugging
- We assume a general distributed system
– Network consists of nodes (routers, middleboxes, ...) – The state of a node is a set of tuples (routes, config, ...)
3
Alice foo.com A B C D E
Data-centric Perspective on Network Debugging
- We assume a general distributed system
– Network consists of nodes (routers, middleboxes, ...) – The state of a node is a set of tuples (routes, config, ...)
3
Alice foo.com
route(A, B)
A B C D E
…… route(A, foo.com) route(A, D)
Data-centric Perspective on Network Debugging
- We assume a general distributed system
– Network consists of nodes (routers, middleboxes, ...) – The state of a node is a set of tuples (routes, config, ...)
3
Alice foo.com
route(A, B)
A B C D E
…… route(A, foo.com) route(A, D) link(A, B) link(A, D)
Data-centric Perspective on Network Debugging
- We assume a general distributed system
– Network consists of nodes (routers, middleboxes, ...) – The state of a node is a set of tuples (routes, config, ...) – Idea: Explanation as reasoning about distributed state dependencies
3
Alice foo.com A B C D E
route(A, foo.com)
Data-centric Perspective on Network Debugging
- We assume a general distributed system
– Network consists of nodes (routers, middleboxes, ...) – The state of a node is a set of tuples (routes, config, ...) – Idea: Explanation as reasoning about distributed state dependencies
3
Alice foo.com A B C D E
route(B, foo.com) route(A, foo.com) link(A, B)
Data-centric Perspective on Network Debugging
- We assume a general distributed system
– Network consists of nodes (routers, middleboxes, ...) – The state of a node is a set of tuples (routes, config, ...) – Idea: Explanation as reasoning about distributed state dependencies
3
Alice foo.com
route(C, foo.com) link(C, foo.com)
A B C D E
route(B, foo.com) link(B, C) route(A, foo.com) link(A, B)
Network Provenance
4
Alice foo.com
route(C, foo.com) link(C, foo.com)
A B C D E
route(B, foo.com) link(B, C) route(A, foo.com) link(A, B) route(D, foo.com) link(D, E) route(E, foo.com) link(E, B)
[SIGMOD 2010]
Network Provenance
- Provenance for encoding distributed state dependencies
– Explains the derivation of tuples – Captures the dependencies between tuples as a graph
4
route(C, foo.com) link(C, foo.com) route(B, foo.com) link(B, C) route(A, foo.com) link(A, B) route(D, foo.com) link(D, E) route(E, foo.com) link(E, B)
[SIGMOD 2010]
Network Provenance
- Provenance for encoding distributed state dependencies
– Explains the derivation of tuples – Captures the dependencies between tuples as a graph – Explanation of a tuple is an acyclic graph rooted at the tuple
4
route(C, foo.com) link(C, foo.com) route(B, foo.com) link(B, C) route(A, foo.com) link(A, B)
[SIGMOD 2010]
NetTrails: First Generation Network Provenance Tool
- http://netdb.cis.upenn.edu/nettrails/ [SIGMOD 2011 demo]
5
- Network provenance [SIGMOD’10]
- Secure network provenance [SOSP’11]
- Provenance in dynamic environments [VLDB’13]
- Negative provenance [SIGCOMM’14]
- Distributed provenance compression [SIGMOD’17]
- Differential provenance [SIGCOMM’16]
- Meta-provenance [NSDI’17]
Ph.D. dissertation work of Ang Chen (2017), Chen Chen (2017), Yang Wu (2017), and Wenchao Zhou (2012).
Network Provenance Research (2010 – 2017)
Deeper diagnostics and repair Explanations
6
Assumption #1: All nodes in the network can be trusted
Alice foo.com Route r2 A D E B C
Why did my route to foo.com change?!
7
Assumption #1: All nodes in the network can be trusted
The Network Alice foo.com Route r2 A D E B C
Q: Explain why the route to foo.com changed to r2.
7
Assumption #1: All nodes in the network can be trusted
The Network
A: Because someone accessed Router D and changed the configuration from X to Y.
Alice foo.com Route r2 A D E B C
Q: Explain why the route to foo.com changed to r2.
7
Assumption #1: All nodes in the network can be trusted
The Network
A: Because someone accessed Router D and changed the configuration from X to Y.
Alice foo.com Route r2 A D E B C
Not realistic: adversary can tell lies
Q: Explain why the route to foo.com changed to r2.
7
Challenge: Adversaries Can Lie
8
The Network
Q: Explain why the route to foo.com changed to r2.
Alice foo.com Route r2 A D E B C
Problem: adversary can …
... fabricate plausible (yet incorrect) response … point accusation towards innocent nodes
I should cover up the intrusion.
Challenge: Adversaries Can Lie
8
The Network
Q: Explain why the route to foo.com changed to r2.
Alice foo.com Route r2 A D E B C
Problem: adversary can …
... fabricate plausible (yet incorrect) response … point accusation towards innocent nodes
Everything is fine. Router E advertised a new route.
Secure Network Provenance (SNP)
- Step 1: Each node keeps vertices about local actions
– Split cross-node communications
9
route(C, foo.com) link(C, foo.com) route(B, foo.com) link(B, C) route(A, foo.com) link(A, B) route(D, foo.com) link(D, E) route(E, foo.com) link(E, B)
SOSP 2011
Secure Network Provenance (SNP)
- Step 1: Each node keeps vertices about local actions
– Split cross-node communications
9
SOSP 2011
Secure Network Provenance (SNP)
- Step 1: Each node keeps vertices about local actions
– Split cross-node communications
9
SOSP 2011
Secure Network Provenance (SNP)
- Step 1: Each node keeps vertices about local actions
– Split cross-node communications
9
SOSP 2011
Secure Network Provenance (SNP)
- Step 1: Each node keeps vertices about local actions
– Split cross-node communications
9
RECV SEND
SOSP 2011
Secure Network Provenance (SNP)
- Step 1: Each node keeps vertices about local actions
– Split cross-node communications
- Step 2: Make the graph tamper-evident
9
RECV SEND
SOSP 2011
SNP Guarantees
- No faults: Explanation is complete and accurate
- Byzantine fault: Explanation identifies at least one faulty node
10
The Network
Q: Why did my route to foo.com change to r2? A: Because someone accessed Router D and changed its router configuration from X to Y.
Alice foo.com Route r2 A D E B C
SNP Guarantees
- No faults: Explanation is complete and accurate
- Byzantine fault: Explanation identifies at least one faulty node
10
The Network
Q: Why did my route to foo.com change to r2? A: Because someone accessed Router D and changed its router configuration from X to Y.
Alice foo.com Route r2 A D E B C
SNP Guarantees
- No faults: Explanation is complete and accurate
- Byzantine fault: Explanation identifies at least one faulty node
10
The Network
Q: Why did my route to foo.com change to r2? A: Because someone accessed Router D and changed its router configuration from X to Y.
Alice foo.com Route r2 A D E B C
SNP Guarantees
- No faults: Explanation is complete and accurate
- Byzantine fault: Explanation identifies at least one faulty node
10
The Network
Q: Why did my route to foo.com change to r2? A: Because someone accessed Router D and changed its router configuration from X to Y.
Alice foo.com Route r2 A D E B C
Aha, at least I know which node is compromised.
Assumption #2: Operators react only to presence of anomaly events
11
Assumption #2: Operators react only to presence of anomaly events
- What if something expected is
not happening?
- Missing events cannot be
handled by existing tools
11
Assumption #2: Operators react only to presence of anomaly events
Internet HTTP Server Data Center Network Controller
- What if something expected is
not happening?
- Missing events cannot be
handled by existing tools
11
Assumption #2: Operators react only to presence of anomaly events
Internet HTTP Server Data Center Network Controller
???
Why is the HTTP server
NOT getting requests?
- What if something expected is
not happening?
- Missing events cannot be
handled by existing tools
11
17% 83%
Outages
48% 52%
NANOG-user
26% 74%
floodlight-dev
- Missing events are consistently in the majority
- Lengthier email threads for missing events
Missing events Positive events NANOG-user Floodlight-dev Outages
12
How common are missing events?
Negative Provenance
[SIGCOMM 2014]
13
Negative Provenance
Internet HTTP Server Data Center Network Controller
[SIGCOMM 2014]
13
Negative Provenance
Internet HTTP Server Data Center Network Controller
No HTTP Packet arrived at HTTP Server
???
Why is the HTTP server
NOT getting requests?
[SIGCOMM 2014]
13
Negative Provenance
Internet HTTP Server Data Center Network Controller
No HTTP Packet arrived at HTTP Server No Forwarding-FlowEntry installed at Switch
???
???
Why is the HTTP server
NOT getting requests?
[SIGCOMM 2014]
13
Negative Provenance
Internet HTTP Server Data Center Network Controller
No HTTP Packet arrived at HTTP Server No Forwarding-FlowEntry installed at Switch HTTP Packet received at Switch
???
??? HTTP Packet
Why is the HTTP server
NOT getting requests?
[SIGCOMM 2014]
13
Negative Provenance
Internet HTTP Server Data Center Network Controller
No HTTP Packet arrived at HTTP Server No Forwarding-FlowEntry installed at Switch HTTP Packet received at Switch Dropping-FlowEntry existed at Switch
???
??? HTTP Packet Dropping- FlowEntry
Why is the HTTP server
NOT getting requests?
[SIGCOMM 2014]
13
Negative Provenance
Internet HTTP Server Data Center Network Controller
No HTTP Packet arrived at HTTP Server No Forwarding-FlowEntry installed at Switch HTTP Packet received at Switch Dropping-FlowEntry existed at Switch … Program
???
??? HTTP Packet Dropping- FlowEntry
Why is the HTTP server
NOT getting requests?
[SIGCOMM 2014]
13
Negative Provenance
Internet HTTP Server Data Center Network Controller
No HTTP Packet arrived at HTTP Server No Forwarding-FlowEntry installed at Switch HTTP Packet received at Switch Dropping-FlowEntry existed at Switch … … Program
…
???
??? HTTP Packet Dropping- FlowEntry
Why is the HTTP server
NOT getting requests?
[SIGCOMM 2014]
13
Approach: Counter-factual reasoning
Find all the ways a missing event could have occurred, and show why each of them did not happen.
14
Approach: Counter-factual reasoning
Find all the ways a missing event could have occurred, Philly and show why each of them did not happen.
14
Approach: Counter-factual reasoning
Find all the ways a missing event could have occurred, Philly Santa Cruz and show why each of them did not happen.
14
Approach: Counter-factual reasoning
Find all the ways a missing event could have occurred,
Why did Bob NOT arrive at CIDR?
Philly Santa Cruz and show why each of them did not happen.
14
Approach: Counter-factual reasoning
Find all the ways a missing event could have occurred,
Why did Bob NOT arrive at CIDR?
Philly Santa Cruz and show why each of them did not happen.
14
Approach: Counter-factual reasoning
Find all the ways a missing event could have occurred,
Why did Bob NOT arrive at CIDR?
Philly Santa Cruz and show why each of them did not happen.
14
Approach: Counter-factual reasoning
Find all the ways a missing event could have occurred,
Why did Bob NOT arrive at CIDR?
Philly Santa Cruz and show why each of them did not happen.
14
Approach: Counter-factual reasoning
Find all the ways a missing event could have occurred,
Why did Bob NOT arrive at CIDR?
Philly Santa Cruz and show why each of them did not happen.
14
Assumption #3: Provenance trees alone are sufficient for diagnostics
15
r
- t
Assumption #3: Provenance trees alone are sufficient for diagnostics
Root cause Symptom
15
r
- t
Assumption #3: Provenance trees alone are sufficient for diagnostics
Root cause Symptom
C received packet B sent packet Rule match on B
… …
15
r
- t
Assumption #3: Provenance trees alone are sufficient for diagnostics
Root cause Symptom
15
r
- t
Assumption #3: Provenance trees alone are sufficient for diagnostics
Root cause Symptom Packet arrives at the wrong server
15
r
- t
Assumption #3: Provenance trees alone are sufficient for diagnostics
Root cause Symptom Packet arrives at the wrong server Rule 7: Next-hop=port2
15
What can we do?
16
What can we do?
From: alice@xyz.com To: Admin (bob@xyz.com) Title: Help! My server is receiving suspicious traffic from 4.3.2.0/24--it should have been sent to the low-security server. Packets from 4.3.3.0/24 are still being routed correctly. Can you help?
16
What can we do?
From: alice@xyz.com To: Admin (bob@xyz.com) Title: Help! My server is receiving suspicious traffic from 4.3.2.0/24--it should have been sent to the low-security server. Packets from 4.3.3.0/24 are still being routed correctly. Can you help?
Working reference!
16
What can we do?
From: alice@xyz.com To: Admin (bob@xyz.com) Title: Help! My server is receiving suspicious traffic from 4.3.2.0/24--it should have been sent to the low-security server. Packets from 4.3.3.0/24 are still being routed correctly. Can you help?
Outages mailing list Sept.—Dec. 2014: 66% have references! Working reference!
16
What can we do?
Web server 1 DPI Web server 2
S1 S2 S3 S4 S5 S6
Bob
16
What can we do?
- Idea: Reason about the differences between the symptom
and the reference
Web server 1 DPI Web server 2
S1 S2 S3 S4 S5 S6
Bob
16
4.3.3.1 fails 4.3.2.1 works
Differential provenance
- Input: a bad symptom and a good reference
[SIGCOMM 2016]
17
4.3.3.1 fails 4.3.2.1 works
Differential provenance
- Input: a bad symptom and a good reference
Differential Provenance
[SIGCOMM 2016]
17
4.3.3.1 fails 4.3.2.1 works
Differential provenance
- Input: a bad symptom and a good reference
- Debugger reasons about the differences
Differential Provenance
[SIGCOMM 2016]
17
4.3.3.1 fails 4.3.2.1 works
Differential provenance
- Input: a bad symptom and a good reference
- Debugger reasons about the differences
- Output: root cause
Differential Provenance Rule 7’s next hop is wrong!
[SIGCOMM 2016]
17
Strawman solution
- Strawman solution: Find vertexes that are different in the
two trees
faulty rule root root- =
Provenance (Symptom) Provenance (Reference)
?
18
Strawman solution
- Strawman solution: Find vertexes that are different in the
two trees
- Problem: The diff can be larger than the individual trees!
- =
Provenance (Symptom) Provenance (Reference)
18
Overly Simplified Approach in a nutshell
Roll back the execution to a divergence point Change the faulty node to be like the correct node Roll forward the execution to align the trees
19
- Networks are software and can have bugs
Assumption #4: Software is correct and static
20
- Networks are software and can have bugs
Assumption #4: Software is correct and static
S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server
20
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server
20
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server HTTP traffic
20
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server Off-loading HTTP HTTP traffic
20
else if (switch == S1 && protocol == HTTP) then action = output:3.
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server Off-loading HTTP HTTP traffic
20
else if (switch == S1 && protocol == HTTP) then action = output:3.
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server Off-loading HTTP HTTP traffic
20
else if (switch == S1 && protocol == HTTP) then action = output:3. else if (switch == S1 && protocol == HTTP) then action = output:5.
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server Off-loading HTTP HTTP traffic
20
else if (switch == S1 && protocol == HTTP) then action = output:3. else if (switch == S1 && protocol == HTTP) then action = output:5.
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server Off-loading HTTP HTTP traffic
20
else if (switch == S1 && protocol == HTTP) then action = output:3. else if (switch == S1 && protocol == HTTP) then action = output:5.
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server
Copy-and-paste bug!!!
Off-loading HTTP HTTP traffic
20
else if (switch == S1 && protocol == HTTP) then action = output:3. else if (switch == S1 && protocol == HTTP) then action = output:5.
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server
Copy-and-paste bug!!!
Off-loading HTTP HTTP traffic
20
else if (switch == S1 && protocol == HTTP) then action = output:3. else if (switch == S1 && protocol == HTTP) then action = output:5.
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server Why is the backup web server not getting requests?
Copy-and-paste bug!!!
Off-loading HTTP HTTP traffic
20
else if (switch == S1 && protocol == HTTP) then action = output:3. else if (switch == S1 && protocol == HTTP) then action = output:5.
- Networks are software and can have bugs
Assumption #4: Software is correct and static
SDN Controller S0
5
S2
3 4
S1 Backup Web Server Main Web Server DNS Server
- How can we find and fix bugs quickly?
Why is the backup web server not getting requests?
Copy-and-paste bug!!!
Off-loading HTTP HTTP traffic
20
Approach: Meta provenance
- Key idea: Treating program as data
HTTP Packet received at Main Web Server Matching Flow Entry installed at S1
- Idea: Provenance can pinpoint the root cause
Executed If Clause in Controller Program
- But previous provenance focus exclusively on data
- Problem: Finding fixes is hard
PacketIn received at Controller
meta provenance
[NSDI 2017]
21
Repairing Software-defined Networking Programs
Full support of Network Datalog (declarative) Uses Z3 SMT solver to enumerate repairs More details are in [HotNets’15, NSDI’17] Support a subset of Pyretic (Python + DSL) Support a subset of Trema (imperative Ruby)
22
- Network provenance model [SIGMOD’10]
- Secure network provenance [SOSP’11]
- Provenance in dynamic environments [VLDB’13]
- Negative provenance [SIGCOMM’14]
- Distributed provenance compression [SIGMOD’17]
- Differential provenance [SIGCOMM’16]
- Meta-provenance [NSDI’17]
Ph.D. dissertation work of Ang Chen (2017), Chen Chen (2017), Yang Wu (2017), and Wenchao Zhou (2012).
Network Provenance Research (2010 – 2017)
Deeper diagnostics and repair Explanations
23
The Road Ahead
- Network forensics meets data provenance is a
rich area of exploration!
- Sampling of the problems we are working on:
– Network forensics on the data plane – Privacy-preserving provenance on sensitive networks – Probabilistic provenance – Automated repairs of complex events – Timing-based provenance – and more….
24
Thank You!
- Network provenance team at Penn/Georgetown:
– Ang Chen, Chen Chen, Ling Ding, Qiong Fei, Andreas Haeberlen, Zachary Ives, Yang Li, Boon Thau Loo, Suyog Mapara, Arjun Narayan, Yiqing Ren, Micah Sherr, Shengzhi Sun, Tao Tao, Yang Wu, Mingchen Zhao, Wenchao Zhou
25