Tolerating Application Failures with LegoSDN Balakrishnan - - PowerPoint PPT Presentation
Tolerating Application Failures with LegoSDN Balakrishnan - - PowerPoint PPT Presentation
Tolerating Application Failures with LegoSDN Balakrishnan Chandrasekaran Theophilus Benson Duke University Quality of Code In C, I never learned to use the debugger, so I used to never make mistakes I went millions and millions
Quality of Code
“In C, I never learned to use the debugger, so I used to never make mistakes …” “I went millions and millions of hours with no problems—probably tens of millions of hours with no problems.” — Arthur Whitney, creator of A, K and Q.
ACM Queue, Feb 2009.
October 28, 2014 HotNets 2014 | LegoSDN 2
Bugs are endemic in software!
§ Bugs can be deterministic or non- deterministic § [STS] Pox Premature PacketIn
– l2_multi routing module failed unexpectedly with a KeyError.
October 28, 2014 HotNets 2014 | LegoSDN 3
Cascading Crashes
October 28, 2014 HotNets 2014 | LegoSDN 4
Controller
A
App1
A
App2
A
…
in
- ut
Cascading Crashes
October 28, 2014 HotNets 2014 | LegoSDN 5
Controller
A
App1
A
App2
A
…
in
- ut
Cascading Crashes
October 28, 2014 HotNets 2014 | LegoSDN 6
Controller
A
App1
A
App2
A
…
in
- ut
LegoSDN
§ Availability is of utmost importance
– Second only to security
October 28, 2014 7 HotNets 2014 | LegoSDN
Fate-sharing
§ Fate-sharing relationships between
– the SDN controller and the SDN application(s) (also between SDN applications) – the SDN application and the network
§ Failure in any one SDN application brings down the other applications, and the SDN controller.
October 28, 2014 8 HotNets 2014 | LegoSDN
Three-pronged approach
Controller
A
App1
A
App2
A
…
in
- ut
1
October 28, 2014 HotNets 2014 | LegoSDN 9
Contain c crash
Three-pronged approach
Controller
A
App1
A
App2
A
…
in
- ut
2
October 28, 2014 HotNets 2014 | LegoSDN 10
Undo c changes es
Three-pronged approach
Controller
A
App1
A
App2
A
…
in
- ut
3
October 28, 2014 HotNets 2014 | LegoSDN 11
Handle m e mes essage
Controller architecture must support two new abstractions
October 28, 2014 HotNets 2014 | LegoSDN 12
Current architecture
Controller
A
App1
A
App2
October 28, 2014 HotNets 2014 | LegoSDN 13
Isolate SDN-Apps from the controller
Sandbox
A
App1 Sandbox
A
App2 Controller
October 28, 2014 HotNets 2014 | LegoSDN 14
Isolate SDN-Apps from the controller
Sandbox
A
App1 Sandbox
A
App2 Controller
October 28, 2014 HotNets 2014 | LegoSDN 15
Isolate SDN-Apps from the controller
Sandbox
A
App1 Sandbox
A
App2 Controller
October 28, 2014 HotNets 2014 | LegoSDN 16
Isolate SDN-Apps from the network
Sandbox
A
App1 Controller
a
October 28, 2014 HotNets 2014 | LegoSDN 17
Isolate SDN-Apps from the network
Sandbox
A
App1 Controller
a
October 28, 2014 HotNets 2014 | LegoSDN 18
LegoSDN
Ap AppVisor S Stub Lightweight wrapper Ap AppVisor P Proxy xy Message dispatcher SDN-App is treated as a black-box.
Stub and proxy allow SDN-Apps to talk to controller.
Ne NetLog tLog Transactional support
Sandbox
A
App1 Controller
a
AppVisor Stub AppVisor Proxy NetLog
October 28, 2014 HotNets 2014 | LegoSDN 19
LegoSDN
Built o
- n t
top o
- f F
FloodLight Ported three applications bundled with FloodLight to LegoSDN
Sandbox
A
App1 Controller
a
AppVisor Stub AppVisor Proxy NetLog
October 28, 2014 HotNets 2014 | LegoSDN 20
Three-pronged approach
Controller
A
App1
A
App2
A
…
in
- ut
3
October 28, 2014 HotNets 2014 | LegoSDN 21
Handle m e mes essage
How do you handle the crash inducing message?
October 28, 2014 HotNets 2014 | LegoSDN 22
- 1. Crash and burn
§ Halt the application
– SDN-App cannot continue processing – Other SDN-Apps can continue unaffected
§ No Compromise
– Think of security related SDN-Apps
Correctness: SDN-App’s ability to implement its functionality without change, according to the specification.
October 28, 2014 HotNets 2014 | LegoSDN 23
- 2. Induce amnesia
§ Ignore or drop the crash inducing message
– SDN-App will not see the message again
§ Complete Compromise
October 28, 2014 HotNets 2014 | LegoSDN 24
- 3. Apply transformations
§ Transform the offending message into another one that the application can handle
– application will continue with a modified input
§ Equivalence Compromise
October 28, 2014 HotNets 2014 | LegoSDN 25
Course of action?
No Compromise Apply T ransformation(s) Complete Compromise
Operator
October 28, 2014 HotNets 2014 | LegoSDN 26
Related work
§ Fault tolerance
– via reboots – applying Paxos for leader selection
§ Debugging SDN-Apps or the controller
October 28, 2014 HotNets 2014 | LegoSDN 27
Message equivalence
§ How do you determine two messages are equivalent?
October 28, 2014 HotNets 2014 | LegoSDN 28
Rollbacks are non-trivial
§ Rollback of one or more rules installed changes controller’s view of the state of network
– Might induce crashes of other SDN applications that rely on a consistent view of network state
October 28, 2014 HotNets 2014 | LegoSDN 29
Error propagation
§ Last message received by the SDN-App prior to the crash need not be the culprit!
– How far along should we go back in history to find the root cause of the crash? – Recovery from an earlier checkpoint; How many checkpoints should we maintain?
October 28, 2014 HotNets 2014 | LegoSDN 30
Road ahead
§ Rethink controller architecture
– LegoSDN is only the tip of the iceberg.
§ Resilient controllers can catalyze adoption § Failures need to be a first-class citizen
October 28, 2014 HotNets 2014 | LegoSDN 31
October 28, 2014 HotNets 2014 | LegoSDN 32