Fault-Resilient In-Band Control Plane Ermin Sakic, Amaury Van - - PowerPoint PPT Presentation

fault resilient in band control plane
SMART_READER_LITE
LIVE PREVIEW

Fault-Resilient In-Band Control Plane Ermin Sakic, Amaury Van - - PowerPoint PPT Presentation

Automated Bootstrapping of A Fault-Resilient In-Band Control Plane Ermin Sakic, Amaury Van Bemten, Mirza Avdic, Wolfgang Kellerer Technical University Munich & Siemens Germany ACM SOSR 2020 San Jose, March 3, 2020 INTRODUCTION Industrial


slide-1
SLIDE 1

Automated Bootstrapping of A Fault-Resilient In-Band Control Plane

Ermin Sakic, Amaury Van Bemten, Mirza Avdic, Wolfgang Kellerer

Technical University Munich & Siemens Germany ACM SOSR 2020 San Jose, March 3, 2020

slide-2
SLIDE 2

INTRODUCTION

slide-3
SLIDE 3
  • Strict requirements:
  • QoS: Sub-ms hard real-time E2E delays
  • Dependability: Control & data plane HA & reliability
  • Topology dynamics: factory cell / work-piece

(de)-attachment

  • TSN group (802.1) standardizes industrial CP & DP:
  • E.g., TAS (Qbv), Frame Pre-emption (Qbu),

FRER (CB), Policing (Qci) etc.

Industrial Networks Overview

slide-4
SLIDE 4
  • Strict requirements:
  • QoS: Sub-ms hard real-time E2E delays
  • Dependability: Control & data plane HA & reliability
  • Topology dynamics: factory cell / work-piece

(de)-attachment

  • TSN group (802.1) standardizes industrial CP & DP:
  • E.g., TAS (Qbv), Frame Pre-emption (Qbu),

FRER (CB), Policing (Qci) etc.

  • Centralized (CNC) and distributed stream reservation
  • TSN requires a highly-available CNC w/ in-band,

dynamically extensible CP & DP

Industrial Networks Overview

slide-5
SLIDE 5

Industrial Network Topologies

slide-6
SLIDE 6

Industrial Network Topologies

VirtuWind – Virtual and programmable industrial network prototype deployed in operational wind park - https://5g-ppp.eu/virtuwind/

slide-7
SLIDE 7

Control Plane Design

Out-of-Band In-Band

slide-8
SLIDE 8

Control Plane Design

In-Band

Goal of Bootstrapping: Automated establishment of a functional and resilient In-Band SDN control plane Required:

  • Initial C2S and C2C connections
  • Control plane fault tolerance
  • Full topology available (no blocked ports!)
  • Network extensions
  • Compliant with current implementations

Constraints:

  • Switches know nothing about the controllers
  • Controllers know whitelisted IP addresses of

remote controllers (e.g., standardized)

  • Switches and controllers exchange PKI certificates
slide-9
SLIDE 9

Control Plane Design

High-level steps:

1. Controllers distribute IP addresses to switches from a common pool 2. Controllers provides each switch with controller lists (e.g., OF) 3. Controllers establish control channels to each switch (e.g., OF)

slide-10
SLIDE 10

Resilience Requirements

CP: Must tolerate F out of 2F+1 Fail-Stop controller failures

DP: Must tolerate k element failures

  • k+1 fully or maximally disjoint paths
slide-11
SLIDE 11

Bootstrapping Co-Dependency

  • DP requires appropriate table rules
  • Rule configuration requires C2C
  • In-Band C2C requires DP connectivity

 Break bootstrapping procedure into sub-phases

Fully Bootstrapped Data Plane Bootstrapped Controllers

  • Part. Bootstrapped

Data Plane Flow Configurations

slide-12
SLIDE 12

Design Overview

Contribution:Two automated bootstrapping schemes for a reliable multi- controller in-band control plane

  • Hybrid Switch Approach (HSW): Assumes (R)STP
  • Hop-By-Hop Approach (HHC): No (R)STP
slide-13
SLIDE 13

Why regard (R)STP?

+

Beneficial for effortless initial C2C connectivity

  • Dimensioning the (R)STP-disable timer non-trivial

 Delays in bootstrapping convergence

  • Added complexity in the data plane:

 Prone to additional failure vectors (YMMV)

slide-14
SLIDE 14

DESIGN OF THE TWO SCHEMES

slide-15
SLIDE 15

System Initialization

HSW - (R)STP enabled:

  • standalone mode

 Heavy use of NORMAL port

  • in-band mode enabled

HHC - (R)STP unavailable:

  • secure mode
  • in-band mode disabled

 „generic“ OF rules

slide-16
SLIDE 16

System Initialization

HSW - (R)STP enabled:

  • standalone mode

 Heavy use of NORMAL port

  • in-band mode enabled

HHC - (R)STP unavailable:

  • secure mode
  • in-band mode disabled

 „generic“ OF rules

HHC: How to fight initial broadcast storms without (R)STP? Police problematic C2C traffic (ARP, TCP SYN, TCP SYN ACK)

slide-17
SLIDE 17

HSW Phases 0 and 1

slide-18
SLIDE 18

HHC Phases 0 and 1

slide-19
SLIDE 19

HSW (with (R)STP) HHC (no (R)STP)

Output: Phases 0 and 1

slide-20
SLIDE 20

Phase 2: Resilience Embedding

HSW (with (R)STP): Step 2a:

  • Establish OF sessions FCFS, install initial rules, disable in-band rules

Step 2b: - Disable R(STP)

  • Install resilient flow rules

HHC (no (R)STP): Step 2a: - Establish OF sessions Hop-By-Hop, install tree flow rules Step 2b: - Install resilient flow rules whenever possible

slide-21
SLIDE 21

HSW Phase 2a

slide-22
SLIDE 22

HSW Phase 2a

slide-23
SLIDE 23

HSW Phase 2b

slide-24
SLIDE 24

HSW Phase 2b

slide-25
SLIDE 25

HHC Phase 2a

slide-26
SLIDE 26

HHC Phase 2a

slide-27
SLIDE 27

HHC Phase 2b

slide-28
SLIDE 28

HHC Phase 2b

slide-29
SLIDE 29

Phase 2: Outcome both schemes

k+1 max. disjoint paths for C2C pairs k+1 max. disjoint paths for C2S pairs (here only S4)

slide-30
SLIDE 30

Dynamic network extensions

  • Allow new traffic to reach the leader via tree
  • HSW: Prim’s algorithm
  • HHC: Custom Hop-By-Hop Algorithm
  • Special rule: in_port=inactive port, udp, udp_src=68, actions=controller
  • Extend tree by parsing DHCP DISCOVERY message
slide-31
SLIDE 31

Data Plane Failures

  • Proactively compute alternative trees
  • Embed an alternative tree in case a DP element fails
slide-32
SLIDE 32

EVALUATION

slide-33
SLIDE 33

Evaluation - KPIs

  • Global Bootstrapping Convergence Time (GBCT)
  • Network Extension Time (TEXT)
  • Flow Table Occupancy (FTO)

TOPOLOGY TYPES TOPOLOGY SIZES CONTROLLER PLACEMENTS NUMBER OF CONTROLLERS

GBCT TEXT FTO

slide-34
SLIDE 34

Global Bootstrapping Convergence Time Single Controller

* normalized by minimum mean ~13.5s

slide-35
SLIDE 35

Global Bootstrapping Convergence Time Multiple Controllers

* normalized by minimum mean ~33.9s

slide-36
SLIDE 36

Network Extension Time Single Controller

* normalized by minimum mean ~6.5s

slide-37
SLIDE 37

Network Extension Time Multiple Controllers

* normalized by minimum mean ~33.5s

slide-38
SLIDE 38

Flow Table Occupation

Ratios of per-switch FTOs, normalized respective to the FTO in 1-controller case

slide-39
SLIDE 39

SUMMARY

slide-40
SLIDE 40

HSW - (R)STP

+ Straightforward; easier to implement

  • Dependency on legacy protocols (and implementation)
  • Worse performance due to (R)STP Timer

HHC - No (R)STP

+ Less legacy protocol dependencies + Faster on average

  • Slightly more complex implementation

Summary - Pros and Cons

slide-41
SLIDE 41

Artifacts and Future Updates

Source code for both approaches and Docker-based OpenFlow emulator available! https://github.com/ermin-sakic/sdn-automated-bootstrapping

slide-42
SLIDE 42

Artifacts and Future Updates

Source code for both approaches and Docker-based OpenFlow emulator available! https://github.com/ermin-sakic/sdn-automated-bootstrapping

slide-43
SLIDE 43

Artifacts and Future Updates

Potential optimizations:

  • Automated rule compression for lower FTO
  • Tree merging instead of swapping
  • Support for concurrent multi-controller bootstrapping?

(RAFT membership issues?) Source code for both approaches and Docker-based OpenFlow emulator available! https://github.com/ermin-sakic/sdn-automated-bootstrapping

slide-44
SLIDE 44

Selected References

Marco Canini, Iosif Salem, Liron Schiff, Elad M Schiller, and Stefan Schmid. 2017. A self-organizing distributed and in-band SDN control plane. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2656–2657. Marco Canini, Iosif Salem, Liron Schiff, Elad Michael Schiller, and Stefan Schmid. 2018. Renaissance: A self-stabilizing distributed SDN control plane. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 233–243. Josef Dorr. 2018. IEC/IEEE P60802 JWG TSN Industrial Profile: Use Cases Status Update 2018-05-14. IEC/IEEE. https://1.ieee802.org/tsn/iec-ieee- 60802/ Peter Heise, Fabien Geyer, and Roman Obermaisser. 2017. Self-configuring deterministic network with in-band configuration channel. In Software Defined Systems (SDS), 2017 Fourth International Conference on. IEEE, 162–167. Liron Schiff, Stefan Schmid, and Marco Canini. 2016. Ground control to major faults: Towards a fault tolerant and adaptive SDN control network. In Dependable Systems and Networks Workshop, 2016 46th Annual IEEE/IFIP International Conference on. IEEE, 90–96. Liron Schiff, Stefan Schmid, and Marco Canini. 2015. Medieval: Towards A Self-Stabilizing, Plug & Play, In-Band SDN Control Network. In ACM Sigcomm Symposium on SDN Research (SOSR). Sachin Sharma, Dimitri Staessens, Didier Colle, Mario Pickavet, and Piet Demeester. 2013. A demonstration of automatic bootstrapping of resilient OpenFlow networks. In 13th IFIP/IEEE International Symposium on Integrated Network Management (IM). IEEE, 1066–1067. Sachin Sharma, Dimitri Staessens, Didier Colle, Mario Pickavet, and Piet Demeester. 2013. Fast failure recovery for in-band OpenFlow networks. In Design of Reliable Communication Networks (DRCN) 2013 9th International Conference on the. IEEE, 52–59.