Challenges in Distributed SDN Duarte Nunes duarte@midokura.com - - PowerPoint PPT Presentation

challenges in distributed sdn
SMART_READER_LITE
LIVE PREVIEW

Challenges in Distributed SDN Duarte Nunes duarte@midokura.com - - PowerPoint PPT Presentation

Challenges in Distributed SDN Duarte Nunes duarte@midokura.com @duarte_nunes MidoNet transform this... IP Fabric VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM Bare Metal Server VM VM VM VM VM VM VM


slide-1
SLIDE 1

Challenges in Distributed SDN

Duarte Nunes duarte@midokura.com @duarte_nunes

slide-2
SLIDE 2

Bare Metal Server Bare Metal Server

MidoNet transform this...

VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM

IP Fabric

slide-3
SLIDE 3

Bare Metal Server Bare Metal Server

...into this...

VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM

FW LB FW LB Internet/ WAN FW

slide-4
SLIDE 4

Bare Metal Server Bare Metal Server

Packet processing

VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM

FW LB FW LB Internet/ WAN FW

slide-5
SLIDE 5

Bare Metal Server Bare Metal Server

VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM

IP Fabric midonet nsdb 2 midonet nsdb 3 midonet nsdb 1 midonet gateway 2 midonet gateway 3 midonet gateway 1 IP Fabric IP Fabric Internet/ WAN

Physical view

slide-6
SLIDE 6

MidoNet

  • Fully distributed architecture
  • All traffic processed at the edges, i.e., where it ingresses the physical network

○ virtual devices become distributed ○ a packet can traverse a particular virtual device at any host in the cloud ○ distributed virtual bridges, routers, NATs, FWs, LBs, etc.

  • No SPOF
  • No middle boxes
  • Horizontally scalable L2 and L3 Gateways
slide-7
SLIDE 7

Gateway 1

MidoNet Hosts

Quagga, bgpd

OVS kmod

IP3 eth0 eth1 VXLAN Tunnel Port

Internet/WAN

port1 port2 port3, veth0 veth1

MidoNet Agent (Java Daemon)

Compute 1

VM VM VM VM VM VM VM VM

IP Fabric OVS kmod

IP1 VXLAN Tunnel Port eth0 port5, tap12345

MidoNet Agent (Java Daemon)

slide-8
SLIDE 8

Flow computation and tunneling

  • Flows are computed at the ingress host

○ by simulating a packet’s path through the virtual topology ○ without fetching any information off-box (~99% of the time)

  • Just-in-time flow computation
  • If the egress port is on a different host, then the packet is tunneled

○ the tunnel key encodes the egress port ○ no computation is needed at the egress

slide-9
SLIDE 9

Virtual Devices

slide-10
SLIDE 10

Device state

  • ZooKeeper serves the virtual network topology

○ reliable subscription to topology changes

  • Agents fetch, cache, and “watch” virtual devices on-demand to process

packets

  • Packets naturally traverse the same virtual device at different hosts
  • This affects device state:

○ a virtual bridge learns a MAC-port mapping a host and needs to read it in other hosts ○ a virtual router emits an ARP request out of one host and receives the reply on another host

  • Store device state tables (ARP, MAC-learning, routes) in ZooKeeper

interested agents subscribe to tables to get updates

the owner of an entry manages its lifecycle

use ZK Ephemeral nodes so entries go away if a host fails

slide-11
SLIDE 11

ARP Table

VM VM

ARP Table

IP Fabric

VM

VM

slide-12
SLIDE 12

ARP Table

VM VM

ARP Table

IP Fabric

VM

VM

slide-13
SLIDE 13

ARP Table

VM VM

VM

ARP Table

IP Fabric Encapsulated ARP request

VM

slide-14
SLIDE 14

ARP Table

VM VM

ARP Table

IP Fabric ARP reply handled locally and written to ZK ZK notification

VM

VM

slide-15
SLIDE 15

Encapsulated packet

ARP Table

VM VM

ARP Table

IP Fabric

VM

VM

slide-16
SLIDE 16

Flow State

slide-17
SLIDE 17

Flow state

  • Per-flow L4 state, e.g. connection tracking or NAT
  • Forward and return flows are typically handled by different hosts

○ thus, they need to share state

slide-18
SLIDE 18

Virtual NAT

VM

LB

VM VM

NIC

Internet/ WAN Return flow Forward flow 180.0.1.100:80 10.0.0.2 10.0.0.2:6456 Internet/ WAN

slide-19
SLIDE 19

Asymmetric routing

NIC

Internet/ WAN

NIC NIC

VM

LB

VM

slide-20
SLIDE 20

Asymmetric routing

NIC

Internet/ WAN

NIC NIC

VM

LB

VM

Forward flow

slide-21
SLIDE 21

Asymmetric routing

NIC

Internet/ WAN

NIC NIC

VM

LB

VM

Return flow

slide-22
SLIDE 22

Asymmetric routing

NIC

Internet/ WAN

NIC NIC

VM

LB

VM

Return flow

slide-23
SLIDE 23

Flow state

  • Connection tracking

○ Key: 5 tuple + ingress device UUID ○ Value: NA ○ Forward state not needed ○ One flow state entry per flow

  • NAT

○ Key: 5 tuple + device UUID under which NAT was performed ○ Value: (IP, port) binding ○ Possibly multiple flow state entries per flow

  • Key must always be derivable from the packet
slide-24
SLIDE 24

Sharing state - Peer-to-peer handoff

Node 2 Node 1

  • 1. New flow arrives
  • 4. Tunnel the packet
  • 5. Deliver the packet

Node 4 (possible asym.

  • fwd. path)

Node 3 (possible asym.

  • ret. path)
  • 2. Check or create

local state

  • 3. Replicate the flow

state to interested set

slide-25
SLIDE 25
  • 1. Return flow arrives
  • 4. Deliver the packet

Sharing state - Peer-to-peer handoff

Node 2 Node 1

  • 3. Tunnel the packet

Node 4 (possible asym.

  • fwd. path)

Node 3 (possible asym.

  • ret. path)
  • 2. Lookup local state
slide-26
SLIDE 26

Sharing state - Peer-to-peer handoff

Node 2 Node 1

  • 3. Tunnel the packet
  • 4. Deliver the packet

Node 3 (possible asym.

  • ret. path)
  • 2. Lookup local state
  • 1. Exiting flow

arrives at different node

Node 4 (possible asym.

  • fwd. path)
slide-27
SLIDE 27

Sharing state - Peer-to-peer handoff

  • No added latency
  • Fire-and-forget or reliable?
  • How often to retry?
  • Delay tunneling the packets until the flow state has propagated or accept the

risk of the return flow being computed without the flow state?

slide-28
SLIDE 28

SNAT block reservation

VM VM VM

NIC

10.0.0.2 10.0.0.2:6456 180.0.1.100:9043

VM

dst: 216.58.210.164:80 Internet/ WAN Internet/ WAN

slide-29
SLIDE 29

SNAT block reservation

VM VM VM

NIC

10.0.0.2 10.0.0.2:6456 180.0.1.100:9043

VM

dst: 216.58.210.164:80 NAT Target: (start_ip..end_ip, start_port..end_port) e.g. 180.0.1.100..180.0.1.100 5000..65535 Internet/ WAN Internet/ WAN

slide-30
SLIDE 30

10.0.0.1

SNAT block reservation

VM VM

NIC

10.0.0.2 10.0.0.2:6456 180.0.1.100:9043 dst: 216.58.210.164:80 10.0.0.1:7182 180.0.1.100:9044

VM VM

Internet/ WAN Internet/ WAN

slide-31
SLIDE 31

SNAT block reservation

VM VM VM

NIC

10.0.0.2 10.0.0.2:6456 180.0.1.100:9043

VM

dst: 216.58.210.164:80 10.0.0.1 10.0.0.1:7182 180.0.1.100:? Internet/ WAN Internet/ WAN

slide-32
SLIDE 32
  • Performed through ZooKeeper
  • /nat/{device_id}/{ip}/{block_idx}
  • 64 ports per block, 1024 total blocks
  • LRU based allocation
  • Blocks are referenced by flow state

SNAT block reservation

slide-33
SLIDE 33

Thank you! Q&A

slide-34
SLIDE 34

Low-level

slide-35
SLIDE 35

Inside the Agent

Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU

Upcall Output Simulation Datapath Backchannel Backchannel Backchannel Backchannel Virtual Topology

User Kernel

slide-36
SLIDE 36

Performance

  • Sharding

○ Share nothing model ○ Each simulation thread is responsible for a subset of the installed flows ○ Each simulation thread is responsible for a subset of the flow state ○ Each thread ARPs individually ○ Communication by message passing through “backchannels”

  • Run to completion model

○ When a piece of the virtual topology is needed, simulations are parked

  • Lock-free algorithms where sharding is not possible