HW Hig igh-Availability and Li Link Aggregation for Eth thernet - - PowerPoint PPT Presentation

hw hig igh availability and li link aggregation for eth
SMART_READER_LITE
LIVE PREVIEW

HW Hig igh-Availability and Li Link Aggregation for Eth thernet - - PowerPoint PPT Presentation

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) HW Hig igh-Availability and Li Link Aggregation for Eth thernet switch and NIC IC RDMA usin ing Li Linux bonding/team Tzahi


slide-1
SLIDE 1

Tzahi Oved tzahio@mellanox.com ; Or Gerlitz ogerlitz@mellanox.com

Netdev 1.1 | 2016

HW Hig igh-Availability and Li Link Aggregation for Eth thernet switch and NIC IC RDMA usin ing Li Linux bonding/team

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-2
SLIDE 2

Bonding / Team drivers

  • both expose software netdevice that provides LAG / HA toward the

networking stack

  • team/bond is considered “upper” device to “lower” NIC net-devices

through which packets are flowing to the wire

  • different modes of operation: Active/Passive, 802.3ad (LAG) and

policies: link monitoring, xmit hash, etc

  • Bonding – legacy
  • Team - introduced in 3.3, more modular/flexible design, extendable,

state-machine in user-space library/daemon

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-3
SLIDE 3

HW LAG using SW Team/Bond

  • Idea: use SW LAG on netdevices to apply LAG into HW offloaded traffic
  • offloaded traffic – doesn’t pass through the network stack

100Gbs Switch

  • each port is represented by netdevice
  • SW LAG on few ports netdevs  set HW LAG on physical ports (mlxsw, upstream 4.5)

40/100Gbs NIC

  • each port of the device is Eth netdevice
  • RDMA traffic is offloaded from the network stack
  • port netdevice serves for plain Eth networking and control pass for the RDMA stack
  • SW LAG on two NIC ports netdevs  HW LAG for RDMA traffic (mlx4, upstream 4.0)
  • under SRIOV, SW LAG on PF NIC ports  HW LAG for vport used by VF (mlx4, upstream 4.5)
  • for 100Gbs NIC (mlx5) – coming soon…

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-4
SLIDE 4

Network notifiers && their usage for HW LAG

  • notification sent to subscribed consumers in the networking stack on

a change which is about to take place, or that just happened

  • the notification contains events type and affected parties
  • Notifications used for LAG: pre change-upper, change-upper

HW driver usage for LAG notifications:

  • pre-change upper: refuse certain configurations, NAK the change
  • change upper: create / configure HW LAG

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-5
SLIDE 5

Switch HW driver

  • ip link set dev sw1p1 master team0
  • NETDEV_PRECHANGEUPPER
  • if lag type is not LACP, etc - NAK

 operation fails

  • NETDEV_CHANGEUPPER
  • observe that new lag is created for the switch

 create HW LAG and add this port there

  • ip link set dev sw1p2 master team0
  • NETDEV_PRECHANGEUPPER
  • […]
  • NETDEV_CHANGEUPPER
  • observe that this lag already exists

 add this port there

switch driver team

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-6
SLIDE 6

RDMA over Ethernet (RoCE) / RDMA-CM

  • The upstream RDMA stack supports multiple transports: RoCE, IB, iWARP
  • RoCE – RDMA over Converged Ethernet, RoCE V2 (upstream 4.5), IBTA RDMA headers over UDP.

Uses IPv4/6 addresses set over the regular Eth NIC port netdev

  • RoCE apps use RDMA-CM API for control path and verbs API for data path
  • RDMA-CM API (include/rdma/rdma_cm.h)
  • Address resolution – Local Route lookup + ARP/ND services (rdma_resolve_addr())
  • Route resolution – Path lookup in IB networks (rdma_resolve_route())
  • Connection establishment – per transport CM to wire the offloaded connection (rdma_connect())
  • Verbs API
  • Send/RDMA – Send message or perform RDMA operation (post_send())
  • Poll– Poll for completion of Send/RDMA or Receive operation (poll_cq())
  • Async completion handling and fd semantics are supported
  • Post Receive Buffer – Hand receive buffers to the NIC (post_recv())
  • RDMA Device
  • The DEVICE structure, exposes all above operations
  • Associated with net_device
  • Available for both RoCE and user mode Ethernet programming (DPDK)

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-7
SLIDE 7

Nativ ive Model l – HW Teamin ing

  • Configuration
  • Native Linux administration
  • RoCE Bonding is mainly auto configured
  • RoCE
  • Use transport object (QP, TIS) attribute: port affinity
  • RDMA devices associated with eth0, eth1 will be used for

port management only (through Immutable caps)

  • And will unregister and register to drop existing consumers
  • Register new ib_dev attached to the bond
  • eth0, eth1 will listen on Linux bond enslavement netlink events
  • New RDMA device will always use vendor pick of PCIe Function

(PF0/1 or both)

  • LACP ((802.3ad)
  • Either handled by Linux bonding/teaming driver
  • Or in HW/FW for supporting NICs (required for many PFs

to single phys port configurations)

  • HW Bond
  • NIC logic for HW forwarding of ingress traffic to bond/team

RDMA device

  • net_dev traffic is passed directly to owner net_dev

according to ingress port

eth0

NIC

Phys Port1 PCIe PF0 eth1 Phys Port2 Linux Bonding/ Teaming PCIe PF1 RDMA Device HW Bond

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-8
SLIDE 8

eSwitch Software Model l – Optio ion I

eth0 rep_vf0 rep_vf1 Linux/OVS Bridge br0

Linux Switch Device

SRIOV VM0 SRIOV VM1

NIC

eSwitch Native OS Phys Port PCIe VF0.0 PCIe VF0.1 PCIe PF0 RDMA Device VM2 VM3

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-9
SLIDE 9

eSwitch Software Model l – Optio ion II

rep_phy0 rep_vf0 rep_vf1 Linux/OVS Bridge rep_eth0 eth0

Linux Switch Device

SRIOV VM0 SRIOV VM1

NIC

eSwitch Native OS Phys Port PCIe VF0.0 PCIe VF0.1 PCIe PF0 RDMA Device VM2 VM3

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-10
SLIDE 10

eSwitch Software Model l wit ith HA

rep_phy0 rep_vf0 rep_vf1 Linux/OVS Bridge rep_eth0 eth0

Linux Switch Device

SRIOV VM0 SRIOV VM1

NIC

Native OS Phys Port1 PCIe VF0.0 PCIe VF1.0 PCIe PF0 Phys Port2 Linux Bonding PCIe PF1 RDMA Device rep_phy1 eSwitch HW Bond

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-11
SLIDE 11

eSwitch Software Model l wit ith Tunneli ling

rep_phy0 rep_vf0 rep_vf1 Linux/OVS Bridge rep_eth0 eth0

Linux Switch Device NIC

eSwitch UDP/IP Stack Phys Port PCIe PF0 RDMA Device VM2 VM3 OVS-VX Bridge

vxlan net_device VNI (Key)

SRIOV VM0 SRIOV VM1 PCIe VF0.1 PCIe VF0.0 HW Tunnel

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)

slide-12
SLIDE 12

Multi-PCI Socket NIC

  • Multiple PCIe end point NIC - NIC can be

connected through one or more PCIe buses

  • Each PCIe bus is connected different NUMA

node

  • Exposed as 2 or more net_device each with

it’s own associated RDMA device

  • Enjoy direct device to local NUMA access
  • Application use & feel – would like to work

with single net interface

  • Use Linux bonding with RDMA device

bonding

  • For TCP/IP traffic on TX, select slave according to

calling context affinity

  • For RDMA traffic select slave according to:
  • Transport object (QP) logical port affinity
  • Or transport object creation thread CPU affinity
  • Don’t share HW resources (CQ, SRQ) on different CPU

sockets – each device has it’s own HW resources

CPU CPU QPI

eth0 Phys Port PCIe PF0 eth1 Linux Bonding/ Teaming PCIe PF1 RDMA Device HW Bond NIC

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain)