hw hig igh availability and li link aggregation for eth
play

HW Hig igh-Availability and Li Link Aggregation for Eth thernet - PowerPoint PPT Presentation

Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) HW Hig igh-Availability and Li Link Aggregation for Eth thernet switch and NIC IC RDMA usin ing Li Linux bonding/team Tzahi


  1. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) HW Hig igh-Availability and Li Link Aggregation for Eth thernet switch and NIC IC RDMA usin ing Li Linux bonding/team Tzahi Oved tzahio@mellanox.com ; Or Gerlitz ogerlitz@mellanox.com Netdev 1.1 | 2016

  2. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) Bonding / Team drivers • both expose software netdevice that provides LAG / HA toward the networking stack • team/bond is considered “upper” device to “lower” NIC net -devices through which packets are flowing to the wire • different modes of operation: Active/Passive, 802.3ad (LAG) and policies: link monitoring, xmit hash, etc • Bonding – legacy • Team - introduced in 3.3, more modular/flexible design, extendable, state-machine in user-space library/daemon

  3. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) HW LAG using SW Team/Bond • Idea: use SW LAG on netdevices to apply LAG into HW offloaded traffic • offloaded traffic – doesn’t pass through the network stack 100Gbs Switch • each port is represented by netdevice • SW LAG on few ports netdevs  set HW LAG on physical ports (mlxsw, upstream 4.5) 40/100Gbs NIC • each port of the device is Eth netdevice • RDMA traffic is offloaded from the network stack • port netdevice serves for plain Eth networking and control pass for the RDMA stack • SW LAG on two NIC ports netdevs  HW LAG for RDMA traffic (mlx4, upstream 4.0) • under SRIOV, SW LAG on PF NIC ports  HW LAG for vport used by VF (mlx4, upstream 4.5) • for 100Gbs NIC (mlx5) – coming soon…

  4. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) Network notifiers && their usage for HW LAG • notification sent to subscribed consumers in the networking stack on a change which is about to take place, or that just happened • the notification contains events type and affected parties • Notifications used for LAG: pre change-upper, change-upper HW driver usage for LAG notifications: • pre-change upper: refuse certain configurations, NAK the change • change upper: create / configure HW LAG

  5. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) Switch HW driver • ip link set dev sw1p1 master team0 • NETDEV_PRECHANGEUPPER • if lag type is not LACP, etc - NAK team  operation fails • NETDEV_CHANGEUPPER • observe that new lag is created for the switch  create HW LAG and add this port there • ip link set dev sw1p2 master team0 • NETDEV_PRECHANGEUPPER switch driver • […] • NETDEV_CHANGEUPPER • observe that this lag already exists  add this port there

  6. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) RDMA over Ethernet (RoCE) / RDMA-CM • The upstream RDMA stack supports multiple transports: RoCE , IB, iWARP • RoCE – R DMA o ver C onverged E thernet, RoCE V2 (upstream 4.5), IBTA RDMA headers over UDP. Uses IPv4/6 addresses set over the regular Eth NIC port netdev • RoCE apps use RDMA-CM API for control path and verbs API for data path • RDMA-CM API (include/rdma/rdma_cm.h ) • Address resolution – Local Route lookup + ARP/ND services (rdma_resolve_addr()) • Route resolution – Path lookup in IB networks (rdma_resolve_route()) • Connection establishment – per transport CM to wire the offloaded connection (rdma_connect()) • Verbs API • Send/RDMA – Send message or perform RDMA operation (post_send()) • Poll – Poll for completion of Send/RDMA or Receive operation (poll_cq()) • Async completion handling and fd semantics are supported • Post Receive Buffer – Hand receive buffers to the NIC (post_recv()) • RDMA Device • The DEVICE structure, exposes all above operations • Associated with net_device • Available for both RoCE and user mode Ethernet programming (DPDK)

  7. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) Nativ ive Model l – HW Teamin ing • Configuration • Native Linux administration RDMA • RoCE Bonding is mainly auto configured Device • RoCE Linux Bonding/ • Use transport object (QP, TIS) attribute: port affinity Teaming • RDMA devices associated with eth0, eth1 will be used for port management only (through Immutable caps) • And will unregister and register to drop existing consumers • Register new ib_dev attached to the bond eth1 eth0 • eth0, eth1 will listen on Linux bond enslavement netlink events • New RDMA device will always use vendor pick of PCIe Function (PF0/1 or both) • LACP ((802.3ad) PCIe PCIe • Either handled by Linux bonding/teaming driver PF1 PF0 • Or in HW/FW for supporting NICs (required for many PFs to single phys port configurations) HW • HW Bond Bond • NIC logic for HW forwarding of ingress traffic to bond/team NIC RDMA device Phys • net_dev traffic is passed directly to owner net_dev Phys Port1 according to ingress port Port2

  8. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) eSwitch Software Model l – Optio ion I VM3 VM2 Native OS SRIOV SRIOV VM0 VM1 Linux/OVS Bridge RDMA Device br0 eth0 rep_vf0 rep_vf1 Linux Switch Device PCIe PF0 PCIe VF0.0 eSwitch PCIe VF0.1 NIC Phys Port

  9. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) eSwitch Software Model l – Optio ion II VM3 VM2 Native OS SRIOV SRIOV VM0 VM1 Linux/OVS Bridge RDMA Device eth0 rep_eth0 rep_phy0 rep_vf0 rep_vf1 Linux Switch Device PCIe PF0 PCIe VF0.0 eSwitch PCIe VF0.1 NIC Phys Port

  10. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) eSwitch Software Model l wit ith HA Linux/OVS Bridge Native OS SRIOV SRIOV VM0 VM1 Linux Bonding RDMA Linux Switch Device Device rep_vf0 rep_vf1 eth0 rep_eth0 rep_phy0 rep_phy1 PCIe PCIe PF0 PF1 PCIe VF0.0 eSwitch PCIe HW VF1.0 Bond NIC Phys Phys Port1 Port2

  11. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) eSwitch Software Model l wit ith Tunneli ling VM3 VM2 SRIOV SRIOV VM0 VM1 OVS-VX Linux/OVS Bridge Bridge vxlan net_device VNI (Key) UDP/IP Stack rep_eth0 rep_phy0 rep_vf0 rep_vf1 Linux Switch Device RDMA PCIe Device PF0 eth0 PCIe eSwitch VF0.0 HW Tunnel PCIe VF0.1 NIC Phys Port

  12. Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February 10th-12th 2016. Seville, Spain) Multi-PCI Socket NIC QPI CPU CPU • Multiple PCIe end point NIC - NIC can be connected through one or more PCIe buses • Each PCIe bus is connected different NUMA node • Exposed as 2 or more net_device each with it’s own associated RDMA device RDMA Device • Enjoy direct device to local NUMA access Linux Bonding/ Teaming • Application use & feel – would like to work with single net interface • Use Linux bonding with RDMA device eth1 eth0 bonding • For TCP/IP traffic on TX, select slave according to PCIe PCIe calling context affinity PF1 PF0 • For RDMA traffic select slave according to: HW • Transport object (QP) logical port affinity Bond • Or transport object creation thread CPU affinity NIC • Don’t share HW resources (CQ, SRQ) on different CPU Phys sockets – each device has it’s own HW resources Port

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend