The Next Generation Lossless Network in the Data Center BrightTalk, - - PowerPoint PPT Presentation

the next generation lossless network
SMART_READER_LITE
LIVE PREVIEW

The Next Generation Lossless Network in the Data Center BrightTalk, - - PowerPoint PPT Presentation

IEEE 802 Industry Connections Report The Next Generation Lossless Network in the Data Center BrightTalk, Data Center Transformation 3.0, January 2019 Paul Congdon, PhD Disclaimer All speakers presenting information on IEEE standards speak as


slide-1
SLIDE 1

The Next Generation Lossless Network

in the Data Center

BrightTalk, Data Center Transformation 3.0, January 2019 Paul Congdon, PhD

IEEE 802 Industry Connections Report

slide-2
SLIDE 2

Page 3

Disclaimer

All speakers presenting information on IEEE standards speak as individuals, and their views should be considered the personal views of that individual rather than the formal position, explanation, or interpretation of the IEEE.

slide-3
SLIDE 3

Page 4

Acknowledgements

The initial technical contribution and sponsorship for this work was provided by Huawei Technologies Co., Ltd.

This presentation summaries work from the IEEE 802 Network Enhancements for the Next Decade Industry Connections Activity (Nendica).

Nendica: IEEE 802 “Network Enhancements for the Next Decade” Industry Connections Activity

An IEEE Industry Connections Activity

Organized under the IEEE 802.1 Working Group

https://1.ieee802.org/802-nendica/

Report Freely Available at: https://ieeexplore.ieee.org/document/8462819

slide-4
SLIDE 4

Page 5

Our Digital Lives are driving Innovation in the DC

Interactive Speech Recognition Interactive Image Recognition Human / Machine Interaction Autonomous Driving

slide-5
SLIDE 5

Page 6

Critical Use Case – Online Data Intensive Services (OLDI)

  • OLDI applications have real-time

deadlines and run in parallel on 1000s

  • f servers.
  • Incast is a naturally occurring

phenomenon.

  • Tail latency reduces the quality of the

results

Aggregator

… Worker Worker Worker … Aggregator Aggregator Aggregator Worker … Worker

Request

Deadline = 250 ms Deadline = 50 ms Deadline = 10 ms

slide-6
SLIDE 6

Page 7

Critical Use Case – Deep Learning

  • Massively parallel HPC applications,

such AI training, are dependent on low latency and high throughput network.

  • Billions of parameters.
  • Scale out is limited by network

performance.

… … … Start Elapsed Time

Feed Data Training MPI Allreduce Weights Send Weight

Rank 0 Rank 1 Rank 2 Partition 0 Partition 1 Partition 2 Dataset

Sweet Spot Comsumed Time

Number of Computing Nodes Computing Time Network Time Overall Time

slide-7
SLIDE 7

Page 8

Critical Use Case – NVMe Over Fabrics

  • Disaggregated resource pooling,

such as NVMe over Fabrics, use RDMA and run over converged network infrastructure.

  • Low latency and lossless are critical.
  • Ease of deployment and cloud scale

are important success factors.

slide-8
SLIDE 8

Page 9

Critical Use Case – Cloudification of the Central Office

Traditional Central Office Cloudified Central Office

Network Function Virtualization

Standard Ethernet Switches

Subscribers Subscribers

Firewall VPN BRAS Base-Band Units IP Telephony CDN High-Speed Storage DPI Orchestration

  • Massive growth in Mobile and

Internet traffic is driving Infrastructure investment

  • To meet performance requirements
  • f traditional purpose built

equipment, SDN and NFV must run

  • n low-latency, low-loss, scalable

and highly available network infrastructure

slide-9
SLIDE 9

Page 10

We are dealing with massive amounts of data and computing

Requirements:

  • Fast-scalable storage
  • Parallel applications and data
  • Cloud-ified Infrastructure

High Speed Network Storage Neural Network Cloud Infrastructure

Divide and Conquer Real-time Natural Human/Machine Response

slide-10
SLIDE 10

Page 11

Congestion Creates the Problems

Massive Data Massive Compute Massive Messaging

Network Congestion Packet Loss Latency Loss Throughput Loss Parallelism can create congestion which leads to loss making end-user unhappy

slide-11
SLIDE 11

Page 12 ⚫

The impact of congestion on network performance can be very serious.

As shown in paper (Pedro J. Garcia et al, IEEE Micro 2006)[1]:

Injecting hot-spot traffic

Throughput diminishing by 70% Latency increasing

  • f three orders of

magnitude

Network Performance Degrades Dramatically after Congestion Appears

Network Throughput and Generated Traffic Average Packet Latency

[1] Garcia, Pedro Javier, et al. "Efficient, scalable congestion management for interconnection networks." IEEE Micro 26.5 (2006): 52-66.

Injecting hot-spot traffic

The Impact of Congestion in Lossless Network

slide-12
SLIDE 12

Page 13

… … … … … … … … …

ECMP

Dealing with Congestion today

ECMP – Equal Cost MultiPath Routing … … … … … … … … …

Congestion ECN Congestion Feedback ECN Mark PFC

Explicit Congestion Notification (ECN) + Priority-based Flow Control (PFC)

slide-13
SLIDE 13

Page 14

Ongoing challenges with congestion

… … … … … … … … …

30G 30G 30G 15G 30G 30G 40G Links 40G Links 15G 30G

ECMP

ECMP Collisions

… … … … … … … … …

Congestion PFC ECN Mark ECN Congestion Feedback HOLB

ECN Control Loop Delay Head-of-line Blocking

slide-14
SLIDE 14

Page 15

Potential New Lossless Technologies for the Data Center

Goal = No Loss

No Packet Loss

No Latency Loss

No Throughput Loss Solutions

Virtual Input Queuing - VIQ

Dynamic Virtual Lanes - DVL

Load-Aware Packet Spraying - LPS

Push & Pull Hybrid Scheduling - PPH

slide-15
SLIDE 15

Page 16

VIQ (Virtual Input Queues):Resolve Internal Packet Loss

Incast Congestion leading to internal packet loss

  • 1. During incast scenario, ingress

queue counter doesn’t exceed the PFC threshold, so will not send PFC Pause frame to upstream. Packet will always come in from ingress port.

Ingress queue counter Ingress queue counter

  • 2. But the physical egress queue has

backlog because of convergence effect. Packet loss occurs without egress- ingress coordination.

Egress queue PFC threshold PFC threshold

VIQ could be looked as: that on out port, assign a dedicated queue for every in port. Memory changes from sharing to virtually monopolized according to in ports. So that every in port could get fair scheduling. The tail latency of business could be controlled effectively.

Coordinated egress-ingress queuing

slide-16
SLIDE 16

Page 17

Downstream Upstream 1 3 1 3 2 4 2 4

Ingress Port (Virtual Queues) Egress Port Ingress Port (Virtual Queues) Egress Port Congested Flows Non-Congested Flows

  • 1. Identify the flow

causing congestion and isolate locally

CIP

  • 2. Signal to neighbor

when congested queue fills

Eliminate HoL Blocking

  • 3. Upstream isolates the

flow too, eliminating head-of-line blocking

PFC

  • 4. If congested queue

continues to fill, invoke PFC for lossless

DVL (Dynamic Virtual Lanes)

slide-17
SLIDE 17

Page 18

LPS (Load-Aware Packet Spraying)

LPS = Packet Spraying + Endpoint Reordering + Load-Aware

Framework

◼ Centralized

(e.g. Hedera, B4, SWAN) Slow to react for Data Centers

◼ Distributed

Notes

State

◼ Stateless ◼ Local

(e.g. ECMP, Flare, LocalFlow) Poor handling of asymmetric traffic

◼ Global

Granularity

◼ Flow ◼ Flowlet ◼ Flowcell ◼ Packet

May require packet re-ordering

Load Balancing Design Space

slide-18
SLIDE 18

Page 19

PPH (Push & Pull Hybrid Scheduling)

PPH = Congestion aware traffic scheduling

Push when load is light Pull when load is high

Leaf Leaf Leaf Leaf Leaf Leaf Spine Spine Spine Spine

… … … … … …

Request Grant Data Data Request Grant

1 2 3 source source destination

Push Data Grant (Pull) Long RTT Short RTT Request (Pull) Pull Data Request (Pull) Push Data

Light load: All

  • Push. Acquire low

latency. Light congestion: Open Pull for part of the congested path Heavy load: All

  • Pull. Reduce

queuing delay, improve throughput.

slide-19
SLIDE 19

Page 20

Dynamic Virtual Lane

Isolate Congestion

Priority-based Flow Control (Coarse grain). Victim flows hurt by the congested flows Allow time for end-to-end congestion control. Move congested flows out of the way. Eliminate head-of-line blocking.

Push & Pull Hybrid Scheduling

Schedule Appropriately

Unscheduled and network resource unaware many-to-

  • ne communication leads to

incast packet loss

Source Network Destination

Scheduling decision integrated the information from source, network and destination.

Source Network Destination

Load-aware Packet Spraying

Spread the Load

Unbalanced load sharing. Elephant flow collisions block mice flows. Load-balance flows at higher

  • granularity. Use congestion

awareness to avoid collisions

Virtual Input Queues

Coordinated Resources

Ingress thresholds unrelated to egress buffer availability. Incast causes internal packet loss. Coordinate egress availability with ingress demand. Avoid internal switch packet loss

Congestion Impact Mitigating Congestion

Innovation for the Lossless Network

Innovation

slide-20
SLIDE 20

Page 21

Thank You