RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent - - PowerPoint PPT Presentation

rogue rdma over generic unconverged ethernet
SMART_READER_LITE
LIVE PREVIEW

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent - - PowerPoint PPT Presentation

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi, Aditya Akella, Mike Swift RDMA Overview RDMA Zero Copy USER Application Application Buffer Buffer KERNEL HARWARE Kernel Bypass


slide-1
SLIDE 1

RoGUE: RDMA over Generic Unconverged Ethernet

Yanfang Le

with Brent Stephens, Arjun Singhvi, Aditya Akella, Mike Swift

slide-2
SLIDE 2

RDMA Overview

USER KERNEL HARWARE

RDMA Zero Copy

Application Application Buffer Buffer Kernel Bypass Protocol Offload

slide-3
SLIDE 3

RDMA Overview

USER KERNEL HARWARE

RDMA Zero Copy

Application Application Buffer Buffer Kernel Bypass Protocol Offload Low Latency, High throughput, Low CPU utilization

slide-4
SLIDE 4

RDMA Overview

  • RoCE: a protocol that provides RDMA over a lossless Ethernet network

USER KERNEL HARWARE

RDMA Zero Copy

Application Application Buffer Buffer Kernel Bypass Protocol Offload Low Latency, High throughput, Low CPU utilization

slide-5
SLIDE 5

Priority Flow Control

RoCE assumes Ethernet network to be lossless – achieved by enabling Priority Flow Control (PFC).

Server/ Switch Switch/ Server

slide-6
SLIDE 6

Priority Flow Control

RoCE assumes Ethernet network to be lossless – achieved by enabling Priority Flow Control (PFC).

Server/ Switch Switch/ Server

Pause frame

slide-7
SLIDE 7

Motivation

slide-8
SLIDE 8

Motivation

slide-9
SLIDE 9

Motivation

HOL Blocking

slide-10
SLIDE 10

Motivation

HOL Blocking

Unfairness

slide-11
SLIDE 11

Motivation

  • Data center providers are

reluctant to enable PFC

– Instead, isolate RDMA traffic and TCP traffic

HOL Blocking

Unfairness

slide-12
SLIDE 12

Motivation

  • Data center providers are

reluctant to enable PFC

– Instead, isolate RDMA traffic and TCP traffic

  • RDMA has not seen the

uptake it deserves

HOL Blocking

Unfairness

slide-13
SLIDE 13

Can we run RDMA over generic Ethernet network without any reliance on PFC ?

slide-14
SLIDE 14

Can we run RDMA over generic Ethernet network without any reliance on PFC ? RoCE + PFC

Congestion Control No packet drop

slide-15
SLIDE 15

Can we run RDMA over generic Ethernet network without any reliance on PFC ? RoCE + PFC

Congestion Control No packet drop

RoGUE

slide-16
SLIDE 16

Can we run RDMA over generic Ethernet network without any reliance on PFC ? RoCE + PFC

Congestion Control No packet drop Congestion Control

RoGUE

slide-17
SLIDE 17

Can we run RDMA over generic Ethernet network without any reliance on PFC ? RoCE + PFC

Congestion Control No packet drop Congestion Control Retransmission

RoGUE

slide-18
SLIDE 18

Can we run RDMA over generic Ethernet network without any reliance on PFC ? RoCE + PFC

Congestion Control No packet drop Congestion Control Retransmission yet retain low latency, CPU utilization

RoGUE

slide-19
SLIDE 19

Signal

CPU RNIC

RDMA APP

RoCE Overview

Send QUEUE Receive QUEUE

QP Verb

Completion QUEUE

Brake the animations

slide-20
SLIDE 20

Signal

CPU RNIC

RDMA APP

RoCE Overview

Send QUEUE Receive QUEUE

QP Verb

Completion QUEUE

Brake the animations

slide-21
SLIDE 21

Signal

CPU RNIC

RDMA APP

RoCE Overview

Send QUEUE Receive QUEUE

QP Verb

Completion QUEUE

Brake the animations

slide-22
SLIDE 22

Signal

CPU RNIC

RDMA APP

RoCE Overview

Send QUEUE Receive QUEUE

QP Verb

Completion QUEUE

Signal Brake the animations

slide-23
SLIDE 23

Where to fix: HW or SW?

Hardware

✅Low CPU utilization, Low Latency ❌It requires to work with NIC vendor ❌Heterogeneous network hardware with non- standard protocol implementation ❌Complicates network evolution

Software

✅ Easy to implement ❌ Packet level congestion signals are unavailable ❌ High CPU utilization if per- packet operations

slide-24
SLIDE 24

RoGUE Overview

CPU RNIC

Congestion Control Loss Recovery

slide-25
SLIDE 25

RoGUE Overview

CPU RNIC

Congestion Control Congestion Control loop CPU-efficient segmenting Loss Recovery

slide-26
SLIDE 26

RoGUE Overview

CPU RNIC

Congestion Control Congestion Control loop CPU-efficient segmenting

Hardware timestamp to measure RTT Hardware rate limiter to pace packets

Loss Recovery

slide-27
SLIDE 27

RoGUE Overview

CPU RNIC

Congestion Control Congestion Control loop CPU-efficient segmenting

Hardware timestamp to measure RTT Hardware rate limiter to pace packets

Loss Recovery Shadow Queue Pair

slide-28
SLIDE 28

RoGUE Overview

CPU RNIC

Congestion Control Congestion Control loop CPU-efficient segmenting

Hardware timestamp to measure RTT Hardware rate limiter to pace packets

Loss Recovery

Hardware retransmission

Shadow Queue Pair

slide-29
SLIDE 29

Congestion Signal

Sender Switch Receive r

Packets from different flows

slide-30
SLIDE 30

Congestion Signal

Sender Switch Receive r

Packets from different flows

ACK

RTT

slide-31
SLIDE 31

Congestion Signal

Sender Switch Receive r

Packets from different flows

ACK

RTT

slide-32
SLIDE 32

Congestion Signal

Sender Switch Receive r

Packets from different flows

ACK

RTT

ACK

RTT

slide-33
SLIDE 33

Congestion Signal

Sender Switch Receive r

Packets from different flows

ACK

RTT

ACK

RTT

  • RTT is high, the queue

builds up, reduce the sending rate

  • RTT is low, network is

idle, increase the sending rate

slide-34
SLIDE 34

CPU Efficient Segmenting

  • Two key questions
  • How large a verb should

RoGUE send?

  • How often should the RNIC

signaled?

  • Small Verb (< 64KB)
  • signal every 64KB
  • CPU utilization (< 20%)
  • Large Verb (>= 64KB)
  • chunk, and signal every 64KB.
  • CPU utilization (< 10%)

Host RNIC RNIC

Verb 1, 2, 3, 4, 5 Verb 6 Signal 1

slide-35
SLIDE 35

CPU Efficient Segmenting

  • Two key questions
  • How large a verb should

RoGUE send?

  • How often should the RNIC

signaled?

  • Small Verb (< 64KB)
  • signal every 64KB
  • CPU utilization (< 20%)
  • Large Verb (>= 64KB)
  • chunk, and signal every 64KB.
  • CPU utilization (< 10%)

Host RNIC RNIC

Verb 1, 2, 3, 4, 5 Verb 6 Signal 1

slide-36
SLIDE 36

CPU Efficient Segmenting

  • Two key questions
  • How large a verb should

RoGUE send?

  • How often should the RNIC

signaled?

  • Small Verb (< 64KB)
  • signal every 64KB
  • CPU utilization (< 20%)
  • Large Verb (>= 64KB)
  • chunk, and signal every 64KB.
  • CPU utilization (< 10%)

Host RNIC RNIC

Verb 1, 2, 3, 4, 5 Verb 6 Signal 1 Verb 6 packets Signal 3 Signal 2

slide-37
SLIDE 37

RTT measurement

Host RNIC RNIC

Verb 1 Verb 1 packets Signal 1 Send Ack 1 Send Ack 2

Tenc_s1 Tenc_s2 Tcomp_s2 Tcomp_s1

slide-38
SLIDE 38

RTT measurement

Host RNIC RNIC

Verb 1 Verb 2 Verb 1 packets Verb 2 packets Signal 1 Signal 2 Send Ack 1 Send Ack 2

Tenc_s1 Tenc_s2 Tcomp_s2 Tcomp_s1

slide-39
SLIDE 39

RTT measurement

Host RNIC RNIC

Verb 1 Verb 2 Verb 1 packets Verb 2 packets Signal 1 Signal 2 Send Ack 1 Send Ack 2

Tenc_s1 Tenc_s2 Tcomp_s2 Tcomp_s1

Tstart_si =max( Verb i enqueued, last packet of Verb i-1 goes out of NIC)

slide-40
SLIDE 40

RTT measurement

Host RNIC RNIC

Verb 1 Verb 2 Verb 1 packets Verb 2 packets Signal 1 Signal 2 Send Ack 1 Send Ack 2

Tenc_s1 Tenc_s2 Tstart_s2 Tcomp_s2 Tcomp_s1

Tstart_si =max( Verb i enqueued, last packet of Verb i-1 goes out of NIC)

slide-41
SLIDE 41

RTT measurement

Host RNIC RNIC

Verb 1 Verb 2 Verb 1 packets Verb 2 packets Signal 1 Signal 2 Send Ack 1 Send Ack 2

Tenc_s1 Tenc_s2 Tstart_s2 Tcomp_s2 Tcomp_s1

Tstart_si =max( Verb i enqueued, last packet of Verb i-1 goes out of NIC) RTTi= Tcomp_si - Tstart_si - bytes/ rate_limit

slide-42
SLIDE 42

RTT measurement

Host RNIC RNIC

Verb 1 Verb 2 Verb 1 packets Verb 2 packets Signal 1 Signal 2 Send Ack 1 Send Ack 2

Tenc_s1 Tenc_s2 Tstart_s2 Tcomp_s2 Tcomp_s1

Tstart_si =max( Verb i enqueued, last packet of Verb i-1 goes out of NIC) RTTi= Tcomp_si - Tstart_si - bytes/ rate_limit RTT is measured by Hardware timestamp.

slide-43
SLIDE 43

Congestion Response

slide-44
SLIDE 44

Congestion Response

  • Similar to TCP Vegas, and Timely
slide-45
SLIDE 45

Congestion Response

  • Similar to TCP Vegas, and Timely
  • If congestion window >= 64KB, window-based +

rate limiter

slide-46
SLIDE 46

Congestion Response

  • Similar to TCP Vegas, and Timely
  • If congestion window >= 64KB, window-based +

rate limiter

  • If congestion window < 64KB, rate limiter only
slide-47
SLIDE 47

Congestion Response

  • Similar to TCP Vegas, and Timely
  • If congestion window >= 64KB, window-based +

rate limiter

  • If congestion window < 64KB, rate limiter only
  • Rate limiter is offloaded to RNIC
slide-48
SLIDE 48

Evaluation

  • Mellanox ConnectX-3 Pro 10Gbps RNICs,

DCQCN

  • Baselines: DCTCP

, DCQCN

slide-49
SLIDE 49

Evaluation-Cluster Experiments

  • Each of 16 hosts generates 1MB RPC for random destinations and

send 1KB RPC once every ten 1MB RPC

10 25 50 75

Network Load (%) (a) Large RPCs (1MB) - Median FCT

1 2 3 4 5 6 7

Flow Completion Time (ms)

10 25 50 75

Network Load (%) (b) Small RPCs (1KB) - 90th %ile FCT

100 200 300 400 500 600 700

Flow Completion Time (us)

RoGUE RoCE (w/ DCQCN) DCTCP

slide-50
SLIDE 50

Evaluation-Congestion Response

50 100 150 200

Time (s)

2 4 6 8 10

Throughput (Gbps)

flow 0 flow 1 flow 2 flow 3 flow 4

slide-51
SLIDE 51

Evaluation-CPU Utilization

Client Server 10 20 30 40 50 60 CPU Utilization (%) DCTCP RoCE (READ RC) RoGUE (READ RC)

slide-52
SLIDE 52

Summary

  • It is possible to support RoCE without relying on PFC
  • Judicious division of labor between SW and HW to do

the congestion control and retransmission, yet retain a low CPU utilization

  • RoGUE supports RC and UC transport types of CC
  • Evaluation results validate that RoGUE has competitive

performance with native RoCE