RoGUE: RDMA over Generic Unconverged Ethernet
Yanfang Le
with Brent Stephens, Arjun Singhvi, Aditya Akella, Mike Swift
RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent - - PowerPoint PPT Presentation
RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi, Aditya Akella, Mike Swift RDMA Overview RDMA Zero Copy USER Application Application Buffer Buffer KERNEL HARWARE Kernel Bypass
with Brent Stephens, Arjun Singhvi, Aditya Akella, Mike Swift
USER KERNEL HARWARE
Application Application Buffer Buffer Kernel Bypass Protocol Offload
USER KERNEL HARWARE
Application Application Buffer Buffer Kernel Bypass Protocol Offload Low Latency, High throughput, Low CPU utilization
USER KERNEL HARWARE
Application Application Buffer Buffer Kernel Bypass Protocol Offload Low Latency, High throughput, Low CPU utilization
RoCE assumes Ethernet network to be lossless – achieved by enabling Priority Flow Control (PFC).
RoCE assumes Ethernet network to be lossless – achieved by enabling Priority Flow Control (PFC).
Pause frame
– Instead, isolate RDMA traffic and TCP traffic
– Instead, isolate RDMA traffic and TCP traffic
Congestion Control No packet drop
Congestion Control No packet drop
Congestion Control No packet drop Congestion Control
Congestion Control No packet drop Congestion Control Retransmission
Congestion Control No packet drop Congestion Control Retransmission yet retain low latency, CPU utilization
Signal
RDMA APP
Send QUEUE Receive QUEUE
QP Verb
Completion QUEUE
Brake the animations
Signal
RDMA APP
Send QUEUE Receive QUEUE
QP Verb
Completion QUEUE
Brake the animations
Signal
RDMA APP
Send QUEUE Receive QUEUE
QP Verb
Completion QUEUE
Brake the animations
Signal
RDMA APP
Send QUEUE Receive QUEUE
QP Verb
Completion QUEUE
Signal Brake the animations
✅Low CPU utilization, Low Latency ❌It requires to work with NIC vendor ❌Heterogeneous network hardware with non- standard protocol implementation ❌Complicates network evolution
✅ Easy to implement ❌ Packet level congestion signals are unavailable ❌ High CPU utilization if per- packet operations
Congestion Control Loss Recovery
Congestion Control Congestion Control loop CPU-efficient segmenting Loss Recovery
Congestion Control Congestion Control loop CPU-efficient segmenting
Hardware timestamp to measure RTT Hardware rate limiter to pace packets
Loss Recovery
Congestion Control Congestion Control loop CPU-efficient segmenting
Hardware timestamp to measure RTT Hardware rate limiter to pace packets
Loss Recovery Shadow Queue Pair
Congestion Control Congestion Control loop CPU-efficient segmenting
Hardware timestamp to measure RTT Hardware rate limiter to pace packets
Loss Recovery
Hardware retransmission
Shadow Queue Pair
Sender Switch Receive r
Packets from different flows
Sender Switch Receive r
Packets from different flows
ACK
RTT
Sender Switch Receive r
Packets from different flows
ACK
RTT
Sender Switch Receive r
Packets from different flows
ACK
RTT
ACK
RTT
Sender Switch Receive r
Packets from different flows
ACK
RTT
ACK
RTT
builds up, reduce the sending rate
idle, increase the sending rate
RoGUE send?
signaled?
Host RNIC RNIC
Verb 1, 2, 3, 4, 5 Verb 6 Signal 1
RoGUE send?
signaled?
Host RNIC RNIC
Verb 1, 2, 3, 4, 5 Verb 6 Signal 1
RoGUE send?
signaled?
Host RNIC RNIC
Verb 1, 2, 3, 4, 5 Verb 6 Signal 1 Verb 6 packets Signal 3 Signal 2
Host RNIC RNIC
Verb 1 Verb 1 packets Signal 1 Send Ack 1 Send Ack 2
Tenc_s1 Tenc_s2 Tcomp_s2 Tcomp_s1
Host RNIC RNIC
Verb 1 Verb 2 Verb 1 packets Verb 2 packets Signal 1 Signal 2 Send Ack 1 Send Ack 2
Tenc_s1 Tenc_s2 Tcomp_s2 Tcomp_s1
Host RNIC RNIC
Verb 1 Verb 2 Verb 1 packets Verb 2 packets Signal 1 Signal 2 Send Ack 1 Send Ack 2
Tenc_s1 Tenc_s2 Tcomp_s2 Tcomp_s1
Tstart_si =max( Verb i enqueued, last packet of Verb i-1 goes out of NIC)
Host RNIC RNIC
Verb 1 Verb 2 Verb 1 packets Verb 2 packets Signal 1 Signal 2 Send Ack 1 Send Ack 2
Tenc_s1 Tenc_s2 Tstart_s2 Tcomp_s2 Tcomp_s1
Tstart_si =max( Verb i enqueued, last packet of Verb i-1 goes out of NIC)
Host RNIC RNIC
Verb 1 Verb 2 Verb 1 packets Verb 2 packets Signal 1 Signal 2 Send Ack 1 Send Ack 2
Tenc_s1 Tenc_s2 Tstart_s2 Tcomp_s2 Tcomp_s1
Tstart_si =max( Verb i enqueued, last packet of Verb i-1 goes out of NIC) RTTi= Tcomp_si - Tstart_si - bytes/ rate_limit
Host RNIC RNIC
Verb 1 Verb 2 Verb 1 packets Verb 2 packets Signal 1 Signal 2 Send Ack 1 Send Ack 2
Tenc_s1 Tenc_s2 Tstart_s2 Tcomp_s2 Tcomp_s1
Tstart_si =max( Verb i enqueued, last packet of Verb i-1 goes out of NIC) RTTi= Tcomp_si - Tstart_si - bytes/ rate_limit RTT is measured by Hardware timestamp.
send 1KB RPC once every ten 1MB RPC
10 25 50 75
Network Load (%) (a) Large RPCs (1MB) - Median FCT
1 2 3 4 5 6 7
Flow Completion Time (ms)
10 25 50 75
Network Load (%) (b) Small RPCs (1KB) - 90th %ile FCT
100 200 300 400 500 600 700
Flow Completion Time (us)
RoGUE RoCE (w/ DCQCN) DCTCP
50 100 150 200
Time (s)
2 4 6 8 10
Throughput (Gbps)
flow 0 flow 1 flow 2 flow 3 flow 4
Client Server 10 20 30 40 50 60 CPU Utilization (%) DCTCP RoCE (READ RC) RoGUE (READ RC)