Performance Isolation Anomalies in RDMA
Yiwen Zhang
with Juncheng Gu, Youngmoon Lee, Mosharaf Chowdhury, and Kang G. Shin
Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng - - PowerPoint PPT Presentation
Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf Chowdhury, and Kang G. Shin RDMA Is Being Deployed in Datacenters Cloud operators are aggressively deploying RDMA in datacenters [1][2][3] [1] Guo,
Yiwen Zhang
with Juncheng Gu, Youngmoon Lee, Mosharaf Chowdhury, and Kang G. Shin
Cloud operators are aggressively deploying RDMA in datacenters[1][2][3]
[1] Guo, Chuanxiong, et al.“RDMA over Commodity Ethernet at Scale. “ SIGCOMM’16 [2] Mittal, Radhika, et al. “TIMELY: RTT
[3] Zhu, Yibo, et al. ” Congestion control for large-scale RDMA deployments.” SIGCOMM’15
Cloud operators are aggressively deploying RDMA in datacenters[1][2][3] Growing demands in ultra-low latency applications
High bandwidth applications
[1] Guo, Chuanxiong, et al.“RDMA over Commodity Ethernet at Scale. “ SIGCOMM’16 [2] Mittal, Radhika, et al. “TIMELY: RTT
[3] Zhu, Yibo, et al. ” Congestion control for large-scale RDMA deployments.” SIGCOMM’15
Cloud operators are aggressively deploying RDMA in datacenters RDMA provides both low latency and high bandwidth
latency and throughput
At large-scale deployments, RDMA-enabled applications are unlikely to run in vacuum – the network must be shared
At large-scale deployments, RDMA-enabled applications are unlikely to run in vacuum – the network must be shared HPC community uses static partitioning to minimize sharing[1] Researches in RDMA over Ethernet-based datacenters focus on the vagaries
[1] Ranadive, Adit, et al. “FaReS:Fairresource scheduling for VMM-bypass In Infiniband devices.” CCGRID 2010 [2] Guo, Chuanxiong, et al.“RDMA over Commodity Ethernet at Scale. “ SIGCOMM’16 [3] Zhu, Yibo, et al. ” Congestion control for large-scale RDMA deployments.” SIGCOMM’15
Scenarios Fair? 10B vs. 10B
Scenarios Fair? 10B vs. 10B 10B vs. 1MB
Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB
Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB 1MB vs. 1GB
Modified based on Mellanox Perftest tool
[1] https://github.com/Infiniswap/frdma_benchmark
Modified based on Mellanox Perftest tool
[1] https://github.com/In niswap/frdma_benchmark
RDMA Verbs
Transport Type
INLINE Message
Request Pipelining
Polling mechanism
Message Acknowledgement
Build connection & Register memory PollWC from CQ Sender Receiver
1 10 100 1000 10000 100000 1000000 10 20 30 40 50 60 10 1K 100K 10M 1G Messages per second Throughput (Gbps) Message Size (Byte) WRITE_Tput READ_Tput SEND_Tput WRITE_Mps READ_Mps SEND_Mps 1M
Mouse Elephant
Compare two throughput-sensitive flows by varying verb types, message sizes, and polling mechanism.
1MB 1GB 1MB 1GB 1MB 1GB 1MB 1GB 1GB 1MB 1GB 1MB 1GB 1MB Fair 1GB Unfair 1MB WIMM SEND WIMM READ WRITE WRITE READ SEND
0.750 1.000 1.250 1.500 1 2 5 10 100 1000 Throughput
Ratio
Message Size Ratio
0.750 1.000 1.250 1.500 1 2 5 10 100 1000 Throughput
Ratio
Message Size Ratio 1MB 2MB 5MB 10MB 100MB
Both flows use busy-polling.
0.75 1 1.25 1.5 1 2 5 10 100 1000 Throughput Ratio
Message Size Ratio
1MB 2MB 5MB 10MB 100MB
20 40 60 80 100 10 100 1K 10K 100K 1M 10M 100M 1G CPU Usage (%) Message Size (Byte) Event-triggered Busying-polling
Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB 1MB vs. 1GB
Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB Fair 1MB vs. 1GB
Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB Fair 1MB vs. 1GB Unfair
Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Unfair
Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Depends on CPU
Compare two latency-sensitive flows with varying message sizes.
presence of a competing flow
1.3 1.3 1.4 1.3 5.4 5.8 7.0 7.8 0.0 2.0 4.0 6.0 8.0 10.0 10B Alone 10B vs. (10B) 10B vs. (100B) 10B vs. (1KB) Latency (us) Median 99.99th 0.78 0.76 0.72 0.75 0.0 0.2 0.4 0.6 0.8 1.0 10B Alone 10B vs. (10B) 10B vs. (100B) 10B vs. (1KB) Million Messages/sec
Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Depends on CPU
Scenarios Fair? 10B vs. 10B Good enough 10B vs. 1MB 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Depends on CPU
Study performance isolation of a mouse flow running under a background elephant flow.
triggered polling
1.3 2.8 2.9 1.4 2.6 2.9 2.4 6.3 6.0 5.4 8.1 9.2 5.5 8.5 9.5 6.0 11.1 14.7 4 8 12 16 10B Alone 10B vs. (1MB) 10B vs. (1GB) 100B Alone 100B vs. (1MB) 100B vs. (1GB) 1KB Alone 1KB vs. (1MB) 1KB vs. (1GB) Latency (us) Median 99.99th
0.79 0.36 0.34 0.71 0.39 0.35 0.42 0.16 0.17 0.2 0.4 0.6 0.8 1 10B Alone 10B vs. (1MB) 10B vs. (1GB) 100B Alone 100B vs. (1MB) 100B vs. (1GB) 1KB Alone 1KB vs. (1MB) 1KB vs. (1GB) Million Messages/sec
Scenarios Fair? 10B vs. 10B Good enough 10B vs. 1MB 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Depends on CPU
Scenarios Fair? 10B vs. 10B Good enough 10B vs. 1MB Unfair 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Depends on CPU
So far we ran all experiments using Mellanox FDR ConnectX-3 (56 Gbps) NIC on CloudLab. Switch to Mellanox EDR ConnectX-4 (100 Gbps) NIC on the Umich Conflux cluster.
ratio of 1.32.
background flows, where the median latency increases by up to 5×.
Interested to know how isolation is maintained in HERD at a presence of a background elephant flow. Running HERD on the Umich Conflux cluster.
background flow
[1] Kalia, Anuj, et al. “Using RDMA efficiently for key-value services" SIGCOMM 2014
3.4 9.5 13.2 2.9 8.8 12.5 9.0 15.9 26.9 7.9 14.5 27.1 12 24 36 48 GET Alone GET vs. (1MB) GET vs. (1GB) PUT Alone PUT vs. (1MB) PUT vs. (1GB) Latency (us) Median 99.99th
HERD also has isolation issues when running with big background flows Currently, we are working on a solution to provide isolation in RDMA Special thanks to Yue Tan’s great help in generating isolation data on HERD
flows or very big flows, the isolation appears to be good
degradation of the smaller flow
1 10 100 1000 10000 100000 1000000 10 1K 100K 10M 1G Latency (us) Message Size (Byte) 1 2 3 4 5 10 100 1K 10K Latency (us) Message Size (Byte)
1MB 1GB 1MB 1GB 1MB 1GB 1MB 1GB 1GB 1.41 1.00 1.44 1.00 1.39 1.00 1.40 1.00 1MB 1.02 0.71 1.00 0.72 0.99 0.71 1.00 1GB 1.40 1.00 1.43 1.00 1.37 1.00 1MB 1.08 0.71 1.04 0.71 1.00 1GB 1.40 1.00 1.44 1.00 1MB 1.00 0.70 1.00 1GB 1.41 1.00 1MB 1.00 SEND WRITE READ WIMM SEND WIMM READ WRITE
0.5 1 1.5 2 10B vs. 10B 100B vs. 100B 1KB vs. 1KB 10B vs. 100B 10B vs. 1KB 100B vs. 1KB Latency Ratio 0.5 1 1.5 2 10B vs. 10B 100B vs. 100B 1KB vs. 1KB 10B vs. 100B 10B vs. 1KB 100B vs. 1KB MPS Ratio
3.4 13.2 13.1 9.5 8.5 9.0 26.9 34.0 15.9 17.8 12 24 36 48 GET Alone GET vs. (1GB Req) GET vs. (1GB Resp) GET vs. (1MB Req) GET vs. (1MB Resp) Latency (us) Median 99.99th
2.9 12.5 12.4 8.8 7.6 7.9 27.1 29.0 14.5 18.8 12 24 36 48 PUT Alone PUT vs. (1GB Req) PUT vs. (1GB Resp) PUT vs. (1MB Req) PUT vs. (1MB Resp) Latency (us) Median 99.99th
Elephant vs. Elephant:
the only thing that matters in a throughput-sensitive environment
Mouse vs. Mouse:
Mouse vs. Elephant:
queue buffer while the RNIC is doing continuous DMA reads from the main memory due to the background flow HERD vs. Elephant:
Up to 4x increase in the median latency