Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng - - PowerPoint PPT Presentation

performance isolation anomalies in rdma
SMART_READER_LITE
LIVE PREVIEW

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng - - PowerPoint PPT Presentation

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf Chowdhury, and Kang G. Shin RDMA Is Being Deployed in Datacenters Cloud operators are aggressively deploying RDMA in datacenters [1][2][3] [1] Guo,


slide-1
SLIDE 1

Performance Isolation Anomalies in RDMA

Yiwen Zhang

with Juncheng Gu, Youngmoon Lee, Mosharaf Chowdhury, and Kang G. Shin

slide-2
SLIDE 2

RDMA Is Being Deployed in Datacenters

Cloud operators are aggressively deploying RDMA in datacenters[1][2][3]

[1] Guo, Chuanxiong, et al.“RDMA over Commodity Ethernet at Scale. “ SIGCOMM’16 [2] Mittal, Radhika, et al. “TIMELY: RTT

  • based Congestion Control for the Datacenter.” SIGCOMM’15

[3] Zhu, Yibo, et al. ” Congestion control for large-scale RDMA deployments.” SIGCOMM’15

slide-3
SLIDE 3

RDMA Is Being Deployed in Datacenters

Cloud operators are aggressively deploying RDMA in datacenters[1][2][3] Growing demands in ultra-low latency applications

  • Key-value store & remote paging

High bandwidth applications

  • Cloud storage & memory-intensive workloads

[1] Guo, Chuanxiong, et al.“RDMA over Commodity Ethernet at Scale. “ SIGCOMM’16 [2] Mittal, Radhika, et al. “TIMELY: RTT

  • based Congestion Control for the Datacenter.” SIGCOMM’15

[3] Zhu, Yibo, et al. ” Congestion control for large-scale RDMA deployments.” SIGCOMM’15

slide-4
SLIDE 4

RDMA Is Being Deployed in Datacenters

Cloud operators are aggressively deploying RDMA in datacenters RDMA provides both low latency and high bandwidth

  • Order-of-magnitude improvements in

latency and throughput

  • With minimal CPU overhead!
slide-5
SLIDE 5

Great! But There Are Limits …

At large-scale deployments, RDMA-enabled applications are unlikely to run in vacuum – the network must be shared

slide-6
SLIDE 6

Great! But There Are Limits …

At large-scale deployments, RDMA-enabled applications are unlikely to run in vacuum – the network must be shared HPC community uses static partitioning to minimize sharing[1] Researches in RDMA over Ethernet-based datacenters focus on the vagaries

  • f Priority-based Flow Control (PFC)[2][3]

[1] Ranadive, Adit, et al. “FaReS:Fairresource scheduling for VMM-bypass In Infiniband devices.” CCGRID 2010 [2] Guo, Chuanxiong, et al.“RDMA over Commodity Ethernet at Scale. “ SIGCOMM’16 [3] Zhu, Yibo, et al. ” Congestion control for large-scale RDMA deployments.” SIGCOMM’15

slide-7
SLIDE 7

What Happens When Multiple RDMA- Enabled Applications Share The Network?

slide-8
SLIDE 8

At A First Glance…

Scenarios Fair? 10B vs. 10B

slide-9
SLIDE 9

At A First Glance…

Scenarios Fair? 10B vs. 10B 10B vs. 1MB

slide-10
SLIDE 10

At A First Glance…

Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB

slide-11
SLIDE 11

At A First Glance…

Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB 1MB vs. 1GB

slide-12
SLIDE 12

BenchmarkingTool[1]

Modified based on Mellanox Perftest tool

  • Creates 2 flows to simultaneously transfer a stream of messages
  • Single queue pair for each flow
  • Measures bandwidth and latency characteristics only when both flows are active

[1] https://github.com/Infiniswap/frdma_benchmark

slide-13
SLIDE 13

BenchmarkingTool[1]

Modified based on Mellanox Perftest tool

  • Creates 2 flows to simultaneously transfer a stream of messages
  • Single queue pair for each flow
  • Measures bandwidth and latency characteristics only when both flows are active
  • Both flows share the same link

[1] https://github.com/In niswap/frdma_benchmark

slide-14
SLIDE 14

RDMA Design Parameters

RDMA Verbs

  • WRITE, READ, WRITE WITH IMM (WIMM), and SEND/RECEIVE

Transport Type

  • All experiments using Reliable-Connected (RC) Queue Pairs

INLINE Message

  • Enabled INLINE message for 10 Byte and 100 Byte messages in the experiment
slide-15
SLIDE 15

Application-Level Parameters

Request Pipelining

  • Provide better performance, but hard to configure for fair comparison
  • Disabled by default

Polling mechanism

  • Busy vs Event-triggered polling
slide-16
SLIDE 16

Application-Level Parameters

Message Acknowledgement

  • Next work request is posted until the WC of the previous one is polled from CQ
  • No other flow control acknowledgment is used

Build connection & Register memory PollWC from CQ Sender Receiver

slide-17
SLIDE 17

Define an Elephant and a Mouse

1 10 100 1000 10000 100000 1000000 10 20 30 40 50 60 10 1K 100K 10M 1G Messages per second Throughput (Gbps) Message Size (Byte) WRITE_Tput READ_Tput SEND_Tput WRITE_Mps READ_Mps SEND_Mps 1M

Mouse Elephant

slide-18
SLIDE 18

Elephant vs. Elephant

Compare two throughput-sensitive flows by varying verb types, message sizes, and polling mechanism.

  • WRITE, READ,WIMM,& SEND verbs transferring 1MB & 1GB messages
  • T
  • tal amount of data transferred fixed at 1TB
  • Both flows using event-triggered polling
  • Generated bandwidth ratio matrix
slide-19
SLIDE 19

Elephant vs. Elephant: Larger Flows Win

1MB 1GB 1MB 1GB 1MB 1GB 1MB 1GB 1GB 1MB 1GB 1MB 1GB 1MB Fair 1GB Unfair 1MB WIMM SEND WIMM READ WRITE WRITE READ SEND

slide-20
SLIDE 20

Getting Better with Larger Base Flows

0.750 1.000 1.250 1.500 1 2 5 10 100 1000 Throughput

Ratio

Message Size Ratio

slide-21
SLIDE 21

Getting Better with Larger Base Flows

0.750 1.000 1.250 1.500 1 2 5 10 100 1000 Throughput

Ratio

Message Size Ratio 1MB 2MB 5MB 10MB 100MB

slide-22
SLIDE 22

Polling Matters: Is Busy-polling Better?

Both flows use busy-polling.

0.75 1 1.25 1.5 1 2 5 10 100 1000 Throughput Ratio

Message Size Ratio

1MB 2MB 5MB 10MB 100MB

slide-23
SLIDE 23

But There Is aTradeoff in CPU Usage

20 40 60 80 100 10 100 1K 10K 100K 1M 10M 100M 1G CPU Usage (%) Message Size (Byte) Event-triggered Busying-polling

slide-24
SLIDE 24

At A First Glance…

Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB 1MB vs. 1GB

slide-25
SLIDE 25

At A First Glance…

Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB Fair 1MB vs. 1GB

slide-26
SLIDE 26

At A First Glance…

Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB Fair 1MB vs. 1GB Unfair

slide-27
SLIDE 27

At A First Glance…

Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Unfair

slide-28
SLIDE 28

At A First Glance…

Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Depends on CPU

slide-29
SLIDE 29

Mouse vs. Mouse: Pick a Base Flow

Compare two latency-sensitive flows with varying message sizes.

  • All flows using WRITE operation with busy polling
  • 10B, 100B and 1KB messages
  • Pick 10B as base flow
  • Measured latency and MPS of the base flow transferring 10 million messages at the

presence of a competing flow

slide-30
SLIDE 30

Mouse vs. Mouse: Worse Tails

1.3 1.3 1.4 1.3 5.4 5.8 7.0 7.8 0.0 2.0 4.0 6.0 8.0 10.0 10B Alone 10B vs. (10B) 10B vs. (100B) 10B vs. (1KB) Latency (us) Median 99.99th 0.78 0.76 0.72 0.75 0.0 0.2 0.4 0.6 0.8 1.0 10B Alone 10B vs. (10B) 10B vs. (100B) 10B vs. (1KB) Million Messages/sec

slide-31
SLIDE 31

At A First Glance…

Scenarios Fair? 10B vs. 10B 10B vs. 1MB 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Depends on CPU

slide-32
SLIDE 32

At A First Glance…

Scenarios Fair? 10B vs. 10B Good enough 10B vs. 1MB 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Depends on CPU

slide-33
SLIDE 33

Mouse vs. Elephant

Study performance isolation of a mouse flow running under a background elephant flow.

  • All flows using WRITE operation
  • All mouse flows sending 10 millions messages
  • Mouse flows using busy polling while background elephant flows using event-

triggered polling

  • Measured latency and MPS of mouse flows
slide-34
SLIDE 34

Mouse vs. Elephant: Mouse Flows Suffer

1.3 2.8 2.9 1.4 2.6 2.9 2.4 6.3 6.0 5.4 8.1 9.2 5.5 8.5 9.5 6.0 11.1 14.7 4 8 12 16 10B Alone 10B vs. (1MB) 10B vs. (1GB) 100B Alone 100B vs. (1MB) 100B vs. (1GB) 1KB Alone 1KB vs. (1MB) 1KB vs. (1GB) Latency (us) Median 99.99th

slide-35
SLIDE 35

Mouse vs. Elephant: Mouse Flows Suffer

0.79 0.36 0.34 0.71 0.39 0.35 0.42 0.16 0.17 0.2 0.4 0.6 0.8 1 10B Alone 10B vs. (1MB) 10B vs. (1GB) 100B Alone 100B vs. (1MB) 100B vs. (1GB) 1KB Alone 1KB vs. (1MB) 1KB vs. (1GB) Million Messages/sec

slide-36
SLIDE 36

At A First Glance…

Scenarios Fair? 10B vs. 10B Good enough 10B vs. 1MB 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Depends on CPU

slide-37
SLIDE 37

At A First Glance…

Scenarios Fair? 10B vs. 10B Good enough 10B vs. 1MB Unfair 1MB vs. 1MB Depends on CPU 1MB vs. 1GB Depends on CPU

slide-38
SLIDE 38

Hardware is Not Enough for Isolation

So far we ran all experiments using Mellanox FDR ConnectX-3 (56 Gbps) NIC on CloudLab. Switch to Mellanox EDR ConnectX-4 (100 Gbps) NIC on the Umich Conflux cluster.

  • The isolation problem in the elephant vs. elephant case still exists with a throughput

ratio of 1.32.

  • In the mouse vs. mouse case the problem appears to be mitigated; we did not
  • bserve large tail-latency variations when two mouse flows compete.
  • In the mouse vs. elephant scenario, mouse flows are still affected by large

background flows, where the median latency increases by up to 5×.

slide-39
SLIDE 39

What Happens to Isolation in More Sophisticated and Optimized Applications?

slide-40
SLIDE 40

Performance Isolation in HERD[1]

Interested to know how isolation is maintained in HERD at a presence of a background elephant flow. Running HERD on the Umich Conflux cluster.

  • 5 million PUT/GET requests.
  • Background flows using 1MB or 1GB messages with event-triggered polling
  • Measured median and tail latency of HERD requests with and without a

background flow

[1] Kalia, Anuj, et al. “Using RDMA efficiently for key-value services" SIGCOMM 2014

slide-41
SLIDE 41

HERD vs. Elephant: HERD Also Suffers

3.4 9.5 13.2 2.9 8.8 12.5 9.0 15.9 26.9 7.9 14.5 27.1 12 24 36 48 GET Alone GET vs. (1MB) GET vs. (1GB) PUT Alone PUT vs. (1MB) PUT vs. (1GB) Latency (us) Median 99.99th

slide-42
SLIDE 42

HERD vs. Elephant: Summary

HERD also has isolation issues when running with big background flows Currently, we are working on a solution to provide isolation in RDMA Special thanks to Yue Tan’s great help in generating isolation data on HERD

slide-43
SLIDE 43

Summary

  • When the size difference of two flows are small, no matter they are small

flows or very big flows, the isolation appears to be good

  • How fast an application can post RDMA requests onto the RNIC is the
  • nly thing that matters in a throughput-sensitive environment
  • When the size difference of two flows are big, there is a performance

degradation of the smaller flow

  • Current hardware might not help to entirely resolve the issue
slide-44
SLIDE 44
slide-45
SLIDE 45

Mouse Flow Latency

1 10 100 1000 10000 100000 1000000 10 1K 100K 10M 1G Latency (us) Message Size (Byte) 1 2 3 4 5 10 100 1K 10K Latency (us) Message Size (Byte)

slide-46
SLIDE 46

Elephant vs. Elephant: Matrix

1MB 1GB 1MB 1GB 1MB 1GB 1MB 1GB 1GB 1.41 1.00 1.44 1.00 1.39 1.00 1.40 1.00 1MB 1.02 0.71 1.00 0.72 0.99 0.71 1.00 1GB 1.40 1.00 1.43 1.00 1.37 1.00 1MB 1.08 0.71 1.04 0.71 1.00 1GB 1.40 1.00 1.44 1.00 1MB 1.00 0.70 1.00 1GB 1.41 1.00 1MB 1.00 SEND WRITE READ WIMM SEND WIMM READ WRITE

slide-47
SLIDE 47

Mouse vs. Mouse: Unpredicted Behavior

0.5 1 1.5 2 10B vs. 10B 100B vs. 100B 1KB vs. 1KB 10B vs. 100B 10B vs. 1KB 100B vs. 1KB Latency Ratio 0.5 1 1.5 2 10B vs. 10B 100B vs. 100B 1KB vs. 1KB 10B vs. 100B 10B vs. 1KB 100B vs. 1KB MPS Ratio

slide-48
SLIDE 48

HERD vs. Elephant: HERD Also Suffers

3.4 13.2 13.1 9.5 8.5 9.0 26.9 34.0 15.9 17.8 12 24 36 48 GET Alone GET vs. (1GB Req) GET vs. (1GB Resp) GET vs. (1MB Req) GET vs. (1MB Resp) Latency (us) Median 99.99th

slide-49
SLIDE 49

HERD vs. Elephant: HERD Also Suffers

2.9 12.5 12.4 8.8 7.6 7.9 27.1 29.0 14.5 18.8 12 24 36 48 PUT Alone PUT vs. (1GB Req) PUT vs. (1GB Resp) PUT vs. (1MB Req) PUT vs. (1MB Resp) Latency (us) Median 99.99th

slide-50
SLIDE 50

Summary

Elephant vs. Elephant:

  • Polling mechanism dictates bandwidth allocation
  • How fast an application can post RDMA requests onto the RNIC is

the only thing that matters in a throughput-sensitive environment

  • Tradeoff between CPU and Bandwidth

Mouse vs. Mouse:

  • Little predictability between flows using equal-sized messages
  • Increase in tail latency and decrease in MPS
  • Isolation issue mitigated when switching to better hardware
slide-51
SLIDE 51

Summary

Mouse vs. Elephant:

  • In the presence of both types of flows, latency-sensitive flows suffer
  • The requests posted by the mouse flows may queue up in RNIC’s

queue buffer while the RNIC is doing continuous DMA reads from the main memory due to the background flow HERD vs. Elephant:

  • Isolation issues remain when running with background elephant flows

Up to 4x increase in the median latency