Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) - PowerPoint PPT Presentation

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU) 1

RDMA Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer. 2

HERD 1. Improved understanding of RDMA through   micro-benchmarking   2. High-performance key-value system: • Throughput: 26 Mops (2X higher than others) • Latency: 5 µs (2X lower than others) 3

RDMA intro Providers: Features: � • Ultra-low latency: 1 µs RTT InfiniBand, RoCE,… • Zero copy + CPU bypass User buffer DMA buffer NIC DMA buffer User buffer NIC A B 4

RDMA in the datacenter 48 port 10 GbE switches Switch RDMA Cost Mellanox SX1012 YES $5,900 Cisco 5548UP NO $8,180 Juniper EX5440 NO $7,480 5

In-memory KV stores Interface: GET, PUT memcached Webserver � memcached Requirements: Webserver • Low latency Webserver • High request rate Webserver Webserver Database 6

    RDMA basics Verbs RDMA read:   READ(local_buf, size, remote_addr) RDMA write:   WRITE(local_buf, size, remote_addr) RNIC 7

Life of a WRITE Requester Responder CPU,RAM RNIC RNIC CPU,RAM 1 1: Request descriptor, PIO 2 2: Payload, DMA read 3 3: RDMA write request 4 4: Payload, DMA write 5 5: RDMA ACK 6 6: Completion, DMA write 8

Recent systems Pilaf [ATC 2013] FaRM-KV [NSDI 2014]: an example usage of FaRM � Approach: RDMA reads to access remote data structures Reason: the allure of CPU bypass 9

The price of CPU bypass Key-Value stores have an inherent level of indirection. An index maps a keys to address. Values are stored separately. At least 2 RDMA reads required: Server’s DRAM Index Values ≧ 1 to fetch address 1 to fetch value Not true if value is in index 10

The price of CPU bypass 11

The price of CPU bypass Server READ #1 (fetch pointer) Client 12

The price of CPU bypass Server READ #2 (fetch value) Client 13

Our approach Goal Main ideas Request-reply with server CPU involvement + #1: Use a single round trip WRITEs faster than READs #2. Increase throughput Low level verbs optimizations #3. Improve scalability Use datagram transport 14

#1: Use a single round trip DRAM accesses WRITE #1 (request) Server WRITE #2 (reply) Client 15

#1: Use a single round trip Operation Round Trips Operations at server’s RNIC 2+ READ-based GET 2+ RDMA reads 1 HERD GET 2 RDMA writes Lower latency High throughput

RDMA WRITEs faster than READs READ WRITE C 40 Throughput (Mops) C 30 C S C 20 C 10 Setup: Apt Cluster 0 � 4 32 64 92 128 160 192 224 256 192 nodes, 56 Gbps IB Payload size (Bytes) 17

RDMA WRITEs faster than READs Reason: PCIe writes faster than PCIe reads RDMA WRITE RDMA READ Server Server RDMA write request RDMA read request PCIe DMA write PCIe DMA read RDMA ACK RDMA read response RNIC CPU,RAM RNIC CPU,RAM 18

High-speed request-reply Request-reply throughput: 32 byte payloads 30 1 READ C1 Throughput (Mops) 1 20 2 WRITEs S 8 2 READs 10 C8 Setup: one-to-one client-server communication 0 Request-Reply READ 19

#2: Increase throughput Simple request-reply: Client Server WRITE #1: Request Processing WRITE #2: Response RNIC CPU,RAM CPU,RAM RNIC 20

Optimize WRITEs Requester Responder 1 +inlining: encapsulate payload in 2 request descriptor (2 → 1) 3 4 +unreliable: use unreliable transport (- 5) 5 +unsignaled: don’t ask for request completions (- 6) 6 CPU,RAM RNIC RNIC CPU,RAM 21

#2: Increase throughput Optimized request-reply: Client Server WRITE #1: Request Processing WRITE #2: Response RNIC CPU,RAM CPU,RAM RNIC 22

#2: Increase throughput basic +unreliable +unsignaled +inlined 30 25 C1 Throughput (Mops) 1 20 S 15 8 10 C8 5 Setup: one-to-one client-server communication 0 Request-Reply READ 23

#3: Improve scalability Request-Reply 30 Throughput (Mops) 25 C1 20 1 15 S 10 N CN 5 Setup 0 1 2 4 6 8 10 12 14 16 Number of client/server processes 24

#3: Improve scalability Clients SRAM State 1 C1 State 2 C2 State 3 C3 ||state|| > SRAM State N CN

#3: Improve scalability Inbound scalability ≫ outbound because inbound state ( ) outbound ( ) ≪ SRAM Use datagram for outbound replies C1 C2 C3 Datagram only supports SEND/RECV. SEND/RECV is slow. SEND/RECV is slow only at the receiver

Scalable request-reply Request-Reply (Naive) RDMA write, connected Request Reply (Hybrid) SEND, datagram 40 Throughput (Mops) 30 C1 1 20 S N 10 CN 0 Setup 1 2 4 6 8 10 12 14 16 Number of client/server processes 27

Evaluation HERD = Request-Reply + MICA [NSDI 2014] � Compare against emulated versions of Pilaf and FaRM-KV • No datastore • Focus on maximum performance achievable 28

Latency vs throughput 48 byte items, GET intensive workload HERD 12 95 th percentile Latency (microseconds) 8 5 th percentile 26 Mops, 5 µs 4 Low load, 3.4 µs 0 0 5 10 15 20 25 30 Throughput (Mops) 29

Latency vs throughput 48 byte items, GET intensive workload Emulated Pilaf Emulated FaRM-KV HERD 12 95 th percentile Latency (microseconds) 9 12 Mops, 8 µs 5 th percentile 6 26 Mops, 5 µs 3 Low load, 3.4 µs 0 0 5 10 15 20 25 30 Throughput (Mops) 30

Throughput comparison 16 byte keys, 95% GET workload Emulated Pilaf Emulated FaRM-KV HERD 30 Throughput (Mops) 20 2X higher 10 0 4 8 16 32 64 128 256 512 1024 Value size (Bytes) 31

HERD • Re-designing RDMA-based KV stores to use a single round trip • WRITEs outperform READs • Reduce PCIe and InfiniBand transactions • Embrace SEND/RECV • Code is online: https://github.com/efficient/HERD 32

Throughput comparison 16 byte keys, 95% GET workload Emulated Pilaf Emulated FaRM-KV HERD READ 30 Faster than RDMA reads Throughput (Mops) 20 10 0 4 8 16 32 64 128 256 512 1024 Value size 33

Throughput comparison 48 byte items 5% PUT 50% PUT 100% PUT 30 25 Throughput (Mops) 20 15 10 5 0 Emulated Pilaf Emulated FaRM-KV HERD 34

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) - PowerPoint PPT Presentation

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU) 1 RDMA Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer. 2 HERD

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi,

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of

Sil ilent Data Access Protocol for NVRAM+RDMA Dis istributed Storage Qingyue Liu Peter Varman

RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi

NFS/RDMA Tom Talpey Network Appliance tmt@netapp.com IETF NFSv4 Interim WG meeting Ann Arbor,

Inducing Efficiently Inducing Efficiently optimizi optimizing outpati ng outpatient i ent

for Flash Storage and RDMA Michalis Vardoulakis 1,2,* , Giorgos Saloustros 1 , Pilar

Collecting telemetry data using P4 and RDMA Rutger Beltman Silke Knossen Supervisors: Joseph

Accelerating Large Charm++ Messages using RDMA Nitin Bhat, Vipul Harsh Nitin Bhat Masters

Towards Low-Latency Byzantine Agreement Protocols Using RDMA DSN Workshop on Byzantine Consensus

Practical Attacks against Mobile Device Management Solutions Michael Shaulov, CEO

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

XDP (eXpress Data Path) as a building block for other FOSS projects Jesper Dangaard Brouer (Red

Buffer Pools Lecture # 05 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645

for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki 1 Computation & Memory

universal composability from essentially any trusted setup Mike Rosulek | | CRYPTO 2012 .

Ticagrelor vs Aspirin in Patients undergoing Coronary- Artery Bypass Grafting Herib eribert

WEN ETA JB? A 2 million dollars problem Date: 05/06/2019 For: SSTIC Presenters: Eloi

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) - PowerPoint PPT Presentation

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU) 1 RDMA Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer. 2 HERD

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi,

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of

Sil ilent Data Access Protocol for NVRAM+RDMA Dis istributed Storage Qingyue Liu Peter Varman

RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi

NFS/RDMA Tom Talpey Network Appliance tmt@netapp.com IETF NFSv4 Interim WG meeting Ann Arbor,

Inducing Efficiently Inducing Efficiently optimizi optimizing outpati ng outpatient i ent

for Flash Storage and RDMA Michalis Vardoulakis 1,2,* , Giorgos Saloustros 1 , Pilar

Collecting telemetry data using P4 and RDMA Rutger Beltman Silke Knossen Supervisors: Joseph

Accelerating Large Charm++ Messages using RDMA Nitin Bhat, Vipul Harsh Nitin Bhat Masters

Towards Low-Latency Byzantine Agreement Protocols Using RDMA DSN Workshop on Byzantine Consensus

Practical Attacks against Mobile Device Management Solutions Michael Shaulov, CEO

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

XDP (eXpress Data Path) as a building block for other FOSS projects Jesper Dangaard Brouer (Red

Buffer Pools Lecture # 05 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645

for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki 1 Computation &amp; Memory

universal composability from essentially any trusted setup Mike Rosulek | | CRYPTO 2012 .

Ticagrelor vs Aspirin in Patients undergoing Coronary- Artery Bypass Grafting Herib eribert

WEN ETA JB? A 2 million dollars problem Date: 05/06/2019 For: SSTIC Presenters: Eloi

for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki 1 Computation & Memory