Using RDMA Efficiently for Key-Value Services
Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU)
1
Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) - - PowerPoint PPT Presentation
Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU) 1 RDMA Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer. 2 HERD
Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU)
1
Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer.
2
3
micro-benchmarking
B A
Features:
4
User buffer DMA buffer NIC
Providers:
User buffer DMA buffer NIC
5
48 port 10 GbE switches
Switch RDMA Cost Mellanox SX1012 YES $5,900 Cisco 5548UP NO $8,180 Juniper EX5440 NO $7,480
6
Webserver Webserver Webserver Webserver Webserver memcached memcached
Database
Interface: GET, PUT
7
RDMA read: READ(local_buf, size, remote_addr) RDMA write: WRITE(local_buf, size, remote_addr)
RNIC Verbs
1 2 3 4 5 6 Requester Responder
1: Request descriptor, PIO 2: Payload, DMA read 3: RDMA write request 4: Payload, DMA write 5: RDMA ACK 6: Completion, DMA write
8
CPU,RAM RNIC RNIC CPU,RAM
Pilaf [ATC 2013] FaRM-KV [NSDI 2014]: an example usage of FaRM
Reason: the allure of CPU bypass
9
Key-Value stores have an inherent level of indirection. An index maps a keys to address. Values are stored separately.
Server’s DRAM Values Index
At least 2 RDMA reads required: ≧ 1 to fetch address 1 to fetch value
10
Not true if value is in index
11
READ #1 (fetch pointer) Client Server
12
Client Server
13
READ #2 (fetch value)
Goal Main ideas #1: Use a single round trip Request-reply with server CPU involvement + WRITEs faster than READs #2. Increase throughput Low level verbs optimizations #3. Improve scalability Use datagram transport
14
Client Server
WRITE #1 (request) WRITE #2 (reply) DRAM accesses
15
Operation Round Trips Operations at server’s RNIC READ-based GET
2+
2+ RDMA reads HERD GET
1
2 RDMA writes
Lower latency High throughput
S C C C C C
Setup: Apt Cluster
17
Throughput (Mops)
10 20 30 40
Payload size (Bytes)
4 32 64 92 128 160 192 224 256
READ WRITE
RNIC CPU,RAM Server
RDMA WRITE
18 RNIC CPU,RAM Server
RDMA READ
RDMA write request RDMA ACK RDMA read request RDMA read response PCIe DMA write PCIe DMA read
Reason: PCIe writes faster than PCIe reads
Request-reply throughput:
S C1 C8
Setup: one-to-one client-server communication 32 byte payloads
Throughput (Mops)
10 20 30 Request-Reply READ 1 8
2 WRITEs 1 READ
19
2 READs
CPU,RAM RNIC RNIC CPU,RAM Client Server
WRITE #1: Request WRITE #2: Response Processing
20
Simple request-reply:
CPU,RAM RNIC RNIC CPU,RAM 1 2 3 4 5 6 Requester Responder
+inlining: encapsulate payload in request descriptor (2→1) +unreliable: use unreliable transport (- 5) +unsignaled: don’t ask for request completions (- 6)
21
CPU,RAM RNIC RNIC CPU,RAM Client Server
WRITE #1: Request WRITE #2: Response Processing
22
Optimized request-reply:
Throughput (Mops) 5 10 15 20 25 30 Request-Reply READ
basic +unreliable +unsignaled +inlined
S C1 C8
Setup: one-to-one client-server communication
1 8
23
S C1 CN
Setup
1 N
Throughput (Mops)
5 10 15 20 25 30
Number of client/server processes
1 2 4 6 8 10 12 14 16
Request-Reply
24
SRAM State 1 State 2 State 3 State N
C1 C2 C3 CN
||state|| > SRAM Clients
SRAM
C1 C2 C3
Inbound scalability ≫ outbound because inbound state ( ) outbound ( )
≪
Use datagram for outbound replies Datagram only supports SEND/RECV. SEND/RECV is slow. SEND/RECV is slow only at the receiver
27
Throughput (Mops)
10 20 30 40
Number of client/server processes
1 2 4 6 8 10 12 14 16
Request-Reply (Naive) Request Reply (Hybrid)
S C1 CN
Setup
1 N
RDMA write, connected SEND, datagram
HERD = Request-Reply + MICA [NSDI 2014]
FaRM-KV
28
Latency (microseconds)
4 8 12
Throughput (Mops)
5 10 15 20 25 30
HERD
95th percentile 5th percentile
29
48 byte items, GET intensive workload
26 Mops, 5 µs Low load, 3.4 µs
48 byte items, GET intensive workload
Latency (microseconds)
3 6 9 12
Throughput (Mops)
5 10 15 20 25 30
Emulated Pilaf Emulated FaRM-KV HERD
95th percentile 5th percentile 26 Mops, 5 µs 12 Mops, 8 µs Low load, 3.4 µs
30
Throughput (Mops)
10 20 30
Value size (Bytes)
4 8 16 32 64 128 256 512 1024
Emulated Pilaf Emulated FaRM-KV HERD
31
2X higher
16 byte keys, 95% GET workload
single round trip
32
Throughput (Mops)
10 20 30
Value size
4 8 16 32 64 128 256 512 1024
Emulated Pilaf Emulated FaRM-KV HERD READ
Faster than RDMA reads
33
16 byte keys, 95% GET workload
Throughput (Mops)
5 10 15 20 25 30 Emulated Pilaf Emulated FaRM-KV HERD
5% PUT 50% PUT 100% PUT
48 byte items
34