Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) - - PowerPoint PPT Presentation

using rdma efficiently for key value services
SMART_READER_LITE
LIVE PREVIEW

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) - - PowerPoint PPT Presentation

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU) 1 RDMA Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer. 2 HERD


slide-1
SLIDE 1

Using RDMA Efficiently for Key-Value Services

Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU)

1

slide-2
SLIDE 2

RDMA

Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer.

2

slide-3
SLIDE 3

HERD

3

  • 1. Improved understanding of RDMA through 


micro-benchmarking


  • 2. High-performance key-value system:
  • Throughput: 26 Mops (2X higher than others)
  • Latency: 5 µs (2X lower than others)
slide-4
SLIDE 4

B A

RDMA intro

Features:

  • Ultra-low latency: 1 µs RTT
  • Zero copy + CPU bypass

4

User buffer DMA buffer NIC

Providers:

  • InfiniBand, RoCE,…

User buffer DMA buffer NIC

slide-5
SLIDE 5

RDMA in the datacenter

5

48 port 10 GbE switches

Switch RDMA Cost Mellanox SX1012 YES $5,900 Cisco 5548UP NO $8,180 Juniper EX5440 NO $7,480

slide-6
SLIDE 6

In-memory KV stores

6

Webserver Webserver Webserver Webserver Webserver memcached memcached

Database

Interface: GET, PUT

  • Requirements:
  • Low latency
  • High request rate
slide-7
SLIDE 7

RDMA basics

7

RDMA read:
 
 READ(local_buf, size, remote_addr) RDMA write:
 
 WRITE(local_buf, size, remote_addr)

RNIC Verbs

slide-8
SLIDE 8

Life of a WRITE

1 2 3 4 5 6 Requester Responder

1: Request descriptor, PIO 2: Payload, DMA read 3: RDMA write request 4: Payload, DMA write 5: RDMA ACK 6: Completion, DMA write

8

CPU,RAM RNIC RNIC CPU,RAM

slide-9
SLIDE 9

Recent systems

Pilaf [ATC 2013] FaRM-KV [NSDI 2014]: an example usage of FaRM

  • Approach: RDMA reads to access remote data structures

Reason: the allure of CPU bypass

9

slide-10
SLIDE 10

The price of CPU bypass

Key-Value stores have an inherent level of indirection. An index maps a keys to address. Values are stored separately.

Server’s DRAM Values Index

At least 2 RDMA reads required: ≧ 1 to fetch address 1 to fetch value

10

Not true if value is in index

slide-11
SLIDE 11

The price of CPU bypass

11

slide-12
SLIDE 12

The price of CPU bypass

READ #1 (fetch pointer) Client Server

12

slide-13
SLIDE 13

The price of CPU bypass

Client Server

13

READ #2 (fetch value)

slide-14
SLIDE 14

Our approach

Goal Main ideas #1: Use a single round trip Request-reply with server CPU involvement + WRITEs faster than READs #2. Increase throughput Low level verbs optimizations #3. Improve scalability Use datagram transport

14

slide-15
SLIDE 15

Client Server

#1: Use a single round trip

WRITE #1 (request) WRITE #2 (reply) DRAM accesses

15

slide-16
SLIDE 16

#1: Use a single round trip

Operation Round Trips Operations at server’s RNIC READ-based GET

2+

2+ RDMA reads HERD GET

1

2 RDMA writes

Lower latency High throughput

slide-17
SLIDE 17

S C C C C C

Setup: Apt Cluster

  • 192 nodes, 56 Gbps IB

17

Throughput (Mops)

10 20 30 40

Payload size (Bytes)

4 32 64 92 128 160 192 224 256

READ WRITE

RDMA WRITEs faster than READs

slide-18
SLIDE 18

RNIC CPU,RAM Server

RDMA WRITE

18 RNIC CPU,RAM Server

RDMA READ

RDMA write request RDMA ACK RDMA read request RDMA read response PCIe DMA write PCIe DMA read

Reason: PCIe writes faster than PCIe reads

RDMA WRITEs faster than READs

slide-19
SLIDE 19

Request-reply throughput:

High-speed request-reply

S C1 C8

Setup: one-to-one client-server communication 32 byte payloads

Throughput (Mops)

10 20 30 Request-Reply READ 1 8

2 WRITEs 1 READ

19

2 READs

slide-20
SLIDE 20

#2: Increase throughput

CPU,RAM RNIC RNIC CPU,RAM Client Server

WRITE #1: Request WRITE #2: Response Processing

20

Simple request-reply:

slide-21
SLIDE 21

Optimize WRITEs

CPU,RAM RNIC RNIC CPU,RAM 1 2 3 4 5 6 Requester Responder

+inlining: encapsulate payload in request descriptor (2→1) +unreliable: use unreliable transport (- 5) +unsignaled: don’t ask for request completions (- 6)

21

slide-22
SLIDE 22

#2: Increase throughput

CPU,RAM RNIC RNIC CPU,RAM Client Server

WRITE #1: Request WRITE #2: Response Processing

22

Optimized request-reply:

slide-23
SLIDE 23

#2: Increase throughput

Throughput (Mops) 5 10 15 20 25 30 Request-Reply READ

basic +unreliable +unsignaled +inlined

S C1 C8

Setup: one-to-one client-server communication

1 8

23

slide-24
SLIDE 24

#3: Improve scalability

S C1 CN

Setup

1 N

Throughput (Mops)

5 10 15 20 25 30

Number of client/server processes

1 2 4 6 8 10 12 14 16

Request-Reply

24

slide-25
SLIDE 25

#3: Improve scalability

SRAM State 1 State 2 State 3 State N

C1 C2 C3 CN

||state|| > SRAM Clients

slide-26
SLIDE 26

SRAM

#3: Improve scalability

C1 C2 C3

Inbound scalability ≫ outbound because inbound state ( ) outbound ( )

Use datagram for outbound replies Datagram only supports SEND/RECV. SEND/RECV is slow. SEND/RECV is slow only at the receiver

slide-27
SLIDE 27

Scalable request-reply

27

Throughput (Mops)

10 20 30 40

Number of client/server processes

1 2 4 6 8 10 12 14 16

Request-Reply (Naive) Request Reply (Hybrid)

S C1 CN

Setup

1 N

RDMA write, connected SEND, datagram

slide-28
SLIDE 28

Evaluation

HERD = Request-Reply + MICA [NSDI 2014]

  • Compare against emulated versions of Pilaf and

FaRM-KV

  • No datastore
  • Focus on maximum performance achievable

28

slide-29
SLIDE 29

Latency (microseconds)

4 8 12

Throughput (Mops)

5 10 15 20 25 30

HERD

Latency vs throughput

95th percentile 5th percentile

29

48 byte items, GET intensive workload

26 Mops, 5 µs Low load, 3.4 µs

slide-30
SLIDE 30

Latency vs throughput

48 byte items, GET intensive workload

Latency (microseconds)

3 6 9 12

Throughput (Mops)

5 10 15 20 25 30

Emulated Pilaf Emulated FaRM-KV HERD

95th percentile 5th percentile 26 Mops, 5 µs 12 Mops, 8 µs Low load, 3.4 µs

30

slide-31
SLIDE 31

Throughput (Mops)

10 20 30

Value size (Bytes)

4 8 16 32 64 128 256 512 1024

Emulated Pilaf Emulated FaRM-KV HERD

Throughput comparison

31

2X higher

16 byte keys, 95% GET workload

slide-32
SLIDE 32

HERD

  • Re-designing RDMA-based KV stores to use a

single round trip

  • WRITEs outperform READs
  • Reduce PCIe and InfiniBand transactions
  • Embrace SEND/RECV
  • Code is online: https://github.com/efficient/HERD

32

slide-33
SLIDE 33

Throughput comparison

Throughput (Mops)

10 20 30

Value size

4 8 16 32 64 128 256 512 1024

Emulated Pilaf Emulated FaRM-KV HERD READ

Faster than RDMA reads

33

16 byte keys, 95% GET workload

slide-34
SLIDE 34

Throughput comparison

Throughput (Mops)

5 10 15 20 25 30 Emulated Pilaf Emulated FaRM-KV HERD

5% PUT 50% PUT 100% PUT

48 byte items

34