using rdma efficiently for key value services
play

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) - PowerPoint PPT Presentation

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU) 1 RDMA Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer. 2 HERD


  1. Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU) 1

  2. RDMA Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer. 2

  3. HERD 1. Improved understanding of RDMA through 
 micro-benchmarking 
 2. High-performance key-value system: • Throughput: 26 Mops (2X higher than others) • Latency: 5 µs (2X lower than others) 3

  4. RDMA intro Providers: Features: � • Ultra-low latency: 1 µs RTT InfiniBand, RoCE,… • Zero copy + CPU bypass User buffer DMA buffer NIC DMA buffer User buffer NIC A B 4

  5. RDMA in the datacenter 48 port 10 GbE switches Switch RDMA Cost Mellanox SX1012 YES $5,900 Cisco 5548UP NO $8,180 Juniper EX5440 NO $7,480 5

  6. In-memory KV stores Interface: GET, PUT memcached Webserver � memcached Requirements: Webserver • Low latency Webserver • High request rate Webserver Webserver Database 6

  7. 
 
 RDMA basics Verbs RDMA read: 
 READ(local_buf, size, remote_addr) RDMA write: 
 WRITE(local_buf, size, remote_addr) RNIC 7

  8. Life of a WRITE Requester Responder CPU,RAM RNIC RNIC CPU,RAM 1 1: Request descriptor, PIO 2 2: Payload, DMA read 3 3: RDMA write request 4 4: Payload, DMA write 5 5: RDMA ACK 6 6: Completion, DMA write 8

  9. Recent systems Pilaf [ATC 2013] FaRM-KV [NSDI 2014]: an example usage of FaRM � Approach: RDMA reads to access remote data structures Reason: the allure of CPU bypass 9

  10. The price of CPU bypass Key-Value stores have an inherent level of indirection. An index maps a keys to address. Values are stored separately. At least 2 RDMA reads required: Server’s DRAM Index Values ≧ 1 to fetch address 1 to fetch value Not true if value is in index 10

  11. The price of CPU bypass 11

  12. The price of CPU bypass Server READ #1 (fetch pointer) Client 12

  13. The price of CPU bypass Server READ #2 (fetch value) Client 13

  14. Our approach Goal Main ideas Request-reply with server CPU involvement + #1: Use a single round trip WRITEs faster than READs #2. Increase throughput Low level verbs optimizations #3. Improve scalability Use datagram transport 14

  15. #1: Use a single round trip DRAM accesses WRITE #1 (request) Server WRITE #2 (reply) Client 15

  16. #1: Use a single round trip Operation Round Trips Operations at server’s RNIC 2+ READ-based GET 2+ RDMA reads 1 HERD GET 2 RDMA writes Lower latency High throughput

  17. RDMA WRITEs faster than READs READ WRITE C 40 Throughput (Mops) C 30 C S C 20 C 10 Setup: Apt Cluster 0 � 4 32 64 92 128 160 192 224 256 192 nodes, 56 Gbps IB Payload size (Bytes) 17

  18. RDMA WRITEs faster than READs Reason: PCIe writes faster than PCIe reads RDMA WRITE RDMA READ Server Server RDMA write request RDMA read request PCIe DMA write PCIe DMA read RDMA ACK RDMA read response RNIC CPU,RAM RNIC CPU,RAM 18

  19. High-speed request-reply Request-reply throughput: 32 byte payloads 30 1 READ C1 Throughput (Mops) 1 20 2 WRITEs S 8 2 READs 10 C8 Setup: one-to-one client-server communication 0 Request-Reply READ 19

  20. #2: Increase throughput Simple request-reply: Client Server WRITE #1: Request Processing WRITE #2: Response RNIC CPU,RAM CPU,RAM RNIC 20

  21. Optimize WRITEs Requester Responder 1 +inlining: encapsulate payload in 2 request descriptor (2 → 1) 3 4 +unreliable: use unreliable transport (- 5) 5 +unsignaled: don’t ask for request completions (- 6) 6 CPU,RAM RNIC RNIC CPU,RAM 21

  22. #2: Increase throughput Optimized request-reply: Client Server WRITE #1: Request Processing WRITE #2: Response RNIC CPU,RAM CPU,RAM RNIC 22

  23. #2: Increase throughput basic +unreliable +unsignaled +inlined 30 25 C1 Throughput (Mops) 1 20 S 15 8 10 C8 5 Setup: one-to-one client-server communication 0 Request-Reply READ 23

  24. #3: Improve scalability Request-Reply 30 Throughput (Mops) 25 C1 20 1 15 S 10 N CN 5 Setup 0 1 2 4 6 8 10 12 14 16 Number of client/server processes 24

  25. #3: Improve scalability Clients SRAM State 1 C1 State 2 C2 State 3 C3 ||state|| > SRAM State N CN

  26. #3: Improve scalability Inbound scalability ≫ outbound because inbound state ( ) outbound ( ) ≪ SRAM Use datagram for outbound replies C1 C2 C3 Datagram only supports SEND/RECV. SEND/RECV is slow. SEND/RECV is slow only at the receiver

  27. Scalable request-reply Request-Reply (Naive) RDMA write, connected Request Reply (Hybrid) SEND, datagram 40 Throughput (Mops) 30 C1 1 20 S N 10 CN 0 Setup 1 2 4 6 8 10 12 14 16 Number of client/server processes 27

  28. Evaluation HERD = Request-Reply + MICA [NSDI 2014] � Compare against emulated versions of Pilaf and FaRM-KV • No datastore • Focus on maximum performance achievable 28

  29. Latency vs throughput 48 byte items, GET intensive workload HERD 12 95 th percentile Latency (microseconds) 8 5 th percentile 26 Mops, 5 µs 4 Low load, 3.4 µs 0 0 5 10 15 20 25 30 Throughput (Mops) 29

  30. Latency vs throughput 48 byte items, GET intensive workload Emulated Pilaf Emulated FaRM-KV HERD 12 95 th percentile Latency (microseconds) 9 12 Mops, 8 µs 5 th percentile 6 26 Mops, 5 µs 3 Low load, 3.4 µs 0 0 5 10 15 20 25 30 Throughput (Mops) 30

  31. Throughput comparison 16 byte keys, 95% GET workload Emulated Pilaf Emulated FaRM-KV HERD 30 Throughput (Mops) 20 2X higher 10 0 4 8 16 32 64 128 256 512 1024 Value size (Bytes) 31

  32. HERD • Re-designing RDMA-based KV stores to use a single round trip • WRITEs outperform READs • Reduce PCIe and InfiniBand transactions • Embrace SEND/RECV • Code is online: https://github.com/efficient/HERD 32

  33. Throughput comparison 16 byte keys, 95% GET workload Emulated Pilaf Emulated FaRM-KV HERD READ 30 Faster than RDMA reads Throughput (Mops) 20 10 0 4 8 16 32 64 128 256 512 1024 Value size 33

  34. Throughput comparison 48 byte items 5% PUT 50% PUT 100% PUT 30 25 Throughput (Mops) 20 15 10 5 0 Emulated Pilaf Emulated FaRM-KV HERD 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend