nvme over fabrics performance characterization and the
play

NVMe-over-Fabrics Performance Characterization and the Path to - PowerPoint PPT Presentation

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan Memory Solution Lab Samsung Semiconductor Inc. 1 Synopsis Performance


  1. NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan Memory Solution Lab Samsung Semiconductor Inc. 1

  2. Synopsis Performance characterization of NVMe-oF in the context of Flash disaggregation  Overview – NVMe and NVMe-over-Fabrics – Flash disaggregation  Performance characterization – Stress-testing remote storage – Disaggregating RocksDB  Summary 2

  3. Non-Volatile Memory Express (NVMe)  A storage protocol standard on top of PCIe: – Standardize access to local non-volatile memory over PCIe  The predominant protocol for PCIe-based SSD devices – NVMe-SSDs connect through PCIe and support the standard  High-performance through parallelization: – Large number of deep submission/completion queues  NVMe-SSDs deliver lots of IOPS/BW – 1MIOPS, 6GB/s from a single device – 5x more than SAS-SSD, 20x more than SATA-SSD 3

  4. Storage Disaggregation  Separates compute and storage to different nodes – Storage is accessed over a network rather than locally  Enables independent resource scaling – Allow flexible infrastructure tuning to dynamic loads – Reduces resource underutilization – Improves cost-efficiency by eliminating waste  Remote access introduces overheads – Additional interconnect latencies – Network/protocol processing affect both storage and compute nodes  HDD disaggregation is common in datacenters – HDD are so slow that these overheads are negligible 4

  5. Storage Flash Disaggregation  NVMe disaggregation is more challenging – ~90 μ s latency  network/protocol latencies are more pronounced – ~1MIOPS  protocol overheads tax the CPU and degrade performance  Flash disaggregation via iSCSI is difficult: – iSCSI “introduces 20% throughput drop at the application level ” * – Even then, it can still be a cost-efficiency win  We show that these overheads go away with NVMe-oF * A. Klimovic, C. Kozyrakis, E. Thereska, B. John, and S. Kumar, “ Flash storage disaggregation ,” EuroSys’ 16 5

  6. NVMe-oF: NVMe-over-Fabrics  Recent extension of the NVMe standard – Enables access to remote NVMe devices over different network fabrics  Maintains the current NVMe architecture, and: – Adds support for message-based NVMe operations  Advantages: – Parallelism: extends the multiple queue-paired design of NVMe – Efficiency: eliminates protocol translations along the I/O path – Performance  Supported fabrics: – RDMA – InfiniBand, iWarp, RoCE – Fiber Channel, FCoE 6

  7. Methodology  Three configurations: 1. Baseline: Local, (direct-attached) 2. Remote storage with NVMe-oF over RoCEv2 3. Remote storage with iSCSI Baseline: direct-attached (DAS) • Followed best-known-methods for tuning  Hardware setup: – 3 host servers (a.k.a. compute nodes, or datastore servers) • Dual-socket Xeon E5-2699 – 1 target server (a.k.a. storage server) • Quad-socket Xeon E7-8890 – 3x Samsung PM1725 NVMe-SSDs • Random: 750/120 KIOPS read/write • Sequential: 3000/2000 MB/sec read/write – Network: • ConnectX-4 100Gb Ethernet NICs with RoCE support Remote storage setup • 100Gb top-of-rack switch 7

  8. Maximum Throughput  NVMe-oF throughput is the same as DAS – iSCSI cannot keep up for high IOPS rates 4KB Random Traffic Throughput 2,500,000 2,250,000 40% 2,000,000 DAS 1,750,000 IOPS NVMf 1,500,000 iSCSI 1,250,000 1,000,000 750,000 500,000 250,000 0 100/0 80/20 50/50 20/80 0/100 Read/Write Instruction Mix 8

  9. Host CPU Overheads  NVMe-oF CPU processing overheads are minimal – iSCSI adds significant load on the host (30%) • Even when performance is on par with DAS Host CPU Utilization 45 40 35 Utilization [%] DAS 30 25 NVMf 20 iSCSI 15 10 5 0 100/0 80/20 50/50 20/80 0/100 Read/Write Instruction Mix 9

  10. Storage Server CPU Overheads  CPU processing on target is limited – 90% of DAS read-only throughput with 1/12 th of the cores  Cost efficiency win: fewer cores per NVMe-SSD in the server 2,500,000 100 DAS Target CPU Utilization [%] 90 NVMf 32 cores 2,000,000 80 NVMf 16 cores 70 IOPS NVMf 8 cores 1,500,000 60 iSCSI 32 cores 50 iSCSI 16 cores 1,000,000 40 2.4x iSCSI 8 cores 30 500,000 20 NVMf CPU% 10 iSCSI CPU% 0 0 100/0 80/20 Read/Write Instruction Mix 10

  11. Latency Under Load  NVMe-oF latencies are the same as DAS for all practical loads – Both average and tail  iSCSI: 4KB Random Read Load Latency 10,000 – Saturates sooner 9,000 DAS avg latency 8,000 – 10x slower even Latency [usec] DAS 95th percentile 7,000 under light loads 6,000 NVMf avg latency 5,000 NVMf 95th percentile 4,000 iSCSI avg latency 3,000 2,000 iSCSI 95th percentile 1,000 0 IOPS 11

  12. Latency Under Load  NVMe-oF latencies are the same as DAS for all practical loads – Both average and tail  iSCSI: 4KB Random Read Load Latency – Saturates sooner 1,200 1,000 – 10x slower even DAS avg latency Latency [usec] DAS 95th percentile 800 under light loads NVMf avg latency 600 NVMf 95th percentile 400 iSCSI avg latency iSCSI 95th percentile 200 0 IOPS 12

  13. KV-Store Disaggregation (1/3)  Evaluated using RocksDB, driven with db_bench – 3 hosts – 3 rocksdb instances per host – 800B and 10KB objects – 80/20 read-write mix 13

  14. KV-Store Disaggregation (2/3)  NVMe-oF performance on-par with DAS – 2% throughput difference • vs. 40% performance degradation for iSCSI RocksDB Performance 300,000 Disk Bandwidth over Time on the Target Operations Per Second 250,000 200,000 DAS NVMf 150,000 iSCSI 100,000 50,000 0 800B 10KB Objects Size 14

  15. KV-Store Disaggregation (3/3)  NVMe-oF performance on-par with DAS – 2% throughput difference • vs. 40% performance degradation for iSCSI – Average latency increase by 11%, tail latency increase by 2% • Average Latency: 507 μ s  568 μ s Read Latency CDF • 99 th percentile: 3.6ms  3.7ms 100% 90% – 10% CPU utilization overhead 80% Percentage 70% on host 60% DAS 50% NVMf 40% 30% 20% 10% 0% Latency [us] 15

  16. Summary  NVMe-oF reduces remote storage overheads to a bare minimum – Negligible throughput difference, similar latency – Low processing overheads on both host and target • Applications ( host ) gets the same performance • Storage server ( target ) can support more drives with fewer cores  NVMe-oF makes disaggregation more viable – No need to offset iSCSI >>20% performance lose Thank You! zvika.guz@samsung.com 16

  17. Backup 17

  18. Unloaded Latency Breakdown  NVMe-oF adds 11.7 μ s over DAS access latency – Close to the 10 μ s spec target 4K Unloaded Read Latency target side host side [usec] Fio (dev/nvmeXnY) userspace userspace Kernel Kernel 0 10 20 30 40 50 60 70 80 90 100 VFS NVMeT_Core NVMeT_RDMA Latency [usec] Latency [usec] File System submit IO request NVMe DAS Path 81.6 submit IO request Block layer RDMA Stack Block layer enqueue IO request Others 1.52 enqueue IO request NVMe_Core NVMe_Core NVMf Target Modules 4.57 Setup command NVMe_PCI NVMf Host Modules 3.25 NVMe_RDMA RDMA Stack Transport Fabric Fabric 2.43 18

  19. FAQ #1: SPDK  Storage Performance Development Kit (SPDK) – Provides user-mode storage drivers • NVMe, NVMe-oF target, and NVMe-oF host – Better performance through: • Eliminating kernel context switches • Polling rather than interrupts Unloaded Latency  Will improve NVMe-oF performance 100 11.7 μ s 8.9 μ s 90 – BUT , was not stable enough for our setup 80 Latency [usec] 70  For unloaded latency: 60 50 – SPDK target further reduces 40 30 latency overhead 20 – SPDK local  SPDK target similar to 10 0 local  NVMe-oF DAS SPDK DAS SPDK NVMf NVMf Target 19

  20. FAQ #1: SPDK  Storage Performance Development Kit (SPDK) – Provides user-mode storage drivers • NVMe, NVMe-oF target, and NVMe-oF host – Better performance through: • Eliminating kernel context switches • Polling rather than interrupts Unloaded Latency  Will improve NVMe-oF performance 100 11.7 μ s 11.7 μ s 90 – BUT , was not stable enough for our setup 80 Latency [usec] 70  For unloaded latency: 60 50 – SPDK target further reduces 40 30 latency overhead 20 – SPDK local  SPDK target similar to 10 0 local  NVMe-oF DAS SPDK DAS SPDK NVMf NVMf Target 20

  21. FAQ #2: Hyper-convergence vs. Disaggregation  Hyper-convergence Infrastructure (HCI) – Software-defined approach – Bundles commodity servers into a clustered pool – Abstract underlining hardware into a virtualized computing platform  We focus on web-scale data centers – Disaggregation fits well within their deployment model • Several classes of server, some of which are storage-centric • Already disaggregate HDD  NVMe-oF, HCI, and disaggregation are not mutually exclusive – HCI on-top of NVMe-oF – Hybrid architectures 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend