NVMe-over-Fabrics Performance Characterization and the Path to - - PowerPoint PPT Presentation

nvme over fabrics performance characterization and the
SMART_READER_LITE
LIVE PREVIEW

NVMe-over-Fabrics Performance Characterization and the Path to - - PowerPoint PPT Presentation

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan Memory Solution Lab Samsung Semiconductor Inc. 1 Synopsis Performance


slide-1
SLIDE 1

1

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation

Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan Memory Solution Lab Samsung Semiconductor Inc.

slide-2
SLIDE 2

2

Performance characterization of NVMe-oF in the context of Flash disaggregation

  • Overview

– NVMe and NVMe-over-Fabrics – Flash disaggregation

  • Performance characterization

– Stress-testing remote storage – Disaggregating RocksDB

  • Summary

Synopsis

slide-3
SLIDE 3

3

  • A storage protocol standard on top of PCIe:

– Standardize access to local non-volatile memory over PCIe

  • The predominant protocol for PCIe-based SSD devices

– NVMe-SSDs connect through PCIe and support the standard

  • High-performance through parallelization:

– Large number of deep submission/completion queues

  • NVMe-SSDs deliver lots of IOPS/BW

– 1MIOPS, 6GB/s from a single device – 5x more than SAS-SSD, 20x more than SATA-SSD

Non-Volatile Memory Express (NVMe)

slide-4
SLIDE 4

4

  • Separates compute and storage to different nodes

– Storage is accessed over a network rather than locally

  • Enables independent resource scaling

– Allow flexible infrastructure tuning to dynamic loads – Reduces resource underutilization – Improves cost-efficiency by eliminating waste

  • Remote access introduces overheads

– Additional interconnect latencies – Network/protocol processing affect both storage and compute nodes

  • HDD disaggregation is common in datacenters

– HDD are so slow that these overheads are negligible

Storage Disaggregation

slide-5
SLIDE 5

5

  • NVMe disaggregation is more challenging

– ~90μs latency  network/protocol latencies are more pronounced – ~1MIOPS  protocol overheads tax the CPU and degrade performance

  • Flash disaggregation via iSCSI is difficult:

– iSCSI “introduces 20% throughput drop at the application level”* – Even then, it can still be a cost-efficiency win

  • We show that these overheads go away with NVMe-oF

Storage Flash Disaggregation

*A. Klimovic, C. Kozyrakis, E. Thereska, B. John, and S. Kumar,“Flash storage disaggregation,” EuroSys’16

slide-6
SLIDE 6

6

  • Recent extension of the NVMe standard

– Enables access to remote NVMe devices over different network fabrics

  • Maintains the current NVMe architecture, and:

– Adds support for message-based NVMe operations

  • Advantages:

– Parallelism: extends the multiple queue-paired design of NVMe – Efficiency: eliminates protocol translations along the I/O path – Performance

  • Supported fabrics:

– RDMA – InfiniBand, iWarp, RoCE – Fiber Channel, FCoE

NVMe-oF: NVMe-over-Fabrics

slide-7
SLIDE 7

7

  • Three configurations:
  • 1. Baseline: Local, (direct-attached)
  • 2. Remote storage with NVMe-oF over RoCEv2
  • 3. Remote storage with iSCSI
  • Followed best-known-methods for tuning
  • Hardware setup:

– 3 host servers (a.k.a. compute nodes, or datastore servers)

  • Dual-socket Xeon E5-2699

– 1 target server (a.k.a. storage server)

  • Quad-socket Xeon E7-8890

– 3x Samsung PM1725 NVMe-SSDs

  • Random: 750/120 KIOPS read/write
  • Sequential: 3000/2000 MB/sec read/write

– Network:

  • ConnectX-4 100Gb Ethernet NICs with RoCE support
  • 100Gb top-of-rack switch

Methodology

Baseline: direct-attached (DAS) Remote storage setup

slide-8
SLIDE 8

8

  • NVMe-oF throughput is the same as DAS

– iSCSI cannot keep up for high IOPS rates

Maximum Throughput

250,000 500,000 750,000 1,000,000 1,250,000 1,500,000 1,750,000 2,000,000 2,250,000 2,500,000 100/0 80/20 50/50 20/80 0/100

IOPS Read/Write Instruction Mix

4KB Random Traffic Throughput

DAS NVMf iSCSI

40%

slide-9
SLIDE 9

9

  • NVMe-oF CPU processing overheads are minimal

– iSCSI adds significant load on the host (30%)

  • Even when performance is on par with DAS

Host CPU Overheads

5 10 15 20 25 30 35 40 45 100/0 80/20 50/50 20/80 0/100

Utilization [%] Read/Write Instruction Mix

Host CPU Utilization

DAS NVMf iSCSI

slide-10
SLIDE 10

10

10 20 30 40 50 60 70 80 90 100 500,000 1,000,000 1,500,000 2,000,000 2,500,000 100/0 80/20

Target CPU Utilization [%]

IOPS Read/Write Instruction Mix

DAS NVMf 32 cores NVMf 16 cores NVMf 8 cores iSCSI 32 cores iSCSI 16 cores iSCSI 8 cores NVMf CPU% iSCSI CPU%

Storage Server CPU Overheads

2.4x

  • CPU processing on target is limited

– 90% of DAS read-only throughput with 1/12th of the cores

  • Cost efficiency win: fewer cores per NVMe-SSD in the server
slide-11
SLIDE 11

11

  • NVMe-oF latencies are the same as DAS for all practical loads

– Both average and tail

  • iSCSI:

– Saturates sooner – 10x slower even under light loads

Latency Under Load

1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000

Latency [usec] IOPS

4KB Random Read Load Latency

DAS avg latency DAS 95th percentile NVMf avg latency NVMf 95th percentile iSCSI avg latency iSCSI 95th percentile

slide-12
SLIDE 12

12

  • NVMe-oF latencies are the same as DAS for all practical loads

– Both average and tail

  • iSCSI:

– Saturates sooner – 10x slower even under light loads

Latency Under Load

200 400 600 800 1,000 1,200

Latency [usec] IOPS

4KB Random Read Load Latency

DAS avg latency DAS 95th percentile NVMf avg latency NVMf 95th percentile iSCSI avg latency iSCSI 95th percentile

slide-13
SLIDE 13

13

  • Evaluated using RocksDB, driven with db_bench

– 3 hosts – 3 rocksdb instances per host – 800B and 10KB objects – 80/20 read-write mix

KV-Store Disaggregation (1/3)

slide-14
SLIDE 14

14

Disk Bandwidth over Time on the Target

  • NVMe-oF performance on-par with DAS

– 2% throughput difference

  • vs. 40% performance degradation for iSCSI

KV-Store Disaggregation (2/3)

50,000 100,000 150,000 200,000 250,000 300,000 800B 10KB

Operations Per Second Objects Size

RocksDB Performance

DAS NVMf iSCSI

slide-15
SLIDE 15

15

  • NVMe-oF performance on-par with DAS

– 2% throughput difference

  • vs. 40% performance degradation for iSCSI

– Average latency increase by 11%, tail latency increase by 2%

  • Average Latency: 507μs  568μs
  • 99th percentile: 3.6ms  3.7ms

– 10% CPU utilization overhead

  • n host

KV-Store Disaggregation (3/3)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage Latency [us]

Read Latency CDF

DAS NVMf

slide-16
SLIDE 16

16

  • NVMe-oF reduces remote storage overheads to a bare minimum

– Negligible throughput difference, similar latency – Low processing overheads on both host and target

  • Applications (host) gets the same performance
  • Storage server (target) can support more drives with fewer cores
  • NVMe-oF makes disaggregation more viable

– No need to offset iSCSI >>20% performance lose

Summary

Thank You!

zvika.guz@samsung.com

slide-17
SLIDE 17

17

Backup

slide-18
SLIDE 18

18

10 20 30 40 50 60 70 80 90 100

Latency [usec] NVMe DAS Path 81.6 Others 1.52 NVMf Target Modules 4.57 NVMf Host Modules 3.25 Fabric 2.43

Latency [usec]

  • NVMe-oF adds 11.7μs over DAS access latency

– Close to the 10μs spec target

Unloaded Latency Breakdown

4K Unloaded Read Latency

[usec]

Fio (dev/nvmeXnY) VFS File System NVMe_RDMA submit IO request enqueue IO request NVMe_Core RDMA Stack host side Block layer Setup command userspace Kernel Transport Fabric NVMeT_Core submit IO request enqueue IO request RDMA Stack Block layer NVMeT_RDMA userspace Kernel target side NVMe_PCI NVMe_Core

slide-19
SLIDE 19

19

  • Storage Performance Development Kit (SPDK)

– Provides user-mode storage drivers

  • NVMe, NVMe-oF target, and NVMe-oF host

– Better performance through:

  • Eliminating kernel context switches
  • Polling rather than interrupts
  • Will improve NVMe-oF performance

– BUT, was not stable enough for our setup

  • For unloaded latency:

– SPDK target further reduces latency overhead – SPDK local  SPDK target similar to local  NVMe-oF

FAQ #1: SPDK

10 20 30 40 50 60 70 80 90 100 DAS SPDK DAS SPDK NVMf Target NVMf

Latency [usec]

Unloaded Latency

8.9μs 11.7μs

slide-20
SLIDE 20

20

  • Storage Performance Development Kit (SPDK)

– Provides user-mode storage drivers

  • NVMe, NVMe-oF target, and NVMe-oF host

– Better performance through:

  • Eliminating kernel context switches
  • Polling rather than interrupts
  • Will improve NVMe-oF performance

– BUT, was not stable enough for our setup

  • For unloaded latency:

– SPDK target further reduces latency overhead – SPDK local  SPDK target similar to local  NVMe-oF

FAQ #1: SPDK

10 20 30 40 50 60 70 80 90 100 DAS SPDK DAS SPDK NVMf Target NVMf

Latency [usec]

Unloaded Latency

11.7μs 11.7μs

slide-21
SLIDE 21

21

  • Hyper-convergence Infrastructure (HCI)

– Software-defined approach – Bundles commodity servers into a clustered pool – Abstract underlining hardware into a virtualized computing platform

  • We focus on web-scale data centers

– Disaggregation fits well within their deployment model

  • Several classes of server, some of which are storage-centric
  • Already disaggregate HDD
  • NVMe-oF, HCI, and disaggregation are not mutually exclusive

– HCI on-top of NVMe-oF – Hybrid architectures

FAQ #2: Hyper-convergence vs. Disaggregation