NVMe-over-Fabrics Performance Characterization and the Path to - PowerPoint PPT Presentation

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan Memory Solution Lab Samsung Semiconductor Inc. 1

Synopsis Performance characterization of NVMe-oF in the context of Flash disaggregation  Overview – NVMe and NVMe-over-Fabrics – Flash disaggregation  Performance characterization – Stress-testing remote storage – Disaggregating RocksDB  Summary 2

Non-Volatile Memory Express (NVMe)  A storage protocol standard on top of PCIe: – Standardize access to local non-volatile memory over PCIe  The predominant protocol for PCIe-based SSD devices – NVMe-SSDs connect through PCIe and support the standard  High-performance through parallelization: – Large number of deep submission/completion queues  NVMe-SSDs deliver lots of IOPS/BW – 1MIOPS, 6GB/s from a single device – 5x more than SAS-SSD, 20x more than SATA-SSD 3

Storage Disaggregation  Separates compute and storage to different nodes – Storage is accessed over a network rather than locally  Enables independent resource scaling – Allow flexible infrastructure tuning to dynamic loads – Reduces resource underutilization – Improves cost-efficiency by eliminating waste  Remote access introduces overheads – Additional interconnect latencies – Network/protocol processing affect both storage and compute nodes  HDD disaggregation is common in datacenters – HDD are so slow that these overheads are negligible 4

Storage Flash Disaggregation  NVMe disaggregation is more challenging – ~90 μ s latency  network/protocol latencies are more pronounced – ~1MIOPS  protocol overheads tax the CPU and degrade performance  Flash disaggregation via iSCSI is difficult: – iSCSI “introduces 20% throughput drop at the application level ” * – Even then, it can still be a cost-efficiency win  We show that these overheads go away with NVMe-oF * A. Klimovic, C. Kozyrakis, E. Thereska, B. John, and S. Kumar, “ Flash storage disaggregation ,” EuroSys’ 16 5

NVMe-oF: NVMe-over-Fabrics  Recent extension of the NVMe standard – Enables access to remote NVMe devices over different network fabrics  Maintains the current NVMe architecture, and: – Adds support for message-based NVMe operations  Advantages: – Parallelism: extends the multiple queue-paired design of NVMe – Efficiency: eliminates protocol translations along the I/O path – Performance  Supported fabrics: – RDMA – InfiniBand, iWarp, RoCE – Fiber Channel, FCoE 6

Methodology  Three configurations: 1. Baseline: Local, (direct-attached) 2. Remote storage with NVMe-oF over RoCEv2 3. Remote storage with iSCSI Baseline: direct-attached (DAS) • Followed best-known-methods for tuning  Hardware setup: – 3 host servers (a.k.a. compute nodes, or datastore servers) • Dual-socket Xeon E5-2699 – 1 target server (a.k.a. storage server) • Quad-socket Xeon E7-8890 – 3x Samsung PM1725 NVMe-SSDs • Random: 750/120 KIOPS read/write • Sequential: 3000/2000 MB/sec read/write – Network: • ConnectX-4 100Gb Ethernet NICs with RoCE support Remote storage setup • 100Gb top-of-rack switch 7

Maximum Throughput  NVMe-oF throughput is the same as DAS – iSCSI cannot keep up for high IOPS rates 4KB Random Traffic Throughput 2,500,000 2,250,000 40% 2,000,000 DAS 1,750,000 IOPS NVMf 1,500,000 iSCSI 1,250,000 1,000,000 750,000 500,000 250,000 0 100/0 80/20 50/50 20/80 0/100 Read/Write Instruction Mix 8

Host CPU Overheads  NVMe-oF CPU processing overheads are minimal – iSCSI adds significant load on the host (30%) • Even when performance is on par with DAS Host CPU Utilization 45 40 35 Utilization [%] DAS 30 25 NVMf 20 iSCSI 15 10 5 0 100/0 80/20 50/50 20/80 0/100 Read/Write Instruction Mix 9

Storage Server CPU Overheads  CPU processing on target is limited – 90% of DAS read-only throughput with 1/12 th of the cores  Cost efficiency win: fewer cores per NVMe-SSD in the server 2,500,000 100 DAS Target CPU Utilization [%] 90 NVMf 32 cores 2,000,000 80 NVMf 16 cores 70 IOPS NVMf 8 cores 1,500,000 60 iSCSI 32 cores 50 iSCSI 16 cores 1,000,000 40 2.4x iSCSI 8 cores 30 500,000 20 NVMf CPU% 10 iSCSI CPU% 0 0 100/0 80/20 Read/Write Instruction Mix 10

Latency Under Load  NVMe-oF latencies are the same as DAS for all practical loads – Both average and tail  iSCSI: 4KB Random Read Load Latency 10,000 – Saturates sooner 9,000 DAS avg latency 8,000 – 10x slower even Latency [usec] DAS 95th percentile 7,000 under light loads 6,000 NVMf avg latency 5,000 NVMf 95th percentile 4,000 iSCSI avg latency 3,000 2,000 iSCSI 95th percentile 1,000 0 IOPS 11

Latency Under Load  NVMe-oF latencies are the same as DAS for all practical loads – Both average and tail  iSCSI: 4KB Random Read Load Latency – Saturates sooner 1,200 1,000 – 10x slower even DAS avg latency Latency [usec] DAS 95th percentile 800 under light loads NVMf avg latency 600 NVMf 95th percentile 400 iSCSI avg latency iSCSI 95th percentile 200 0 IOPS 12

KV-Store Disaggregation (1/3)  Evaluated using RocksDB, driven with db_bench – 3 hosts – 3 rocksdb instances per host – 800B and 10KB objects – 80/20 read-write mix 13

KV-Store Disaggregation (2/3)  NVMe-oF performance on-par with DAS – 2% throughput difference • vs. 40% performance degradation for iSCSI RocksDB Performance 300,000 Disk Bandwidth over Time on the Target Operations Per Second 250,000 200,000 DAS NVMf 150,000 iSCSI 100,000 50,000 0 800B 10KB Objects Size 14

KV-Store Disaggregation (3/3)  NVMe-oF performance on-par with DAS – 2% throughput difference • vs. 40% performance degradation for iSCSI – Average latency increase by 11%, tail latency increase by 2% • Average Latency: 507 μ s  568 μ s Read Latency CDF • 99 th percentile: 3.6ms  3.7ms 100% 90% – 10% CPU utilization overhead 80% Percentage 70% on host 60% DAS 50% NVMf 40% 30% 20% 10% 0% Latency [us] 15

Summary  NVMe-oF reduces remote storage overheads to a bare minimum – Negligible throughput difference, similar latency – Low processing overheads on both host and target • Applications ( host ) gets the same performance • Storage server ( target ) can support more drives with fewer cores  NVMe-oF makes disaggregation more viable – No need to offset iSCSI >>20% performance lose Thank You! zvika.guz@samsung.com 16

Backup 17

Unloaded Latency Breakdown  NVMe-oF adds 11.7 μ s over DAS access latency – Close to the 10 μ s spec target 4K Unloaded Read Latency target side host side [usec] Fio (dev/nvmeXnY) userspace userspace Kernel Kernel 0 10 20 30 40 50 60 70 80 90 100 VFS NVMeT_Core NVMeT_RDMA Latency [usec] Latency [usec] File System submit IO request NVMe DAS Path 81.6 submit IO request Block layer RDMA Stack Block layer enqueue IO request Others 1.52 enqueue IO request NVMe_Core NVMe_Core NVMf Target Modules 4.57 Setup command NVMe_PCI NVMf Host Modules 3.25 NVMe_RDMA RDMA Stack Transport Fabric Fabric 2.43 18

FAQ #1: SPDK  Storage Performance Development Kit (SPDK) – Provides user-mode storage drivers • NVMe, NVMe-oF target, and NVMe-oF host – Better performance through: • Eliminating kernel context switches • Polling rather than interrupts Unloaded Latency  Will improve NVMe-oF performance 100 11.7 μ s 8.9 μ s 90 – BUT , was not stable enough for our setup 80 Latency [usec] 70  For unloaded latency: 60 50 – SPDK target further reduces 40 30 latency overhead 20 – SPDK local  SPDK target similar to 10 0 local  NVMe-oF DAS SPDK DAS SPDK NVMf NVMf Target 19

FAQ #1: SPDK  Storage Performance Development Kit (SPDK) – Provides user-mode storage drivers • NVMe, NVMe-oF target, and NVMe-oF host – Better performance through: • Eliminating kernel context switches • Polling rather than interrupts Unloaded Latency  Will improve NVMe-oF performance 100 11.7 μ s 11.7 μ s 90 – BUT , was not stable enough for our setup 80 Latency [usec] 70  For unloaded latency: 60 50 – SPDK target further reduces 40 30 latency overhead 20 – SPDK local  SPDK target similar to 10 0 local  NVMe-oF DAS SPDK DAS SPDK NVMf NVMf Target 20

FAQ #2: Hyper-convergence vs. Disaggregation  Hyper-convergence Infrastructure (HCI) – Software-defined approach – Bundles commodity servers into a clustered pool – Abstract underlining hardware into a virtualized computing platform  We focus on web-scale data centers – Disaggregation fits well within their deployment model • Several classes of server, some of which are storage-centric • Already disaggregate HDD  NVMe-oF, HCI, and disaggregation are not mutually exclusive – HCI on-top of NVMe-oF – Hybrid architectures 21

NVMe-over-Fabrics Performance Characterization and the Path to - PowerPoint PPT Presentation

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan Memory Solution Lab Samsung Semiconductor Inc. 1 Synopsis Performance

FINISHES OVERVIEW 2019 FABRIC UPDATES OPS is proud to now offer 360+ fabrics across Grade 1 and

When NVMe over Fabrics Meets Arm: Performance and Implications Yichen Jia * Eric Anger Feng

Tissue Properties and Manufacturing Forming and TAD Fabrics Peter McCabe Tissue Business Leader

Regular Fabrics for Retiming & Regular Fabrics for Retiming & Pipelining over Global

SuperNova burst buffer ( NVMe from Zynq ) Roy Wastie University of Oxford 1 17/10/19 DUNE-UK

Exposition of Fabrics and Accessories for Garment Production September 5 6, 2018 Moscow City

NEW TEXTILES Romo fabrics & Lola velour LINARA Romo fabrics / 8 colours selected by HAY

Decorative synthetic upholstery fabrics for over 50 years. Morbern... A long past, a bright

FLIN: Enabling Fairness and Enhancing Performance in Modern NVMe Solid State Drives Arash

Characterization of the Household Electricity Characterization of the Household Electricity

SITE CHARACTERIZATION Part 1. Non-Intrusive Site Characterization Technologies Tyler E. Gass,

Deformability characterization of fabrics using large and small scale full field optical strain

INTERFABRIC-2018. SPRING IV International Exhibition of Fabrics and Textile Materials In the

NDA: NVMe CAM attachment M. Warner Losh Netflix, Inc. BSDCan 2016

Spool : Reliable Virtualized NVMe Storage Systems in Public Cloud Infrastructure Shuai Xue,

Geomaterial Characterization Sub-topics Chemical characterization pH, TDS, EC, BOD, COD

Ferratum Oyj Corporate Presentation 1 Important Notice This presentation contains, or may be

EKI TECHNICAL PRESENTATION #2 COSUMNES SUBBASIN GSP DEVELOPMENT 19 DECEMBER 2018 COSUMNES

Presentation Overview 1. Schedule 2. Stakeholder Survey 3. Kings Subbasin Coordination Update

OCWD Groundwater Basin Storage Evaluation Joint Planning Committee July 25, 2018 Current Basin

#23040 Sackum Overhead Bridge Replacement Ministry of Transportation & Infrastructure

Overhead Thrower with a Partial Thickness Symptomatic Cuff Tear PRO: Complete the Tear and Repair

Rate Development and Methodology for Home and Community Based Services 4/17/2017 Rate

Ventilation & Overheating BSRIA 22 nd July 2015 Chris Yates, Managing Director Johnson &

NVMe-over-Fabrics Performance Characterization and the Path to - PowerPoint PPT Presentation

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan Memory Solution Lab Samsung Semiconductor Inc. 1 Synopsis Performance

FINISHES OVERVIEW 2019 FABRIC UPDATES OPS is proud to now offer 360+ fabrics across Grade 1 and

When NVMe over Fabrics Meets Arm: Performance and Implications Yichen Jia * Eric Anger Feng

Tissue Properties and Manufacturing Forming and TAD Fabrics Peter McCabe Tissue Business Leader

Regular Fabrics for Retiming &amp; Regular Fabrics for Retiming &amp; Pipelining over Global

SuperNova burst buffer ( NVMe from Zynq ) Roy Wastie University of Oxford 1 17/10/19 DUNE-UK

Exposition of Fabrics and Accessories for Garment Production September 5 6, 2018 Moscow City

NEW TEXTILES Romo fabrics &amp; Lola velour LINARA Romo fabrics / 8 colours selected by HAY

Decorative synthetic upholstery fabrics for over 50 years. Morbern... A long past, a bright

FLIN: Enabling Fairness and Enhancing Performance in Modern NVMe Solid State Drives Arash

Characterization of the Household Electricity Characterization of the Household Electricity

SITE CHARACTERIZATION Part 1. Non-Intrusive Site Characterization Technologies Tyler E. Gass,

Deformability characterization of fabrics using large and small scale full field optical strain

INTERFABRIC-2018. SPRING IV International Exhibition of Fabrics and Textile Materials In the

NDA: NVMe CAM attachment M. Warner Losh Netflix, Inc. BSDCan 2016

Spool : Reliable Virtualized NVMe Storage Systems in Public Cloud Infrastructure Shuai Xue,

Geomaterial Characterization Sub-topics Chemical characterization pH, TDS, EC, BOD, COD

Ferratum Oyj Corporate Presentation 1 Important Notice This presentation contains, or may be

EKI TECHNICAL PRESENTATION #2 COSUMNES SUBBASIN GSP DEVELOPMENT 19 DECEMBER 2018 COSUMNES

Presentation Overview 1. Schedule 2. Stakeholder Survey 3. Kings Subbasin Coordination Update

OCWD Groundwater Basin Storage Evaluation Joint Planning Committee July 25, 2018 Current Basin

#23040 Sackum Overhead Bridge Replacement Ministry of Transportation &amp; Infrastructure

Overhead Thrower with a Partial Thickness Symptomatic Cuff Tear PRO: Complete the Tear and Repair

Rate Development and Methodology for Home and Community Based Services 4/17/2017 Rate

Ventilation &amp; Overheating BSRIA 22 nd July 2015 Chris Yates, Managing Director Johnson &amp;

Regular Fabrics for Retiming & Regular Fabrics for Retiming & Pipelining over Global

NEW TEXTILES Romo fabrics & Lola velour LINARA Romo fabrics / 8 colours selected by HAY

#23040 Sackum Overhead Bridge Replacement Ministry of Transportation & Infrastructure

Ventilation & Overheating BSRIA 22 nd July 2015 Chris Yates, Managing Director Johnson &