SMB3 Extensions for Low Latency Tom Talpey Microsoft May 12, 2016

Problem Statement • “Storage Class Memory” • A new, disruptive class of storage • Nonvolatile medium with RAM-like performance • Low latency, high throughput, high capacity • Resides on memory bus • Byte addressable • Or also on PCIe bus • Block semantics • New interface paradigms are rising to utilize it • Many based on time-honored methods (mapped files, etc) May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 2

Low Latency Storage • 2000 – HDD latency – SAN arrays accelerated using memory • ~5000 usec latency • 2010 – SSD latency – mere mortals can configure high perf storage • ~100 usec latency (50x improvement) • 2016 – beginning of Storage Class Memory (SCM) revolution • <1 usec latency (local), <10 usec latency (remote) – (~100x improvement) • Volume deployment imminent (NVDIMM today) 5000x change over 15 years! May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 3

Storage Latencies and Storage API Never use async Always use async >500x Reduction in Latency, >500x more IOPs • Requires re-architecture of IO stack • Requires re-architecture of net stack DRAM (for replication) • Applications will program differently • instant on in-memory • will consider moving to sync 50x Reduction in Latency, SCM 1000x more IOPs • Moving from SAN to SDS • Commoditization of storage IT SSD HDD uSec 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 4 40 1000 200K 1M 2 GHz Cycles

Need for A New Programming Model • Current programming model • Data records are created in volatile memory • Memory operations • Copied to HDD or SSD to make them persistent • I/O operations • Opportunities provided by NVM devices • Software to skip the steps that copy data from memory to disks. • Software can take advantages of the unique capabilities of both persistent memory and flash NVM • Need for a new programming model • Application writes persistent data directly to NVM which can be treated just like RAM • Mapped files, DAX, NVML • Storage can follow this new model May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 5

Local Filesystems and Local APIs • DAX • Direct Access Filesystem • Windows and Linux (very) similar • NVML • NVM Programming Library • Open source, included in Linux, future included in Windows • Specialized interfaces • Databases • Transactional libraries • Language extensions (!) • etc May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 6

Push Mode May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 7

RDMA Transfers – Storage Protocols Today • Direct placement model (simplified WRITE Register and optimized) Send • Client advertises RDMA region in (Register) scatter/gather list • Server performs all RDMA DATA RDMA Read (with local invalidate) • More secure: client does not access Send (with invalidate) server’s memory • More scalable: server does not Client Server preallocate to client READ • Faster: for parallel (typical) storage Register workloads Send • SMB3 uses for READ and WRITE • Server ensures durability RDMA Write DATA • NFS/RDMA, iSER similar Send (with invalidate) • Interrupts and CPU on both sides May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 8

Latencies • Undesirable latency contributions • Interrupts, work requests • Server request processing • Server-side RDMA handling • CPU processing time • Request processing • I/O stack processing and buffer management • To “traditional” storage subsystems • Data copies • Can we reduce or remove all of the above to PM? May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 9

RDMA Push Mode (Schematic) • Enhanced direct placement model • Client requests server resource of file, memory region, etc Remote Direct Access • MAP_REMOTE_REGION(offset, length, mode r/w) Send • Server pins/registers/advertises RDMA handle for region • Register Client performs all RDMA Send • RDMA Write to region Push • RDMA Read from region (“Pull mode”) • No requests of server (no server CPU/interrupt) RDMA Write DATA • Achieves near-wire latencies • RDMA Write Client remotely commits to PM (new RDMA operation!) DATA • Ideally, no server CPU interaction RDMA Commit (new) • RDMA NIC optionally signals server CPU Pull • Operation completes at client only when remote durability is guaranteed RDMA Read DATA • Client periodically updates server via master protocol • Send E.g. file change, timestamps, other metadata • Server can call back to client Unregister • Send To recall, revoke, manage resources, etc • Client signals server (closes) when done May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 10

Push Mode Implications • Historically, RDMA storage protocols avoided push mode • For good reasons: • Non-exposure of server memory • Resource conservation • Performance (perhaps surprisingly) • Server scheduling of data with I/O • Write congestion control – server-mediated data pull • Today: • Server memory can be well-protected with little performance compromise • Resources are scalable • However, congestion issue remains • Upper storage layer crediting • Hardware (RDMA NIC) flow control • QoS infrastructure • Existing Microsoft/MSR innovation to the rescue? May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 11

Consistency and Durability - Platform May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 12

RDMA with byte-addressable PM – Intel HW Architecture - Background • ADR – Asynchronous DRAM Refresh • Allows DRAM contents to be saved to NVDIMM on power loss • Requires special hardware with PS or supercap support • ADR Domain – All data inside of the domain is protected by ADR and will ADR Domain make it to NVM before power dies. The integrated memory controller DRAM/NVDIMM (iMC) is currently inside of the ADR Domain. • HW does not guarantee the order that cache lines are written to NVM during an ADR event iMC CPU • IIO – Integrated IO Controller • Controls IO flow between PCIe devices and Main Memory CORE • “Allocating write transactions” IIO L • CORE PCI Root Port will utilize write buffers backed by LLC core cache L when the target write buffer has WB attribute CORE C Allocating Write • Data buffers naturally aged out of cache to main memory CORE Transactions • “Non - Allocating write transactions” PCI Root Port • PCI Root Port Write transactions utilize buffers not backed by cache • Forces write data to move to the iMC without cache delay PCI DMA Write Flow • Various Enable/Disable methods, non-default RNIC PCI DMA Read Flow PCI Func RNIC RDMA Write Flow • DDIO – Data Direct IO RNIC RDMA Read Flow • Allows Bus Mastering PCI & RDMA IO to move data directly in/out of LLC Allocating Write Flow Core Caches PCI Func Allocating Read Flow • Allocating Write transactions will utilize DDIO Non-Allocating Write Flow Credit: Intel Non-Allocating Read Flow CPU Write Flow May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 13 CPU Read Flow

Durability Workarounds • Alternatives proposed – also see SDC 2015 Intel presentation • Significant performance (latency) implications, however! NVM ADR Domain NVM iMC CPU iMC CPU CORE IIO L CORE CORE IIO Internal BUFFERS L L CORE CORE C Internal BUFFERS L Allocating Write CORE CORE Transactions C Non-Allocating Write CORE PCI Root Port Transactions PCI Root Port RNIC RDMA Write Flow RNIC RNIC RDMA Send/Receive Flow RNIC RDMA Write Flow RNIC RDMA Write Data forced to iMC by Send/Receive Flow RNIC RDMA Read Flow Send/Receive Callback RDMA Write Data forced to ADR CLFLUSHOPT/SFENCE Flow Domain by RDMA Read Flow Send/Receive Callback Write Data forced to persistence by ADR Flow PCOMMIT/SFENCE Flow Credit: Intel May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 14

RDMA Durability – Protocol Extension May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 15

“Doing it right” - RDMA protocols • Need a remote guarantee of Durability • RDMA Write alone is not sufficient for this semantic • Completion at sender does not mean data was placed • NOT that it was even sent on the wire, much less received • Some RNICs give stronger guarantees, but never that data was stored remotely • Processing at receiver means only that data was accepted • NOT that it was sent on the bus • Segments can be reordered, by the wire or the bus • Only an RDMA completion at receiver guarantees placement • And placement != commit/durable • No Commit operation • Certain platform-specific guarantees can be made • But the remote client cannot know them • E.g. RDMA Read-after- RDMA Write (which won’t generally work) May 12, 2016 Tom Talpey - Microsoft - SambaXP 2016 Berlin 16

SMB3 Extensions for Low Latency Tom Talpey Microsoft May 12, 2016 - PowerPoint PPT Presentation

SMB3 Extensions for Low Latency Tom Talpey Microsoft May 12, 2016 Problem Statement Storage Class Memory A new, disruptive class of storage Nonvolatile medium with RAM-like performance Low latency, high throughput, high

SMB3 Protocol Update 2020 edition! Tom Talpey Microsoft Corporation 1 Outline SMB3

SMB3 Protocol Update Tom Talpey Microsoft Corporation 1 Outline SMB3 Protocol changes

SMB3 in Samba Multi-Channel and Beyond Michael Adam Red Hat / samba.org 2016-04-20 agenda

Forward Tom Talpey Microsoft Outline A look at SMB3 today A look at things in the

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Cheap Talk Games: Extensions Cheap Talk Games: Extensions F. Koessler / November 12, 2008 Cheap

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

UNI Extensions for Diversity and Latency Support

Product Ads Sitelink Extensions xo group; Jam & Toast, Feb 2012 1 xo group; Jam &

H.264/AVC Standard and H.264/AVC Standard and H.264/AVC Standard and Extensions Extensions

Status of SMB2 and SMB3 development in Samba SDC 2012 Michael Adam obnox@samba.org Samba Team /

SMB3 Multi-Channel in Samba ... Now Really! Michael Adam Red Hat / samba.org sambaXP -

SMB3 Multichannel Update Gnther Deschner <gd@samba.org> Sachin Prabhu

Tuomas Savolainen Max-Planck-Institut fr Radioastronomie Agudo Aller Aller Angelakis

Scalable Dynamic Analysis of Large Linear Systems Parasara Sridhar Duggirala Joint Work Mahesh

The Ground Myth Dr. Bruce Archambeault IBM Distinguished Engineer IEEE Fellow IBM

Specifying Workflows Lance M Evans Cray Inc, 2016-05-03 Typical I/O Subsystem Customer

Science with nano-satellites: BRITE-Constellation Andrzej Pigulski Astronomical Ins6tute

Analysing Temporally Annotated Corpora with CAVaT Temporal Annotation What to annotate?

Open Access Inf rast ruct ures f or European Research Natalia Manola University of Athens

Memory Management in C Memory Management in C Personal Software Engineering Personal Software