Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for - PowerPoint PPT Presentation

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong Sungkyunkwan University (SKKU) Source: Gyusun Lee et al., Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs, USENIX ATC 19 NVRAMOS 2019

Storage Performance Trends • Emerging ultra-low latency SSDs deliver I/Os in a few µs 10  10  HDD 10  Latency (ns) Samsung Z-SSD Intel Optane SSD SSD 10  10  ULL-SSD 10  DRAM 10  SRAM 10  10  1985 1990 1995 2000 2003 2005 2010 2017 Year Source: R. E. Bryant and D. R. O'Hallaron, Computer Systems: A Programmer's Perspective, Second Edition, Pearson Education, Inc., 2015 NVRAMOS 2019 2

Overhead of Kernel I/O Stack • Low-latency SSDs expose the overhead of kernel I/O stack 1 Normalized Latency 0.8 0.6 Device 0.4 Kernel User 0.2 0 SATA NVMe Z-SSD Optane SATA NVMe Z-SSD Optane SSD SSD SSD SSD SSD SSD Read Write (+fsync) NVRAMOS 2019 3

Synchronous I/O vs. Asynchronous I/O A (computation) CPU Synchronous I/O Device B (I/O) Total latency A’ A” CPU A ” is independent to B Asynchronous I/O Device B Total latency Throughput Total latency Our Idea: apply asynchronous I/O concept to the I/O stack itself NVRAMOS 2019 4

Read Path Overview Vanilla Read Path sys_read() Return to user I/O stack operations I/O stack operations CPU Device I/O Proposed Read Path sys_read() Return to user CPU Async. operations Latency reduction Device I/O NVRAMOS 2019 5

Write Path Overview Vanilla Write Path sys_write() Buffered write CPU Return to user Device NVRAMOS 2019

Write Path Overview Vanilla Fsync Path sys_fsync() Return to user I/O stack ops. I/O stack ops. I/O stack ops. I/O stack ops. CPU Device I/O I/O I/O Proposed Fsync Path sys_fsync() Return to user I/O stack ops. I/O stack ops. CPU Latency reduction Device I/O I/O I/O NVRAMOS 2019 7

Agenda • Read path − Analysis of vanilla read path − Proposed read path • Light-weight block I/O layer • Write path − Analysis of vanilla write path − Proposed write path • Evaluation • Conclusion NVRAMOS 2019 8

Analysis of Vanilla Read Path Copy-to-user 0.21 µs Page cache lookup 0.30 µs Context switch 0.95 µs Page allocation 0.19 µs Request completion 0.81 µs Page cache insertion 0.33 µs DMA unmapping 0.23 µs LBA retrieval 0.09 µs BIO submission 0.72 µs DMA mapping 0.29 µs NVMe I/O submission 0.37 µs Context switch 0.95 µs sys_read() Return to user CPU Page cache File system Block layer Device driver Interrupt handler I/O submit Interrupt Device I/O 7.26 µs Total latency 12.82μs NVRAMOS 2019 9

Page Allocation / DMA Mapping Page allocation 0.19 µs DMA mapping 0.29 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 10

Asynchronous Page Allocation / DMA Mapping • DMA-mapped page pool 64 pages … Core 0 … 4KB DMA-mapped pages … Core N Pagepool allocation Page allocation 0.19 µs DMA mapping 0.29 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 11

Asynchronous Page Allocation / DMA Mapping • DMA-mapped page pool 64 pages … Core 0 … 4KB DMA-mapped pages … Core N Pagepool allocation Page refill Pagepool allocation 0.016 µs Page allocation 0.19 µs DMA mapping 0.29 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 12

Page Cache Insertion Page cache tree Root Page cache lookup overhead Leaf node … Page cache tree extension overhead … Page Page Page Cache lookup? Cache lookup? Miss Hit Cache insertion success Wait for page read Prevention from duplicated I/O requests Make I/O request for the same file index Page cache insertion 0.33 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 13

Lazy Page Cache Insertion Page cache tree Root Page cache lookup overhead Leaf node … Page cache tree extension overhead Page Page Cache lookup? Miss Cache lookup? Make I/O request Miss Make I/O request Lazy cache insertion? Fail Lazy cache insertion? Duplicated I/O requests Page free (extremely low frequency) Success Page cache insertion 0.35 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 14

DMA Unmapping DMA unmapping 0.23 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 15

Lazy DMA Unmapping • Implementation − Delays DMA unmapping to when a system is idle or waiting for another I/O requests − Extended version of the deferred protection scheme in Linux [ ASPLOS ’ 16 ] − Optionally disabled for safety Lazy DMA unmapping 0.35 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 16

Remaining Overheads in the Proposed Read Path BIO submission 0.72 µs Request completion 0.81 µs NVMe I/O submission 0.37 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 17

Linux Multi-queue Block I/O Layer submit_bio() • Structure conversion Multi-queue − Merge bio with pending request via I/O merging bio: LBA, length, Block Layer page(s), … − Assign new tag & request and convert from bio page • Multi-queue structure request: length, bio(s) − Software staging queue (SW queue) Per-core … SW Queues ü Support I/O scheduling and reordering … HW Queues − Hardware dispatch queue (HW queue) ü Deliver the I/O request to the device driver Device request bio • Multiple dynamic memory allocations Driver Tag iod NVMe CMD − Bio (block layer) prp_list sg_list − NVMe iod, scatter/gather list, NVMe PRP* list … NVMe Queue Pairs (device driver) Linux multi-queue block layer *PRP: physical region page NVRAMOS 2019 19

Linux Multi-queue Block I/O Layer submit_bio() • Structure conversion Multi-queue − Inefficiency of I/O merging [Zhang, OSDI ’ 18 ] bio: LBA, length, Block Layer page(s), … ü Useful feature for low-performance storage device page • Multi-queue structure request: length, bio(s) • Multiple dynamic memory allocations Per-core … SW Queues … HW Queues Device request bio Driver Tag iod NVMe CMD prp_list sg_list … NVMe Queue Pairs Linux multi-queue block layer NVRAMOS 2019 20

Linux Multi-queue Block I/O Layer submit_bio() • Structure conversion Multi-queue − Inefficiency of I/O merging [Zhang, OSDI ’ 18 ] bio: LBA, length, Block Layer page(s), … ü Useful feature for low-performance storage device page • Multi-queue structure request: length, bio(s) − Inefficiency of I/O scheduling for low-latency Per-core … SW Queues SSDs [Saxena, ATC ’ 10 ] [Xu, SYSTOR ’ 15 ] … HW Queues ü Default configuration is noop scheduler − Bypass multi-queue structure [Zhang, OSDI ’ 18 ] Device request bio − Device-side I/O scheduling [Peter, OSDI’14 ] Driver Tag iod [Joshi, HotStorage ’ 17 ] NVMe CMD prp_list sg_list • Multiple dynamic memory allocations … NVMe Queue Pairs Linux multi-queue block layer NVRAMOS 2019 21

Light-weight Block I/O Layer submit_lbio() • Light-weight bio (lbio) structure Light-weight − Contains only essential arguments for to make lbio: LBA, length, Block Layer NVMe I/O request prp_list, page(s), dma_addr(s) page − Eliminates unnecessary structure conversions DMA-mapped and allocations Page Pool … Core 0 • Per-CPU lbio pool … … … Per-CPU … Core n − Supports lockless lbio object allocation Lbio Pool − Supports tagging function Device lbio • Single dynamic memory allocation Driver Tag prp_list NVMe CMD − NVMe PRP* list (device driver) … NVMe Queue Pairs Light-weight block layer *PRP: physical region page NVRAMOS 2019 22

Read Path Comparison Proposed Read Path (before applying light-weight block I/O layer) sys_read() Return to user BIO submission 0.72 µs NVMe I/O submission 0.37 µs Request completion 0.81 µs CPU I/O submit Interrupt Device I/O 7.26 µs Proposed Read Path sys_read() Return to user LBIO submission 0.13 µs LBIO completion 0.65 µs CPU I/O submit Interrupt Latency reduction Device I/O 7.26 µs NVRAMOS 2019 23

Read Path Comparison Vanilla Read Path sys_read() Return to user CPU Interrupt I/O submit Device I/O 7.26 µs Total latency 12.82μs Proposed Read Path Latency reduction sys_read() Return to user CPU I/O submit Interrupt Device I/O 7.26 µs Total latency 10.10μs NVRAMOS 2019 24

Analysis of Vanilla Fsync Path (Ext4 Ordered Mode) Data writeback 5.68 µs sys_fsync() Return to user jbd2 call 0.80 µs CPU Journal block preparation 5.55 µs - Allocating buffer pages - Allocating journal area block Commit block preparation - Checksum computation… 2.15 µs jbd2 Flush & commit block Data block submit Journal block submit submit Data block I/O Journal block I/O Commit block I/O Device 12.57μs 12.73μs 10.72μs NVRAMOS 2019 26

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for - PowerPoint PPT Presentation

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong Sungkyunkwan University (SKKU) Source: Gyusun Lee et al., Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs, USENIX

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Stack and Queue Stack Overview Stack ADT Basic operations of stack Pushing, popping

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

Stack ADT Tiziana Ligorio 1 Todays Plan Questons? Stack ADT 2 Abstract Data Types

Call Stack Stack Bottom Memory region managed with stack discipline Procedures and the Call

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Cruise Tourism in the IOR Region Madagascar Mauritius Reunion Island Amrita Craig Mauritius

Annual Results Presentation London, 19 May 2009 These materials do not constitute an offer to

Fund structures www.charltonslaw.com 0 Disclaimers This presentation is for general

Sprinkles of extensionality for your vanilla type theory Adding custom rewrite rules to Agda

SWIFT Asset Servicing Messaging ISO15022 and ISO 20022 Corporate Actions January 16 th 2018 -

4-1. Compound Options Motivating Example: Compound options as a means of contingency hedging A

SUSE SES 5.5 Real life deployment SUSECon19 - Nashville Florian Rommel, Datalounges Oy

The Interaction of Implied Equity Volatility, Stochastic Interest, and Volatility Control Funds

Sambuz

Useful Links

Newsletter

Mail Us