optimizing memory mapped i o for fast storage devices
play

Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios - PowerPoint PPT Presentation

Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios Papagiannis 1,2 , Giorgos Xanthakis 1,2 , Giorgos Saloustros 1 , Manolis Marazakis 1 , and Angelos Bilas 1,2 Foundation for Research and Technology Hellas (FORTH) 1 &


  1. Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios Papagiannis 1,2 , Giorgos Xanthakis 1,2 , Giorgos Saloustros 1 , Manolis Marazakis 1 , and Angelos Bilas 1,2 Foundation for Research and Technology – Hellas (FORTH) 1 & University of Crete 2 USENIX ATC 2020 1

  2. Fast storage devices • Fast storage devices à Flash, NVMe • Millions of IOPS • < 10 μs access latency • Small I/Os are not such a big issue as in rotational disks • Require many outstanding I/Os for peak throughput USENIX ATC 2020 2

  3. Read/write system calls User Space • Read/write system calls + DRAM cache • Reduce accesses to the device Cache • Kernel-space cache • Requires system calls also for hits • Used for raw (serialized) blocks Kernel Space • User-space cache • Lookups for hits + system calls only for misses Cache • Application specific (deserialized) data • User-space cache removes system calls for hits • Hit lookups in user space introduce significant overhead [SIGMOD’08] Device USENIX ATC 2020 3

  4. Memory-mapped I/O • In memory-mapped I/O (mmio) hits handled in hardware à MMU + TLB • Less overhead compared to cache lookup • In mmio a file mapped to virtual address space • Load/store processor instructions to access data • Kernel fetch/evict page on-demand • Additionally mmio removes • Serialization/deserialization • Memory copies between user and kernel USENIX ATC 2020 4

  5. Disadvantages of mmio • Misses require a page fault instead of a system call • 4KB page size à Small & random I/Os • With fast storage devices this is not a big issue • Linux mmio path fails to scale with #threads USENIX ATC 2020 5

  6. Mmio path scalability Device: null_blk 5 4.5 Dataset: 4TB Million page-faults/sec (IOPS) 4 DRAM cache: 192GB 3.5 3 2.5 2 1.5 1 0.5 0 1 2 4 8 16 32 Linux-Read Linux-Write USENIX ATC 2020 6

  7. Mmio path scalability Device: null_blk 5 4.5 Dataset: 4TB Million page-faults/sec (IOPS) 4 DRAM cache: 192GB 3.5 Queue depth ≈ 27 3 2M IOPS 2.5 1.3M IOPS 2 1.5 1 0.5 0 1 2 4 8 16 32 Linux-Read (4.14) Linux-Write (4.14) Linux-Read (5.4) Linux-Write (5.4) USENIX ATC 2020 7

  8. FastMap • A novel mmio path that achieves high scalability and I/O concurrency • In the Linux kernel • Avoids all centralized contention points • Reduces CPU processing in the common path • Uses dedicated data structures to minimize interference USENIX ATC 2020 8

  9. Mmio path scalability Device: null_blk 5 4.5 Dataset: 4TB Million page-faults/sec (IOPS) 4 DRAM cache: 192GB 3.5 3x in 3 reads 2.5 2 6x in 1.5 writes 1 0.5 0 1 2 4 8 16 32 Linux-Read (4.14) Linux-Write (4.14) Linux-Read (5.4) Linux-Write (5.4) FastMap-Read FastMap-Write USENIX ATC 2020 9

  10. Outline • Introduction • Motivation • FastMap design • Experimental analysis • Conclusions USENIX ATC 2020 10

  11. Outline • Introduction • Motivation • FastMap design • Experimental analysis • Conclusions USENIX ATC 2020 11

  12. FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 12

  13. FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 13

  14. Linux mmio design page_tree page tree_lock address_space page VMA 126x contented lock acquisitions 155x more page wait time • tree_lock acquired for 2 main reasons • Insert/remove elements from page_tree & lock-free (RCU) lookups • Modify tags for a specific entry à Used to mark a page dirty USENIX ATC 2020 14

  15. FastMap design . . . page_tree page_tree page_tree 0 1 N-1 VMA PFD . . . dirty_tree dirty_tree dirty_tree 0 1 N-1 • Keep dirty pages on a separate data structure • Marking a page dirty/clean does not serialize insert/remove ops • Choose data-structure based on page_offset % num_cpus • Radix trees to keep ALL cached pages à lock-free (RCU) lookups • Red-black trees to keep ONLY dirty pages à sorted by device offset USENIX ATC 2020 15

  16. FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 16

  17. Reverse mappings • Find out which page table entries map a specific page • Page eviction à Due to memory pressure or explicit writeback • Destroy mappings à munmap • Linux uses object-based reverse mappings • Executables and libraries (e.g. libc) introduce large amount of sharing • Reduces DRAM consumption and housekeeping costs • Storage applications that use memory-mapped I/O • Require minimal sharing • Can be applied selectively to certain devices or files USENIX ATC 2020 17

  18. Linux object-based reverse mappings page vma PGD _mapcount address_space i_mmap vma PGD read/write page semaphore vma PGD _mapcount • _mapcount can still results in useless page table traversals • rw-semaphore acquired as read on all operations • Cross NUMA-node traffic • Spend many CPU cycles USENIX ATC 2020 18

  19. FastMap full reverse mappings • Full reverse mappings VMA, vaddr page VMA, vaddr • Reduce CPU overhead • Efficient munmap VMA, vaddr page • No ordering required è scalable updates • More DRAM required per-core • Limited by small degree of sharing in pages VMA USENIX ATC 2020 19

  20. FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 20

  21. Batched TLB invalidations • Under memory pressure FastMap evicts a batch of clean pages • Cache related operations • Page table cleanup • TLB invalidation • TLB invalidation require an IPI (Inter-Processor Interrupt) • Limits scalability [EuroSys’13, USENIX ATC’17, EurorSys’20] • Single TLB invalidation for the whole batch • Convert batch to range including unnecessary invalidations USENIX ATC 2020 21

  22. Other optimizations in the paper • DRAM cache • Eviction/writeback operations • Implementation details USENIX ATC 2020 22

  23. Outline • Introduction • Motivation • FastMap design • Experimental analysis • Conclusions USENIX ATC 2020 23

  24. Testbed • 2x Intel Xeon CPU E5-2630 v3 CPUs (2.4GHz) • 32 hyper-threads • Different devices • Intel Optane SSD DC P4800X (375GB) in workloads • null_blk in microbenchmarks • 256 GB of DDR4 DRAM • CentOS v7.3 with Linux 4.14.72 USENIX ATC 2020 24

  25. Workloads • Microbenchmarks • Storage applications • Kreon [ACM SoCC’18] – persistent key-value store (YCSB) • MonetDB – column oriented DBMS (TPC-H) • Extend available DRAM over fast storage devices • Silo [SOSP’13] – key-value store with scalable transactions (TPC-C) • Ligra [PPoPP’13] – graph algorithms (BFS) USENIX ATC 2020 25

  26. FastMap Scalability 4x Intel Xeon CPU E5-4610 v3 CPUs (1.7 GHz) 80 hyper-threads 8 million page-faults/sec (IOPS) FastMap-Rd-SPF 7 FastMap-Wr-SPF FastMap-Rd 6 FastMap-Wr 11.8x mmap-Rd 5 37.4% mmap-Wr 32% 25.4% 4 7.6% 3 2 1 0 1 10 20 40 80 #threads USENIX ATC 2020 26

  27. FastMap execution time breakdown 600 500 #samples (x1000) 400 mark_dirty address-space 300 page-fault other 200 100 0 mmap-Read mmap-Write FastMap-Read FastMap-Write USENIX ATC 2020 27

  28. Kreon key-value store • Persistent key-value store based on LSM-tree • Designed to use memory-mapped I/O in the common path • YCSB with 80M records • 80GB dataset • 16GB DRAM USENIX ATC 2020 28

  29. Kreon – 100% inserts FastMap mmap 400 idle iowait 350 kworker 300 pgfault 3.2x pthread time (sec) 250 others ycsb 200 kreon 150 100 50 0 1 2 4 8 16 32 1 2 4 8 16 32 #cores USENIX ATC 2020 29

  30. Kreon – 100% lookups FastMap mmap idle 400 iowait 350 kworker pgfault 300 others time (sec) ycsb 250 kreon 200 150 1.5x 100 50 0 1 2 4 8 16 32 1 2 4 8 16 32 #cores USENIX ATC 2020 30

  31. Batched TLB invalidations • TLB batching results in 25.5% more TLB misses • Improvement due to fewer IPIs Silo key-value store • 24% higher throughput & • 23.8% lower average latency TPC-C • Less time in flush_tlb_mm_range() • 20.3% à 0.1% USENIX ATC 2020 31

  32. Conclusions • FastMap, an optimized mmio path in Linux • Scalable with number of threads & low CPU overhead • FastMap has significant benefits for data-intensive applications • Fast storage devices • Multi-core servers • Up to 11.8x more IOPS with 80 cores and null_blk • Up to 5.2x more IOPS with 32 cores and Intel Optane SSD USENIX ATC 2020 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend