Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios - - PowerPoint PPT Presentation

optimizing memory mapped i o for fast storage devices
SMART_READER_LITE
LIVE PREVIEW

Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios - - PowerPoint PPT Presentation

Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios Papagiannis 1,2 , Giorgos Xanthakis 1,2 , Giorgos Saloustros 1 , Manolis Marazakis 1 , and Angelos Bilas 1,2 Foundation for Research and Technology Hellas (FORTH) 1 &


slide-1
SLIDE 1

Optimizing Memory-mapped I/O for Fast Storage Devices

Anastasios Papagiannis1,2, Giorgos Xanthakis1,2, Giorgos Saloustros1, Manolis Marazakis1, and Angelos Bilas1,2

Foundation for Research and Technology – Hellas (FORTH)1 & University of Crete2

1 USENIX ATC 2020

slide-2
SLIDE 2

Fast storage devices

  • Fast storage devices à Flash, NVMe
  • Millions of IOPS
  • < 10 μs access latency
  • Small I/Os are not such a big issue as in rotational disks
  • Require many outstanding I/Os for peak throughput

2 USENIX ATC 2020

slide-3
SLIDE 3

Read/write system calls

  • Read/write system calls + DRAM cache
  • Reduce accesses to the device
  • Kernel-space cache
  • Requires system calls also for hits
  • Used for raw (serialized) blocks
  • User-space cache
  • Lookups for hits + system calls only for misses
  • Application specific (deserialized) data
  • User-space cache removes system calls for hits
  • Hit lookups in user space introduce

significant overhead [SIGMOD’08]

3 USENIX ATC 2020

Device User Space Kernel Space Cache Cache

slide-4
SLIDE 4

Memory-mapped I/O

  • In memory-mapped I/O (mmio) hits handled in hardware à MMU + TLB
  • Less overhead compared to cache lookup
  • In mmio a file mapped to virtual address space
  • Load/store processor instructions to access data
  • Kernel fetch/evict page on-demand
  • Additionally mmio removes
  • Serialization/deserialization
  • Memory copies between user and kernel

4 USENIX ATC 2020

slide-5
SLIDE 5

Disadvantages of mmio

  • Misses require a page fault instead of a system call
  • 4KB page size à Small & random I/Os
  • With fast storage devices this is not a big issue
  • Linux mmio path fails to scale with #threads

5 USENIX ATC 2020

slide-6
SLIDE 6

Mmio path scalability

6 USENIX ATC 2020 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 2 4 8 16 32 Linux-Read Linux-Write

Million page-faults/sec (IOPS)

Device: null_blk Dataset: 4TB DRAM cache: 192GB

slide-7
SLIDE 7

Mmio path scalability

7 USENIX ATC 2020 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 2 4 8 16 32 Linux-Read (4.14) Linux-Write (4.14) Linux-Read (5.4) Linux-Write (5.4)

Million page-faults/sec (IOPS)

2M IOPS 1.3M IOPS Queue depth ≈ 27

Device: null_blk Dataset: 4TB DRAM cache: 192GB

slide-8
SLIDE 8

FastMap

  • A novel mmio path that achieves high scalability and I/O concurrency
  • In the Linux kernel
  • Avoids all centralized contention points
  • Reduces CPU processing in the common path
  • Uses dedicated data structures to minimize interference

8 USENIX ATC 2020

slide-9
SLIDE 9

Mmio path scalability

9 USENIX ATC 2020 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 2 4 8 16 32 Linux-Read (4.14) Linux-Write (4.14) Linux-Read (5.4) Linux-Write (5.4) FastMap-Read FastMap-Write

Million page-faults/sec (IOPS)

3x in reads 6x in writes Device: null_blk Dataset: 4TB DRAM cache: 192GB

slide-10
SLIDE 10

Outline

  • Introduction
  • Motivation
  • FastMap design
  • Experimental analysis
  • Conclusions

10 USENIX ATC 2020

slide-11
SLIDE 11

Outline

  • Introduction
  • Motivation
  • FastMap design
  • Experimental analysis
  • Conclusions

11 USENIX ATC 2020

slide-12
SLIDE 12

FastMap design: 3 main techniques

  • Separates data structures that keep clean and dirty pages
  • Avoids all centralized contention points
  • Optimizes reverse mappings
  • Reduces CPU processing in the common path
  • Uses a scalable DRAM cache
  • Minimizes interference and reduce latency variability

12 USENIX ATC 2020

slide-13
SLIDE 13

FastMap design: 3 main techniques

  • Separates data structures that keep clean and dirty pages
  • Avoids all centralized contention points
  • Optimizes reverse mappings
  • Reduces CPU processing in the common path
  • Uses a scalable DRAM cache
  • Minimizes interference and reduce latency variability

13 USENIX ATC 2020

slide-14
SLIDE 14

Linux mmio design

14 USENIX ATC 2020

VMA address_space page_tree tree_lock page page page 126x contented lock acquisitions 155x more wait time

  • tree_lock acquired for 2 main reasons
  • Insert/remove elements from page_tree & lock-free (RCU) lookups
  • Modify tags for a specific entry à Used to mark a page dirty
slide-15
SLIDE 15

FastMap design

15 USENIX ATC 2020

  • Keep dirty pages on a separate data structure
  • Marking a page dirty/clean does not serialize insert/remove ops
  • Choose data-structure based on page_offset % num_cpus
  • Radix trees to keep ALL cached pages à lock-free (RCU) lookups
  • Red-black trees to keep ONLY dirty pages à sorted by device offset

VMA PFD page_tree page_tree

1

page_tree

N-1

. . . . . .

dirty_tree dirty_tree

1

dirty_tree

N-1

slide-16
SLIDE 16

FastMap design: 3 main techniques

  • Separates data structures that keep clean and dirty pages
  • Avoids all centralized contention points
  • Optimizes reverse mappings
  • Reduces CPU processing in the common path
  • Uses a scalable DRAM cache
  • Minimizes interference and reduce latency variability

16 USENIX ATC 2020

slide-17
SLIDE 17

Reverse mappings

  • Find out which page table entries map a specific page
  • Page eviction à Due to memory pressure or explicit writeback
  • Destroy mappings à munmap
  • Linux uses object-based reverse mappings
  • Executables and libraries (e.g. libc) introduce large amount of sharing
  • Reduces DRAM consumption and housekeeping costs
  • Storage applications that use memory-mapped I/O
  • Require minimal sharing
  • Can be applied selectively to certain devices or files

17 USENIX ATC 2020

slide-18
SLIDE 18

Linux object-based reverse mappings

18 USENIX ATC 2020

page address_space i_mmap vma PGD vma PGD vma PGD page

_mapcount _mapcount

read/write semaphore

  • _mapcount can still results in useless page table traversals
  • rw-semaphore acquired as read on all operations
  • Cross NUMA-node traffic
  • Spend many CPU cycles
slide-19
SLIDE 19

FastMap full reverse mappings

19 USENIX ATC 2020

page VMA, vaddr VMA, vaddr VMA per-core page VMA, vaddr

  • Full reverse mappings
  • Reduce CPU overhead
  • Efficient munmap
  • No ordering required è

scalable updates

  • More DRAM required
  • Limited by small degree of

sharing in pages

slide-20
SLIDE 20

FastMap design: 3 main techniques

  • Separates data structures that keep clean and dirty pages
  • Avoids all centralized contention points
  • Optimizes reverse mappings
  • Reduces CPU processing in the common path
  • Uses a scalable DRAM cache
  • Minimizes interference and reduce latency variability

20 USENIX ATC 2020

slide-21
SLIDE 21

Batched TLB invalidations

  • Under memory pressure FastMap evicts a batch of clean pages
  • Cache related operations
  • Page table cleanup
  • TLB invalidation
  • TLB invalidation require an IPI (Inter-Processor Interrupt)
  • Limits scalability [EuroSys’13, USENIX ATC’17, EurorSys’20]
  • Single TLB invalidation for the whole batch
  • Convert batch to range including unnecessary invalidations

21 USENIX ATC 2020

slide-22
SLIDE 22

Other optimizations in the paper

  • DRAM cache
  • Eviction/writeback operations
  • Implementation details

22 USENIX ATC 2020

slide-23
SLIDE 23

Outline

  • Introduction
  • Motivation
  • FastMap design
  • Experimental analysis
  • Conclusions

23 USENIX ATC 2020

slide-24
SLIDE 24

Testbed

  • 2x Intel Xeon CPU E5-2630 v3 CPUs (2.4GHz)
  • 32 hyper-threads
  • Different devices
  • Intel Optane SSD DC P4800X (375GB) in workloads
  • null_blk in microbenchmarks
  • 256 GB of DDR4 DRAM
  • CentOS v7.3 with Linux 4.14.72

24 USENIX ATC 2020

slide-25
SLIDE 25

Workloads

  • Microbenchmarks
  • Storage applications
  • Kreon [ACM SoCC’18] – persistent key-value store (YCSB)
  • MonetDB – column oriented DBMS (TPC-H)
  • Extend available DRAM over fast storage devices
  • Silo [SOSP’13] – key-value store with scalable transactions (TPC-C)
  • Ligra [PPoPP’13] – graph algorithms (BFS)

25 USENIX ATC 2020

slide-26
SLIDE 26

FastMap Scalability

1 2 3 4 5 6 7 8 1 10 20 40 80 million page-faults/sec (IOPS) #threads FastMap-Rd-SPF FastMap-Wr-SPF FastMap-Rd FastMap-Wr mmap-Rd mmap-Wr

26 USENIX ATC 2020

4x Intel Xeon CPU E5-4610 v3 CPUs (1.7 GHz) 80 hyper-threads

7.6% 25.4% 32% 37.4% 11.8x

slide-27
SLIDE 27

FastMap execution time breakdown

100 200 300 400 500 600 mmap-Read mmap-Write FastMap-Read FastMap-Write mark_dirty address-space page-fault

  • ther

27 USENIX ATC 2020

#samples (x1000)

slide-28
SLIDE 28

Kreon key-value store

  • Persistent key-value store based on LSM-tree
  • Designed to use memory-mapped I/O in the common path
  • YCSB with 80M records
  • 80GB dataset
  • 16GB DRAM

28 USENIX ATC 2020

slide-29
SLIDE 29

Kreon – 100% inserts

50 100 150 200 250 300 350 400 1 2 4 8 16 32 1 2 4 8 16 32 time (sec) #cores FastMap mmap idle iowait kworker pgfault pthread

  • thers

ycsb kreon

29 USENIX ATC 2020

3.2x

slide-30
SLIDE 30

Kreon – 100% lookups

50 100 150 200 250 300 350 400 1 2 4 8 16 32 1 2 4 8 16 32 time (sec) #cores FastMap mmap idle iowait kworker pgfault

  • thers

ycsb kreon

30 USENIX ATC 2020

1.5x

slide-31
SLIDE 31

Batched TLB invalidations

  • TLB batching results in 25.5% more TLB misses
  • Improvement due to fewer IPIs
  • 24% higher throughput
  • 23.8% lower average latency
  • Less time in flush_tlb_mm_range()
  • 20.3% à 0.1%

31 USENIX ATC 2020

Silo key-value store & TPC-C

slide-32
SLIDE 32

Conclusions

  • FastMap, an optimized mmio path in Linux
  • Scalable with number of threads & low CPU overhead
  • FastMap has significant benefits for data-intensive applications
  • Fast storage devices
  • Multi-core servers
  • Up to 11.8x more IOPS with 80 cores and null_blk
  • Up to 5.2x more IOPS with 32 cores and Intel Optane SSD

32 USENIX ATC 2020

slide-33
SLIDE 33

Optimizing Memory-mapped I/O for Fast Storage Devices

Anastasios Papagiannis Foundation for Research and Technology Hellas (FORTH) & University of Crete email: apapag@ics.forth.gr

33 USENIX ATC 2020