CompoundFS: Compounding I/O Operations in Firmware File Systems - - PowerPoint PPT Presentation

compoundfs compounding i o operations in firmware file
SMART_READER_LITE
LIVE PREVIEW

CompoundFS: Compounding I/O Operations in Firmware File Systems - - PowerPoint PPT Presentation

CompoundFS: Compounding I/O Operations in Firmware File Systems Yujie Ren 1 , Jian Zhang 2 and Sudarsun Kannan 1 1 Rutgers University; 2 ShanghaiTech University Outline Background Analysis Design Evaluation Conclusion 2


slide-1
SLIDE 1

CompoundFS: Compounding I/O Operations in Firmware File Systems

Yujie Ren1, Jian Zhang2 and Sudarsun Kannan1

1 Rutgers University; 2 ShanghaiTech University

slide-2
SLIDE 2
  • Background
  • Analysis
  • Design
  • Evaluation
  • Conclusion

2

Outline

slide-3
SLIDE 3

3

In-storage Processors Are Powerful

CPU: 2-core 3-core 5-core RAM: 128MB DDR2 512MB LPDDR2 1GB LPDDR4 Year: 2008 2013 2018 Samsung 840 Samsung 970 Intel X25M Price: $7.4/GB $0.92/GB $0.80/GB Latency: ~70𝜈s ~60𝜈s ~40𝜈s B/W: 250 MB/s 500 MB/s 3300 MB/s

slide-4
SLIDE 4

Software Latency Matters Now

4

OS Kernel

Software Overhead Matters !

Page Cache Block I/O Layer Device Driver VFS Layer Actual FS Application

: Kernel Trap : Data Copy : OS Overhead

PMFS ext4

write()

1 - 4𝜈𝑡

slide-5
SLIDE 5

5

Current Solutions

  • DirectFS (i.e. Strata, SplitFS, DevFS) reduces software overhead

bypassing OS kernel partially or fully

Application FS Lib FS Server Storage

Strata (SOSP’17) DevFS (FAST’18) SplitFS (SOSP’19)

Application Application FS Lib Kernel DAX FS Storage FS Lib Storage Firmware FS

: data-plane ops : control-plane ops

slide-6
SLIDE 6

6

Limitation of Current Solutions

  • DirectFS designs do not reduce boundary crossing
  • Strata needs boundary crossing between FS Lib and FS Server
  • SplitFS needs kernel trap for control-plane operations
  • DevFS suffers from high PCIe latency for every operation
  • DirectFS designs do not efficiently reduce data copy
  • Current solutions need multiple data copies back and forth between

application and storage stack

  • DirectFS designs do not utilize in-storage computation
  • Current solutions only use host CPUs for I/O related operations
slide-7
SLIDE 7
  • Background
  • Analysis
  • Design
  • Evaluation
  • Conclusion

7

Outline

slide-8
SLIDE 8

8

Analysis Methodology

  • File Systems
  • ext4-DAX: ext4 on byte-addressable storage bypassing page cache
  • SplitFS: direct-access file system bypassing kernel for data-plane ops
  • Application
  • LevelDB: Well-known persistent key-value store
  • db_bench: random write and read benchmarks
  • Storage
  • Emulated persistent memory on DRAM like prior work (e.g., SplitFS)
slide-9
SLIDE 9

9

LevelDB Overhead Breakdown

  • LevelDB spends significant time (~%50) in OS storage stack
  • Spends ~%15 of time on data copy between App and OS
  • Spends ~%20 of time on App-level crash consistency – CRC of data

0% 20% 40% 60% 80%

256 (DAX) 4096 (DAX) 256 (SplitFS) 4096 (SplitFS)

Run time percentage(%) Value size (bytes) Data allocation (OS) Data copy (OS) Filesystem update (OS) Lock (OS) Data allocation (user) Data copy (user) CRC32 (user)

slide-10
SLIDE 10
  • Background
  • Analysis
  • Design
  • Evaluation
  • Conclusion

10

Outline

slide-11
SLIDE 11

11

Our solution: CompoundFS

  • Combine (compound) multiple file system I/O ops into one
  • Offload I/O pre- and post-processing to storage-level CPUs
  • Bypass OS kernel and provide direct-access
slide-12
SLIDE 12

12

Our solution: CompoundFS

  • Combine (compound) multiple file system I/O ops into one
  • e.g. write() after read() compounded to write-after-read()
  • Reduces boundary crossing b/w host and storage (e.g., syscall)
  • Offload I/O pre- and post-processing to storage-level CPUs
  • e.g. checksum() after write() compounded to write-and-checksum()
  • Storage CPUs perform computation (e.g., checksum) and persist
  • Reduce data movement cost across boundaries
  • Bypass OS kernel and provide direct-access
  • firmware file system design to provide direct access for data plane

and most control plane operations

slide-13
SLIDE 13

13

I/O Only Compound Operations

Read-modify-write:

Traditional FS Path: 2 syscalls + 2 data copies User space Kernel space User space Storage Only 1 data copy with direct access

Read(data) Write(data) Read_modify_write(data)

Compound FS Path: : Kernel Trap : Data Copy

modify

Storage FS

StorageFS performs compound op

slide-14
SLIDE 14

14

I/O + Compute Compound Operations

Write-and-checksum

Traditional FS Path: 2 syscalls + 2 data copies User space Kernel space User space Storage Only 1 data copy with direct access

Write(data) Write(checksum) Write_and_checksum(data)

Compound FS Path: : Kernel Trap : Data Copy

checksum

Storage FS

StorageFS handles checksum calculation

slide-15
SLIDE 15

15

CompoundFS Architecture

Application (Thread 1)

Op1

  • pen(File1) -> fd1

Application (Thread 2)

Op2+ read_modify_write(fd2, buf, off=30, sz=5)

UserLib (in Host)

Per-inode I/O Queue Per-inode Data Buffer Converting POSIX I/O syscalls to CompoundFS compoundOps

Journal

TxB TxE Meta- data NVM Data Block Addr Cred Table

CPUID Cred CPUID CPUID Cred Cred

StorageFS (In Device)

I/O Request Processing Threads Device CPU Cores Compounding I/O ops Perform CRC calculation before write()

Op3* write_and_checksum(fd1,buff, off=10, sz=1K, checksum_pos=head) Op4 read(fd2, buf, off=30, sz=5)

Op1 Op2+ Op4 Op3*

slide-16
SLIDE 16

16

CompoundFS Implementation

  • Command-based arch based on PMFS (Eurosys’14)
  • control-plane ops (e.g. open) as commands via ioctl()
  • ioctl() carries arguments for each I/O ops
  • Avoids VFS overhead
  • control-plane ops are issued via ioctl(), no VFS layer
  • Avoids system call overhead
  • UserLib and StorageFS share a command buffer
  • UserLib adds requests to command buffer
  • StorageFS processes requests from the buffer
slide-17
SLIDE 17

17

CompoundFS Challenges

  • Crash-consistency model for compound I/O operations
  • All-or-nothing model (current solution)
  • an entire compound operation is a transaction
  • partially completed operations cannot be recovered
  • e.g., write-and-checksum, only data is persisted but checksum not
  • All-or-something model (ongoing)
  • fine-grained journaling and partial recovery is supported
  • recovery could become complex
slide-18
SLIDE 18
  • Background
  • Analysis
  • Design
  • Evaluation
  • Conclusion

18

Outline

slide-19
SLIDE 19

19

Evaluation Goal

  • Effectiveness to reduce boundary crossing
  • Effectiveness to reduce data copy overheads
  • Ability to exploit compute capability of modern storage
slide-20
SLIDE 20

20

Experimental Setup

  • Hardware Platform
  • dual-socket 64-core Xeon Scalable CPU @ 2.6GHz
  • 512GB Intel DC Optane NVM
  • Emulate firmware-level FS
  • reserve dedicated device threads handling I/O requests
  • add PCIe latency for every I/O operation
  • reduce CPU frequency to 1.2GHz for device CPU
  • State-of-the-art File Systems
  • ext4-DAX (Kernel-level file system)
  • SplitFS (User-level file system)
  • DevFS (Device-level file system)
slide-21
SLIDE 21

21

Micro-Benchmark

Read-modify-write Write-and-checksum

  • CompoundFS reduces unnecessary data movement and system call
  • verhead by combining operations

200 400 600 800 1000 1200 256 4096 Throughput (MB/s) Value Size

ext4-DAX SplitFS DevFS CompoundFS CompoundFS-slowCPU

200 400 600 800 1000 1200 256 4096 Throughput (MB/s) Value Size

2.1x 1.25x

  • Even with slow device CPUs, CompoundFS can still provide gains for in-

storage computation

slide-22
SLIDE 22

22

LevelDB

db_bench random write db_bench random read

  • CompoundFS also shows promising speedup in Leveldb

20 40 60 80 100 512 4096 Throughput (MB/s) db_bench Value Size (500k keys) 10 20 30 40 512 4096 Latency (us/op) db_bench Value Size (500k keys)

ext4-DAX SplitFS DevFS CompoundFS CompoundFS-slowCPU

1.75x

slide-23
SLIDE 23

23

Conclusion

  • Storage hardware is moving to microsecond era
  • Software overhead matters and providing direct-access is critical
  • Storage compute capability can benefit I/O intensive applications
  • CompoundFS combines I/O ops and offloads computations
  • Reduces boundary crossing (system call) and data copy overhead
  • Takes advantage of in-storage compute resources
  • Our ongoing work
  • Fine-grained crash consistency mechanism
  • Efficient I/O scheduler for managing computation in storage
slide-24
SLIDE 24

Thanks!

24

Questions?

yujie.ren@rutgers.edu