SplitFS: Reducing Software Overhead in File Systems for Persistent Memory
Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap*, Taesoo Kim, Aasheesh Kolli, Vijay Chidambaram
* on the job market
- 1
SplitFS: Reducing Software Overhead in File Systems for Persistent - - PowerPoint PPT Presentation
SplitFS: Reducing Software Overhead in File Systems for Persistent Memory Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap*, Taesoo Kim, Aasheesh Kolli, Vijay Chidambaram * on the job market 1 Persistent Memory (PM) Non-volatile Fast
Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap*, Taesoo Kim, Aasheesh Kolli, Vijay Chidambaram
* on the job market
Non-volatile Fast
PM file system
Application File 1 PM File 2 File n
PM file system
Application File 1 PM File 2 File n read(), write(),
POSIX API
PM file system
Application File 1 PM File 2 File n read(), write(),
POSIX API
ext4-DAX PMFS NOVA Strata
700
700
Modification of the ext4 file system for Persistent Memory Works with modern Linux kernels Under active development by the ext4 community Only PM file system that is widely used
700
2000 4000 6000 8000 10000 device Strata NOVA PMFS ext4-DAX
Time (ns)
700
Time (ns)
2000 4000 6000 8000 10000 device Strata NOVA PMFS ext4-DAX
700
700
Time (ns)
2000 4000 6000 8000 10000 device Strata NOVA PMFS ext4-DAX
700 9002 (12x)
700
Time (ns)
2000 4000 6000 8000 10000 device Strata NOVA PMFS ext4-DAX
700 9002 (12x) software
700
Time (ns)
2000 4000 6000 8000 10000 device Strata NOVA PMFS ext4-DAX
4150 (5x) 3021 (3x) 700 2450 (2.5x) 9002 (12x)
700
Time (ns)
2000 4000 6000 8000 10000 device Strata NOVA PMFS ext4-DAX
4150 (5x) 3021 (3x) 700 2450 (2.5x) 9002 (12x)
File systems suffer from high software overhead!
700
Time (ns)
2000 4000 6000 8000 10000 device Strata NOVA PMFS ext4-DAX
4150 (5x) 3021 (3x) 700 2450 (2.5x) 9002 (12x)
File systems suffer from high software overhead! ext4-DAX, although widely used, suffers from highest software overhead and provides weak guarantees
POSIX file system aimed at reducing software overhead for PM
POSIX file system aimed at reducing software overhead for PM SplitFS serves data operations from user space and metadata
POSIX file system aimed at reducing software overhead for PM SplitFS serves data operations from user space and metadata
Provides strong guarantees such as atomic and synchronous data operations
POSIX file system aimed at reducing software overhead for PM SplitFS serves data operations from user space and metadata
Reduces software overhead by up to 17x compared to ext4-DAX https://github.com/utsaslab/splitfs Provides strong guarantees such as atomic and synchronous data operations
Improves application throughput by up to 2x compared to NOVA
SplitFS is targeted at POSIX applications which use read() / write() system calls in order to access their data on Persistent Memory.
SplitFS is targeted at POSIX applications which use read() / write() system calls in order to access their data on Persistent Memory. SplitFS does not optimize for the case when multiple processes concurrently access the same file
SplitFS lies both in user space as well as in the kernel.
SplitFS lies both in user space as well as in the kernel.
Data in kernel Metadata in kernel Data in user Metadata in user
SplitFS lies both in user space as well as in the kernel.
ext4-DAX, PMFS [EuroSys 14], NOVA [FAST 16]
Data in kernel Metadata in kernel Data in user Metadata in user
SplitFS lies both in user space as well as in the kernel.
ext4-DAX, PMFS [EuroSys 14], NOVA [FAST 16] Low performance
Data in kernel Metadata in kernel Data in user Metadata in user
SplitFS lies both in user space as well as in the kernel.
ext4-DAX, PMFS [EuroSys 14], NOVA [FAST 16] Strata [SOSP 17] Low performance
Data in kernel Metadata in kernel Data in user Metadata in user
SplitFS lies both in user space as well as in the kernel.
ext4-DAX, PMFS [EuroSys 14], NOVA [FAST 16] Strata [SOSP 17] Low performance High complexity
Data in kernel Metadata in kernel Data in user Metadata in user
SplitFS lies both in user space as well as in the kernel.
ext4-DAX, PMFS [EuroSys 14], NOVA [FAST 16] Strata [SOSP 17] Low performance High complexity
Data in kernel Metadata in kernel Data in user Metadata in user Aerie [EuroSys 14] High complexity Data in user Metadata in user Allocations in kernel Low performance
SplitFS lies both in user space as well as in the kernel.
ext4-DAX, PMFS [EuroSys 14], NOVA [FAST 16] Strata [SOSP 17] Low performance High complexity SplitFS Low complexity High performance Data in user Metadata in kernel
Data in kernel Metadata in kernel Data in user Metadata in user Aerie [EuroSys 14] High complexity Data in user Metadata in user Allocations in kernel Low performance
High performance Low complexity
Accelerate data operations from user space
High performance Low complexity
Accelerate data operations from user space
Use ext4-DAX for metadata operations
High performance Low complexity
user kernel Application File 1 File 2 File 3 PM
U-Split K-Split (ext4-DAX)
user kernel Application File 1 File 2 File 3 PM read() write()
U-Split K-Split (ext4-DAX)
creat() delete() user kernel Application File 1 File 2 File 3 PM read() write()
U-Split K-Split (ext4-DAX)
creat() delete() user kernel Application File 1 File 2 File 3 PM read() write()
Log U-Split K-Split (ext4-DAX)
creat() delete() user kernel Application File 1 File 2 File 3 PM read() write()
Log U-Split
SplitFS accelerates common case data operations while leveraging the maturity of ext4-DAX for metadata operations
K-Split (ext4-DAX)
creat() delete() user kernel Application File 1 File 2 File 3 PM read() write()
Log U-Split
SplitFS accelerates common case data operations while leveraging the maturity of ext4-DAX for metadata operations
K-Split (ext4-DAX)
SplitFS uses logging and out of place updates for providing atomic and synchronous operations
K-Split (ext4-DAX) U-Split Application File PM User Kernel
read / update K-Split (ext4-DAX) U-Split Application File PM User Kernel
read / update K-Split (ext4-DAX) U-Split Application File PM mmap perform mmap User Kernel
read / update K-Split (ext4-DAX) U-Split Application File PM DAX-mmaps mmap perform mmap User Kernel
read / update K-Split (ext4-DAX) U-Split Application File PM DAX-mmaps User Kernel
read / update K-Split (ext4-DAX) U-Split Application File PM DAX-mmaps User Kernel
In the common case, file reads and updates do not pass through the kernel
foo size = 10 foo inode Persistent Memory
user kernel
foo size = 10 foo inode Persistent Memory
user kernel Application Start
foo size = 10 foo inode Persistent Memory
user kernel Application staging file size = 100 staging file inode Start
foo size = 10 foo inode Persistent Memory
user kernel Application staging file size = 100 staging file inode Start staging file mmap
foo size = 10 foo inode Persistent Memory
user kernel Application staging file size = 100 staging file inode Start staging file mmap append (foo,“abc”) abc store
foo size = 10 foo inode Persistent Memory
user kernel Application staging file size = 100 staging file inode Start staging file mmap append (foo,“abc”) abc read (foo) load
foo size = 10 foo inode Persistent Memory
user kernel Application staging file size = 100 staging file inode Start staging file mmap append (foo,“abc”) abc read (foo) fsync (foo)
foo size = 10 foo inode Persistent Memory
user kernel Application staging file size = 100 staging file inode Start staging file mmap append (foo,“abc”) abc read (foo) fsync (foo)
relink()
foo size = 10 foo inode Persistent Memory
user kernel Application staging file size = 100 staging file inode Start staging file mmap append (foo,“abc”) abc read (foo) fsync (foo) foo staging ext4-journal transaction
relink()
foo size = 10 foo inode Persistent Memory
user kernel Application staging file size = 100 staging file inode Start staging file mmap append (foo,“abc”) read (foo) fsync (foo) foo staging ext4-journal transaction abc
relink()
foo size = 10 foo inode Persistent Memory
user kernel Application staging file size = 100 staging file inode Start staging file mmap append (foo,“abc”) read (foo) fsync (foo) foo staging ext4-journal transaction abc
In the common case, file appends do not pass through the kernel
relink()
Mode Metadata Atomicity Synchronous Operations Data Atomicity File System POSIX ext4-DAX, SplitFS-POSIX Sync PMFS, SplitFS-Sync Strict NOVA, Strata, SplitFS-Strict
Mode Metadata Atomicity Synchronous Operations Data Atomicity File System POSIX ext4-DAX, SplitFS-POSIX Sync PMFS, SplitFS-Sync Strict NOVA, Strata, SplitFS-Strict
Mode Metadata Atomicity Synchronous Operations Data Atomicity File System POSIX ext4-DAX, SplitFS-POSIX Sync PMFS, SplitFS-Sync Strict NOVA, Strata, SplitFS-Strict
Mode Metadata Atomicity Synchronous Operations Data Atomicity File System POSIX ext4-DAX, SplitFS-POSIX Sync PMFS, SplitFS-Sync Strict NOVA, Strata, SplitFS-Strict
Mode Metadata Atomicity Synchronous Operations Data Atomicity File System POSIX ext4-DAX, SplitFS-POSIX Sync PMFS, SplitFS-Sync Strict NOVA, Strata, SplitFS-Strict
Optimized logging is used in order to provide stronger guarantees in sync and strict modes
SplitFS employs a per-application log in sync and strict mode, which logs every logical operation
SplitFS employs a per-application log in sync and strict mode, which logs every logical operation In the common case
K-Split (ext4-DAX) App 1 File 1 PM App 2 App 3 File 2 File 3 File 4 User Kernel
K-Split (ext4-DAX) App 1 File 1 PM App 2 App 3 U-Split- POSIX U-Split- sync U-Split- strict File 2 File 3 File 4 User Kernel
K-Split (ext4-DAX) App 1 File 1 PM App 2 App 3 U-Split- POSIX U-Split- sync U-Split- strict File 2 File 3 File 4 Data Data Data Meta Meta Meta User Kernel
Technique Benefit
Technique Benefit
SplitFS Architecture Low-overhead data operations, Correct metadata operations
Technique Benefit
SplitFS Architecture Low-overhead data operations, Correct metadata operations Staging + Relink Optimized appends, No data copy
Technique Benefit
SplitFS Architecture Low-overhead data operations, Correct metadata operations Staging + Relink Optimized appends, No data copy Optimized Logging + out-of-place writes Stronger guarantees
Setup:
Setup:
File systems compared:
Setup:
File systems compared:
2000 4000 6000 8000 10000 device SplitFS-strict Strata NOVA PMFS ext4-DAX
4150 (5x) 3021 (3x) 9002 (12x) 700 2450 (2.5x)
Time (ns)
Time (ns)
2000 4000 6000 8000 10000 device SplitFS-strict Strata NOVA PMFS ext4-DAX
1251 (0.8x) 4150 (5x) 3021 (3x) 700 2450 (2.5x) 9002 (12x)
YCSB on LevelDB Seq reads Redis Tar Git Rsync Microbenchmarks Data intensive Metadata intensive TPCC on SQLite Rand reads Seq writes Rand writes Appends
YCSB on LevelDB Seq reads Redis Tar Git Rsync Microbenchmarks Data intensive Metadata intensive TPCC on SQLite Rand reads Seq writes Rand writes Appends
Load A - 100% writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes
Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insert 5M keys. Run 5M operations. Key size = 16 bytes. Value size = 1K
0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F NOVA SplitFS-Strict
Normalized throughput
0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F NOVA SplitFS-Strict
Load A - 100% writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 13.39 kops/s 32.24 kops/s 139.94 kops/s 174.85 kops/s 191.54 kops/s 13.59 kops/s 17.75 kops/s 66.54 kops/s
Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insert 5M keys. Run 5M operations. Key size = 16 bytes. Value size = 1K
Normalized throughput
Load A - 100% writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes
Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insert 5M keys. Run 5M operations. Key size = 16 bytes. Value size = 1K
0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F NOVA SplitFS-Strict
13.39 kops/s 32.24 kops/s 139.94 kops/s 174.85 kops/s 191.54 kops/s 13.59 kops/s 17.75 kops/s 66.54 kops/s
Normalized throughput
Load A - 100% writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes
Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insert 5M keys. Run 5M operations. Key size = 16 bytes. Value size = 1K Read-heavy workloads optimized because of converting reads to loads
0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F NOVA SplitFS-Strict
13.39 kops/s 32.24 kops/s 139.94 kops/s 174.85 kops/s 191.54 kops/s 13.59 kops/s 17.75 kops/s 66.54 kops/s
Normalized throughput
Load A - 100% writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes
Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insert 5M keys. Run 5M operations. Key size = 16 bytes. Value size = 1K
0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F NOVA SplitFS-Strict
13.39 kops/s 32.24 kops/s 139.94 kops/s 174.85 kops/s 191.54 kops/s 13.59 kops/s 17.75 kops/s 66.54 kops/s
Normalized throughput
Load A - 100% writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes
Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insert 5M keys. Run 5M operations. Key size = 16 bytes. Value size = 1K Write-heavy workloads optimized because of staging and relink
0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F NOVA SplitFS-Strict
13.39 kops/s 32.24 kops/s 139.94 kops/s 174.85 kops/s 191.54 kops/s 13.59 kops/s 17.75 kops/s 66.54 kops/s
Normalized throughput
SplitFS introduces a new architecture for building PM file systems that… reduces software overhead, provides strong guarantees, and leverages the widely-used ext4-DAX
SplitFS introduces a new architecture for building PM file systems that… reduces software overhead, provides strong guarantees, and leverages the widely-used ext4-DAX https://github.com/utsaslab/splitfs
Load A - 100% writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes
Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insert 5M keys. Run 1M operations. Key size = 16 bytes. Value size = 1K
Normalized throughput
0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Strata SplitFS-Strict