Strata: A Cross Media File System
1
Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson
Strata: A Cross Media File System Youngjin Kwon , Henrique Fingler, - - PowerPoint PPT Presentation
Strata: A Cross Media File System Youngjin Kwon , Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson 1 Lets build a fast server NoSQL store, Database, File server, Mail server Requirements Small updates (1
1
Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson
2
NoSQL store, Database, File server, Mail server …
Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02
Better performance Higher capacity
3
Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02
Better performance Higher capacity
3
Byte-addressable: cache-line granularity IO
Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02
Better performance Higher capacity
3
Large erasure blocks need to be sequentially written Random writes: 5~6x slowdown due to GC [FAST’15] Byte-addressable: cache-line granularity IO
Application
4
91%
Kernel file system NVM
Kernel file system: NOVA [FAST 16, SOSP 17]
Application
4
Small, random IO is slow!
91%
Kernel file system NVM
Kernel file system: NOVA [FAST 16, SOSP 17]
Application
4
Small, random IO is slow!
91%
Kernel file system NVM 1 KB IO latency (us) 1.5 3 4.5 6
Write to device Kernel code
91%
Kernel file system: NOVA [FAST 16, SOSP 17]
Application
4
Small, random IO is slow!
91%
Kernel file system NVM 1 KB IO latency (us) 1.5 3 4.5 6
Write to device Kernel code
91%
Kernel file system: NOVA [FAST 16, SOSP 17]
5
Kernel file system NVM Application
5
Need huge capacity, but NVM alone is too expensive! ($40K for 10TB)
Kernel file system NVM Application
5
Need huge capacity, but NVM alone is too expensive! ($40K for 10TB)
Kernel file system NVM Application
6
Block-level caching NVM SSD HDD Kernel file system Application
6
but NVM is byte-addressable
Block-level caching NVM SSD HDD Kernel file system Application
6
but NVM is byte-addressable
Block-level caching NVM SSD HDD Kernel file system Application 1 KB IO latency (us) 3 6 9 12
NOVA Block-level caching
Better
6
but NVM is byte-addressable
Block-level caching NVM SSD HDD Kernel file system Application 1 KB IO latency (us) 3 6 9 12
NOVA Block-level caching
Better
7
7
Pillai et al., OSDI 2014
SQLite HDFS ZooKeeper LevelDB HSQLDB Mercurial Git
Crash vulnerabilities 2 4 6 8 10
7
Pillai et al., OSDI 2014
SQLite HDFS ZooKeeper LevelDB HSQLDB Mercurial Git
Crash vulnerabilities 2 4 6 8 10
8
NVM is so fast that kernel is the bottleneck
For low-cost capacity with high performance, must leverage multiple device types NVM (soon), SSD, HDD
write to device only when you must (fsync) Applications struggle for crash consistency
9
Performance: especially small, random IO
Low-cost capacity: leverage NVM, SSD & HDD
Simplicity: intuitive crash consistency model
10
Performance: Kernel bypass, but private
10
Simplicity: Intuitive crash consistency Performance: Kernel bypass, but private
10
Simplicity: Intuitive crash consistency Performance: Kernel bypass, but private Coordinate multi-process accesses
10
Simplicity: Intuitive crash consistency Performance: Kernel bypass, but private Coordinate multi-process accesses
10
Apply log operations to shared data
Simplicity: Intuitive crash consistency Performance: Kernel bypass, but private Coordinate multi-process accesses
10
Apply log operations to shared data
11
12
unmodified application Strata: LibFS
NVM
POSIX API
Private operation log
creat write …
File operations (data & metadata)
rename
12
unmodified application Strata: LibFS
Kernel- bypass NVM
POSIX API
Private operation log
creat write …
File operations (data & metadata)
rename
12
unmodified application Strata: LibFS
Kernel- bypass NVM
POSIX API
Private operation log
creat write …
File operations (data & metadata)
rename
unmodified application
13
Strata: LibFS
Kernel- bypass NVM Synchronous IO
POSIX API
Private operation log
unmodified application
13
Strata: LibFS
Kernel- bypass NVM Synchronous IO
POSIX API
Private operation log
unmodified application
13
Strata: LibFS
Kernel- bypass NVM Synchronous IO
POSIX API
Private operation log
fsync() is no-op
unmodified application
13
Strata: LibFS
Kernel- bypass NVM Synchronous IO
POSIX API
Private operation log
fsync() is no-op
14
15
NVM
NVM Shared area Private operation log Application Strata: LibFS
POSIX API
Strata: KernelFS
16
NVM
NVM Shared area Private operation log Application Strata: LibFS
POSIX API
Strata: KernelFS
16
Write NVM
NVM Shared area Private operation log Application Strata: LibFS
POSIX API
Strata: KernelFS
16
Write NVM Digest (Background copy)
NVM Shared area Private operation log Application Strata: LibFS
POSIX API
Strata: KernelFS
16
make private log visible to other applications
turn write-optimized to read-optimized format (extent tree)
Write NVM Digest (Background copy)
NVM Shared area Private operation log Application Strata: LibFS
POSIX API
Strata: KernelFS
SQLite, Mail server: crash consistent update using write ahead logging
17
Digest eliminates unneeded work
. . . . . .
Remove temporary durable writes Private operation log
Application Strata: LibFS Strata: KernelFS
NVM Shared area
SQLite, Mail server: crash consistent update using write ahead logging
17
Create journal file Write data to journal file Write data to database file Delete journal file
Digest eliminates unneeded work
. . . . . .
Remove temporary durable writes Private operation log
Application Strata: LibFS Strata: KernelFS
NVM Shared area
SQLite, Mail server: crash consistent update using write ahead logging
17
Create journal file Write data to journal file Write data to database file Delete journal file
Digest eliminates unneeded work
. . . . . .
Write data to database file
Remove temporary durable writes Private operation log
Application Strata: LibFS Strata: KernelFS
NVM Shared area
SQLite, Mail server: crash consistent update using write ahead logging
17
Create journal file Write data to journal file Write data to database file Delete journal file
Digest eliminates unneeded work
. . . . . .
Write data to database file
Remove temporary durable writes Private operation log
Application Strata: LibFS Strata: KernelFS
NVM Shared area
18
Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log
Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log
19
SSD Shared area HDD Shared area
to lower layers
Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log
19
SSD Shared area HDD Shared area
to lower layers
NVM data
Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log
19
SSD Shared area HDD Shared area
to lower layers
NVM data Logs
Digest
Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log
19
SSD Shared area HDD Shared area
to lower layers
NVM data Logs
Digest
Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log
19
SSD Shared area HDD Shared area
to lower layers
NVM data
Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log
19
SSD Shared area HDD Shared area
to lower layers
NVM data
collection overhead
Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log
19
SSD Shared area HDD Shared area
to lower layers
NVM data
Resembles log-structured merge (LSM) tree
collection overhead
20
Application Strata: LibFS Strata: KernelFS NVM Shared area Private OP log SSD Shared area HDD Shared area NVM data SSD data HDD data Log data 2 1 3 4
Search order
21
21
22
23
24
PMFS[EuroSys 14]: In-place update file system
25
Can Strata sustain high throughput?
when managing data across storage layers?
26
systems: PMFS, NOVA, EXT4-DAX
2 4 6 8 10 IO size 128 B 1 KB 4 KB 16 KB
Strata PMFS NOVA EXT4-DAX
Latency (us)
B e t t e r
17 21 23 29
26
systems: PMFS, NOVA, EXT4-DAX
2 4 6 8 10 IO size 128 B 1 KB 4 KB 16 KB
Strata PMFS NOVA EXT4-DAX
Latency (us)
B e t t e r
17 21 23 29
Avg.: 26% better Tail : 43% better
10 20 30
Write sync. Write seq. Write rand. Overwrite Read rand.
Strata PMFS NOVA EXT4-DAX
35.2 49.2 37.7
27
Better Latency (us)
asynchronous digests
10 20 30
Write sync. Write seq. Write rand. Overwrite Read rand.
Strata PMFS NOVA EXT4-DAX
35.2 49.2 37.7
27
Better
25% better Tied
Latency (us)
asynchronous digests
10 20 30
Write sync. Write seq. Write rand. Overwrite Read rand.
Strata PMFS NOVA EXT4-DAX
35.2 49.2 37.7
27
Better
25% better Tied
Latency (us)
asynchronous digests
28
Can Strata sustain high throughput?
when managing data across storage layers?
29
Mail server workload from Filebench
Create journal file Write data to journal Write data to database file Delete journal file
Digest eliminates unneeded work
Write data to database file
Removes temporary durable writes KernelFS Application
LibFS
Log coalescing
29
Mail server workload from Filebench
Strata PMFS NOVA EXT4-DAX Throughput (op/s) 0K 100K 200K 300K 400K
Better
29% better
Create journal file Write data to journal Write data to database file Delete journal file
Digest eliminates unneeded work
Write data to database file
Removes temporary durable writes KernelFS Application
LibFS
Log coalescing
29
Mail server workload from Filebench
Log coalescing eliminates 86% of log entries, saving 14 GB of IO
Strata PMFS NOVA EXT4-DAX Throughput (op/s) 0K 100K 200K 300K 400K
Better
29% better
Create journal file Write data to journal Write data to database file Delete journal file
Digest eliminates unneeded work
Write data to database file
Removes temporary durable writes KernelFS Application
LibFS
Log coalescing
29
Mail server workload from Filebench
Log coalescing eliminates 86% of log entries, saving 14 GB of IO
Strata PMFS NOVA EXT4-DAX Throughput (op/s) 0K 100K 200K 300K 400K
Better
29% better
Create journal file Write data to journal Write data to database file Delete journal file
Digest eliminates unneeded work
Write data to database file
Removes temporary durable writes KernelFS Application
LibFS
Log coalescing
No kernel file system has both low latency and high throughput:
Strata achieves both low latency and high throughput
30
File server workload from Filebench
30
File server workload from Filebench
User-level migration
30
File server workload from Filebench
User-level migration
Block-level caching
30
File server workload from Filebench
User-level migration
Strata User-level migration Block-level caching
0K 2K 4K 6K 8K 10K 2x faster
Block-level caching
30
File server workload from Filebench
User-level migration
22% faster than user-level migration Cross layer optimization: placing hot metadata in faster layers
Strata User-level migration Block-level caching
0K 2K 4K 6K 8K 10K 2x faster
Block-level caching
Source code is available at https://github.com/ut-osa/strata
31
Server applications need fast, small random IO on vast datasets with intuitive crash consistency
Strata, a cross media file system, addresses these concerns
Performance: low latency, high throughput
Low-cost capacity: leverage NVM, SSD & HDD
Simplicity: intuitive crash consistency model
32
SSD Throughput (MB/s)
250 500 750 1000 SSD utilization 0.1 0.25 0.5 0.6 0.7 0.8 0.9 1
64 MB 128 MB 256 MB 512 MB 1024 MB
For example, SSD Random write:
Sequential writes avoid management overhead
33
5-6x difference by hardware GC
SSD, HDD prefer large sequential IO
34
15 30 45 60 RPC size (IO size) 1 KB 4 KB 64 KB
Strata PMFS NOVA EXT4-DAX No persist
B e t t e r
98
Latency (us)
sending ACK to client
7x faster than EXT4-DAX
34
15 30 45 60 RPC size (IO size) 1 KB 4 KB 64 KB
Strata PMFS NOVA EXT4-DAX No persist
B e t t e r
98
Latency (us)
sending ACK to client
7x faster than EXT4-DAX
35% better