Strata: A Cross Media File System Youngjin Kwon , Henrique Fingler, - - PowerPoint PPT Presentation

strata a cross media file system
SMART_READER_LITE
LIVE PREVIEW

Strata: A Cross Media File System Youngjin Kwon , Henrique Fingler, - - PowerPoint PPT Presentation

Strata: A Cross Media File System Youngjin Kwon , Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson 1 Lets build a fast server NoSQL store, Database, File server, Mail server Requirements Small updates (1


slide-1
SLIDE 1

Strata: A Cross Media File System

1

Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

slide-2
SLIDE 2

Let’s build a fast server

2

Requirements

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10 TB
  • Updates must be crash consistent

NoSQL store, Database, File server, Mail server …

slide-3
SLIDE 3

Storage diversification

Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02

Better performance Higher capacity

3

slide-4
SLIDE 4

Storage diversification

Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02

Better performance Higher capacity

3

Byte-addressable: cache-line granularity IO

slide-5
SLIDE 5

Storage diversification

Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02

Better performance Higher capacity

3

Large erasure blocks need to be sequentially written Random writes: 5~6x slowdown due to GC [FAST’15] Byte-addressable: cache-line granularity IO

slide-6
SLIDE 6

Application

A fast server on today’s file system

4

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent

91%

Kernel file system NVM

Kernel file system: NOVA [FAST 16, SOSP 17]

slide-7
SLIDE 7

Application

A fast server on today’s file system

4

Small, random IO is slow!

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent

91%

Kernel file system NVM

Kernel file system: NOVA [FAST 16, SOSP 17]

slide-8
SLIDE 8

Application

A fast server on today’s file system

4

Small, random IO is slow!

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent

91%

Kernel file system NVM 1 KB IO latency (us) 1.5 3 4.5 6

Write to device Kernel code

91%

Kernel file system: NOVA [FAST 16, SOSP 17]

slide-9
SLIDE 9

Application

A fast server on today’s file system

4

Small, random IO is slow!

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent

91%

Kernel file system NVM 1 KB IO latency (us) 1.5 3 4.5 6

Write to device Kernel code

91%

Kernel file system: NOVA [FAST 16, SOSP 17]

NVM is so fast that kernel is the bottleneck

slide-10
SLIDE 10

A fast server on today’s file system

5

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent

Kernel file system NVM Application

slide-11
SLIDE 11

A fast server on today’s file system

5

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent

Need huge capacity, but NVM alone is too expensive! ($40K for 10TB)

Kernel file system NVM Application

slide-12
SLIDE 12

A fast server on today’s file system

5

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent

Need huge capacity, but NVM alone is too expensive! ($40K for 10TB)

Kernel file system NVM Application

For low-cost capacity with high performance, 
 must leverage multiple device types

slide-13
SLIDE 13

A fast server on today’s file system

6

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent

Block-level caching NVM SSD HDD Kernel file system Application

slide-14
SLIDE 14

A fast server on today’s file system

6

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent
  • Block-level caching manages data in blocks, 


but NVM is byte-addressable

  • Extra level of indirection

Block-level caching NVM SSD HDD Kernel file system Application

slide-15
SLIDE 15

A fast server on today’s file system

6

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent
  • Block-level caching manages data in blocks, 


but NVM is byte-addressable

  • Extra level of indirection

Block-level caching NVM SSD HDD Kernel file system Application 1 KB IO latency (us) 3 6 9 12

NOVA Block-level caching

Better

slide-16
SLIDE 16

A fast server on today’s file system

6

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent
  • Block-level caching manages data in blocks, 


but NVM is byte-addressable

  • Extra level of indirection

Block-level caching NVM SSD HDD Kernel file system Application 1 KB IO latency (us) 3 6 9 12

NOVA Block-level caching

Better

Block-level caching is too slow

slide-17
SLIDE 17

A fast server on today’s file system

7

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent
slide-18
SLIDE 18

A fast server on today’s file system

7

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent

Pillai et al., OSDI 2014

SQLite HDFS ZooKeeper LevelDB HSQLDB Mercurial Git

Crash vulnerabilities 2 4 6 8 10

slide-19
SLIDE 19

A fast server on today’s file system

7

  • Small updates (1 Kbytes) dominate
  • Dataset scales up to 10TB
  • Updates must be crash consistent

Pillai et al., OSDI 2014

SQLite HDFS ZooKeeper LevelDB HSQLDB Mercurial Git

Crash vulnerabilities 2 4 6 8 10

Applications struggle for crash consistency

slide-20
SLIDE 20

Problems in today’s file systems

8

  • Kernel mediates every operation

NVM is so fast that kernel is the bottleneck

  • Tied to a single type of device

For low-cost capacity with high performance, must leverage multiple device types NVM (soon), SSD, HDD

  • Aggressive caching in DRAM, 


write to device only when you must (fsync) Applications struggle for crash consistency

slide-21
SLIDE 21

Strata: A Cross Media File System

9

Performance: especially small, random IO

  • Fast user-level device access

Low-cost capacity: leverage NVM, SSD & HDD

  • Transparent data migration across different storage media
  • Efficiently handle device IO properties

Simplicity: intuitive crash consistency model

  • In-order, synchronous IO
  • No fsync() required
slide-22
SLIDE 22

Strata: main design principle

Log operations to NVM at user-level

10

Digest and migrate data in kernel

slide-23
SLIDE 23

Strata: main design principle

Log operations to NVM at user-level

Performance: Kernel bypass, but private

10

Digest and migrate data in kernel

slide-24
SLIDE 24

Strata: main design principle

Log operations to NVM at user-level

Simplicity: Intuitive crash consistency Performance: Kernel bypass, but private

10

Digest and migrate data in kernel

slide-25
SLIDE 25

Strata: main design principle

Log operations to NVM at user-level

Simplicity: Intuitive crash consistency Performance: Kernel bypass, but private Coordinate multi-process accesses

10

Digest and migrate data in kernel

slide-26
SLIDE 26

Strata: main design principle

Log operations to NVM at user-level

Simplicity: Intuitive crash consistency Performance: Kernel bypass, but private Coordinate multi-process accesses

10

Digest and migrate data in kernel

Apply log operations to shared data

slide-27
SLIDE 27

Strata: main design principle

Log operations to NVM at user-level

Simplicity: Intuitive crash consistency Performance: Kernel bypass, but private Coordinate multi-process accesses

10

Digest and migrate data in kernel

Apply log operations to shared data

LibFS KernelFS

slide-28
SLIDE 28

Outline

  • LibFS: Log operations to NVM at user-level
  • Fast user-level access
  • In-order, synchronous IO
  • KernelFS: Digest and migrate data in kernel
  • Asynchronous digest
  • Transparent data migration
  • Shared file access
  • Evaluation

11

slide-29
SLIDE 29

Log operations to NVM at user-level

12

unmodified application Strata: LibFS

NVM

POSIX API

Private operation log

creat write …

File operations (data & metadata)

rename

slide-30
SLIDE 30

Log operations to NVM at user-level

12

  • Fast writes
  • Directly access fast NVM
  • Sequentially append data
  • Cache-line granularity
  • Blind writes

unmodified application Strata: LibFS

Kernel- bypass NVM

POSIX API

Private operation log

creat write …

File operations (data & metadata)

rename

slide-31
SLIDE 31

Log operations to NVM at user-level

12

  • Fast writes
  • Directly access fast NVM
  • Sequentially append data
  • Cache-line granularity
  • Blind writes

unmodified application Strata: LibFS

Kernel- bypass NVM

POSIX API

Private operation log

creat write …

File operations (data & metadata)

  • Crash consistency
  • On crash, kernel replays log

rename

slide-32
SLIDE 32

unmodified application

Intuitive crash consistency

13

Strata: LibFS

Kernel- bypass NVM Synchronous IO

POSIX API

Private operation log

slide-33
SLIDE 33

unmodified application

Intuitive crash consistency

13

Strata: LibFS

Kernel- bypass NVM Synchronous IO

  • When each system call returns:
  • Data/metadata is durable
  • In-order update
  • Atomic write
  • Limited size (log size)

POSIX API

Private operation log

slide-34
SLIDE 34

unmodified application

Intuitive crash consistency

13

Strata: LibFS

Kernel- bypass NVM Synchronous IO

  • When each system call returns:
  • Data/metadata is durable
  • In-order update
  • Atomic write
  • Limited size (log size)

POSIX API

Private operation log

fsync() is no-op

slide-35
SLIDE 35

unmodified application

Intuitive crash consistency

13

Strata: LibFS

Kernel- bypass NVM Synchronous IO

  • When each system call returns:
  • Data/metadata is durable
  • In-order update
  • Atomic write
  • Limited size (log size)

POSIX API

Fast synchronous IO: NVM and kernel-bypass

Private operation log

fsync() is no-op

slide-36
SLIDE 36

Outline

  • LibFS: Log operations to NVM at user-level
  • Fast user-level access
  • In-order, synchronous IO
  • KernelFS: Digest and migrate data in kernel
  • Asynchronous digest
  • Transparent data migration
  • Shared file access
  • Evaluation

14

slide-37
SLIDE 37

15

  • Operation log
  • Private data
  • Read/writable to LibFS
  • Shared area
  • Managed by KernelFS
  • Globally visible
  • Read only to LibFS

Digest data in kernel

NVM

NVM Shared area Private operation log Application Strata: LibFS

POSIX API

Strata: KernelFS

slide-38
SLIDE 38

16

Digest data in kernel

NVM

NVM Shared area Private operation log Application Strata: LibFS

POSIX API

Strata: KernelFS

slide-39
SLIDE 39

16

Digest data in kernel

Write NVM

NVM Shared area Private operation log Application Strata: LibFS

POSIX API

Strata: KernelFS

slide-40
SLIDE 40

16

Digest data in kernel

Write NVM Digest (Background copy)

NVM Shared area Private operation log Application Strata: LibFS

POSIX API

Strata: KernelFS

slide-41
SLIDE 41

16

  • Visibility: 


make private log visible 
 to other applications

  • Data layout: 


turn write-optimized to 
 read-optimized format (extent tree)

  • Large, batched IO
  • Coalesce log

Digest data in kernel

Write NVM Digest (Background copy)

NVM Shared area Private operation log Application Strata: LibFS

POSIX API

Strata: KernelFS

slide-42
SLIDE 42

Digest optimization: Log coalescing

SQLite, Mail server: crash consistent update using write ahead logging

17

Digest eliminates unneeded work

. . . . . .

Remove 
 temporary durable writes Private operation log

Application Strata: LibFS Strata: KernelFS

NVM Shared area

slide-43
SLIDE 43

Digest optimization: Log coalescing

SQLite, Mail server: crash consistent update using write ahead logging

17

Create journal file Write data to journal file Write data to database file Delete journal file

Digest eliminates unneeded work

. . . . . .

Remove 
 temporary durable writes Private operation log

Application Strata: LibFS Strata: KernelFS

NVM Shared area

slide-44
SLIDE 44

Digest optimization: Log coalescing

SQLite, Mail server: crash consistent update using write ahead logging

17

Create journal file Write data to journal file Write data to database file Delete journal file

Digest eliminates unneeded work

. . . . . .

Write data to database file

Remove 
 temporary durable writes Private operation log

Application Strata: LibFS Strata: KernelFS

NVM Shared area

slide-45
SLIDE 45

Digest optimization: Log coalescing

SQLite, Mail server: crash consistent update using write ahead logging

17

Create journal file Write data to journal file Write data to database file Delete journal file

Digest eliminates unneeded work

. . . . . .

Write data to database file

Remove 
 temporary durable writes Private operation log

Application Strata: LibFS Strata: KernelFS

Throughput optimization: Log coalescing saves IO while digesting

NVM Shared area

slide-46
SLIDE 46

18

Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log

Digest and migrate data in kernel

slide-47
SLIDE 47

Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log

19

SSD Shared area HDD Shared area

  • Low-cost capacity
  • KernelFS migrates cold data

to lower layers

Digest and migrate data in kernel

slide-48
SLIDE 48

Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log

19

SSD Shared area HDD Shared area

  • Low-cost capacity
  • KernelFS migrates cold data

to lower layers

Digest and migrate data in kernel

NVM data

slide-49
SLIDE 49

Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log

19

SSD Shared area HDD Shared area

  • Low-cost capacity
  • KernelFS migrates cold data

to lower layers

Digest and migrate data in kernel

NVM data Logs

Digest

slide-50
SLIDE 50

Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log

19

SSD Shared area HDD Shared area

  • Low-cost capacity
  • KernelFS migrates cold data

to lower layers

Digest and migrate data in kernel

NVM data Logs

Digest

slide-51
SLIDE 51

Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log

19

SSD Shared area HDD Shared area

  • Low-cost capacity
  • KernelFS migrates cold data

to lower layers

Digest and migrate data in kernel

NVM data

slide-52
SLIDE 52

Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log

19

SSD Shared area HDD Shared area

  • Low-cost capacity
  • KernelFS migrates cold data

to lower layers

Digest and migrate data in kernel

NVM data

  • Handle device IO properties
  • Migrate 1 GB blocks
  • Avoid SSD garbage

collection overhead

slide-53
SLIDE 53

Application Strata: LibFS Strata: KernelFS NVM Shared area Private operation log

19

SSD Shared area HDD Shared area

  • Low-cost capacity
  • KernelFS migrates cold data

to lower layers

Digest and migrate data in kernel

NVM data

Resembles log-structured merge (LSM) tree

  • Handle device IO properties
  • Migrate 1 GB blocks
  • Avoid SSD garbage

collection overhead

slide-54
SLIDE 54

Read: hierarchical search

20

Application Strata: LibFS Strata: KernelFS NVM Shared area Private OP log SSD Shared area HDD Shared area NVM data SSD data HDD data Log data 2 1 3 4

Search order

slide-55
SLIDE 55

Shared file access

21

  • Leases grant access rights to applications [SOSP’89]
  • Function like lock, but revocable
  • Required for files and directories
  • Exclusive writer, shared readers
slide-56
SLIDE 56

Shared file access

21

  • Leases grant access rights to applications [SOSP’89]
  • Function like lock, but revocable
  • Required for files and directories
  • Exclusive writer, shared readers
  • On revocation, LibFS digests leased data
  • Private data made public before losing lease
  • Leases serialize concurrent updates
slide-57
SLIDE 57

Outline

  • LibFS: Log operations to NVM at user-level
  • Fast user-level access
  • In-order, synchronous IO
  • KernelFS: Digest and migrate data in kernel
  • Asynchronous digest
  • Transparent data migration
  • Shared file access
  • Evaluation

22

slide-58
SLIDE 58

Experimental setup

  • 2x Intel Xeon E5-2640 CPU, 64 GB DRAM
  • 400 GB NVMe SSD, 1 TB HDD
  • Ubuntu 16.04 LTS, Linux kernel 4.8.12
  • Emulated NVM
  • Use 40 GB of DRAM
  • Performance model [Y. Zhang et al. MSST 2015]
  • Throttle latency & throughput in software

23

slide-59
SLIDE 59

Related work

24

  • NVM file systems

PMFS[EuroSys 14]: In-place update file system

  • NOVA[FAST 16]: log-structured file system
  • EXT4-DAX: NVM support for EXT4
  • SSD file system
  • F2FS[FAST 15]: log-structured file system
slide-60
SLIDE 60

Evaluation questions

25

  • Latency:
  • Does Strata efficiently support small, random writes?
  • Does asynchronous digest have an impact on latency?
  • Throughput:
  • Strata writes data twice (logging and digesting). 


Can Strata sustain high throughput?

  • How well does Strata perform 


when managing data across storage layers?

slide-61
SLIDE 61

Microbenchmark: write latency

26

  • Strata logs to NVM
  • Compare to NVM kernel file

systems:
 PMFS, NOVA, EXT4-DAX

  • Strata, NOVA
  • In-order, synchronous IO
  • Atomic write
  • PMFS, EXT4-DAX
  • No atomic write

2 4 6 8 10 IO size 128 B 1 KB 4 KB 16 KB

Strata PMFS NOVA EXT4-DAX

Latency (us)

B e t t e r

17 21 23 29

slide-62
SLIDE 62

Microbenchmark: write latency

26

  • Strata logs to NVM
  • Compare to NVM kernel file

systems:
 PMFS, NOVA, EXT4-DAX

  • Strata, NOVA
  • In-order, synchronous IO
  • Atomic write
  • PMFS, EXT4-DAX
  • No atomic write

2 4 6 8 10 IO size 128 B 1 KB 4 KB 16 KB

Strata PMFS NOVA EXT4-DAX

Latency (us)

B e t t e r

17 21 23 29

Avg.: 26% better Tail : 43% better

slide-63
SLIDE 63

Latency: LevelDB

10 20 30

Write 
 sync. Write 
 seq. Write 
 rand. Overwrite Read
 rand.

Strata PMFS NOVA EXT4-DAX

35.2 49.2 37.7

27

Better Latency (us)

  • LevelDB (NVM)
  • Key size: 16 B
  • Value size: 1 KB
  • 300,000 objects
  • Workload causes

asynchronous digests

  • Fast user-level logging
  • Random write
  • 25% better than PMFS
  • Random read
  • Tied with PMFS
slide-64
SLIDE 64

Latency: LevelDB

10 20 30

Write 
 sync. Write 
 seq. Write 
 rand. Overwrite Read
 rand.

Strata PMFS NOVA EXT4-DAX

35.2 49.2 37.7

27

Better

25% better Tied

Latency (us)

  • LevelDB (NVM)
  • Key size: 16 B
  • Value size: 1 KB
  • 300,000 objects
  • Workload causes

asynchronous digests

  • Fast user-level logging
  • Random write
  • 25% better than PMFS
  • Random read
  • Tied with PMFS
slide-65
SLIDE 65

Latency: LevelDB

10 20 30

Write 
 sync. Write 
 seq. Write 
 rand. Overwrite Read
 rand.

Strata PMFS NOVA EXT4-DAX

35.2 49.2 37.7

27

Better

25% better Tied

Latency (us)

  • LevelDB (NVM)
  • Key size: 16 B
  • Value size: 1 KB
  • 300,000 objects
  • Workload causes

asynchronous digests

  • Fast user-level logging
  • Random write
  • 25% better than PMFS
  • Random read
  • Tied with PMFS

Low latency IO despite of background digest

slide-66
SLIDE 66

Evaluation questions

28

  • Latency:
  • Does Strata efficiently support small, random writes?
  • Does asynchronous digest have an impact on latency?
  • Throughput:
  • Strata writes data twice (logging and digesting). 


Can Strata sustain high throughput?

  • How well does Strata perform 


when managing data across storage layers?

slide-67
SLIDE 67

Throughput: Varmail

29

Mail server workload from Filebench

  • Using only NVM
  • 10000 files
  • Read/Write ratio is 1:1
  • Write-ahead logging

Create journal file Write data to journal Write data to database file Delete journal file

Digest eliminates unneeded work

Write data to database file

Removes 
 temporary durable writes KernelFS Application

LibFS

Log coalescing

slide-68
SLIDE 68

Throughput: Varmail

29

Mail server workload from Filebench

  • Using only NVM
  • 10000 files
  • Read/Write ratio is 1:1
  • Write-ahead logging

Strata PMFS NOVA EXT4-DAX Throughput (op/s) 0K 100K 200K 300K 400K

Better

29% better

Create journal file Write data to journal Write data to database file Delete journal file

Digest eliminates unneeded work

Write data to database file

Removes 
 temporary durable writes KernelFS Application

LibFS

Log coalescing

slide-69
SLIDE 69

Throughput: Varmail

29

Mail server workload from Filebench

  • Using only NVM
  • 10000 files
  • Read/Write ratio is 1:1
  • Write-ahead logging

Log coalescing eliminates 86% of log entries, saving 14 GB of IO

Strata PMFS NOVA EXT4-DAX Throughput (op/s) 0K 100K 200K 300K 400K

Better

29% better

Create journal file Write data to journal Write data to database file Delete journal file

Digest eliminates unneeded work

Write data to database file

Removes 
 temporary durable writes KernelFS Application

LibFS

Log coalescing

slide-70
SLIDE 70

Throughput: Varmail

29

Mail server workload from Filebench

  • Using only NVM
  • 10000 files
  • Read/Write ratio is 1:1
  • Write-ahead logging

Log coalescing eliminates 86% of log entries, saving 14 GB of IO

Strata PMFS NOVA EXT4-DAX Throughput (op/s) 0K 100K 200K 300K 400K

Better

29% better

Create journal file Write data to journal Write data to database file Delete journal file

Digest eliminates unneeded work

Write data to database file

Removes 
 temporary durable writes KernelFS Application

LibFS

Log coalescing

No kernel file system has both low latency and high throughput:

  • PMFS: better latency
  • NOVA: better throughput

Strata achieves both low latency and high throughput

slide-71
SLIDE 71

Throughput: data migration

30

File server workload from Filebench

  • Working set starts at NVM, grows to SSD, HDD
  • Read/Write ratio is 1:2
slide-72
SLIDE 72

Throughput: data migration

30

File server workload from Filebench

  • Working set starts at NVM, grows to SSD, HDD
  • Read/Write ratio is 1:2

User-level migration

  • LRU: whole file granularity
  • Treat each file system as a black-box
  • NVM: NOVA, SSD: F2FS, HDD: EXT4
slide-73
SLIDE 73

Throughput: data migration

30

File server workload from Filebench

  • Working set starts at NVM, grows to SSD, HDD
  • Read/Write ratio is 1:2

User-level migration

  • LRU: whole file granularity
  • Treat each file system as a black-box
  • NVM: NOVA, SSD: F2FS, HDD: EXT4

Block-level caching

  • Linux LVM cache, formatted with F2FS
slide-74
SLIDE 74

Throughput: data migration

30

File server workload from Filebench

  • Working set starts at NVM, grows to SSD, HDD
  • Read/Write ratio is 1:2

User-level migration

  • LRU: whole file granularity
  • Treat each file system as a black-box
  • NVM: NOVA, SSD: F2FS, HDD: EXT4

Strata User-level migration Block-level caching

  • Avg. throughput (ops/s)

0K 2K 4K 6K 8K 10K 2x faster

Block-level caching

  • Linux LVM cache, formatted with F2FS
slide-75
SLIDE 75

Throughput: data migration

30

File server workload from Filebench

  • Working set starts at NVM, grows to SSD, HDD
  • Read/Write ratio is 1:2

User-level migration

  • LRU: whole file granularity
  • Treat each file system as a black-box
  • NVM: NOVA, SSD: F2FS, HDD: EXT4

22% faster than 
 user-level migration Cross layer optimization: placing hot metadata 
 in faster layers

Strata User-level migration Block-level caching

  • Avg. throughput (ops/s)

0K 2K 4K 6K 8K 10K 2x faster

Block-level caching

  • Linux LVM cache, formatted with F2FS
slide-76
SLIDE 76

Conclusion

Source code is available at https://github.com/ut-osa/strata

31

Server applications need fast, small random IO on vast datasets with intuitive crash consistency

Strata, a cross media file system, addresses these concerns

Performance: low latency, high throughput

  • Novel split of LibFS, KernelFS
  • Fast user-level access

Low-cost capacity: leverage NVM, SSD & HDD

  • Asynchronous digest
  • Transparent data migration with large, sequential IO

Simplicity: intuitive crash consistency model

  • In-order, synchronous IO
slide-77
SLIDE 77

Backup

32

slide-78
SLIDE 78

Device management

  • verhead

SSD Throughput (MB/s)

250 500 750 1000 SSD utilization 0.1 0.25 0.5 0.6 0.7 0.8 0.9 1

64 MB 128 MB 256 MB 512 MB 1024 MB

For example, SSD Random write:

Sequential writes avoid management overhead

33

5-6x difference by hardware GC

SSD, HDD prefer large sequential IO

slide-79
SLIDE 79

Latency: persistent RPC

34

15 30 45 60 RPC size (IO size) 1 KB 4 KB 64 KB

Strata PMFS NOVA EXT4-DAX No persist

B e t t e r

98

Latency (us)

  • Foundation of most servers
  • Persist RPC data before

sending ACK to client

  • RPC over RDMA
  • 40 Gb/s Infiniband NIC
  • For small IO (1 KB)
  • 25% slower than No persist
  • 35% faster than PMFS


7x faster than EXT4-DAX

slide-80
SLIDE 80

Latency: persistent RPC

34

15 30 45 60 RPC size (IO size) 1 KB 4 KB 64 KB

Strata PMFS NOVA EXT4-DAX No persist

B e t t e r

98

Latency (us)

  • Foundation of most servers
  • Persist RPC data before

sending ACK to client

  • RPC over RDMA
  • 40 Gb/s Infiniband NIC
  • For small IO (1 KB)
  • 25% slower than No persist
  • 35% faster than PMFS


7x faster than EXT4-DAX

35% better