BetrFS: A path-based write- optimized file system CSCI 333 Spring - - PowerPoint PPT Presentation

betrfs a path based write optimized file system
SMART_READER_LITE
LIVE PREVIEW

BetrFS: A path-based write- optimized file system CSCI 333 Spring - - PowerPoint PPT Presentation

BetrFS: A path-based write- optimized file system CSCI 333 Spring 2019 Last Class B e trees Operations Asymptotics Write optimization: tips, tricks, and secret sauce Batched updates: only do work when you have enough to do that


slide-1
SLIDE 1

BetrFS: A path-based write-

  • ptimized file system

CSCI 333 Spring 2019

slide-2
SLIDE 2

2

  • Be trees

§ Operations § Asymptotics

  • Write optimization: tips, tricks, and secret sauce

§ Batched updates: only do work when you have enough to do that the setup is worth it § Read-write asymmetry

  • Blind updates whenever possible

§ Big nodes, modest fanout

Last Class

slide-3
SLIDE 3

3

  • The pros and cons of indirection
  • How do we make a file system using Be trees?

§ Converting file system operations to kv-operations § Synergies with write-optimization and the OS

  • Evaluating performance and being critical
  • The value of iteration and rethinking designs

This Class

slide-4
SLIDE 4

4

  • Two conference talks on BetrFS v1 and v2

§ What is the goal of a conference talk? § What is the goal of a lecture?

  • Why present this work?

§ Long project history, spanning 6+years

  • I’ll fill in the gaps and give context, but ask questions

because I have ”the curse of knowledge”

  • 4 consecutive FAST papers, 3 BP nominations, 1 BP

§ I hope you’ll poke holes!

Today’s Strategy

slide-5
SLIDE 5

5

BetrFS: A right-optimized, write-

  • ptimized file system

William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter Stony Brook University, Tokutek Inc., Rutgers University, Massachusetts Institute

  • f Technology
slide-6
SLIDE 6

6

  • Disk bandwidth spec:

125 MB/s

  • Workload: 1GiB sequential

write

  • ext4 bandwidth:

§ 104 MB/s

ext4 is good at sequential I/O

40 80 120

*higher is better MB/s

ext4 raw disk

Sequential I/O

slide-7
SLIDE 7

7

ext4 struggles with random writes

  • Disk bandwidth spec:

125 MB/s

  • Workload: Small, random

writes of cached data

  • ext4 write bandwidth:

§ 1.5 MB/s

40 80 120

*higher is better MB/s

ext4 raw disk

Random Overwrites

slide-8
SLIDE 8

8

  • Random write performance dominated by

seeks

  • Back-of-the-envelope:

§ Average disk seek time is 11ms § Seek for every 4KB write § Implies maximum 0.4MB/s bandwidth

  • Previous benchmark benefits from locality, good I/O

scheduling

What is going on here?

slide-9
SLIDE 9

9

Ext4 Sequential I/O

slide-10
SLIDE 10

10

Ext4 Random I/O

slide-11
SLIDE 11

11

  • Pros:

§ writing data is just an append to the log

  • Cons:

§ file blocks can become scattered on disk § reading data becomes slow

Avoiding seeks: log-structured file systems

Logging still presents a tradeoff between random-write and sequential-I/O performance

slide-12
SLIDE 12

12

  • Use write-optimized dictionaries (WODs)

§ on-disk data structures that rapidly ingest new data while maintaining logical locality

  • Create a schema that maps file operations to

efficient WOD operations

  • Implemented in the Linux kernel

§ exposed new performance opportunities

BetrFS

slide-13
SLIDE 13

13

  • Prior work: WODs can accelerate FS operations

§ TokuFS [Esmet, Bender, Farach-Colton, Kuszmaul ‘12], KVFS [Shetty, Spillane, Malpani,

Andrews, Seyster, and Zadok ‘13], TableFS [Ren and Gibson ‘13],

§ Prior WOFSes in user space

  • BetrFS goal: explore all the ways write-optimization

can be used in a file system § explore the impact of write-optimization on the interaction with the rest of the system

Advancing write-optimized FSes

slide-14
SLIDE 14

14

  • Bε-trees: an asymptotically optimal key-value store
  • Bε-trees asymptotically dominate log-structured

merge-trees

  • We use Fractal Trees, an open-source Bε-tree

implementation from Tokutek

BetrFS uses Bε-Trees

slide-15
SLIDE 15

15

  • Implement a dictionary on key-value pairs

§ insert(k,v) § v = search(k) § delete(k) § k’ = successor(k) § k’ = predecessor(k)

  • New operation:

§ upsert(k, ƒ)

Bε-Tree Operations

slide-16
SLIDE 16

16

  • Queries (point and range) comparable to B-trees

§ with caching, ~1 seek + disk bandwidth § hundreds of random queries per second

  • Extremely fast inserts

§ tens of thousands per second

Bε-trees search/insert asymmetry

slide-17
SLIDE 17

17

upsert(k,ƒ)

  • An upsert specifies a mutation to a value

§ e.g. increment a reference count § e.g. modify the 5th byte of a string

  • upserts are encoded as messages and inserted

into the tree

§ defer and batch expensive queries § we can perform tens of thousands of upserts per second

upsert = update + insert

slide-18
SLIDE 18

18

  • Maintain two separate Bε-tree indexes:

metadata index: path -> struct stat data index: (path,blk#) -> data[4096]

  • Implications:

§ fast directory scans § data blocks are laid out sequentially

File System è Bε Tree

slide-19
SLIDE 19

19

Operation Roundup

read write metadata update readdir mkdir/rmdir unlink rename range query upsert upsert range query upsert *delete each block *delete then reinsert each block

Operation Implementation

slide-20
SLIDE 20

20

  • Problem: Write-back caching can convert

single-byte to full-page writes

  • upserts enable BetrFS to avoid this write

amplification

Integrating BetrFS with the page cache

slide-21
SLIDE 21

21

Page cache integration #1: blind write

Page cache

/home/bill/foo.txt

upsert(/home/bill/foo.txt, ) write(/home/bill/foo.txt, ) upsert(/home/bill/foo.txt, )

slide-22
SLIDE 22

22

Page cache

/home/bill/foo.txt

upsert(/home/bill/foo.txt, ) write(/home/bill/foo.txt, ) upsert(/home/bill/foo.txt, )

Target page is cached.

Page cache integration #2: write-after-read

slide-23
SLIDE 23

23

Page cache

/home/bill/foo.txt

write(/home/bill/foo.txt, )

Target page is cached.

Page cache integration #3: write to mmap’ed file

slide-24
SLIDE 24

24

  • By rethinking the interaction between the

page cache and the file system, we benefit more than simply speeding up individual

  • perations

§ use upserts to avoid unnecessary reads § use upserts to avoid write amplification

Page-cache takeaways

slide-25
SLIDE 25

25

System Architecture

VFS ext4 Page Cache Disk unmodified* new code

slide-26
SLIDE 26

26

  • Do we meet our performance goals for small,

random, unaligned writes?

  • Is BetrFS competitive for sequential I/O?
  • Do any real-world applications benefit?

Performance Questions

slide-27
SLIDE 27

27

  • Dell optiplex desktop:

§ 4-core 3.4 GHz i7, 4 GB RAM § 7200RPM 250GB Seagate Barracuda

  • Compare with btrfs, ext4, xfs, zfs

§ default settings for all

  • All tests are cold cache

Experimental Setup

slide-28
SLIDE 28

28

0.1 1 10 100

*lower is better Time (s)

BetrFS btrfs ext4 xfs zfs

1000 Random 4−byte writes

Small, random, unaligned writes are an order-of-magnitude faster

  • 1 GiB file, random data
  • 1,000 random 4-byte writes
  • fsync() at end
slide-29
SLIDE 29

29

  • ● ● ●
  • ● ● ●
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

100 1000 10000 100000 1M 2M 3M

Files Created *higher is better Files/second

  • BetrFS

btrfs ext4 xfs zfs

Small File Creation

Small file creates are an order-of- magnitude faster

  • create 3 million files and

write 200-bytes to each

  • balanced directory tree

with fanout 128

  • performance over time
slide-30
SLIDE 30

30

Sequential I/O

25 50 75 100 read write

Operation *higher is better MiB/s

BetrFS btrfs ext4 xfs zfs

1GiB Sequential I/O

  • Write random data to file,

10 4K-blocks at a time

  • Sequentially read data back
slide-31
SLIDE 31

31

BetrFS forgoes indirection for locality: delete, rename O(n)

  • 100

200 300

2 5 6 M i B 5 1 2 M i B 1 G i B 2 G i B 4 G i B

File Size Time (s)

  • BetrFS

BetrFS Delete Scaling

  • write random data to file,

fsync() it

  • delete file
slide-32
SLIDE 32

32

20 40 60 80

Time (s)

BetrFS btrfs ext4 xfs zfs

grep −r

5 10 15 20

Time (s)

GNU Find

BetrFS forgoes indirection for locality: fast directory scans

  • recursive scans from root of

Linux 3.11.10 source

  • GNU find scans file

metadata

  • grep –r scans file

contents

slide-33
SLIDE 33

33

200 400 600

*lower is better Time (s)

BetrFS btrfs ext4 xfs zfs

IMAP (50% read, 50% mark or move)

  • Dovecot 2.2.13 mail server

using maildir

  • 26,000 sync() operations

BetrFS Benefits Mailserver Workloads

slide-34
SLIDE 34

34

BetrFS Benefits rsync

10 20 30

*higher is better MB / s

BetrFS btrfs ext4 xfs zfs

In−place rsync of Linux 3.11.10

  • rsync Linux source tree to

to new directory on same FS

  • copying to an empty directory
slide-35
SLIDE 35

35

  • Do we meet our performance goals for small,

random writes?

  • Is BetrFS competitive for sequential I/O?

§ More work to do here

  • Do any real-world applications benefit?

§ More experiments in paper

Performance Questions

slide-36
SLIDE 36

36

  • Cake && Eat: One file system can have good

sequential and random I/O performance

  • WOI performance requires revisiting many

design decisions

§ inodes § write-through vs. write-back caching § perform blind writes whenever possible

betrfs.org – github.com/oscarlab/betrfs

BetrFS

slide-37
SLIDE 37

37

  • What problems do you see?

§ Are there operations that were slower than expected? § What are the bottlenecks of those operations

  • What information was left out?

§ Be-tree details § SSDs

  • Next steps?

Thinking Critically