BetrFS: A path-based write- optimized file system CSCI 333 Spring - PowerPoint PPT Presentation

BetrFS: A path-based write- optimized file system CSCI 333 Spring 2019

Last Class • B e trees § Operations § Asymptotics • Write optimization: tips, tricks, and secret sauce § Batched updates: only do work when you have enough to do that the setup is worth it § Read-write asymmetry • Blind updates whenever possible § Big nodes, modest fanout 2

This Class • The pros and cons of indirection • How do we make a file system using B e trees? § Converting file system operations to kv-operations § Synergies with write-optimization and the OS • Evaluating performance and being critical • The value of iteration and rethinking designs 3

Today’s Strategy • Two conference talks on BetrFS v1 and v2 § What is the goal of a conference talk? § What is the goal of a lecture? • Why present this work? § Long project history, spanning 6+years • I’ll fill in the gaps and give context, but ask questions because I have ”the curse of knowledge” • 4 consecutive FAST papers, 3 BP nominations, 1 BP § I hope you’ll poke holes! 4

BetrFS: A right-optimized, write- optimized file system William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter Stony Brook University, Tokutek Inc., Rutgers University, Massachusetts Institute of Technology 5

ext4 is good at sequential I/O • Disk bandwidth spec: Sequential I/O 125 MB/s 120 • Workload: 1GiB sequential write 80 • ext4 bandwidth: MB/s ext4 raw disk § 104 MB/s 40 0 *higher is better 6

ext4 struggles with random writes • Disk bandwidth spec: Random Overwrites 125 MB/s 120 • Workload: Small, random writes of cached data 80 • ext4 write bandwidth: MB/s ext4 raw disk § 1.5 MB/s 40 0 *higher is better 7

What is going on here? • Random write performance dominated by seeks • Back-of-the-envelope: § Average disk seek time is 11ms § Seek for every 4KB write § Implies maximum 0.4MB/s bandwidth • Previous benchmark benefits from locality, good I/O scheduling 8

Ext4 Sequential I/O 9

Ext4 Random I/O 10

Avoiding seeks: log-structured file systems • Pros: § writing data is just an append to the log • Cons: § file blocks can become scattered on disk § reading data becomes slow Logging still presents a tradeoff between random-write and sequential-I/O performance 11

BetrFS • Use write-optimized dictionaries (WODs) § on-disk data structures that rapidly ingest new data while maintaining logical locality • Create a schema that maps file operations to efficient WOD operations • Implemented in the Linux kernel § exposed new performance opportunities 12

Advancing write-optimized FSes • Prior work: WODs can accelerate FS operations § TokuFS [Esmet, Bender, Farach-Colton, Kuszmaul ‘12] , KVFS [Shetty, Spillane, Malpani, Andrews, Seyster, and Zadok ‘13], TableFS [Ren and Gibson ‘13], § Prior WOFSes in user space • BetrFS goal: explore all the ways write-optimization can be used in a file system § explore the impact of write-optimization on the interaction with the rest of the system 13

BetrFS uses B ε -Trees • B ε -trees: an asymptotically optimal key-value store • B ε- trees asymptotically dominate log-structured merge-trees • We use Fractal Trees, an open-source B ε -tree implementation from Tokutek 14

B ε -Tree Operations • Implement a dictionary on key-value pairs § insert( k , v ) § v = search( k ) § delete( k ) § k ’ = successor( k ) § k ’ = predecessor( k ) • New operation: § upsert( k , ƒ) 15

B ε -trees search/insert asymmetry • Queries (point and range) comparable to B-trees § with caching, ~1 seek + disk bandwidth § hundreds of random queries per second • Extremely fast inserts § tens of thousands per second 16

upsert = update + insert upsert( k ,ƒ) • An upsert specifies a mutation to a value § e.g. increment a reference count § e.g. modify the 5 th byte of a string • upserts are encoded as messages and inserted into the tree § defer and batch expensive queries § we can perform tens of thousands of upserts per second 17

File System è B ε Tree • Maintain two separate B ε -tree indexes: metadata index: path -> struct stat data index: (path,blk#) -> data[4096] • Implications: § fast directory scans § data blocks are laid out sequentially 18

Operation Roundup Operation Implementation range query read write upsert metadata update upsert range query readdir upsert mkdir/rmdir * delete each block unlink * delete then rename reinsert each block 19

Integrating BetrFS with the page cache • Problem: Write-back caching can convert single-byte to full-page writes • upserts enable BetrFS to avoid this write amplification 20

Page cache integration #1: blind write write(/home/bill/foo.txt, ) Page cache /home/bill/foo.txt upsert(/home/bill/foo.txt, ) upsert(/home/bill/foo.txt, ) 21

Page cache integration #2: write-after-read write(/home/bill/foo.txt, ) Page cache Target page is cached. /home/bill/foo.txt upsert(/home/bill/foo.txt, ) upsert(/home/bill/foo.txt, ) 22

Page cache integration #3: write to mmap’ed file write(/home/bill/foo.txt, ) Page cache Target page is cached. /home/bill/foo.txt 23

Page-cache takeaways • By rethinking the interaction between the page cache and the file system, we benefit more than simply speeding up individual operations § use upserts to avoid unnecessary reads § use upserts to avoid write amplification 24

System Architecture unmodified* new code VFS ext4 Page Cache Disk 25

Performance Questions • Do we meet our performance goals for small, random, unaligned writes? • Is BetrFS competitive for sequential I/O? • Do any real-world applications benefit? 26

Experimental Setup • Dell optiplex desktop: § 4-core 3.4 GHz i7, 4 GB RAM § 7200RPM 250GB Seagate Barracuda • Compare with btrfs, ext4, xfs, zfs § default settings for all • All tests are cold cache 27

Small, random, unaligned writes are an order-of-magnitude faster 1000 Random 4 − byte writes 1 GiB file, random data • 100 1,000 random 4-byte writes • fsync() at end • 10 BetrFS Time (s) btrfs ext4 xfs zfs 1 0.1 *lower is better 28

Small file creates are an order-of- magnitude faster Small File Creation create 3 million files and • write 200-bytes to each 100000 balanced directory tree • ● ● ● ● with fanout 128 ● ● ● ● performance over time • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Files/second 10000 ● BetrFS btrfs ext4 xfs zfs 1000 100 0 1M 2M 3M Files Created *higher is better 29

Sequential I/O Write random data to file, • 1GiB Sequential I/O 10 4K-blocks at a time Sequentially read data back • 100 75 BetrFS MiB/s btrfs ext4 xfs 50 zfs 25 0 read write Operation *higher is better 30

BetrFS forgoes indirection for locality: delete, rename O(n) BetrFS Delete Scaling write random data to file, • fsync() it ● 300 delete file • 200 Time (s) ● BetrFS ● 100 ● ● ● 0 B B B B B i i i i i M M G G G 1 2 4 6 2 5 1 2 5 File Size 31

BetrFS forgoes indirection for locality: fast directory scans recursive scans from root of • grep − r GNU Find Linux 3.11.10 source 80 GNU find scans file • 20 metadata grep –r scans file • 60 contents 15 BetrFS Time (s) Time (s) btrfs ext4 40 xfs 10 zfs 20 5 0 0 32

BetrFS Benefits Mailserver Workloads Dovecot 2.2.13 mail server • IMAP using maildir (50% read, 50% mark or move) 26,000 sync() operations • 600 400 BetrFS Time (s) btrfs ext4 xfs zfs 200 0 *lower is better 33

BetrFS Benefits rsync rsync Linux source tree to • In − place rsync of to new directory on same FS Linux 3.11.10 copying to an empty directory • 30 BetrFS 20 MB / s btrfs ext4 xfs zfs 10 0 *higher is better 34

Performance Questions • Do we meet our performance goals for small, random writes? • Is BetrFS competitive for sequential I/O? § More work to do here • Do any real-world applications benefit? § More experiments in paper 35

BetrFS • Cake && Eat: One file system can have good sequential and random I/O performance • WOI performance requires revisiting many design decisions § inodes § write-through vs. write-back caching § perform blind writes whenever possible betrfs.org – github.com/oscarlab/betrfs 36

Thinking Critically • What problems do you see? § Are there operations that were slower than expected? § What are the bottlenecks of those operations • What information was left out? § B e -tree details § SSDs • Next steps? 37

BetrFS: A path-based write- optimized file system CSCI 333 Spring - PowerPoint PPT Presentation

BetrFS: A path-based write- optimized file system CSCI 333 Spring 2019 Last Class B e trees Operations Asymptotics Write optimization: tips, tricks, and secret sauce Batched updates: only do work when you have enough to do that

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

Overview Last Week: Efficiency read/write The File Unix System Programming File

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File System Implementation Summer 2016 Cornell University Today File allocation Unix

FILE SYSTEM IMPLEMENTATION Sunu Wibirama Outline File-System Structure File-System

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 File-System Case Studies Local -

Revisiting Virtual File System for Metadata Optimized Non-Volatile Main Memory File System Ying

File System Thierry Sans (recap) File System Abstraction File system specifics of which disk

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

Chapter 12: File System Implementation File System Structure File System Implementation

File Systems Chapter 11, 13 OSPP What is a File? What is a Directory? Goals of File System

What if... There is no file with the name given to the File constructor: new File

How to Hire and Fire Your Employer April Sides Icon Credit: Work by Alina Oleynik from the Noun

CS615 - Aspects of System Administration SMTP , Backup and Disaster Recovery Department of

Apache Wicket Trifork Projects Copenhagen A/S Niels Sthen Hansen, nsh@trifork.com Claus

CDx: A Family of Real-time Java Benchmarks Tomas Kalibera, Jeff Hagelberg, Filip Pizlo, Ales

CS 135: File Systems Persistent Solid-State Storage 1 / 23 Introduction Technology Change is

LHD: IMPROVING CACHE HIT RATE BY MAXIMIZING HIT DENSITY Nathan Beckmann Haoxian Chen Asaf

Aegir Hosting System One Drupal to Rule Them All THE OHIO STATE UNIVERSITY COLLEGE OF ENGINEERING

Percona Live 2017 Santa Clara, California | April 24-27, 2017 MySQL INDEX Cookbook How to Build

BetrFS: A path-based write- optimized file system CSCI 333 Spring - PowerPoint PPT Presentation

BetrFS: A path-based write- optimized file system CSCI 333 Spring 2019 Last Class B e trees Operations Asymptotics Write optimization: tips, tricks, and secret sauce Batched updates: only do work when you have enough to do that

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

Overview Last Week: Efficiency read/write The File Unix System Programming File

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File System Implementation Summer 2016 Cornell University Today File allocation Unix

FILE SYSTEM IMPLEMENTATION Sunu Wibirama Outline File-System Structure File-System

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 File-System Case Studies Local -

Revisiting Virtual File System for Metadata Optimized Non-Volatile Main Memory File System Ying

File System Thierry Sans (recap) File System Abstraction File system specifics of which disk

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

Chapter 12: File System Implementation File System Structure File System Implementation

File Systems Chapter 11, 13 OSPP What is a File? What is a Directory? Goals of File System

What if... There is no file with the name given to the File constructor: new File

How to Hire and Fire Your Employer April Sides Icon Credit: Work by Alina Oleynik from the Noun

CS615 - Aspects of System Administration SMTP , Backup and Disaster Recovery Department of

Apache Wicket Trifork Projects Copenhagen A/S Niels Sthen Hansen, nsh@trifork.com Claus

CDx: A Family of Real-time Java Benchmarks Tomas Kalibera, Jeff Hagelberg, Filip Pizlo, Ales

CS 135: File Systems Persistent Solid-State Storage 1 / 23 Introduction Technology Change is

LHD: IMPROVING CACHE HIT RATE BY MAXIMIZING HIT DENSITY Nathan Beckmann Haoxian Chen Asaf

Aegir Hosting System One Drupal to Rule Them All THE OHIO STATE UNIVERSITY COLLEGE OF ENGINEERING

Percona Live 2017 Santa Clara, California | April 24-27, 2017 MySQL INDEX Cookbook How to Build

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of