SLIDE 1 File Systems Fated for Senescence? Nonsense, Says Science!
Alex Conway🃠 , Ainesh Bakshi🃠, Yizheng Jiao♢, Yang Zhan♢, Michael
- A. Bender♠, William Jannen♠, Rob Johnson♠, Bradley C. Kuszmaul♡,
Donald E. Porter♢, Jun Yuan♣ and Martin Farach-Colton🃠
🃠Rutgers University, ♢The University of North Carolina at Chapel Hill, ♠Stony Brook University, ♡Oracle Corporation and Massachusetts Institute of Technology, ♣Farmingdale State College of SUNY
SLIDE 2
File Systems Fated for Senescence? Nonsense, Says Science; The Essence of Semperjuvenescense is Coalescence!
SLIDE 3 File Systems Fated for Senescence? Nonsense, Says Science; The Essence of Semperjuvenescense is Coalescence!
being young forever merging together
SLIDE 4 File System Aging
Aging is fragmentation over time
Performance
SLIDE 5
In this talk
Do file systems age? What can we do about it?
SLIDE 6
Is aging a problem?
SLIDE 7
Is aging a problem?
SLIDE 8 Chris Hoffman at howtogeek.com says:
“Linux’s ext2, ext3, and ext4 file systems… [are] designed to avoid fragmentation in normal use.” “If you do have problems with fragmentation on Linux, you probably need a larger hard disk.”
Is aging a problem?
SLIDE 9 Chris Hoffman at howtogeek.com says:
“Linux’s ext2, ext3, and ext4 file systems… [are] designed to avoid fragmentation in normal use.” “If you do have problems with fragmentation on Linux, you probably need a larger hard disk.” “Modern Linux filesystems keep fragmentation at a minimum…Therefore it is not necessary to worry about fragmentation in a Linux system.”
Is aging a problem?
SLIDE 10 Chris Hoffman at howtogeek.com says:
“Linux’s ext2, ext3, and ext4 file systems… [are] designed to avoid fragmentation in normal use.” “If you do have problems with fragmentation on Linux, you probably need a larger hard disk.” “Modern Linux filesystems keep fragmentation at a minimum…Therefore it is not necessary to worry about fragmentation in a Linux system.”
Nope
Is aging a problem?
SLIDE 11
Is aging a problem?
SLIDE 12 Is aging a problem?
Aging happens in real filesystems
Benchmarks should incorporate aging
- Zhu, Chen and Chiueh (’05)
- Agrawal, A. Arpaci-Dusseau and R. Arpaci-Dusseau (’09)
Yep
SLIDE 13
Is aging a problem?
Yep Nope
SLIDE 14
Let’s do some science!
SLIDE 15 Inducing Aging
Developer workload Server workload Synthetic workloads
We use three different workloads
SLIDE 16 Developer workload Server workload Synthetic workloads
We use three different workloads
See the paper
Inducing Aging
SLIDE 17
Simulating a Developer
SLIDE 18 Simulating a Developer
get coffee
SLIDE 19 Simulating a Developer
get coffee git pull git pull
SLIDE 20 Simulating a Developer
get coffee git pull git pull make make
SLIDE 21 Simulating a Developer
get coffee git pull git pull make make get coffee
SLIDE 22 Simulating a Developer
get coffee git pull git pull make make get coffee git pull
SLIDE 23 Simulating a Developer
get coffee git pull git pull make get coffee git pull add awesome features
SLIDE 24 Simulating a Developer
get coffee git pull git pull make get coffee git pull add awesome features get coffee
SLIDE 25 Simulating a Developer
get coffee git pull git pull make get coffee git pull add awesome features get coffee git pull
SLIDE 26 Simulating a Developer
get coffee git pull git pull make get coffee git pull add awesome features get coffee git pull fix bugs
SLIDE 27 Simulating a Developer
get coffee git pull git pull make get coffee git pull add awesome features get coffee git pull fix bugs ...
SLIDE 28 Simulating a Developer
get coffee git pull git pull make get coffee git pull add awesome features get coffee git pull fix bugs ... We can simulate a developer by replaying Git histories
SLIDE 29
Simulating a Developer
SLIDE 30
Simulating a Developer
Use the Linux kernel repo from github.com
Do 100 git pulls Measure Performance
SLIDE 31 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4
SLIDE 32 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4
SLIDE 33 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4
SLIDE 34 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4
SLIDE 35 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4
SLIDE 36 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4 Intrafile Fragmentation
SLIDE 37 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4 Intrafile Fragmentation
SLIDE 38 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4 Intrafile Fragmentation
SLIDE 39 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4 Interfile Fragmentation Intrafile Fragmentation
SLIDE 40 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4 Interfile Fragmentation Intrafile Fragmentation
SLIDE 41 Measuring Aging
time grep -r random_string /path/to/filesystem
dir file1 file2 file3 file4 Interfile Fragmentation Intrafile Fragmentation
Then normalize per gigabyte read
SLIDE 42
Do modern file systems age?
SLIDE 43 Time in seconds / GiB
200 400 600 800
Git pulls performed
1 2 3 4 5 6 7 8 9 1
14.3x
Lower is better
Our Setup: Cold Cache, 3.4 GHz Quad Core, 4GiB RAM, 20 GiB HDD partition - SATA 7200 RPM
Git Workload on ext4 on HDD
SLIDE 44 Time in seconds / GiB
200 400 600 800
Git pulls performed
1 2 3 4 5 6 7 8 9 1
14.3x
Lower is better
Our Setup: Cold Cache, 3.4 GHz Quad Core, 4GiB RAM, 20 GiB HDD partition - SATA 7200 RPM
2x slowdown
Git Workload on ext4 on HDD
SLIDE 45 Time in seconds / GiB
200 400 600 800
Git pulls performed
1 2 3 4 5 6 7 8 9 1
14.3x
Lower is better
Our Setup: Cold Cache, 3.4 GHz Quad Core, 4GiB RAM, 20 GiB HDD partition - SATA 7200 RPM
2x slowdown 4x slowdown
Git Workload on ext4 on HDD
SLIDE 46 Time in seconds / GiB
200 400 600 800
Git pulls performed
1 2 3 4 5 6 7 8 9 1
14.3x
Lower is better
Our Setup: Cold Cache, 3.4 GHz Quad Core, 4GiB RAM, 20 GiB HDD partition - SATA 7200 RPM
15 minutes to grep 1.2GiB
Git Workload on ext4 on HDD
SLIDE 47
How can we be sure this slowdown is due to aging?
SLIDE 48
I’m not old. My directory structure is different!
How can we be sure this slowdown is due to aging?
SLIDE 49 File System Rejuvenation
Idea: Copy same logical state to a new file system
- After each 100 pulls
- Compare grep cost
SLIDE 50 Aging ext4 with Git on HDD
Time in seconds / GiB
200 400 600 800
Git pulls performed
1 2 3 4 5 6 7 8 9 1
Aged Unaged 8.8x
Lower is better
SLIDE 51 Time in seconds / GiB
200 400 600 800
Git pulls performed
1 2 3 4 5 6 7 8 9 1
Aged Unaged 8.8x
Smaller average file size makes the unaged 60% slower
Lower is better
Aging ext4 with Git on HDD
SLIDE 52
Is this specific to ext4?
SLIDE 53 Btrfs
200 400 600 800
F2FS
500 1000 1500 2000
ZFS
500 1000 1500 2000
XFS
200 400 600 800
20.6x 22.4x 2.2x
weird unaged behavior on XFS
11.8x
Lower is better
Aging other file systems with Git on HDD
SLIDE 54
Will SSDs save us?
SLIDE 55 Git Workload on XFS on SSD
Time in seconds / GiB
10 20 30
Git pulls performed
1 2 3 4 5 6 7 8 9 1
Aged Unaged
Lower is better
1.9x
SLIDE 56 Git Workload on SSD
Btrfs
10 20 30
ext4
10 20 30
ZFS
10 20 30 40
F2FS
10 20 30
2.2x
Lower is better
1.5x
SLIDE 57 Btrfs
10 20 30
ext4
10 20 30
ZFS
10 20 30 40
F2FS
10 20 30
2.2x
Lower is better
1.5x
ZFS and ext4 slow down with smaller average file size
Git Workload on SSD
SLIDE 58 Btrfs
10 20 30
ext4
10 20 30
ZFS
10 20 30 40
F2FS
10 20 30
2.2x
Lower is better
1.5x
Told ya!
ZFS and ext4 slow down with smaller average file size
Git Workload on SSD
SLIDE 59 Aging is real
Btrfs, ext4, F2FS, XFS, ZFS all age
- Up to 22x on HDD
- Up to 2x on SSD
Git lets us replay a real development history
- Induce aging by simulating years of use
- Takes between 5 hours and 2 days
- Download these scripts from betrfs.org
SLIDE 60
How can we prevent aging?
SLIDE 61 Intrafile Fragmentation: Avoid breaking large files into small fragments
Design goals to address fragmentation
SLIDE 62 Intrafile Fragmentation: Avoid breaking large files into small fragments Interfile Fragmentation: Cluster logically related small files
Design goals to address fragmentation
SLIDE 63 Intrafile Fragmentation: Avoid breaking large files into small fragments Interfile Fragmentation: Cluster logically related small files
Design goals to address fragmentation
What do we mean by small?
SLIDE 64 Read Length vs Bandwidth
Bandwidth in MiB/sec
0.1 1 10 100 1000
Sequential Read Length
4 K i B 8 K i B 1 6 K i B 3 2 K i B 6 4 K i B 1 2 8 K i B 2 5 6 K i B 5 1 2 K i B 1 M i B 2 M i B 4 M i B 8 M i B 1 6 M i B 3 2 M i B 6 4 M i B 1 2 8 M i B 2 5 6 M i B
HDD Higher is better
I/O Size vs Effective Bandwidth
SLIDE 65 Read Length vs Bandwidth
Bandwidth in MiB/sec
0.1 1 10 100 1000
Sequential Read Length
4 K i B 8 K i B 1 6 K i B 3 2 K i B 6 4 K i B 1 2 8 K i B 2 5 6 K i B 5 1 2 K i B 1 M i B 2 M i B 4 M i B 8 M i B 1 6 M i B 3 2 M i B 6 4 M i B 1 2 8 M i B 2 5 6 M i B
SSD HDD Higher is better
I/O Size vs Effective Bandwidth
SLIDE 66 Read Length vs Bandwidth
Bandwidth in MiB/sec
0.1 1 10 100 1000
Sequential Read Length
4 K i B 8 K i B 1 6 K i B 3 2 K i B 6 4 K i B 1 2 8 K i B 2 5 6 K i B 5 1 2 K i B 1 M i B 2 M i B 4 M i B 8 M i B 1 6 M i B 3 2 M i B 6 4 M i B 1 2 8 M i B 2 5 6 M i B
SSD HDD Higher is better
I/O Size vs Effective Bandwidth
SLIDE 67 Intrafile Fragmentation: Avoid breaking large files into small fragments Interfile Fragmentation: Cluster logically related small files
Design goals to address fragmentation
Prediction: 4MiB chunks will substantially reduce aging
SLIDE 68
Testing this with Btrfs
SLIDE 69 64 37 86 58 72 63 67 65 90 91 68 69 93 98 74 92 67 71 70 66
Metadata and small files are stored in a B-tree Large files get written elsewhere
Big File Bigger File Large File
Btrfs: Larger leaves = less aging?
SLIDE 70 Time in seconds / GiB
150 300 450 600
Git pulls performed
1 2 3 4 5 6 7 8 9 1
4k 8k 16k 32k 64k
Bigger leaves does mean less aging! Btrfs allows leaf size to be configured between 4KiB and 64KiB. lower is better
Btrfs Leaf Size Performance
SLIDE 71
Cost of large leaves
Why don’t B-tree usually have big leaves? Because making small changes to big leaves causes a lot of writing
SLIDE 72 Btrfs Leaf Size Writing
Blocks Written in Thousands
75 150 225 300
Git pulls performed
1 2 3 4 5 6 7 8 9 1
4k 8k 16k 32k 64k
Bigger leaves do mean more writing lower is better Btrfs allows leaf size to be configured between 4KiB and 64KiB.
SLIDE 73
B-Tree Performance Tradeoff
More Aging 🙂 Less Writing 😁
Large Leaves Small Leaves
Less Aging 😁 More Writing 🙂
SLIDE 74
B-Tree Performance Tradeoff
More Aging 🙂 Less Writing 😁
Large Leaves Small Leaves
Less Aging 😁 More Writing 🙂
This tradeoff is inherent to B-trees
SLIDE 75 Other File System Types
Update-in-place Log-structured Write-Optimized
Must other types of file systems age?
See the paper
BεtrFS
SLIDE 76
BεtrFS
BεtrFS packs small logically related data in a Bε-tree with 4MiB nodes.
SLIDE 77
BεtrFS
BεtrFS packs small logically related data in a Bε-tree with 4MiB nodes.
SLIDE 78
BεtrFS
BεtrFS packs small logically related data in a Bε-tree with 4MiB nodes. Bε-trees batch updates which allows leaves to be big without increasing the amount of writing
SLIDE 79 Time in seconds / GiB
200 400 600 800
Git pulls performed
1 2 3 4 5 6 7 8 9 1
Git on BetrFS on HDD
Lower is better BetrFS
XFS ext4/F2FS/ZFS Btrfs F2FS ZFS ext4 Btrfs XFS
— Aged — Unaged
SLIDE 80 Time in seconds / GiB
20 40 60 80
Git pulls performed
1 2 3 4 5 6 7 8 9 1
Lower is better BetrFS
Git on BetrFS on HDD
— Aged — Unaged
Btrfs F2FS ext4 ZFS
SLIDE 81 Time in seconds / GiB
20 40 60 80
Git pulls performed
1 2 3 4 5 6 7 8 9 1
Lower is better BetrFS
Git on BetrFS on HDD
— Aged — Unaged
Btrfs F2FS ext4 ZFS
SLIDE 82
And SSDs?
SLIDE 83 Time in seconds / GiB
10 20 30
Git pulls performed
1 2 3 4 5 6 7 8 9 1
Lower is better
Btrfs
— Aged — Unaged BetrFS
ZFS ext4 XFS F2FS Btrfs ZFS F2FS/XFS/ext4
Git on BetrFS on SSD
SLIDE 84
How to prevent aging
Batch updates to avoid too much writing Rewrite to keep related data in large blocks
SLIDE 85
Conclusion
Aging is avoidable It’s easy to age file systems quickly and substantially
SLIDE 86
Thank you!
Alex Conway alexander.conway@rutgers.edu betrfs.org