The Btrfs Filesystem Chris Mason Btrfs Design Goals Broad - - PowerPoint PPT Presentation

the btrfs filesystem chris mason
SMART_READER_LITE
LIVE PREVIEW

The Btrfs Filesystem Chris Mason Btrfs Design Goals Broad - - PowerPoint PPT Presentation

The Btrfs Filesystem Chris Mason Btrfs Design Goals Broad development community General purpose filesystem that scales to very large storage Extents for large files Small files packed in as metadata Flexible disk format that


slide-1
SLIDE 1

The Btrfs Filesystem Chris Mason

slide-2
SLIDE 2

Btrfs Design Goals

  • Broad development community
  • General purpose filesystem that scales to very large

storage

– Extents for large files – Small files packed in as metadata

  • Flexible disk format that can adapt to new features

– Btree indexes based on extensible key/value lookups – Key ordering determines relative location in the btree

  • Data and metadata checksumming

– Crc32c used for fast hardware enabled crcs

slide-3
SLIDE 3

Btrfs Design Goals

  • Data and metadata copy on write

– Block contents preserved until replacement is safely on disk

  • Data and metadata reference counting with back

references

– Every block and filename link back to their owners

  • Fast, writable snapshots

– COW enables O(1) snapshots of subvolumes – O(number of extents in the file) snapshots of single files

  • Efficient detection of recently modified files
slide-4
SLIDE 4

Btrfs Design Goals

  • Simple, online disk administration

– Btrfs dev add /dev/xxx /mnt – Btrfs dev delete /dev/xxx /mnt – Btrfs filesystem resize XX /mnt

  • Can also resize a single device

– Btrfs filesystem balance /mnt

  • Multiple device support

– Flexible relocation of space – Easily find good copies when crcs fail

  • Efficient synchronous operations that do not stall the

rest of the filesystem

  • These goals have been met!
slide-5
SLIDE 5

Snapshots and Subvolumes

  • Subvolume is the unit of snapshotting
  • Snapshots are very efficient, even when many are in

place against the same source

– Individual files may be cloned without a full snapshot – Cloning support now in cp --relink

  • Subvolumes and snapshots may be created anywhere
  • Subvolumes are roughly as expensive as directories

– But, you may not rename or hardlink files between subvolumes

  • Snapshots can be written and snapshotted again
slide-6
SLIDE 6

Snapshot Rollback

  • The snapshot or subvolume used as the root of the

filesystem can be specified

– Btrfs subvol list to find subvolumes – btrfs subvolume setdefault to set a new default

  • Allows you to snapshot before upgrading and rollback

if things don't work well

slide-7
SLIDE 7

Current Work In Progress

  • Fsck with repair

– Initially fs rescue

  • Robust error handling
  • RAID5/6

– Reuse MD's parity calculation code – Single stripe size, adapt allocator and FS writeback to send down full stripes

  • SSD front end cache
  • Locking bottlenecks
slide-8
SLIDE 8

SSD Optimizations

  • Really just turning off rotational optimizations
  • Send IO to the device right away

– No stalling or waiting to collect more IO

  • Don't avoid fragmentation
  • Send large writes whenever possible
  • Reuse blocks instead of spreading across the device

– Unless you're on a cheap SSD

  • Send discards down in large batches

– Collected in bulk and sent down right after transaction commit

slide-9
SLIDE 9

Why Discard/Trim

slide-10
SLIDE 10

SSD Front End Cache

  • Stage writes to a set of fast SSD devices
  • Remapping layer to remember which blocks are up to

date on the SSD

  • Push frequently read extents into the SSD as well
  • Hot data will stay on the SSD without hitting spinning

disks

  • Work in progress, slightly different from IBM's

experiments over the summer

slide-11
SLIDE 11

Thin Provisioning

  • Btrfs storage chunks are well suited to thin

provisioning

  • Btrfs can return large chunks of storage back to the

array

  • Btrfs can quickly expand the FS
  • Discard support in Btrfs sends information about

unused blocks down to the storage at run time

  • Fitrim ioctl support is important for thin provisioning
slide-12
SLIDE 12

Atomic Writes for Applications

  • COW writes to Btrfs can be atomic up to large sizes
  • Some hardware support fast atomic writes of larger

Ios as well

  • Work in progress to wire up Btrfs atomic write support

and use optimizations from the hardware

  • We may also support linked atomic writes between

two or more files

slide-13
SLIDE 13

Database Write Performance

  • Poor random write performance in COW mode
  • Large files tend to fragment badly, leading to huge

amounts of metadata and seeking

  • New data from random writes can be collected in bulk

after transaction commit and copied back to the

  • riginal location
  • Work in progress
slide-14
SLIDE 14

Finding Recent Modifications

  • Btrfs subvol find-new
slide-15
SLIDE 15

Btrfs Scrubbing

  • Scrubbing finds and repairs bad data
  • Read all the allocated extents
  • Verify checksums
  • Replace bad copies with correct mirror
  • Work in progress, initial implementation working
slide-16
SLIDE 16

Conclusions

  • Many things working and stable
  • Focused on stability and performance
  • http://btrfs.wiki.kernel.org/
  • chris.mason@oracle.com