ZFS UTH Always consistent on disk Under The Hood No journal - - PowerPoint PPT Presentation

zfs uth
SMART_READER_LITE
LIVE PREVIEW

ZFS UTH Always consistent on disk Under The Hood No journal - - PowerPoint PPT Presentation

ZFS I/O Stack Object-Based Transactions Make these 7 changes to these 3 objects ZFS All-or-nothing Transaction Group Commit Again, all-or-nothing DMU ZFS UTH Always consistent on disk Under The Hood No


slide-1
SLIDE 1

1

ZFS UTH Under The Hood

Superlite

Jason Banham & Jarod Nash

Systems TSC Sun Microsystems

ZFS Storage Pool DMU

ZFS I/O Stack

Object-Based Transactions

  • “Make these 7 changes

to these 3 objects”

  • All-or-nothing

Transaction Group Commit

  • Again, all-or-nothing
  • Always consistent on disk
  • No journal – not needed

Transaction Group Batch I/O

  • Schedule, aggregate,

and issue I/O at will

  • No resync if power lost
  • Runs at platter speed
slide-2
SLIDE 2

2

ZFS Elevator Pitch

  • Data Integrity

> Historically considered “too expensive” > Turns out, no it isn't > Real world evidence shows silent corruption a reality > Alternative is unacceptable

  • Ease of Use

> Combined filesystem and volume management > Underlying storage managed as Pools which simply admin > Two commands: zpool & zfs

> zpool: manage storage pool (aka volume management) > zfs: manage filesystems “To create a reliable storage system from inherently unreliable components”

slide-3
SLIDE 3

3

ZFS Data Integrity

2 Aspects 1.Always consistent on disk format

> Everything is copy-on-write (COW)

> Never overwrite live data > On-disk state always valid – no “windows of vulnerability” > Provides snapshots “for free”

> Everything is transactional

> Related changes succeed or fail as a whole

– AKA Transaction Group (TXG)

> No need for journaling

2.End to End checksums

> Filesystem metadata and file data protected using checksums > Protects End to End across interconnect, handling failings

between storage and host

slide-4
SLIDE 4

4

ZFS COW: Copy On Write

  • 1. Initial block tree
  • 2. COW some blocks
  • 4. Rewrite uberblock (atomic)
  • 3. COW indirect blocks
slide-5
SLIDE 5

5

FS/Volume Model vs. ZFS

FS/Volume I/O Stack

FS Volume

Block Device Interface

  • “Write this block,

then that block, ...”

  • Loss of power = loss of on-

disk consistency

  • Workaround: journaling,

which is slow & complex

ZFS Storage Pool

Block Device Interface

  • Write each block to each disk

immediately to keep mirrors in sync

  • Loss of power = resync
  • Synchronous and slow

Object-Based Transactions

  • “Make these 7 changes

to these 3 objects”

  • All-or-nothing

Transaction Group Batch I/O

  • Schedule, aggregate,

and issue I/O at will

  • No resync if power lost
  • Runs at platter speed

ZFS I/O Stack

DMU

Transaction Group Commit

  • Again, all-or-nothing
  • Always consistent on disk
  • No journal – not needed
slide-6
SLIDE 6

6

ZFS End to End Checksums

Disk Block Checksums

  • Checksum stored with data block
  • Any self-consistent block will pass
  • Can't even detect stray writes
  • Inherent FS/volume interface limitation

Data Data

Address Checksum Checksum Address

Data

Checksum

Data

Checksum

ZFS Data Authentication

  • Checksum stored in parent block pointer
  • Fault isolation between data and checksum
  • Entire storage pool is a

self-validating Merkle tree

ZFS validates the entire I/O path

✔ Bit rot ✔ Phantom writes ✔ Misdirected reads and writes ✔ DMA parity errors ✔ Driver bugs ✔ Accidental overwrite

Address Checksum Checksum Address

Disk checksum only validates media

✔ Bit rot

✗ Phantom writes ✗ Misdirected reads and writes ✗ DMA parity errors ✗ Driver bugs ✗ Accidental overwrite

slide-7
SLIDE 7

7

Traditional Mirroring

Application xxVM mirror

  • 1. Application issues a read.

Mirror reads the first disk, which has a corrupt block. It can't tell.

  • 2. Volume manager passes

bad block up to filesystem. If it's a metadata block, the filesystem panics. If not...

  • 3. Filesystem returns bad data

to the application.

FS Application xxVM mirror FS Application xxVM mirror FS

slide-8
SLIDE 8

8

Self-Healing Data in ZFS

Application ZFS mirror Application ZFS mirror Application ZFS mirror

  • 1. Application issues a read.

ZFS mirror tries the first disk. Checksum reveals that the block is corrupt on disk.

  • 2. ZFS tries the second disk.

Checksum indicates that the block is good.

  • 3. ZFS returns good data

to the application and repairs the damaged block.

slide-9
SLIDE 9

9

ZFS Administration

  • Pooled storage – no more volumes!

> All storage is shared – no wasted space, no wasted bandwidth

  • Hierarchical filesystems with inherited properties

> Filesystems become administrative control points

– Per-dataset policy: snapshots, compression, backups, privileges, etc. – Who's using all the space? du(1) takes forever, but df(1M) is instant!

> Manage logically related filesystems as a group > Control compression, checksums, quotas, reservations, and more > Mount and share filesystems without /etc/vfstab or /etc/dfs/dfstab > Inheritance makes large-scale administration a snap

  • Online everything
slide-10
SLIDE 10

10

FS/Volume Model vs. ZFS

Traditional Volumes

  • Abstraction: virtual disk
  • Partition/volume for each FS
  • Grow/shrink by hand
  • Each FS has limited bandwidth
  • Storage is fragmented, stranded

ZFS Pooled Storage

  • Abstraction: malloc/free
  • No partitions to manage
  • Grow/shrink automatically
  • All bandwidth always available
  • All storage in the pool is shared

Storage Pool Volume FS Volume FS Volume FS ZFS ZFS ZFS

slide-11
SLIDE 11

11

Dynamic Striping

  • Automatically distributes load across all devices

Storage Pool ZFS ZFS ZFS Storage Pool ZFS ZFS ZFS

1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 5 5

  • Writes: striped across all four mirrors
  • Reads: wherever the data was written
  • Block allocation policy considers:

> Capacity > Performance (latency, BW) > Health (degraded mirrors)

  • Writes: striped across all five mirrors
  • Reads: wherever the data was written
  • No need to migrate existing data

> Old data striped across 1-4 > New data striped across 1-5 > COW gently reallocates old data

Add Mirror 5 Add Mirror 5

slide-12
SLIDE 12

12

  • The combination of COW and TXGs means constant time

snapshots fall out for free*

  • At end of TXG, don't free COWed blocks

> Actually cheaper to take a snapshot than not!

Snapshot root Live root

Snapshots for Free

*Nothing is ever free, old COWed blocks of course consume space

slide-13
SLIDE 13

13

Disk Scrubbing

  • Finds latent errors while they're still correctable

> ECC memory scrubbing for disks

  • Verifies the integrity of all data

> Traverses pool metadata to read every copy of every block > Verifies each copy against its 256-bit checksum > Self-healing as it goes

  • Provides fast and reliable resilvering

> Traditional resilver: whole-disk copy, no validity check > ZFS resilver: live-data copy, everything checksummed > All data-repair code uses the same reliable mechanism

– Mirror resilver, RAIDZ resilver, attach, replace, scrub

slide-14
SLIDE 14

14

ZFS Commands

  • zfs(1m) used to administer

filesystems, zvols, and dataset properties

  • zpool(1m) used to control

the storage pool

Storage Pool ZFS ZFS ZFS

slide-15
SLIDE 15

15

ZFS Live Demo

slide-16
SLIDE 16

16

ZFS Availability

  • OpenSolaris

> Open Source version of latest Solaris in development (nevada) > Available via:

> Solaris Express Developer Edition > Solaris Express Community Edition > OpenSolaris Developer Preview 2 (Project Indiana) > Other distros (Belenix, Nexenta, Schilix, MarTux)

  • Solaris 10

> Since Update 2 (latest is Update 4)

  • OpenSolaris versions will always have the latest and

greatest bits and therefore best version to play with and explore the potential of ZFS

slide-17
SLIDE 17

17

ZFS Under The Hood

  • Full day of ZFS Presentations and Talks

> Covering:

> Overview – more of this presentation and “manager safe” > Issues – known issues around current ZFS implementation > Under The Hood – how ZFS does what it does

  • If you are seriously interested in ZFS and want to know

more, or have discussions or just plainly interested in how it works, then drop us a line:

> Jarod.Nash@sun.com > Jason.Banham@sun.com