Safety First New Features in Linux File & Storage Systems Ric - - PowerPoint PPT Presentation

safety first
SMART_READER_LITE
LIVE PREVIEW

Safety First New Features in Linux File & Storage Systems Ric - - PowerPoint PPT Presentation

Safety First New Features in Linux File & Storage Systems Ric Wheeler rwheeler@redhat.com September 27, 2010 Overview Why Care about Data Integrity Common Causes of Data Loss Data Loss Exposure Timeline Reliability


slide-1
SLIDE 1

Safety First

New Features in Linux File & Storage Systems

Ric Wheeler rwheeler@redhat.com September 27, 2010

slide-2
SLIDE 2

2

Overview

Why Care about Data Integrity

Common Causes of Data Loss

Data Loss Exposure Timeline

Reliability Building Blocks

Examples

Grading our Progress

Questions?

slide-3
SLIDE 3

3

Why Care about Data Integrity

Linux is used to store critical personal data

  • On Linux based desktops and laptops
  • Linux based devices like Tivo, cell phones or a home NAS storage device
  • Remote backup providers use Linux systems to backup your personal data

Linux servers are very common in corporate data centers

  • Financial services
  • Medical images
  • Banks

Used as the internal OS for appliances

slide-4
SLIDE 4

4

Linux has one IO & FS Stack

One IO stack has to run on critical and non-critical devices

  • Data integrity is the only “safe” default configuration
  • Sophisticated users can choose non-default options that might disable data

reliability features

Trade offs need to be done with care

  • Some configurations can hurt data integrity (like ext3 writeback mode)
  • Some expose data integrity issues (non-default support for write barriers)

Where possible, the system should auto-configure these options

  • Detection of battery or UPS can allow for mounting without write barriers
  • Special tunings for SSD's and high end arrays to avoid barrier operations
  • Similar to the way we auto-configure features for other types of hardware
slide-5
SLIDE 5

5

Common Causes of Data Loss

Uncountable ways to lose data

  • Honesty requires vendors that sell you a storage device should provide some

disclaimer about when – not if - you will lose data

Storage device manufacturers work hard to

  • Track the causes of data loss or system unavailability of real, deployed

systems

  • Classify those instances into discrete buckets
  • Carefully select the key issues that are most critical (or common) to fix first

Tracking real world issues and focusing on fixing high priority issues requires really good logging and reporting from deployed systems

  • Voluntary methods like the kernel-oops project
  • Critical insight comes from being able to gather good data, do real analysis

and then monitor how your fixes impact the deployed base

slide-6
SLIDE 6

6

User Errors

Accidental destruction or deletion of data

  • Overwriting good data with bad
  • Restoring old backups to good data

Service errors

  • Replacing the wrong drive in a RAID array
  • Not plugging in the UPS

Losing track of where you put it

  • Too many places to store a photo, MP3, etc

Misunderstanding or lack of knowledge about how to configure the system properly

Developers need to be responsible for providing easy to use system in order to minimize this kind of error!

slide-7
SLIDE 7

7

Application Developer Errors

Some application coders do not worry about data integrity to begin with

  • Occasionally on purpose – speed is critical, data is not (can be rerun)
  • Occasionally by ignorance
  • Occasionally by mistake

For application authors that do care, we make data integrity difficult by giving them poor documentation:

  • rename: How many fsync() calls should we use when renaming a file?
  • fsync: just the file? File and parent directory?
  • Best practices that change with type of storage, file system type and file

system journal mode

Primitives need to be clearly understood & well documented

  • Too many choices make it next to impossible for application developers to

deliver reliable software

“Mostly works” is not good enough for data integrity

slide-8
SLIDE 8

8

OS & Configuration Errors

Configuration Errors

  • Does the system have the write barrier enabled if needed?

Bugs in the IO Stack

  • Do we use write barriers correctly to flush volatile write caches?
  • Do we properly return error codes so the application can handle failures?
  • Do we log the correct failure in /var/log/messages in a way that can point the

user to a precise component?

slide-9
SLIDE 9

9

Hardware Failures

Hard disk failures

Power supply failures

DRAM failures

Cable failures

slide-10
SLIDE 10

10

Disasters

Name your favorite disaster

  • Fire, Flood, Blizzards....
  • Power outage
  • Terrorism

No single point of failure requires that any site has a method to have a copy at some secondary location

  • Remote data mirroring can be done at the file level or block level
  • Backup and storage of backup images off site
  • Buy expensive hardware support for remote replication, etc
slide-11
SLIDE 11

11

Data Loss ExposureTimeline

Rough Timeline:

  • State 0: Data creation
  • State 1: Stored to persistent storage
  • State 2: Component failure in your system
  • State 3: Detection of the failed component
  • State 4: Data repair started
  • State 5: Data repair completed

Minimizing the time spent out of State 1 is what storage designers lose sleep

  • ver!
slide-12
SLIDE 12

12

What is the expected frequency of disk failure?

Hard failures

  • Total disk failure
  • Read or write failure
  • Usually instantaneously detected

Soft failures

  • Can happen at any time
  • Usually detection requires scrubbing or scanning the storage
  • Unfortunately, can be discovered during RAID rebuild

Note that this is not just a rotating disk issue

  • SSDs wear out, paper and ink fade, CDs deteriorate, etc
slide-13
SLIDE 13

13

How long does it take you to detect a failure?

Does your storage system detect latent errors in

  • Hours? Days? Weeks?
  • Only when you try to read the data back?

Most storage systems do several levels of data scanning and scrubbing

  • Periodic reads (proper read or read_verify commands) to insure that any latent

errors are detected

  • File servers and object based storage systems can do whole file reads and

compare data to a digital hash for example

  • Balance is needed between frequent scanning/scrubbing and system

performance for its normal workload

slide-14
SLIDE 14

14

How long does it take you to repair a failure?

Repair the broken storage physically

  • Rewrite a few bad sectors (multiple seconds)?
  • Replace a broken drive and rebuild the RAID group (multiple hours)?

Can we repair the damage done to the file system?

  • Are any files present but damaged?
  • Do I need to run fsck?
  • Very useful to be able to map an IO error back to a user file, metadata or

unallocated space

Repair the logical structure of the file system metadata

  • Fsck time can take hours or days
  • Restore any data lost from backup

Users like to be able to verify file system integrity after a repair

  • Are all of my files still on the system?
  • Is the data in those files intact and unchanged?
  • Can you tell me precisely what I lost?
slide-15
SLIDE 15

15

Exposure to Permanent Data Loss

Combination of the factors described:

  • Robustness of storage system
  • Rate of failure of components
  • Time to detect the failure
  • Time to repair the physical media
  • Time to repair the file system metadata (fsck)
  • Time to summarize for the user any permanent loss

If the time required to detect and repair is larger than your failure rate, you will lose data!

slide-16
SLIDE 16

16

Storage Downtime without Data Loss Counts

Unavailable data loses money

  • Banking transactions
  • Online transactions

Data unavailability can be really mission critical

  • X-rays are digital and used in operations
  • Digital maps used in search

Horrible performance during repair is downtime in many cases

  • Minimizing repair time minimizing this loss as well
slide-17
SLIDE 17

17

How Many Concurrent Failures Can Your Data Survive?

Protection against failure is expensive

Storage systems performance

Utilized capacity

Extra costs for hardware, power and cooling for less efficient storage systems

Single drive can survive soft failures

A single disk is 100% efficient

RAID5 can survive 1 hard failure & soft failures

RAID5 with 5 data disks and 1 parity disk is 83% efficient

RAID6 can survive 2 hard failures & soft failures

RAID6 with 4 data disks and 2 parity disks is only 66% efficient!

Fancy schemes (erasure encoding schemes) can survive many failures

Any “k” drives out of “n” are sufficient to recovery data

Popular in cloud and object storage systems

slide-18
SLIDE 18

18

Example: MD RAID5 & EXT3

RAID5 gives us the ability to survive 1 hard failure

  • Any second soft failure during RAID rebuild can cause data loss since we

need to read each sector of all other disks during rebuild

  • Rebuild can begin only when we have a new or spare drive to use for rebuild

Concurrent hard drive failures in a RAID group are rare

  • ... but detecting latent (soft) errors during rebuild are increasingly common!

MD has the ability to “check” RAID members on demand

  • Useful to be able to de-prioritize this background scan
  • Should run once every 2 to 4 weeks

RAID rebuild times are linear with drive size

  • Can run up to 1 day for a healthy set of disk drives

EXT3 fsck times can run a long time

  • 1TB FS fsck with 45 million files ran 1 hour (reports in the field of run time up

to 1 week!)

  • Hard (not impossible) to map bad sectors back to user files using

ncheck/icheck

slide-19
SLIDE 19

19

EXT4 Improvements

Checksums are computed for the journal transactions

  • Help detect corrupted transactions on replay of the journal after a crash
  • Do not cover user data blocks

Extent based allocation structures do support quicker fsck

  • Uninitialized inodes can be skipped –upstream patches posted that will help

mkfs eventually

  • Preallocated data blocks can be flagged as unwritten which will let us avoid

writing blocks of zero data

  • These features – especially on larger file systems – need thorough testing!

Better allocation policies lead to faster fsck times

  • Directory related blocks are stored on disk in contiguous allocations
  • High file count fsck times are 6-8 times faster
slide-20
SLIDE 20

20

Updated Example: MD RAID5 & EXT4

RAID5 gives us the ability to survive 1 hard failure

  • Any second soft failure during RAID rebuild can cause data loss since we

need to read each sector of all other disks during rebuild

  • Rebuild can begin only when we have a new or spare drive to use for rebuild

Concurrent hard drive failures in a RAID group are rare

  • ... but detecting latent (soft) errors during rebuild are increasingly common!

MD has the ability to “check” RAID members on demand

  • Useful to be able to de-prioritize this background scan
  • Should run once every 2 to 4 weeks

RAID rebuild times are linear with drive size

  • Can run up to 1 day for a healthy set of disk drives

EXT4 fsck times are much improved

  • 1TB FS fsck with 45 million files ran under 7 minutes – much faster

than the ext3 1 hour time

  • 1 billion file ext4 fsck finished in 2.5 hours!
  • Credit goes to better meta-data layout
slide-21
SLIDE 21

21

Native Support for Sector Level Scans

Data protection schemes need ways to detect partial failures

  • Background scans of storage are essential parts of high end storage
  • Need to run at a rate which allows you to detect partial failures before double

failures occur

  • Need to balance impact on performance against frequent scanning
  • Over-zealous scanning can prematurely age storage devices

File System Level Scans

  • BTRFS has data and metadata checksums which enable a full scan can be

done by simply reading each file

  • Ext4 has some meta-data checksums

Block level scanning

  • READ_VERIFY command can be used for SCSI and S-ATA to read from

platter to device cache without transfer of data to host

  • Checks un-allocated space as well
slide-22
SLIDE 22

22

Better API's For Data Integrity

Enhanced API's between the block layer and file systems

  • Ability to retry a given mirror for a RAID1 device?
  • Better and more specific error values – EIO is not always enough

New system calls?

  • Support for batching of expensive calls like fsync()?
  • Exporting transactions to user space?

Better documentation on existing API's like fsync(), O_DIRECT writes and rename()

Better documentation on the data path

  • write() moves data from application buffer to page cache
  • fsync() without write barrier or normal page flushing moves data from page

cache to storage device and its write caches

  • fsync() with write barrier returns only when data has been flushed from storage

write cache

  • O_DIRECT write bypasses page cache but still can end up in a volatile

storage device cache

slide-23
SLIDE 23

23

Grading our Progress

How do we know if we are getting better?

  • Define data loss and data unavailability
  • Gather real data on installed systems
  • Gather real data on field failures

Good information allows us to focus on fixing real issues

  • Rate of bad incidents should go down as we advance
  • Blips in the rate help us focus on bad code, bad hardware, etc

Providing robust test suites that are easy to run by both experienced QA testers and developers

Providing extensive and correct documentation for developers

  • For kernel developers internally
  • For application developers
  • For system administrators
  • For end users
slide-24
SLIDE 24

24

Questions?

Project Wiki Pages:

  • http://btrfs.wiki.kernel.org
  • http://ext4.wiki.kernel.org
  • http://xfs.org
  • http://oss.sgi.com/projects/xfs

Contact Information

  • Ric Wheeler
  • rwheeler@redhat.com