Safety First New Features in Linux File & Storage Systems Ric - - PowerPoint PPT Presentation
Safety First New Features in Linux File & Storage Systems Ric - - PowerPoint PPT Presentation
Safety First New Features in Linux File & Storage Systems Ric Wheeler rwheeler@redhat.com September 27, 2010 Overview Why Care about Data Integrity Common Causes of Data Loss Data Loss Exposure Timeline Reliability
2
Overview
Why Care about Data Integrity
Common Causes of Data Loss
Data Loss Exposure Timeline
Reliability Building Blocks
Examples
Grading our Progress
Questions?
3
Why Care about Data Integrity
Linux is used to store critical personal data
- On Linux based desktops and laptops
- Linux based devices like Tivo, cell phones or a home NAS storage device
- Remote backup providers use Linux systems to backup your personal data
Linux servers are very common in corporate data centers
- Financial services
- Medical images
- Banks
Used as the internal OS for appliances
4
Linux has one IO & FS Stack
One IO stack has to run on critical and non-critical devices
- Data integrity is the only “safe” default configuration
- Sophisticated users can choose non-default options that might disable data
reliability features
Trade offs need to be done with care
- Some configurations can hurt data integrity (like ext3 writeback mode)
- Some expose data integrity issues (non-default support for write barriers)
Where possible, the system should auto-configure these options
- Detection of battery or UPS can allow for mounting without write barriers
- Special tunings for SSD's and high end arrays to avoid barrier operations
- Similar to the way we auto-configure features for other types of hardware
5
Common Causes of Data Loss
Uncountable ways to lose data
- Honesty requires vendors that sell you a storage device should provide some
disclaimer about when – not if - you will lose data
Storage device manufacturers work hard to
- Track the causes of data loss or system unavailability of real, deployed
systems
- Classify those instances into discrete buckets
- Carefully select the key issues that are most critical (or common) to fix first
Tracking real world issues and focusing on fixing high priority issues requires really good logging and reporting from deployed systems
- Voluntary methods like the kernel-oops project
- Critical insight comes from being able to gather good data, do real analysis
and then monitor how your fixes impact the deployed base
6
User Errors
Accidental destruction or deletion of data
- Overwriting good data with bad
- Restoring old backups to good data
Service errors
- Replacing the wrong drive in a RAID array
- Not plugging in the UPS
Losing track of where you put it
- Too many places to store a photo, MP3, etc
Misunderstanding or lack of knowledge about how to configure the system properly
Developers need to be responsible for providing easy to use system in order to minimize this kind of error!
7
Application Developer Errors
Some application coders do not worry about data integrity to begin with
- Occasionally on purpose – speed is critical, data is not (can be rerun)
- Occasionally by ignorance
- Occasionally by mistake
For application authors that do care, we make data integrity difficult by giving them poor documentation:
- rename: How many fsync() calls should we use when renaming a file?
- fsync: just the file? File and parent directory?
- Best practices that change with type of storage, file system type and file
system journal mode
Primitives need to be clearly understood & well documented
- Too many choices make it next to impossible for application developers to
deliver reliable software
“Mostly works” is not good enough for data integrity
8
OS & Configuration Errors
Configuration Errors
- Does the system have the write barrier enabled if needed?
Bugs in the IO Stack
- Do we use write barriers correctly to flush volatile write caches?
- Do we properly return error codes so the application can handle failures?
- Do we log the correct failure in /var/log/messages in a way that can point the
user to a precise component?
9
Hardware Failures
Hard disk failures
Power supply failures
DRAM failures
Cable failures
10
Disasters
Name your favorite disaster
- Fire, Flood, Blizzards....
- Power outage
- Terrorism
No single point of failure requires that any site has a method to have a copy at some secondary location
- Remote data mirroring can be done at the file level or block level
- Backup and storage of backup images off site
- Buy expensive hardware support for remote replication, etc
11
Data Loss ExposureTimeline
Rough Timeline:
- State 0: Data creation
- State 1: Stored to persistent storage
- State 2: Component failure in your system
- State 3: Detection of the failed component
- State 4: Data repair started
- State 5: Data repair completed
Minimizing the time spent out of State 1 is what storage designers lose sleep
- ver!
12
What is the expected frequency of disk failure?
Hard failures
- Total disk failure
- Read or write failure
- Usually instantaneously detected
Soft failures
- Can happen at any time
- Usually detection requires scrubbing or scanning the storage
- Unfortunately, can be discovered during RAID rebuild
Note that this is not just a rotating disk issue
- SSDs wear out, paper and ink fade, CDs deteriorate, etc
13
How long does it take you to detect a failure?
Does your storage system detect latent errors in
- Hours? Days? Weeks?
- Only when you try to read the data back?
Most storage systems do several levels of data scanning and scrubbing
- Periodic reads (proper read or read_verify commands) to insure that any latent
errors are detected
- File servers and object based storage systems can do whole file reads and
compare data to a digital hash for example
- Balance is needed between frequent scanning/scrubbing and system
performance for its normal workload
14
How long does it take you to repair a failure?
Repair the broken storage physically
- Rewrite a few bad sectors (multiple seconds)?
- Replace a broken drive and rebuild the RAID group (multiple hours)?
Can we repair the damage done to the file system?
- Are any files present but damaged?
- Do I need to run fsck?
- Very useful to be able to map an IO error back to a user file, metadata or
unallocated space
Repair the logical structure of the file system metadata
- Fsck time can take hours or days
- Restore any data lost from backup
Users like to be able to verify file system integrity after a repair
- Are all of my files still on the system?
- Is the data in those files intact and unchanged?
- Can you tell me precisely what I lost?
15
Exposure to Permanent Data Loss
Combination of the factors described:
- Robustness of storage system
- Rate of failure of components
- Time to detect the failure
- Time to repair the physical media
- Time to repair the file system metadata (fsck)
- Time to summarize for the user any permanent loss
If the time required to detect and repair is larger than your failure rate, you will lose data!
16
Storage Downtime without Data Loss Counts
Unavailable data loses money
- Banking transactions
- Online transactions
Data unavailability can be really mission critical
- X-rays are digital and used in operations
- Digital maps used in search
Horrible performance during repair is downtime in many cases
- Minimizing repair time minimizing this loss as well
17
How Many Concurrent Failures Can Your Data Survive?
Protection against failure is expensive
Storage systems performance
Utilized capacity
Extra costs for hardware, power and cooling for less efficient storage systems
Single drive can survive soft failures
A single disk is 100% efficient
RAID5 can survive 1 hard failure & soft failures
RAID5 with 5 data disks and 1 parity disk is 83% efficient
RAID6 can survive 2 hard failures & soft failures
RAID6 with 4 data disks and 2 parity disks is only 66% efficient!
Fancy schemes (erasure encoding schemes) can survive many failures
Any “k” drives out of “n” are sufficient to recovery data
Popular in cloud and object storage systems
18
Example: MD RAID5 & EXT3
RAID5 gives us the ability to survive 1 hard failure
- Any second soft failure during RAID rebuild can cause data loss since we
need to read each sector of all other disks during rebuild
- Rebuild can begin only when we have a new or spare drive to use for rebuild
Concurrent hard drive failures in a RAID group are rare
- ... but detecting latent (soft) errors during rebuild are increasingly common!
MD has the ability to “check” RAID members on demand
- Useful to be able to de-prioritize this background scan
- Should run once every 2 to 4 weeks
RAID rebuild times are linear with drive size
- Can run up to 1 day for a healthy set of disk drives
EXT3 fsck times can run a long time
- 1TB FS fsck with 45 million files ran 1 hour (reports in the field of run time up
to 1 week!)
- Hard (not impossible) to map bad sectors back to user files using
ncheck/icheck
19
EXT4 Improvements
Checksums are computed for the journal transactions
- Help detect corrupted transactions on replay of the journal after a crash
- Do not cover user data blocks
Extent based allocation structures do support quicker fsck
- Uninitialized inodes can be skipped –upstream patches posted that will help
mkfs eventually
- Preallocated data blocks can be flagged as unwritten which will let us avoid
writing blocks of zero data
- These features – especially on larger file systems – need thorough testing!
Better allocation policies lead to faster fsck times
- Directory related blocks are stored on disk in contiguous allocations
- High file count fsck times are 6-8 times faster
20
Updated Example: MD RAID5 & EXT4
RAID5 gives us the ability to survive 1 hard failure
- Any second soft failure during RAID rebuild can cause data loss since we
need to read each sector of all other disks during rebuild
- Rebuild can begin only when we have a new or spare drive to use for rebuild
Concurrent hard drive failures in a RAID group are rare
- ... but detecting latent (soft) errors during rebuild are increasingly common!
MD has the ability to “check” RAID members on demand
- Useful to be able to de-prioritize this background scan
- Should run once every 2 to 4 weeks
RAID rebuild times are linear with drive size
- Can run up to 1 day for a healthy set of disk drives
EXT4 fsck times are much improved
- 1TB FS fsck with 45 million files ran under 7 minutes – much faster
than the ext3 1 hour time
- 1 billion file ext4 fsck finished in 2.5 hours!
- Credit goes to better meta-data layout
21
Native Support for Sector Level Scans
Data protection schemes need ways to detect partial failures
- Background scans of storage are essential parts of high end storage
- Need to run at a rate which allows you to detect partial failures before double
failures occur
- Need to balance impact on performance against frequent scanning
- Over-zealous scanning can prematurely age storage devices
File System Level Scans
- BTRFS has data and metadata checksums which enable a full scan can be
done by simply reading each file
- Ext4 has some meta-data checksums
Block level scanning
- READ_VERIFY command can be used for SCSI and S-ATA to read from
platter to device cache without transfer of data to host
- Checks un-allocated space as well
22
Better API's For Data Integrity
Enhanced API's between the block layer and file systems
- Ability to retry a given mirror for a RAID1 device?
- Better and more specific error values – EIO is not always enough
New system calls?
- Support for batching of expensive calls like fsync()?
- Exporting transactions to user space?
Better documentation on existing API's like fsync(), O_DIRECT writes and rename()
Better documentation on the data path
- write() moves data from application buffer to page cache
- fsync() without write barrier or normal page flushing moves data from page
cache to storage device and its write caches
- fsync() with write barrier returns only when data has been flushed from storage
write cache
- O_DIRECT write bypasses page cache but still can end up in a volatile
storage device cache
23
Grading our Progress
How do we know if we are getting better?
- Define data loss and data unavailability
- Gather real data on installed systems
- Gather real data on field failures
Good information allows us to focus on fixing real issues
- Rate of bad incidents should go down as we advance
- Blips in the rate help us focus on bad code, bad hardware, etc
Providing robust test suites that are easy to run by both experienced QA testers and developers
Providing extensive and correct documentation for developers
- For kernel developers internally
- For application developers
- For system administrators
- For end users
24
Questions?
Project Wiki Pages:
- http://btrfs.wiki.kernel.org
- http://ext4.wiki.kernel.org
- http://xfs.org
- http://oss.sgi.com/projects/xfs
Contact Information
- Ric Wheeler
- rwheeler@redhat.com