CS5460: Operating Systems Lecture 20: File System Reliability CS - - PowerPoint PPT Presentation

cs5460 operating systems lecture 20 file system
SMART_READER_LITE
LIVE PREVIEW

CS5460: Operating Systems Lecture 20: File System Reliability CS - - PowerPoint PPT Presentation

CS5460: Operating Systems Lecture 20: File System Reliability CS 5460: Operating Systems File System Optimizations Technique Effect Disk buffer cache Eliminates problem Modern Aggregated disk I/O Reduces seeks Prefetching Overlap/hide


slide-1
SLIDE 1

CS 5460: Operating Systems

CS5460: Operating Systems

Lecture 20: File System Reliability

slide-2
SLIDE 2

CS 5460: Operating Systems

File System Optimizations

Technique Effect Disk buffer cache Eliminates problem Aggregated disk I/O Reduces seeks Prefetching Overlap/hide disk access Disk head scheduling Reduces seeks Disk interleaving Reduces rotational latency

Modern Historic  Goal: Reduce or hide expensive disk operations

slide-3
SLIDE 3

CS 5460: Operating Systems

Buffer/Page Cache

 Idea: Keep recently used disk blocks in kernel

memory

 Process reads from a file: – If blocks are not in buffer cache

» Allocate space in buffer cache

 Q: What do we purge and how?

» Initiate a disk read » Block the process until disk operations complete

– Copy data from buffer cache to process memory – Finally, system call returns  Usually, a process does not see the buffer cache

directly

 mmap() maps buffer cache pages into process RAM

slide-4
SLIDE 4

CS 5460: Operating Systems

Buffer/Page Cache

 Process writes to a file: – If blocks are not in the buffer cache

» Allocate pages » Initiate disk read » Block process until disk operations complete

– Copy written data from process RAM to buffer cache  Default: writes create dirty pages in the cache, then

the system call returns

– Data gets written to device in the background – What if the file is unlinked before it goes to disk?  Optional: Synchronous writes which go to disk

before the system call returns

– Really slow!

slide-5
SLIDE 5

CS 5460: Operating Systems

Performing Large File I/Os

 Idea: Try to allocate contiguous chunks of file in

large contiguous regions of the disk

– Disks have excellent bandwidth, but lousy latency! – Amortize expensive seeks over many block read/writes  Question: How? – Maintain free block bitmap (cache parts in memory) – When you allocate blocks, use a modified “best fit” algorithm, rather than allocating a block at a time (pre-allocate even)  Problem: Hard to do this when disk full/fragmented – Solution A: Keep a reserve (e.g., 10%) available at all times – Solution B: Run a disk “defragger” occasionally

slide-6
SLIDE 6

CS 5460: Operating Systems

Prefetching

 Idea: Read blocks from disk ahead of user request  Goal: Reduce number of seeks visible to user – If block read before request à à hits in file buffer cache  Problem: What blocks should we prefetch? – Easy: Detect sequential access and prefetch ahead N blocks – Harder: Detect periodic/predictable “random” accesses

Read 0 Read 2 Read 1 User Read 0 Read 2 Read 1 File System

slide-7
SLIDE 7

CS 5460: Operating Systems

Fault Tolerance and Reliability

slide-8
SLIDE 8

CS 5460: Operating Systems

Fault Tolerance

 What kinds of failures do we need to consider? – OS crash, power failure

» Data not on disk is lost; rarely, partial writes

– Disk media failure

» Data on disk corrupted or unavailable

– Disk controller failure

» Large swaths of data unavailable temporarily or permanently

– Network failure

» Clients and servers cannot communicate (transient failure) » Only have access to stale data (if any)

– … (what else?)

slide-9
SLIDE 9

CS 5460: Operating Systems

Techniques to Tolerate Failure

 Careful disk writes and “fsck” – Leave disk in recoverable state even if not all writes finish – Run “disk check” program to identify/fix inconsistent disk state  RAID: – Redundant Array of Inexpensive Independent Disks – Write each block on more than one independent disk – If disk fails, can recover block contents from non-failed disks  Logging – Rather than overwrite-in-place, write changes to log file – Use two-phase commit to make log updates transactional  Clusters – Replicate data at the server level

slide-10
SLIDE 10

CS 5460: Operating Systems

Careful Writes

 Order writes so that disk state is recoverable – Accept that disk contents may be inconsistent or stale – Run sanity check program to detect and fix problems  Properties that should hold at all times – All blocks pointed to are not marked free – All blocks not pointed to are marked free – No block belongs to more than one file  Goal: Avoid major inconsistency  Not a goal: Never lose data

slide-11
SLIDE 11

Careful Writes Example

 To create a file, you must: – Allocate and initialize an inode – Allocate and initialize some data blocks – Modify the directory file of the directory containing the file – Modify the directory file’s inode (last modified time, size)  In what order should we do these writes?  How to add transactional (all or nothing) semantics?  How do careful writes interact with optimizations?

CS 5460: Operating Systems

slide-12
SLIDE 12

CS 5460: Operating Systems

Careful Writes Exercise

 To delete a file, you must: – Deallocate the file’s inode – Deallocate the file’s disk blocks – Modify the directory file of the directory containing the file – Update the directory file’s inode  In what order should we do these operations? – Consider what intermediate states are recoverable via fsck

slide-13
SLIDE 13

Soft Update Rules

 Never point to a block before initializing it  Never reuse a block before nullifying pointers to it  Never reset last pointer to live block before setting a

new one

 Always mark free-block bitmap entries as used

before making the directory entry point to it

CS 5460: Operating Systems

slide-14
SLIDE 14

CS 5460: Operating Systems

Careful Writes: More Exercises

 To write a file, you must: – Modify (and perhaps allocate) the file’s disk blocks – Modify the file’s inode (size and last modified time) – Maybe, modify indirect block(s)  To move a file between directories, you must: – Modify the source directory – Modify the destination directory – Modify the inodes of both directories

slide-15
SLIDE 15

CS 5460: Operating Systems

RAID

 Goal: Organize multiple physical disks into a single

high-performance, high-reliability logical disk

 Issues to consider: – Multiple disks à à higher aggregate throughput (more spindles) – Multiple disks à à (hopefully) independent failure modes – Multiple disks à à vulnerable to individual disk failures (MTTF) – Writing to multiple disks for replication à à higher write overhead RAID ctlr. CPU

I/O bus

slide-16
SLIDE 16

CS 5460: Operating Systems

Possible Uses of Multiple Disks

 Striping – Spread pieces of a single file across multiple disks – Advantages:

» Can service multiple independent requests in parallel » Can service single “large” requests in parallel

– Issues:

» Interleave factor » How the data is striped across disks

 Redundancy (replication) – Store multiple copies of blocks on independent disks – Advantages:

» Can tolerate partial system failure à à How much?

– Issues:

» How widely do you want to spread the data?

slide-17
SLIDE 17

CS 5460: Operating Systems

Types of RAID

RAID level Description Data striping w/o redundancy 1 Disk mirroring 2 Parallel array of disks w/ error correcting disk (checksum) 3 Bit-interleaved parity 4 Block-interleaved parity 5 Block-interleaved, distributed parity

slide-18
SLIDE 18

CS 5460: Operating Systems

RAID Level 0

 Striping

– Spread contiguous blocks of a file across multiple spindles – Simple round-robin distribution

 Non-redundant

– No fault tolerance

 Advantages

– Higher throughput – Larger storage

 Disadvantages

– Lower reliability – any drive failure destroys the file system – Added cost

RAID ctlr. CPU

I/O bus

slide-19
SLIDE 19

CS 5460: Operating Systems

RAID Level 1

 Mirroring

– Write complete copies of all blocks to multiple disks – How many copies à à how much reliability

 No striping

– No added write bandwidth – Potential for pipelined reads

 Advantage:

– Can tolerate disk failures (“availability”)

 Disadvantage:

– High cost (extra disks and RAID controller)

 Q: How to recover from drive

failure?

RAID ctlr. CPU

I/O bus

slide-20
SLIDE 20

CS 5460: Operating Systems

RAID Level 5

 Mirroring + striping +

distributed parity

– Spread contiguous blocks of a file across multiple spindles – Adds parity information

» Example: XOR of other blocks

 Combines features of 0 & 1  Advantages

– Higher throughput – Lower cost (than level 1) – Any single disk can fail

 Disadvantages

– More complexity in RAID controller – Slower recovery time than RAID 1

 RAID 6: 2 parity disks

RAID ctlr. CPU

I/O bus

slide-21
SLIDE 21

CS 5460: Operating Systems

RAID Tradeoffs

 Space efficiency  Minimum number of disks  Number of simultaneous failures tolerated  Read performance  Write performance  Time to recover from a failed disk  Complexity of controller

slide-22
SLIDE 22

CS 5460: Operating Systems

RAID Discussion

 RAID can be implemented by hardware or software – Hardware RAID implemented by RAID controller

» Often supports hot swapping using hot spare disks » Not totally clear that cheap RAID HW is worth it

– Software RAID implemented by OS kernel (device driver)  Multiple parity disks can handle multiple errors  Nested RAID – Can use a RAID array as a “disk” in a higher level RAID

» RAID 1+0: RAID 0 (striping) run across RAID 1 (mirrored) arrays » RAID 0+1: RAID 1 (mirroring) run across RAID 0 (striped) arrays

slide-23
SLIDE 23

CS 5460: Operating Systems

RAID Discussion

 What are the risks due to purchasing a large

number of disks at the same time for use in a RAID?

 Hot spares can be useful  What does a RAID look like to the file system code?  RAID summary – Tolerates failed disks – May not deal well with correlated failure modes – Can improve sustained transfer rate – Does not improve individual seek latencies

slide-24
SLIDE 24

CS 5460: Operating Systems

Logging / Journaling

 Observations: – Recreating consistent disk after failure is problematic – Conventional file systems optimized for large contiguous reads – File buffer cache eliminates reads à à writes often bottleneck

» Recall “careful writes” à à cannot defer metadata writes indefinitely » Metadata ops access non-contiguous parts of disk (file, inode, dir)

 Idea: redesign the file system around a “log” – Contiguous log structure à à append at end – Usage is similar to a database transaction log – Eliminate random seeks in the critical path  Sweeper process – Copies data from log to “real” locations – Kicked off periodically (e.g., log filling up)

StartTransaction <transaction info> EndTransaction StartTransaction <transaction info> EndTransaction

slide-25
SLIDE 25

CS 5460: Operating Systems

Example: File Creation

 Conventional file system:

– Allocate and initialize inode – Write inode to disk – Load directory file – Load directory inode – Update directory file – Write directory file to disk – Update directory inode – Write directory inode to disk – Later: Flush free inode bitmap

Lots of seeks Lots of small writes

 Log-based file system:

– Allocate and initialize inode – Load directory file – Load directory inode – Write:

» BeginTransaction (FileCreate) » Filename: /tmp/foo » Inode#: 1234 » Inode Contents: … » Directory Contents: … » EndTransaction (FileCreate)

– Later: Copy data from log to “real” structures

Few seeks + one big write

slide-26
SLIDE 26

CS 5460: Operating Systems

Using the Operation Log

 Issue: Inconsistency between log contents and

“real” contents (for anything not yet copied back)

 Questions: – What problems can this cause? – How do you get around these problems?  Issue: What if I re-modify file/inode before flush?

slide-27
SLIDE 27

CS 5460: Operating Systems

Using the Operation Log

 Issue: Inconsistency between log contents and

“real” contents (for anything not yet copied back)

 Questions: – What problems can this cause?

» Cannot simply read data/metadata from “real” locations » Need to check log contents on any lookup/read

– How do you get around these problems?

» Maintain index of logged-but-not-flushed state in DRAM » Always check index first whenever you want to read data/metadata

 Issue: What if I re-modify file/inode before flush? – Correct: Simply flush changes in order they appear in log – Optimized: If 2nd change negates first, only flush 2nd à à be careful!

slide-28
SLIDE 28

CS 5460: Operating Systems

What About File Data Writes?

 Option one: – Write the new data into a log – Later copy data from log to “real” disk blocks  Option two: – Write new data to “real” disk blocks right away  Tradeoffs?

slide-29
SLIDE 29

CS 5460: Operating Systems

Crash Recovery

 Question: How do you recover after a crash? – What inconsistencies are possible? – How do you detect and correct inconsistencies?  Answer: Run a log sweeper (ala fsck/ChkDsk) – Search through the log to find “oldest” valid record – Walk log from oldest to newest:

» If complete transaction present in the log à à complete (if necessary) » If incomplete transaction found à à abort/undo it

– Recovery analogous to transaction logs in database systems

slide-30
SLIDE 30

CS 5460: Operating Systems

Logging vs. Not

 Advantages of logging: – Fast metadata operations à à one big synchronous write – Efficient for small write operations (if normal writes are logged) – Clean, fast recovery mechanism  Disadvantages of logging: – Space overhead à à log and in-memory structures – Complexity à à transactions, extra data structures, sweeper process – Duplication of effort à à write to both log and “real” locations

slide-31
SLIDE 31

Logging Filesystems in Practice

 NTFS uses a log  Recent versions of UFS+ use a log  Linux EXT2 does not use a log – Works using techniques we discussed through the last lecture  Linux EXT3 is log-based, and is forward-compatible – You can take an EXT2 filesystem and start using it as EXT3 by adding a log – EXT3 can be converted back to EXT2  EXT4 is more sophisticated than EXT3 but still

retains back-compatibility

 Btrfs does not use logging

CS 5460: Operating Systems

slide-32
SLIDE 32

CS 5460: Operating Systems

Questions?