fjlesystem reliability 1 last time inodes (double-, - PowerPoint PPT Presentation

free map pt 2 = C updates will defjnitely happen! and redo them (just in case) redo logging: fjle creation block G I N … data blk 74 = (fjle) … super inode array log B data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log E T B (fjle) E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = … I addr[0]=34 inode #53 = … 1 0 1 … O M M 23

updates will defjnitely happen! and redo them (just in case) redo logging: fjle creation block E G I N … data blk 74 = (fjle) … super inode array log T data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log B I B … E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = (fjle) addr[0]=34 M inode #53 = … 1 0 1 … free map pt 2 = C O M 23

redo logging: fjle creation block E G I N … data blk 74 = (fjle) … super log B inode array data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log B T I M E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = (fjle) … addr[0]=34 inode #53 = … 1 0 1 … O M 23 free map pt 2 = C updates will defjnitely happen! and redo them (just in case)

updates will defjnitely happen! and redo them (just in case) redo logging: fjle creation block G I N … data blk 74 = (fjle) … super inode array log B data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log E T B … E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = (fjle) addr[0]=34 I inode #53 = … 1 0 1 … O M M 23 free map pt 2 = C

promise: will perform logged updates redo logging: fjle creation no partial operation to real data recovery reclaim space in log inode twice already done? — okay, setting “commit” redo any operation with “commit” ignore any operation with no read log and… (after system reboots/recovers) fjle created crash after commit ? fjle not created write to log transaction steps: crash before commit ? “garbage collection” reclaim space in log update directory inode update fjle inode update directory entry update fjle data blocks in any order: write to log “commit transaction” normal operation update directory inode (size, time) direcotry entry, inode to write data blocks to create 24

redo logging: fjle creation write to log transaction steps: recovery reclaim space in log inode twice already done? — okay, setting “commit” redo any operation with “commit” ignore any operation with no read log and… (after system reboots/recovers) fjle created crash after commit ? no partial operation to real data fjle not created crash before commit ? “garbage collection” reclaim space in log update directory inode update fjle inode update directory entry update fjle data blocks in any order: write to log “commit transaction” normal operation update directory inode (size, time) direcotry entry, inode to write data blocks to create 24 promise: will perform logged updates

promise: will perform logged updates redo logging: fjle creation no partial operation to real data recovery reclaim space in log inode twice already done? — okay, setting “commit” redo any operation with “commit” ignore any operation with no read log and… (after system reboots/recovers) fjle created crash after commit ? fjle not created write to log transaction steps: crash before commit ? “garbage collection” reclaim space in log update directory inode update fjle inode update directory entry update fjle data blocks in any order: write to log “commit transaction” normal operation update directory inode (size, time) direcotry entry, inode to write data blocks to create 24

idempotency logged operations should be okay to do twice = idempotent bad example: increment inode link count as long as last committed inode value in log is right… bad example: allocate new inode with particular contents good example: overwrite data block with new value bad example: append data to last used block of fjle 25 good example: set inode link count to 4 good example: overwrite inode number X with new value

redo logging summary write intended operation to the log before ever touching ‘real’ data in format that’s safe to do twice write marker to commit to the log if exists, the operation will be done eventually actually update the real data 26

redo logging and fjlesystems fjlesystems that do redo logging are called journalling fjlesystems 27

the xv6 journal transaction ready for next transaction 4 clear log header ) (if number of blocks redone on recovery 3 write data (commits transaction) 2 write log header 1 write changed blocks start: num blocks = 0 no transaction otherwise: not committed or non- 0 : committed data of number of blocks (one sector) log header xv6 log (one transaction) … non-log block non-log block … … second block (log copy) fjrst block (log copy) … location for second block location for fjrst block 28

the xv6 journal transaction ready for next transaction 4 clear log header ) (if number of blocks redone on recovery 3 write data (commits transaction) 2 write log header 1 write changed blocks start: num blocks = 0 no transaction otherwise: not committed or non- 0 : committed data of number of blocks = 0 (one sector) log header xv6 log (one transaction) … non-log block non-log block … … second block (log copy) fjrst block (log copy) … location for second block location for fjrst block 28

the xv6 journal transaction ready for next transaction 4 clear log header ) (if number of blocks redone on recovery 3 write data (commits transaction) 2 write log header 1 write changed blocks start: num blocks = 0 no transaction otherwise: not committed or non- 0 : committed data of (one sector) log header xv6 log (one transaction) … non-log block non-log block … … second block (log copy) fjrst block (log copy) … location for second block location for fjrst block 28 number of blocks = N

the xv6 journal transaction ready for next transaction 4 clear log header redone on recovery 3 write data (commits transaction) 2 write log header 1 write changed blocks start: num blocks = 0 no transaction otherwise: not committed or non- 0 : committed data of (one sector) log header xv6 log (one transaction) … non-log block non-log block … … second block (log copy) fjrst block (log copy) … location for second block location for fjrst block 28 number of blocks = N (if number of blocks � = 0 )

the xv6 journal transaction ready for next transaction 4 clear log header redone on recovery 3 write data (commits transaction) 2 write log header 1 write changed blocks start: num blocks = 0 no transaction otherwise: not committed or non- 0 : committed data of (one sector) log header xv6 log (one transaction) … non-log block non-log block … … second block (log copy) fjrst block (log copy) … location for second block location for fjrst block 28 number of blocks = N = 0 (if number of blocks � = 0 )

what is a transaction? so far: each fjle update? faster to do batch of updates together one log write fjnishes lots of things don’t wait to write xv6 solution: combine lots of updates into one transaction only commit when… no active fjle operation, or not enough room left in log for more operations 29

redo logging problems doesn’t the log get infjnitely big? writing everything twice? 30

limiting log size once transaction is written to real data, can discard sometimes called “garbage collecting” the log may sometimes need to block to free up log space perform logged updates before adding more to log hope: usually log cleanup happens “in the background” 32

lots of writing? (1) entire log can be written sequentially ideal for hard disk performance also pretty good for SSDs multiple updates can be done in any order can reorder to minimize seek time/rotational latency/etc. can interleave updates that make up multiple transactions no waiting for ‘real’ updates application can proceed while updates are happening fjles will be updated even if system crashes often better for performance! 34

lots of writing? (2) updating 1000 fjles? with redo logging — 2 big seeks write all updates to log in order write all updates to fjle/inode/directory data in order careful ordering — lots of seeks? write to free block map seek + write to inode seek + write to directory entry repeat 1000x maybe could also combine fjle updates with careful ordering?? but sure starts to get complicated to track order requirements redo logging is probably simpler? 35

degrees of consistency not all journalling fjlesystem use redo logging for everything some use it only for metadata operations some use it for both metadata and user data only metadata: avoids lots of duplicate writing metadata+user data: integrity of user data guaranteed 36

multiple copies FAT: multiple copies of fjle allocation table and header in inode-based fjlesystems: often multiple copies of superblocks if part of disk’s data is lost, have an extra copy always update both copies hope: disk failure to small group of sectors hope: enough to recover most fjles on disk failure extra copy of metadata that is important for all fjles but won’t recover specifjc fjles/directories whose data was lost 37

mirroring whole disks alternate strategy: write everything to two disks always write to both read from either (or difgerent parts of both – faster!) 38

mirroring whole disks alternate strategy: write everything to two disks always write to both read from either 38 (or difgerent parts of both – faster!)

beyond mirroring mirroring seems to waste a lot of space 10 disks of data? how good can we do with 15 disks? best possible: lose 5 disks, still okay can’t do better or it wasn’t really 10 disks of data schemes that do this based on erasure codes erasure code: encode data in way that handles parts missing (being erased) 39 10 disks of data? mirroring → 20 disks

erasure code example store 2 disks of data on 3 disks recompute original 2 disks of data from any 2 of the 3 disks extra disk of data: some formula based on the original disks common choice: bitwise XOR common set of schemes like this: RAID Redundant Array of Independent Disks 40

snapshots fjlesystem snapshots idea: fjlesystem keeps old versions of fjles around accidental deletion? old version stil there eventually discard some old versions can access snapshot of fjles at prior time mechanism: copy-on-write changing fjle makes new copy of fjlesystem common parts shared between versions 41

snapshots fjlesystem snapshots idea: fjlesystem keeps old versions of fjles around accidental deletion? old version stil there eventually discard some old versions can access snapshot of fjles at prior time changing fjle makes new copy of fjlesystem common parts shared between versions 41 mechanism: copy-on-write

inode and copy-on-write + new inode of entire inode array don’t want to write new copy has big array of inodes challenge: FFS/xv6/ext2 design unchanged parts of fjle shared both old+new inode valid + new indirect blocks inode update: new data blocks new inode … fjle data … … indirect blocks 42

inode and copy-on-write + new indirect blocks of entire inode array don’t want to write new copy has big array of inodes challenge: FFS/xv6/ext2 design unchanged parts of fjle shared both old+new inode valid + new inode update: new data blocks old new inode … fjle data … … indirect blocks inode 42

extra indirection for inode array create new root inode array of root inodes multiple snapshots? shared between versions inode array unchanged parts of + pointers update one inode? root inode inode old … split into pieces arrays of inodes … indirect blocks 43

copy-on-write indirection fjle update = replace with new version only copy modifjed parts keep reference counts, like for paging assignment lots of pointers — only change pointers where modifjcations happen 44 array of versions of entire fjlesystem

snapshots in practice ZFS supports this (if turned on) example: .zfs/snapshots/11.11.18-06 pseudo-directory contains contents of fjles at 11 November 2018 6AM 45

backup/if time slides 47

copy-on-write and logging copy-on-write is a nice solution to duplicate writes before (data journalling) write new data to journal copy new data to real location after (copy-on-write) write new data to new location update pointer to point to new locatoin useful even without snapshots but maybe not keeping fjle data in best place? 48

aside: fsync fjlesystem can order things carefully fjlesystem can make sure data on disk before proceeding what if I, non-OS programmer want to do that? POSIX mechanism: fsync “please actually write this fjle to disk now — I’ll wait” some stories of broken implementations of fsync nasty problem — how do you test it??? some varying interpretations some only send to disk, but don’t wait for disk to fjnish writing does not gaurenteeing updating fjle’s directory entry 49

changing fjle atomically? often applications want to update a fjle all at once on Unix, one way to do this: create a new fjle with a hard-to-guess name in the same directory rename the new fjle to replace the old fjle overwrites that directory entry no one will ever read partially written fjle 50

log-structured fjlesystems logging is a great access pattern for hard drives and SSDs sequential right for SSDs — write everything once before writing again how about designing a fjlesystem around it! idea: log-structured fjlesystems 51

log-structured fjlesystem image: Rosenblum and Ousterhout, “The Design and Implementatoin of a Log Structures Filesystem” 52

log-structured fjlesystem ideas write inodes + data + free map + etc. to log instead of disk problem: scanning log to fjnd latest version of inode? periodically write inode maps to log computed latest location of inodes searching limited to last inode map 53

log-structured FS garbage collection challenge: what happens when log gets to the end of the disk? want to start from beginning of disk again… either: copy data to free space or ‘thread’ log around used space: image: Rosenblum and Ousterhout, “The Design and Implementatoin of a Log Structures Filesystem” 54

log-structured fjlesystems in practice the kind of ideas you’d use to implement an SSD used for some fjlesystems that work directly with Flash chips 55

mirroring whole disks alternate strategy: write everything to two disks always write to both read from either (or difgerent parts of both – faster!) 56

mirroring whole disks alternate strategy: write everything to two disks always write to both read from either 56 (or difgerent parts of both – faster!)

RAID 4 parity disk 1 how many writes? how many reads? )with new value? ( exercise: how to replace sector can compute contents of any disk! … … … disk 2 disk 3 57 ⊕ — bitwise xor A p : A 1 ⊕ A 2 A 1 : sector 0 A 2 : sector 1 B p : B 1 ⊕ B 2 B 1 : sector 2 B 2 : sector 3

RAID 4 parity … how many writes? how many reads? )with new value? ( exercise: how to replace sector can compute contents of any disk! … disk 1 … disk 3 disk 2 57 ⊕ — bitwise xor A p : A 1 ⊕ A 2 A 1 : sector 0 A 2 : sector 1 B p : B 1 ⊕ B 2 B 1 : sector 2 B 2 : sector 3 A p = A 1 ⊕ A 2 A 1 = A p ⊕ A 2 A 2 = A 1 ⊕ A p

RAID 4 parity disk 1 how many writes? how many reads? can compute contents of any disk! … … … 57 disk 2 disk 3 ⊕ — bitwise xor A p : A 1 ⊕ A 2 A 1 : sector 0 A 2 : sector 1 B p : B 1 ⊕ B 2 B 1 : sector 2 B 2 : sector 3 exercise: how to replace sector 3 ( B 2 )with new value?

RAID 4 parity (more disks) disk 1 how many writes? how many reads? ) with new value now? ( exercise: how to replace sector can still compute contents of any disk! … … … 58 disk 3 disk 4 disk 2 A p : A 1 ⊕ A 2 ⊕ A 3 A 1 : sector 0 A 2 : sector 1 A 3 sector 2 B p : B 1 ⊕ B 2 ⊕ B 3 B 1 : sector 3 B 2 : sector 4 B 3 : sector 5

RAID 4 parity (more disks) disk 1 how many writes? how many reads? ) with new value now? ( exercise: how to replace sector can still compute contents of any disk! … … … disk 4 disk 3 disk 2 58 A p : A 1 ⊕ A 2 ⊕ A 3 A 1 : sector 0 A 2 : sector 1 A 3 sector 2 B p : B 1 ⊕ B 2 ⊕ B 3 B 1 : sector 3 B 2 : sector 4 B 3 : sector 5 A p = A 1 ⊕ A 2 ⊕ A 3 A 1 = A p ⊕ A 2 ⊕ A 3 A 2 = A 1 ⊕ A p ⊕ A 3 A 3 = A 1 ⊕ A 2 ⊕ A p

fjlesystem reliability 1 last time inodes (double-, - PowerPoint PPT Presentation

fjlesystem reliability 1 last time inodes (double-, triple-)indirect blocks sparse fjles hard and symbolic links block groups for locality extents and fragments non-binary trees on disk 2 note on FAT assignment you will need to use

access control 1 last time (1) network fjlesystem caching open-to-close consistency

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Why the 2018 Water Reliability Study WACO Presentation 2018 OC Reliability Study October 5,

System Reliability Regulation: System Reliability Regulation: A Jurisdictional Survey A

An Inside Look at Electric Reliability 2018 Electric Reliability Report Stockton, California

Quest for Reliability Ankush Malhotra VP & GM of Fluke Reliability Speaker Bio Ankush

Safety and Reliability Safety and Reliability Analysis Analysis Team KANG Team KANG Group 1

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

- Reliability - Reliability What It Is, Why, and How Jason Nicholas, Ph.D. November 13,

Reliability Engineering Overview Reliability engineering measures and improves resistance to

Slide 1 SPHSC 569 Single Subject Design Reliability Slide 2 Reliability-Quantitative and

NUC-001-1 Reliability Standard Update April 8, 2008 Keith ONeal Office of Electric

Exercise 1: Kickoff Exercise Hyun-A Park Launching Enterprise Risk Management in Your Agency

EXERCISE ASSIGNMENTS Practicalities Compilation and running OpenMP programs Simple example

The Coming Gamification of Fitness Vikram Biyani (NetApp) Gregory Corrado (Google) Stacie Hibino

Fractals exercise Investigating task farms and load imbalance Reusing this material This work is

EKT: Exercise-aware Knowledge Tracing for Student Performance Prediction Anhui Province Key Lab.

Activity All Adrift! This is an exercise in consensus decision making. It has two objectives:

Data race detection for large OpenMP applications Ignacio Laguna, Harshitha Menon Lawrence

OneNote Laboratory Notebook Tutorial v2019-06 Jo Montgomery

Sambuz

Useful Links

Newsletter

Mail Us

fjlesystem reliability 1 last time inodes (double-, - PowerPoint PPT Presentation

fjlesystem reliability 1 last time inodes (double-, triple-)indirect blocks sparse fjles hard and symbolic links block groups for locality extents and fragments non-binary trees on disk 2 note on FAT assignment you will need to use

access control 1 last time (1) network fjlesystem caching open-to-close consistency

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Why the 2018 Water Reliability Study WACO Presentation 2018 OC Reliability Study October 5,

System Reliability Regulation: System Reliability Regulation: A Jurisdictional Survey A

An Inside Look at Electric Reliability 2018 Electric Reliability Report Stockton, California

Quest for Reliability Ankush Malhotra VP &amp; GM of Fluke Reliability Speaker Bio Ankush

Safety and Reliability Safety and Reliability Analysis Analysis Team KANG Team KANG Group 1

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

- Reliability - Reliability What It Is, Why, and How Jason Nicholas, Ph.D. November 13,

Reliability Engineering Overview Reliability engineering measures and improves resistance to

Slide 1 SPHSC 569 Single Subject Design Reliability Slide 2 Reliability-Quantitative and

NUC-001-1 Reliability Standard Update April 8, 2008 Keith ONeal Office of Electric

Exercise 1: Kickoff Exercise Hyun-A Park Launching Enterprise Risk Management in Your Agency

EXERCISE ASSIGNMENTS Practicalities Compilation and running OpenMP programs Simple example

The Coming Gamification of Fitness Vikram Biyani (NetApp) Gregory Corrado (Google) Stacie Hibino

Fractals exercise Investigating task farms and load imbalance Reusing this material This work is

EKT: Exercise-aware Knowledge Tracing for Student Performance Prediction Anhui Province Key Lab.

Activity All Adrift! This is an exercise in consensus decision making. It has two objectives:

Data race detection for large OpenMP applications Ignacio Laguna, Harshitha Menon Lawrence

OneNote Laboratory Notebook Tutorial v2019-06 Jo Montgomery

Sambuz

Useful Links

Newsletter

Mail Us

Quest for Reliability Ankush Malhotra VP & GM of Fluke Reliability Speaker Bio Ankush