File System Project Seminar On-Disk Layout
- Prof. Andreas Polze
Andreas Grapentin, Sven Köhler Max Plauth, Jossekin Beilharz, Felix Eberhardt Hasso Plattner Institute
File System Project Seminar On-Disk Layout Prof. Andreas Polze - - PowerPoint PPT Presentation
File System Project Seminar On-Disk Layout Prof. Andreas Polze Andreas Grapentin, Sven Khler Max Plauth, Jossekin Beilharz, Felix Eberhardt Hasso Plattner Institute File System Seminar Overview program open readdir today Virtual File
File System Project Seminar On-Disk Layout
Andreas Grapentin, Sven Köhler Max Plauth, Jossekin Beilharz, Felix Eberhardt Hasso Plattner Institute
Block Buffer
File System Seminar Overview
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 2
program
readdir
Virtual File System ext4 proc fs btrfs disk
today
File System Seminar Tasks of A File System (Simplified)
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 3
A file system needs to be … Searchable □ resolve filename to metadata □ resolve filename to data (streams, forks) □ find the corresponding block to a given file position Modifiable □ find space to add new data □ find space to add new metadata □ mark bad blocks □ query existing free space
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 4
Cylinder-Head-Sector
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 5
For several decades continously spinning magnetic disks were the gold standard for secondary storage. Data is originally addressable by block-wise Cylinder-Head-Sector (CHS) tuples. To reduce movements of the head (arm), data is kept along cylinders first. Modern busses allow Logical Block Addressing (LBA) by linear numbers. block
Sector Vs. Block Vs. Cluster
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 6
Main overhead factors when accessing data: ■ latency (seek+rotational): How long to wait for the first byte? ■ throughput (transfer rate): How many bytes per second once started? In the latency time for one byte, several others can be transferred. Insight: Group bytes into blocks and even bigger ones on FS level. Multiple names: sector cluster physical block logical block device block file system block (within a track)
Cylinder Groups
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 7
Many file systems (UFS, ext, NTFS) use block groups to keep semantically connected data within one cylinder. Reduces head seeks and limits fragmentation within partition.
SB Meta Files SB Meta Files SB Meta Files
redundant superblock backups
Tracking Available (Free) Space
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 8
■ Blocks can be occupied. ■ Blocks can become bad (bit rot). ■ Likewise need to track available free inodes ■ Use a linked list: ■ Use a bitmap:
0x001 0x003 0x040 0xa00 0xab0 0xab1 next next 1110001010001010 0100010111010011 0101111011101001 0011100101110010
block #0 is occupied block #54 is free
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 9
Storing Files – Overview
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 10
How to find files, if the inode/file descriptor is already at hand. Different methods to store files with different use cases exist: □ continuous allocation □ linked list – separated linked List (FAT) □ indexed block references – direct data □ extents
Storing Files – Continuous Allocation
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 11
File system keeps track of free blocks. For a new file the necessary number of blocks is reserved and only a (start_block, size) tuple stored. Advantages Simple implementation No file fragmentation Very few seek times Disadvantages Growing files need expensive move High external fragmentation
File 1 File 3 File 2
free
Storing Files –Linked List
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 12
The inode/file descriptor/metadata points to the first block. Each block contains data and a pointer to the next block in this file. Advantages Data can be distributed across device Files can be resized No external fragmentation Disadvantages “Odd” wasted space per data block High file fragmentation risk No random access High seek times
Inode data next data next data end
Storing Files – Separated Linked List
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 13
The inode/file descriptor/metadata points to the first block. Data fills entire blocks. A separate table tracks for each block either its successor, or if it’s the last block. Also free and bad blocks can be tracked. Example: FAT
e" ," "all" "
Storing Files – Indexed Block References
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 14
The inode contains a fixed number of block references. For files exceeding that number an additional block may be pointed at, containing an additional block references and so forth. Advantages Small files require small overhead Random access efficient for small files Sparse files possible Disadvantages Limits file size High file fragmentation risk Many seeks for random access on large files
e" ," "all" "
Storing Files – Indexed Block References (Direct Data)
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 15
The inode repurposes the block references to store actual data. Commonly used for lock files, pid files and symbolic links. No more seeks required. e.g. for 32-bit block references: (11 + 3) * 32 bit = 14 * 4 byte = 56 bytes available
Storing Files –Extents
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 16
Multiple large, contiguous area reserved for file ranges. Each extent is addressed by only a (first block, length) tuple. Each files extents can be stored as (linked) lists or trees. Advantages Very little overhead data required Limits file fragmentation Disadvantages Difficult to add extents on fragmented systems Copy-on-write for small changes is very expensive Require block buffer to allocate large areas on flush.
Inode Extent 1 Extent 2 Extent 3
Directory Entry Structure
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 17
bin home usr var attributes attributes attributes attributes bin home usr var inode inode inode inode
attributes in directory entry (FAT) separated attributes (Unix)
Directory Structure (Canonical)
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 18
ino ino ino ino name[MAX] name[MAX] name[MAX] name[MAX]
table of fixed length entries (Minix 1, FAT16)
ino
Linked list of variable sized entries (ext2, VFAT) Can still be contiguous on disk Fast unlink (ext2) marks ino as 0
rec_len name ino rec_len name ino rec_len name rec_len unused
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 19
14 21 7 10 19 23 25
Recap: Glossary Search Trees
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 20
19 10 7 14 23 21
■ node, root, leaf ■ parent, child, sibling ■ degree, depth, height ■ pre-order, in-order, post-order, level-order ■ full, complete, balanced ■ value vs. key, value-tuple ■ AVL-tree, Red-Black-Trees
Definition
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 21
14 21 7 10 19 23 25
B = 2 B = 3 ■ A B-Tree is a search tree for keys ■ All leaves have the same depth ■ Classified by a parameter B: □ B ≤ #children < 2 · B □ B – 1 ≤ #keys < 2 · B – 1 ■ (Keys within a node are sorted)
Theory Vs. Reality: Complexity
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 22
19 10 7 14 23 21
O(log3n)
O(log2n)
O(f) = {g : N → R| ∃c > 0 ∃n0 ∈ N ∀n >= n0 : g(n) <= c · f(n)}
log3n = = c · log2n log23 log2n
14 21 7 10 19 23 25
Theory Vs. Reality: Number of Block Seek Operations
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 23
19 10 7 14 23 21
O(log3n) O(log2n) k = 1: log2(230) k = 2: log3(230) k = 1024: log1024(230) = 30 seeks = 18 seeks = 3 seeks
14 21 7 10 19 23 25
stored in level-order
Operations: Search
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 24
10 17 3 7 14 20 24 38 32 42 48 30
B = 2
Operations: Insert
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 25
B = 4
9 56 10 17 38 49 21 30
… …
42
(recursive up to and beyond root) Split:
9 56 10 17 38 42 21 30
… …
49
Operations: Search
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 26
10 17 3 7 14 20 24 38 32 42 48 30
B = 2
Operations: Deletion I
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 27
B = 4
10 23 27 51 62 42 77 83 19 5
Operations: Deletion II
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 28
B = 4
10 99 23 27 51 62 42
… …
77 83
19
Rotation (if filled siblings)
10 99 23 27 62 51
… …
77 83 42
Operations: Deletion III
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 29
B = 4
10 99 23 27 62 51
… …
77 83 42
Merge (underflowing siblings)
10 23 27
…
51 99 62
…
77 83
Polze, Grapentin, Köhler Plauth, Beilharz, Eberhardt 14.11.2017 File System Seminar On-Disk Layout Chart 30