[PDF] - Operating Systems Design and Implementation Chapter 05 (version PDF Document

SLIDE 1

Operating Systems

Design and Implementation

Chapter 05

(version January 30, 2008)

Melanie Rieback

Vrije Universiteit Amsterdam, Faculty of Sciences

Dept. Computer Science

Room R4.23. Tel: (020) 598 7874 E-mail: melanie@cs.vu.nl, URL: www.cs.vu.nl/∼melanie/

01 Introduction 02 Processes 03 Input/Output 04 Memory Management 05 File Systems

00 – 1 /

SLIDE 2

File Systems

Files
Directories
File system implementation
Security
MINIX file system

05 – 1 File Systems/

SLIDE 3

File systems

Basic model: a file is just an abstract storage device with the following operations:

type FILE ID is INT create(id: out FILE ID) delete(id: in FILE ID)

pen(id: in FILE ID)

close(id: in FILE ID) read(f: in FILE ID, pos: in INT, data: out BYTE) write(f: in FILE ID, pos: in INT, data: in BYTE)

Idea: the operating system returns a unique file iden- tifier when a file is created. This file id is used on all subsequent operations. Variations:

Only support sequential files: (1) reads can only

be done starting at the head of the file, (2) writes imply appending data to the file.

Support structured files by providing records in-

stead of bytes, which might further be organized as a tree.

05 – 2 File Systems/5.1 Files

SLIDE 4

File Organization

Observation: Despite that many files are byte-oriented and unstructured from the OS point of view, internally, things may be quite different:

(a) (b) Header Header Header Magic number Text size Data size BSS size Symbol table size Entry point Flags Text Data Relocation bits Symbol table Object module Object module Object module Module name Date Owner Protection Size Header

05 – 3 File Systems/5.1 Files

SLIDE 5

File Attributes

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Field Meaning

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Protection Who can access the file and in what way

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Password Password needed to access the file

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Creator Id of the person who created the file

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Owner Current owner

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Read-only flag 0 for read/write; 1 for read only

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Hidden flag 0 for normal; 1 for do not display in listings

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

System flag 0 for normal files; 1 for system file

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Archive flag 0 for has been backed up; 1 for needs to be backed up

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

ASCII/binary flag 0 for ASCII file; 1 for binary file

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Random access flag 0 for sequential access only; 1 for random access

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Temporary flag 0 for normal; 1 for delete file on process exit

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Lock flags 0 for unlocked; nonzero for locked

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Record length Number of bytes in a record

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Key position Offset of the key within each record

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Key length Number of bytes in the key field

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Creation time Date and time the file was created

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Time of last access Date and time the file was last accessed

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Time of last change Date and time the file has last changed

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Current size Number of bytes in the file

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁

Maximum size Number of bytes the file may grow to

✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂

05 – 4 File Systems/5.1 Files

SLIDE 6

Directories

Basic model: It is necessary to organize files, i.e. keep track of them. A directory is a data structure in which information on files are contained [either in the directory (a) or in separate places (b)]

(a) games mail news work attributes attributes attributes attributes Data structure containing the attributes (b) games mail news work

Note: the information maintained in a directory is of- ten so important that it is worth the trouble storing it ... in a file.

05 – 5 File Systems/5.2 Directories

SLIDE 7

Directories – Pathnames

Idea: if files are organized in a graph, with intermedi- ate nodes acting as directories, we can easily identify files by naming:

usr ast mbox bin gcc image bin /bin/gcc /usr/ast/bin/gcc gcc root

Question: Do we actually need a root?

05 – 6 File Systems/5.2 Directories

SLIDE 8

File System Design

Problem: how do we actually design & implement a file system?

File storage: in order to manipulate files, we have

to know exactly where each part of a file is stored

n disk.
Directory implementation: directories can be

implemented as files, but there are also other pos- sibilities.

Disk space management: we have to map files

to disk, implying that we need to keep track of used and free blocks. Note: this is a different problem than file storage.

Consistency: file systems can be inconsistent in

different ways. How do we handle this?

05 – 7 File Systems/5.3 File system implementation

SLIDE 9

File Storage: Disk Layout

Recall: A file system is usually stored on disk, possi- bly with several other file systems. All information on this organization needs to be stored as well:

Entire disk Disk partition Partition table Files and directories Root dir I-nodes Super block Free space mgmt Boot block MBR

Note: Partition info is generally stored at the end of the master boot record ⇒ limits the number of pos- sible partitions. Solution: use extended partitions,

r subpartition tables stored in a partition.

05 – 8 File Systems/5.3 File system implementation

SLIDE 10

File Storage Linked List Allocation

Essence: just record in a file block, which disk block contains the next file block. Pretty bad for things like random access...

File A Physical block Physical block 4 7 2 10 12 File block File block 1 File block 2 File block 3 File block 4 File B 6 3 11 14 File block File block 1 File block 2 File block 3 05 – 9 File Systems/5.3 File system implementation

SLIDE 11

File Storage – File Allocation Table

Associate with each disk exactly one table FAT,

where FAT[k] contains information on the kth disk block.

Each file is uniquely associated with its first block
n disk, e.g. first(F) = 8 means that the first block
f file F is disk block #8.
FAT[k] denotes the next disk block of the file whose

data is also stored on disk block #k.

If FAT[k] = EOF then disk block #k was the file’s

last block.

FAT[k] = FREE denotes an unused block; FAT[k] =

BAD denotes a bad block, etc.

FAT[4] FAT[7] FAT[2] FAT[10] 7 2 12 10 file F: FAT[12]

1

Question: Where do we store that file F has disk block #k as its first block?

05 – 10 File Systems/5.3 File system implementation

SLIDE 12

File Storage – Inodes

File node Number of links Owner’s user id Owners group id File size Time created Time last accessed Time last modified Disk block #1 Disk block #2 Disk block #3 ... Disk block #10 Single indirect disk block number Double indirect disk block number Triple indirect disk block number Disk block Disk block Pointers to disk blocks that contain file data I-node

05 – 11 File Systems/5.3 File system implementation

SLIDE 13

Directory structure – Windows 98

The root directory is special: it is a fixed table of

directory entries.

A base directory entry is 32 bytes and corre-

sponds to the following record:

8 1 1 3 4 1 2 2 Base name 4 2 4 Bytes Ext File size N T Creation date/time Last write date/time Last access

Attributes

Sec Upper 16 bits

f starting

block Lower 16 bits

f starting

block

Extra directory entries allow for long names, with

a base directory entry acting as a sentinel:

10 1 1 1 1 12 2 4 5 characters 6 characters Bytes 2 characters

Sequence

Attributes Checksum

05 – 12

File Systems/5.3 File system implementation

SLIDE 14

File storage – UNIX

Really simple: a directory is just an ordinary file con- sisting of directory entries that (used to) have the form:

typedef struct{ unsigned inode_number : 16; /* 2 bytes / char file_name[14] : 112; / 14 bytes */ } DIRECTORY_ENTRY

which adds up to 16 bytes. Now: inode numbers are

ften 4 bytes, file names can be up to 255 characters.

Note:

An inode number uniquely identifies an inode.

There is precisely one inode per file.

Inodes have to be stored somewhere: where? Is

there a limit to the number of inodes?

The inode number of the root directory is known

in advance. So is its place on disk.

05 – 13 File Systems/5.3 File system implementation

SLIDE 15

File Storage – Name Resolution

Root directory I-node 6 is for /usr Block 132 is /usr directory I-node 26 is for /usr/ast Block 406 is /usr/ast directory Looking up usr yields i-node 6 I-node 6 says that /usr is in block 132 /usr/ast is i-node 26 /usr/ast/mbox is i-node 60 I-node 26 says that /usr/ast is in block 406 1 1 4 7 14 9 6 8

. ..

bin dev lib etc usr tmp 6 1 19 30 51 26 45

dick

erik jim ast bal

26

6 64 92 60 81 17

grants

books mbox minix src

Mode

size times 132 Mode size times 406

05 – 14 File Systems/5.3 File system implementation

SLIDE 16

Disk Space Management (1/2)

Problem: the administration of free disk space has to be kept on disk as well ⇒ you don’t want too much space for that.

(a) (b) Free disk blocks: 16, 17, 18 A bitmap A 1-KB disk block can hold 256 32-bit disk block numbers 86 234 897 422 140 223 223 160 126

142

141 1001101101101100 0110110111110111 1010110110110110 0110110110111011 1110111011101111 1101101010001111 0000111011010111 1011101101101111 1100100011101111

0111011101110111

1101111101110111 230 162 612 342 214 160 664 216 320

180

482 42 136 210 97 41 63 21 48 262

310

516

05 – 15 File Systems/5.3 File system implementation

SLIDE 17

Disk Space Management (2/2)

Problem: There’s a tradeoff between choosing a proper disk size block when it comes to data rate and utiliza- tion:

1000

800
600
400
200
100
80
60
40
20
128
256
512
1K
2K
4K
8K
16K
Disk space utilization

(percent)

Data rate (KB/sec)
Disk space utilization
Data rate
Block size (bytes)
05 – 16

File Systems/5.3 File system implementation

SLIDE 18

File System Reliability

Essence: storage devices still mess up – they have so-called bad blocks that make it hard to keep a file system reliable. Solution: simply backup the system regularly so that parts of it can be restored when a bad block occurs. The problem is how to do backups efficiently:

incremental dumps, by which changes are added

to the backup, say, every day

use doubling technique, such as doing writes to

two drives, but reading only from one.

05 – 17 File Systems/5.3 File system implementation

SLIDE 19

File system consistency (1/2)

1 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 101112131415 Block number Blocks in use 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 Free blocks (a) 1 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 101112131415 Blocks in use 0 0 1 0 2 0 0 0 0 1 1 0 0 0 1 1 Free blocks (c) 1 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 101112131415 Block number Blocks in use 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 Free blocks (b) 1 1 0 1 0 2 1 1 1 0 0 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 101112131415 Blocks in use 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 Free blocks (d)

(a) Consistent (b) Missing block (c) Duplicate block in free list (d) Duplicate data block

05 – 18 File Systems/5.3 File system implementation

SLIDE 20

File system consistency (2/2)

Block consistency. (1) Go through all inodes.

If block #k is being used, increment count[k]. (2) Check all free blocks. If block #k is free, in- crement free[k]. – count[k] = free[k] = 0: missing block – count[k] = 0, free[k] = 0: free block in use? – count[k] > 1: block is being used by more than 1 file! – free[k] > 1: use another disk space algo- rithm (actually: rebuild the list)

File consistency. (1) Go through all directories.

Count, per file, the number of references to it. (2) Go through all inodes and check the link counts. The two should match: always adjust the link counts.

05 – 19 File Systems/5.3 File system implementation

SLIDE 21

Buffer cache

Idea: when files are read, you have to copy data blocks from disk into main memory (otherwise reading/writing is impossible). Keep those blocks in main memory for some time. Question: Why do you have to copy a block into main memory? Problem: what to do when you write to a block:

MS-DOS: immediately write the block to disk.
UNIX: don’t write immediately – you may want to

do another write. UNIX strategy turns out to work good: disk I/O is strongly

reduced. However, in order to avoid nasty situations,

the buffer cache is flushed every 30 seconds. Question: What’s really so bad about removing a floppy disk in a UNIX-based system?

05 – 20 File Systems/5.3 File system implementation

SLIDE 22

Log-Structured File Systems

Essence: Instead of writing data to disk blocks, we simply collect pending writes into a single segment, and append it to a huge log. Note that a segment can contain anything: directory entries, i-nodes, data blocks, etc. Finding an i-node is now harder: we don’t have a clue where the i-node actually is ⇒ maintain a separate index of i-nodes. Problem: we have to clean the log from time to time to make space for new writes. This means checking which part of the log is no longer referenced.

05 – 21 File Systems/5.3 File system implementation

SLIDE 23

Security

We have a collection resources (CPU, memory,

files, processes), and a collection of users (peo- ple, processes) of those resources.

Resources are part of a system. Users need to

be part of that system before they can use re- sources.

Resources need to be protected against unau-

thorized usage. Operations by authorized users can be checked beforehand.

Users need to be authenticated before they can

be let into the system. Keywords: Authentication and authorization.

05 – 22 File Systems/5.3 File system implementation

SLIDE 24

Mechanism vs. Policy

Question: Should an operating system dictate what is to be authenticated or authorized? Answer: No, but it should provide the means to allow for authentication or authorization to take place. Example:

UNIX allows files to be protected by rwx bits for
wners, groups, and the rest. We can use that

to enforce a policy by which every file is initially readable and writable only for its owner.

Most systems provide a mechanism to have a

system administrator enter new users, who can later be authenticated by use of passwords. An- archist administrators may decide to allow access to anyone by ignoring this mechanism entirely.

05 – 23 File Systems/5.3 File system implementation

SLIDE 25

Authentication

Problem: How do I prove that I am who I say I am ⇒ for most systems, you use a password:

In order to ever enter the system, your name and

password has to be entered in the user’s list. At that moment, you are authorized to enter the sys- tem.

Entering the system means: (1) tell me your name,

and (2) give proof by typing in your password. The system authenticates you by matching the name with the registered password. Problem: is password registration really such a good idea? Solution: use encryption – store a perfectly well read- able version of the encrypted password in a file. When the user types in a password, (1) encrypt, and (2) match with what you stored. Question: What’s the assumption here?

05 – 24 File Systems/5.3 File system implementation

SLIDE 26

Protection Domains (1/2)

Idea: we wish to develop a model that allows us to speak about protection of resources against users ⇒ protection domains.

A resource is assumed to have a set of associ-

ated operations (e.g. read() and write() a file).

Processes are the users of resources.

At any time a process should only be allowed access to the resources it has been authorized to access.

A protection domain describes for all resources

exactly which operations are allowed.

Processes operate in protection domains. Pro-

cesses may possibly switch between protection domains. Question: What would be the simplest implementa- tion of a protection domain? (Hint: think of the UNIX file system.)

05 – 25 File Systems/5.3 File system implementation

SLIDE 27

Protection Domains (2/2)

Domain 1 Domain 2 Domain 3 File1[R]

File2[RW]

File3[R] File4[RWX] File5[RW] Printer1[W] File6[RWX]

Plotter2[W]

Protection domains translate into an access matrix:

Printer1 Plotter2 Domain 1 2 3 File1 File2 File3 File4 File5 File6 Object Read Read Read Write Read Write Read Write Execute Read Write Execute Write Write Write

05 – 26 File Systems/5.3 File system implementation

SLIDE 28

Changing Domains

Problem: suppose a process needs to change do- mains from time to time? Example: after you’ve done a system call, your pro- cess generally continues execution in kernel mode, which is associated with another domain. Solution: model domains as resources with opera- tion enter(). Then, just fill in the access matrix.

Object Domain2 Domain3 Domain1 Enter Printer1 Plotter2 Domain 1 2 3 File1 File2 File3 File4 File5 File6 Read Read Read Write Read Write Read Write Execute Read Write Execute Write Write Write

05 – 27 File Systems/5.3 File system implementation

SLIDE 29

Changing Domains – Example

In UNIX passwords are kept in a file /etc/passwd:
rw-r--r--

1 root root 1364 Mar 20 1995 /etc/passwd

Problem: how can you ever change your own

password if you’re not allowed to write to the file

/etc/passwd?

The passwd program changes your password:
r-sr-sr-x

1 root bin 3964 Mar 21 1995 /usr/bin/passwd

Everyone’s allowed to execute the file.

At that point, the effective uid of the executing process becomes the same one as the root ⇒ you’re sud- denly allowed to write the /etc/passwd file!

05 – 28 File Systems/5.3 File system implementation

SLIDE 30

Impl. the Access Matrix (1/2)

Idea: Maintaining a central table with access privi- leges per domain is not really the way to go (not ef- ficient): Access Control List: each resource keeps a list of the permitted operations per domain.

A B C Process Owner F1 A: RW; B: A F2 A: R; B:RW; C:R F3 B:RWX; C: RX File User space Kernel space ACL

05 – 29 File Systems/5.3 File system implementation

SLIDE 31

Impl. the Access Matrix (2/2)

Capabilities: each process gets a list of what it can do per resource. Effectively, it carries this list around while accessing resources. Problem: how can we avoid that processes change their own lists?

Have the lists maintained by the operating sys-

tem.

Add an encryption key to the list, provided by

the operating system, and which depends on the capabilities:

Server Object Rights f(Objects,Rights,Check)

05 – 30 File Systems/5.3 File system implementation

SLIDE 32

MINIX File system

Boot block Super block Inode bit map Zone bit map Inodes Disk blocks

boot block: Placed in main memory at startup.

Contains code for loading the rest of the OS.

super block: Contains the information about the

layout of the filesystem.

inode bit map: A bit map describing which in-
des are free or in use.
zone bit map: A bit map describing which disk

blocks are free or in use.

05 – 31 File Systems/5.6–5.7 MINIX fi le system

SLIDE 33

MINIX Super Block (1/2)

Number of i-nodes (unused) Number of i-node bitmap blocks Number of zone bitmap blocks First data zone Log2 (block/zone) Padding Maximum file size Number of zones Magic number padding Block size (bytes) FS sub-version Pointer to i-node for root of mounted file system Pointer to i-node mounted upon i-nodes/block Device number Read-only flag Native or byte-swapped flag FS version Direct zones/i-node Indirect zones/indirect block First free bit in i-node bitmap First free bit in zone bitmap Present

n disk

and in memory Present in memory but not

n disk

05 – 32 File Systems/5.6–5.7 MINIX fi le system

SLIDE 34

MINIX Super Block (2/2)

Note I: There is redundancy in the superblock (e.g. if you know the block size, and the number of inodes, you know how many inode bit map blocks you need). Note II: A zone is a number of blocks. Disk blocks are numbered using 32 bit integers ⇒ we can count up to terabytes of disk blocks. Counting zones (e.g. each having 4 disk blocks) allows for extremely large file systems.

05 – 33 File Systems/5.6–5.7 MINIX fi le system

SLIDE 35

Inodes (1/2)

Mode Number of links Uid Gid File size Access time Modification time Status change time Zone 0 Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 6 Indirect zone Double indirect zone Unused File type and rwx bits Directory entries for this file Identifies user who owns file Owner’s group Number of bytes in the file Zone numbers for the first seven data zones in the file 16 bits 64 bytes Times are all in seconds since Jan 1, 1970 Used for files larger than 7 zones (Could be used for triple indirect zone)

05 – 34 File Systems/5.6–5.7 MINIX fi le system

SLIDE 36

Inodes (2/2)

Question: How large can MINIX files actually be? (Assume a zone size of 1K.) Note: All MINIX inodes are 64 bytes. With 1K sized disk blocks, up to 16 inodes fit in a block. A 128K- inode file system requires 8K disk blocks to store all inodes.

05 – 35 File Systems/5.6–5.7 MINIX fi le system

SLIDE 37

Block Cache (1/2)

Rear (MRU) Hash table Front (LRU)

Buffers are contained in double-linked list and hashed

for quick lookups.

Buffers can be marked as being “in use”: they will

never be overwritten (e.g. bit maps).

05 – 36 File Systems/5.6–5.7 MINIX fi le system

SLIDE 38

Block Cache (2/2)

When a block is needed: 1: Check if block is cached through the hash table. If found, mark it “in use” and return. 2: (Block is not cached.) Start at the front of the linked list to evict a buffer to cache the required

block. Choose the first buffer available.

3: Flush chosen buffer if its contents has changed. Then read new block into buffer, and hand over to requester. Returning a buffer: either append or prepend the buffer to the double-linked list. Depends on whether you ex- pect the block to be reused soon (why?). Append the buffer to the list corresponding to its hash value.

05 – 37 File Systems/5.6–5.7 MINIX fi le system

SLIDE 39

Mount Files - Implementation (1/2)

Recall: Mounting a file system means taking its root and attaching it to a leaf node (called a mount point)

f an existing file system.

/usr /bin /home / /edu /dos.html /bal /ast / /steen /www /bs.html

Root filesystem Unmounted filesystem

/usr /bin /home / /edu /dos.html /bal /ast / /steen /www /bs.html

Mount point file:

/home/steen/www/bs.html

05 – 38 File Systems/5.6–5.7 MINIX fi le system

SLIDE 40

Mount Files - Implementation (2/2)

What do we need?

Every filesystem has a super block ⇒ get the su-

per block of the unmounted filesystem into mem-

ry.
Set its field inode-mounted-on to point to the en-

try in the inode table that contains the mount point.

Set the field inode-mounted-filesystem to point

to the entry in the inode table that contains the root node of the to-be mounted filesystem.

Set a flag in the inode of the mount point that in-

dicates that the node is mount point for another filesystem. Searching: when a mount point is inspected, we con- tinue to scan the table of superblocks, looking for a su- perblock that points to the entry in the inode table that contains the mount point. From there on, we continue with the root node of the mounted filesystem.

05 – 39 File Systems/5.6–5.7 MINIX fi le system

SLIDE 41

File Descriptors

Idea: once a file has been opened, the file manager returns an integer that is subsequently used for read and write operations.

File descriptors are handed out on a per-process

basis: they cannot be shared between processes.

If a process forks a new child, the child will inherit

all opened files ⇒ all file descriptors of the parent are copied to the child.

If the parent had left off reading/writing at position

pos in one of its files, the child should proceed

from that point as well. Problem: Where do we store the file pointer? Note: it cannot be stored in the process table, nor can it be stored in the inode. Why not? Solution: keep it in a shared data structure.

05 – 40 File Systems/5.6–5.7 MINIX fi le system

SLIDE 42

Special Cases

Problem: Normally, we let the file manager do a read

r write request, and have it wait until the request

has been completed. This will not work in all cases:

When a user wants to read from a pipe that is

currently empty, it should be blocked until data is appended to it. The file manager should not be suspended as well.

Likewise, if a user wants to read from the termi-

nal when there is no data, it should also be sus- pended, but without also suspending the file man- ager. Solution: Straightforward – just have the file man- ager block users when necessary by not responding to their request.

05 – 41 File Systems/5.6–5.7 MINIX fi le system

SLIDE 43

File System Tables

buffer filled with a disk block in inode table empty entry inode stored in inode table

get_inode() put_inode() put_block() get_block() alloc_bit() free_bit()

superblock of a mounted filesys empty buffer

rw_inode() rw_block()

BUFFER CACHE INODE TABLE

read_super() get_super()

inode bitmap SUPERBLOCK TABLE

05 – 42 File Systems/5.6–5.7 MINIX fi le system

SLIDE 44

File System Overview

superblock table diskblock bitmaps

ne inode

per file file data in user’s address space file data cached in file manager’s address space

✁

✁ ✂ ✂ ✂ ✂ ✄ ✄ ☎ ☎ ☎ ☎ ✆ ✆ ✝ ✝ ✞ ✞ ✟ ✟ ✠ ✠

buffercache inodetable superblock data blocks inode bitmaps filesystem inodes

05 – 43 File Systems/5.6–5.7 MINIX fi le system

SLIDE 45

Buffer Cache - Get a Block

Problem: you want to read or write a specific disk block. That block has to be read into a free buffer from the buffer cache. 1: Check whether the requested block is already con- tained in a buffer. If so, just return a pointer to that

buffer. Mark the buffer as being in use.

2: If the block hasn’t been cached, get a free buffer and issue a disk I/O operation to fill that buffer. When completed, return pointer to buffer. 3: If there is no free buffer available, choose an un- used buffer to read in the disk block. If the buffer is dirty, write it to disk first before reading new disk block. Note: a buffer that has been assigned to a process by means of get block(), is said to be in use.

05 – 44 File Systems/5.6–5.7 MINIX fi le system

SLIDE 46

Getting a Block (1/2)

22426 PUBLIC struct buf *get_block(dev, block, only_search) 22427 register dev_t dev; /* on which device is the block? */ 22428 register block_t block; /* which block is wanted? */ 22429 int only_search; /* if NO_READ, don’t read, else act normal */ 22430 { .... 22446 int b; 22447 register struct buf *bp, *prev_ptr; 22448 22449 /* Search the hash chain for (dev, block). Do_read() can use 22450 * get_block(NO_DEV ...) to get an unnamed block to fill with zeros when 22451 * someone wants to read from a hole in a file, in which case this search 22452 * is skipped 22453 */ 22454 if (dev != NO_DEV) { 22455 b = (int) block & HASH_MASK; 22456 bp = buf_hash[b]; 22457 while (bp != NIL_BUF) { 22458 if (bp->b_blocknr == block && bp->b_dev == dev) { 22459 /* Block needed has been found. */ 22460 if (bp->b_count == 0) rm_lru(bp); 22461 bp->b_count++; /* record that block is in use */ 22462 22463 return(bp); 22464 } else { 22465 /* This block is not the one sought. */ 22466 bp = bp->b_hash; /* move to next block on hash chain */ 22467 } 22468 } 22469 } 22470 22471 /* Desired block is not on available chain. Take oldest block (’front’). */ 22472 if ((bp = front) == NIL_BUF) panic(__FILE__,"all buffers in use", NR_BUFS); 22473 rm_lru(bp); 22474 ....

05 – 45 File Systems/5.6–5.7 MINIX fi le system

SLIDE 47

Getting a Block (2/2)

22475 /* Remove the block that was just taken from its hash chain. */ 22476 b = (int) bp->b_blocknr & HASH_MASK; 22477 prev_ptr = buf_hash[b]; 22478 if (prev_ptr == bp) { 22479 buf_hash[b] = bp->b_hash; 22480 } else { 22481 /* The block just taken is not on the front of its hash chain. */ 22482 while (prev_ptr->b_hash != NIL_BUF) 22483 if (prev_ptr->b_hash == bp) { 22484 prev_ptr->b_hash = bp->b_hash; /* found it */ 22485 break; 22486 } else { 22487 prev_ptr = prev_ptr->b_hash; /* keep looking */ 22488 } 22489 } 22490 22491 /* If the block taken is dirty, make it clean by writing it to the disk. 22492 * Avoid hysteresis by flushing all other dirty blocks for the same device. 22493 */ 22494 if (bp->b_dev != NO_DEV) { 22495 if (bp->b_dirt == DIRTY) flushall(bp->b_dev); 22496 } 22497 22498 /* Fill in block’s parameters and add it to the hash chain where it goes. */ 22499 bp->b_dev = dev; /* fill in device number */ 22500 bp->b_blocknr = block; /* fill in block number */ 22501 bp->b_count++; /* record that block is being used */ 22502 b = (int) bp->b_blocknr & HASH_MASK; 22503 bp->b_hash = buf_hash[b]; 22504 buf_hash[b] = bp; /* add to hash list */ 22505 22506 /* Go get the requested block unless searching or prefetching. */ 22507 if (dev != NO_DEV) { 22508 if (only_search == PREFETCH) bp->b_dev = NO_DEV; 22509 else 22510 if (only_search == NORMAL) { 22511 rw_block(bp, READING); 22512 } 22513 } 22514 return(bp); /* return the newly acquired block */ 22515 }

05 – 46 File Systems/5.6–5.7 MINIX fi le system

SLIDE 48

Buffer Cache - Return a Block

Idea: After you’ve been allocated a buffer with the block you wanted to read or write, you’ll have to re- turn the buffer to the cache.

Make a distinction between different types of blocks:

– Blocks that can be expected to be needed soon (e.g. partially filled) ⇒ put at the end of the buffer list. – Blocks that will probably not be used soon (e.g. full data blocks) ⇒ put at the head of the list. – Blocks that need to be written immediately to disk (e.g. blocks containing updated inodes) ⇒ do immediate disk I/O.

Combined with get block() (that evicts blocks

from the head of the list), we have implemented LRU for the buffer cache. Question: What actually happens between get block() and put block()? Does a process actually get the buffer?

05 – 47 File Systems/5.6–5.7 MINIX fi le system

SLIDE 49

Returning a Block

22520 PUBLIC void put_block(bp, block_type) 22521 register struct buf *bp; /* pointer to the buffer to be released */ 22522 int block_type; /* INODE_BLOCK, DIRECTORY_BLOCK, or whatever */ 22523 { .... 22534 bp->b_count--; /* there is one use fewer now */ 22535 if (bp->b_count != 0) return; /* block is still in use */ 22537 bufs_in_use--; /* one fewer block buffers in use */ 22538 22539 /* Put this block back on the LRU chain. If the ONE_SHOT bit is set in 22540 * ’block_type’, the block is not likely to be needed again shortly, so put 22541 * it on the front of the LRU chain where it will be the first one to be 22542 * taken when a free buffer is needed later. 22543 */ 22544 if (bp->b_dev == DEV_RAM || block_type & ONE_SHOT) { 22545 /* Block probably won’t be needed quickly. Put it on front of chain. 22546 * It will be the next block to be evicted from the cache. 22547 */ 22548 bp->b_prev = NIL_BUF; 22549 bp->b_next = front; 22550 if (front == NIL_BUF) rear = bp; /* LRU chain was empty */ 22552 else front->b_prev = bp; 22554 front = bp; 22555 } else { 22556 /* Block probably will be needed quickly. Put it on rear of chain. 22557 * It will not be evicted from the cache for a long time. 22558 */ 22559 bp->b_prev = rear; 22560 bp->b_next = NIL_BUF; 22561 if (rear == NIL_BUF) front = bp; 22563 else rear->b_next = bp; 22565 rear = bp; 22566 } 22567 22568 /* Some blocks are so important (e.g., inodes, indirect blocks) that they 22569 * should be written to the disk immediately to avoid messing up the file 22570 * system in the event of a crash. 22571 */ 22572 if ((block_type & WRITE_IMMED) && bp->b_dirt==DIRTY && bp->b_dev != NO_DEV) { 22573 rw_block(bp, WRITING); 22574 } 22575 }

05 – 48 File Systems/5.6–5.7 MINIX fi le system

SLIDE 50

Inode Management

In order to do anything with a file, you will have

to have its inode in main memory. The size of the inode table determines the maximum number

f open files ⇒ get inode() and put inode()

routines (straightforward).

When you create a file, you will have to assign

an inode to it ⇒ adjust the inode bit map on disk. This means that the bit maps have to be in mem-

ry (we’ll get to that).
Deleting a file is really simple: (1) adjust the inode

bit map, (2) adjust the disk block bit map. This can all be done (almost) without disk I/O.

When you expand a file you have to allocate disk

blocks to it. Also really (well, almost) simple: (1) ad- just the inode as stored in the inode table, (2) ad- just the disk block map (“stored” in the superblock table).

05 – 49 File Systems/5.6–5.7 MINIX fi le system

SLIDE 51

Allocating an Inode

23003 PUBLIC struct inode *alloc_inode(dev_t dev, mode_t bits) 23004 { 23007 register struct inode *rip; 23008 register struct super_block *sp; 23009 int major, minor, inumb; 23010 bit_t b; 23011 23012 sp = get_super(dev); /* get pointer to super_block */ 23013 if (sp->s_rd_only) { /* can’t allocate an inode on a read only device. */ 23014 err_code = EROFS; 23015 return(NIL_INODE); 23016 } 23018 /* Acquire an inode from the bit map. */ 23019 b = alloc_bit(sp, IMAP, sp->s_isearch); 23020 if (b == NO_BIT) { 23021 err_code = ENFILE; 23022 major = (int) (sp->s_dev >> MAJOR) & BYTE; 23023 minor = (int) (sp->s_dev >> MINOR) & BYTE; 23024 printf("Out of i-nodes on %sdevice %d/%d\n", 23025 sp->s_dev == root_dev ? "root " : "", major, minor); 23026 return(NIL_INODE); 23027 } 23028 sp->s_isearch = b; /* next time start here */ 23029 inumb = (int) b; /* be careful not to pass unshort as param */ 23030 23031 /* Try to acquire a slot in the inode table. */ 23032 if ((rip = get_inode(NO_DEV, inumb)) == NIL_INODE) { 23033 /* No inode table slots available. Free the inode just allocated. */ 23034 free_bit(sp, IMAP, b); 23035 } else { 23036 /* An inode slot is available. Put the inode just allocated into it. */ 23037 rip->i_mode = bits; /* set up RWX bits */ 23038 rip->i_nlinks = 0; /* initial no links */ 23039 rip->i_uid = fp->fp_effuid; /* file’s uid is owner’s */ 23040 rip->i_gid = fp->fp_effgid; /* ditto group id */ 23041 rip->i_dev = dev; /* mark which device it is on */ 23042 rip->i_ndzones = sp->s_ndzones; /* number of direct zones */ 23043 rip->i_nindirs = sp->s_nindirs; /* number of indirect zones per blk*/ 23044 rip->i_sp = sp; /* pointer to super block */ .... 23051 wipe_inode(rip); /* Clear the other parts of the inode */ 23052 } 23054 return(rip); 23055 }

05 – 50 File Systems/5.6–5.7 MINIX fi le system

SLIDE 52

Reading/Writing an Inode

23125 PUBLIC void rw_inode(rip, rw_flag) 23126 register struct inode *rip; /* pointer to inode to be read/written */ 23127 int rw_flag; /* READING or WRITING */ 23128 { 23129 /* An entry in the inode table is to be copied to or from the disk. */ 23130 23131 register struct buf *bp; 23132 register struct super_block *sp; 23133 d1_inode *dip; 23134 d2_inode *dip2; 23135 block_t b, offset; 23136 23137 /* Get the block where the inode resides. */ 23138 sp = get_super(rip->i_dev); /* get pointer to super block */ 23139 rip->i_sp = sp; /* inode must contain super block pointer */ 23140

ffset = sp->s_imap_blocks + sp->s_zmap_blocks + 2;

23141 b = (block_t) (rip->i_num - 1)/sp->s_inodes_per_block + offset; 23142 bp = get_block(rip->i_dev, b, NORMAL); 23143 dip = bp->b_v1_ino + (rip->i_num - 1) % V1_INODES_PER_BLOCK; 23144 dip2 = bp->b_v2_ino + (rip->i_num - 1) % 23145 V2_INODES_PER_BLOCK(sp->s_block_size); 23146 23147 /* Do the read or write. */ 23148 if (rw_flag == WRITING) { 23149 if (rip->i_update) update_times(rip); /* times need updating */ 23150 if (sp->s_rd_only == FALSE) bp->b_dirt = DIRTY; 23151 } 23152 23153 /* Copy the inode from the disk block to the in-core table or vice versa. 23154 * If the fourth parameter below is FALSE, the bytes are swapped. 23155 */ 23156 if (sp->s_version == V1) 23157

ld_icopy(rip, dip,

rw_flag, sp->s_native); 23158 else 23159 new_icopy(rip, dip2, rw_flag, sp->s_native); 23160 23161 put_block(bp, INODE_BLOCK); 23162 rip->i_dirt = CLEAN; 23163 }

05 – 51 File Systems/5.6–5.7 MINIX fi le system

SLIDE 53

Superblock Management

Idea: When a filesystem is mounted, we need to have its super block in main memory as it fully describes where data related to that filesystem is located on disk.

We have to allocate a slot in the superblock table

that can hold the file system’s superblock. Note: the size of the superblock table determines the maximum number of file systems that can be mounted.

We have to load the blocks of inode bit maps.

One block contains a bit map indicating which in-

de is free or in use.
The bit maps for available and used disk blocks

have to be loaded as well. Both the inode bit maps, and the disk block bit maps are kept in the buffer cache: they are not copied to the su- perblock table! Question: How do I actually allocate an inode? (Note: we already discussed this...)

05 – 52 File Systems/5.6–5.7 MINIX fi le system

SLIDE 54

File Manipulation

Problem: Regardless what you want to do with a file (read, write), you’ll have to get the block you want to manipulate into main memory first.

Given an inode and position in file, you can cal-

culate which block you want. Make distinction be- tween direct, single indirect, and double indirect

block. This gives you the block number on disk.
Get that block by issuing a request at the buffer

cache, then transfer data between allocated buffer and requester’s address space. Additional problem: sometimes you’ll have to allo- cate a new block for a file before you can do any read- ing at all. Turns out to be almost symmetric to finding a block number.

05 – 53 File Systems/5.6–5.7 MINIX fi le system

SLIDE 55

Getting the Block Number

Problem: Given a position in a file, what’s the number

f the disk block that contains the referred data.

1: Check if the position falls within the first blocks of the file ⇒ the addresses of these files are stored in the file’s inode. 2: Check if the position falls within the range indi- cated by single indirect blocks ⇒ read in the block containing the single indirect pointers. You can then determine the right block number. 3: Check if the position falls within the range indi- cated by double indirect blocks ⇒ read in the right double indirect block, then do a second read for the block containing the disk block numbers.

05 – 54 File Systems/5.6–5.7 MINIX fi le system

SLIDE 56

Searching the Inode

25337 PUBLIC block_t read_map(rip, position) 25338 register struct inode *rip; /* ptr to inode to map from */ 25339 off_t position; /* position in file whose blk wanted */ 25340 { .... 25351 scale = rip->i_sp->s_log_zone_size; /* for block-zone conversion */ 25352 block_pos = position/rip->i_sp->s_block_size; /* relative blk # in file */ 25353 zone = block_pos >> scale; /* position’s zone */ 25354 boff = (int) (block_pos - (zone << scale) ); /* relative blk # within zone */ 25355 dzones = rip->i_ndzones; 25356 nr_indirects = rip->i_nindirs; 25358 /* Is ’position’ to be found in the inode itself? */ 25359 if (zone < dzones) { 25360 zind = (int) zone; /* index should be an int */ 25361 z = rip->i_zone[zind]; 25362 if (z == NO_ZONE) return(NO_BLOCK); 25363 b = ((block_t) z << scale) + boff; 25364 return(b); 25365 } 25367 /* It is not in the inode, so it must be single or double indirect. */ 25368 excess = zone - dzones; /* first Vx_NR_DZONES don’t count */ 25370 if (excess < nr_indirects) { 25371 /* ’position’ can be located via the single indirect block. */ 25372 z = rip->i_zone[dzones]; 25373 } else { 25374 /* ’position’ can be located via the double indirect block. */ 25375 if ( (z = rip->i_zone[dzones+1]) == NO_ZONE) return(NO_BLOCK); 25376 excess -= nr_indirects; /* single indir doesn’t count*/ 25377 b = (block_t) z << scale; 25378 bp = get_block(rip->i_dev, b, NORMAL); /* get double indirect block */ 25379 index = (int) (excess/nr_indirects); 25380 z = rd_indir(bp, index); /* z= zone for single*/ 25381 put_block(bp, INDIRECT_BLOCK); /* release double ind block */ 25382 excess = excess % nr_indirects; /* index into single ind blk */ 25383 } 25385 /* ’z’ is zone num for single indirect block; ’excess’ is index into it. */ 25386 if (z == NO_ZONE) return(NO_BLOCK); 25387 b = (block_t) z << scale; /* b is blk # for single ind */ 25388 bp = get_block(rip->i_dev, b, NORMAL); /* get single indirect block */ 25389 ex = (int) excess; /* need an integer */ 25390 z = rd_indir(bp, ex); /* get block pointed to */ 25391 put_block(bp, INDIRECT_BLOCK); /* release single indir blk */ 25392 if (z == NO_ZONE) return(NO_BLOCK); 25393 b = ((block_t) z << scale) + boff; 25394 return(b); 25395 }

05 – 55 File Systems/5.6–5.7 MINIX fi le system

SLIDE 57

Reading/Writing File Block

1: First get the right disk block number associated with the block in the file that you’re addressing (read map()). 2a: If you’re reading from a non-existent block, just get a buffer and read only zeroes. (Question: Why don’t you have to assign a disk block?) 2b: If you’re writing to a non-existent block, allocate it first, i.e. get a free disk block and assign it to the

file. (Note: you’ll have to adjust the inode.)

3: Otherwise, just get a buffer that contains the re- quested disk block. 4: Transfer data between buffer and requesting pro-

cess. (Question: Why do you have to do this

explicitly?) 5: Return the allocated buffer.

05 – 56 File Systems/5.6–5.7 MINIX fi le system

SLIDE 58

Doing the I/O (1/2)

25251 PRIVATE int rw_chunk(rip, position, off, chunk, left, rw_flag, buff, 25252 seg, usr, block_size, completed) 25253 register struct inode *rip; /* pointer to inode for file to be rd/wr */ 25254 off_t position; /* position within file to read or write */ 25255 unsigned off; /* off within the current block */ 25256 int chunk; /* number of bytes to read or write */ 25257 unsigned left; /* max number of bytes wanted after position */ 25258 int rw_flag; /* READING or WRITING */ 25259 char *buff; /* virtual address of the user buffer */ 25260 int seg; /* T or D segment in user space */ 25261 int usr; /* which user process */ 25262 int block_size; /* block size of FS operating on */ 25263 int *completed; /* number of bytes copied */ 25264 { 25265 /* Read or write (part of) a block. */ 25266 25267 register struct buf *bp; 25268 register int r = OK; 25269 int n, block_spec; 25270 block_t b; 25271 dev_t dev; 25272 25273 *completed = 0; 25274 25275 block_spec = (rip->i_mode & I_TYPE) == I_BLOCK_SPECIAL; 25276 if (block_spec) { 25277 b = position/block_size; 25278 dev = (dev_t) rip->i_zone[0]; 25279 } else { 25280 b = read_map(rip, position); 25281 dev = rip->i_dev; 25282 } 25283 25284 if (!block_spec && b == NO_BLOCK) { 25285 if (rw_flag == READING) { 25286 /* Reading from a nonexistent block. Must read as all zeros.*/ 25287 bp = get_block(NO_DEV, NO_BLOCK, NORMAL); /* get a buffer */ 25288 zero_block(bp); 25289 } else { 25290 /* Writing to a nonexistent block. Create and enter in inode.*/ 25291 if ((bp= new_block(rip, position)) == NIL_BUF) return(err_code); 25292 } 25293 } ....

05 – 57 File Systems/5.6–5.7 MINIX fi le system

SLIDE 59

Doing the I/O (2/2)

.... 25293 else if (rw_flag == READING) { 25294 /* Read and read ahead if convenient. */ 25295 bp = rahead(rip, b, position, left); 25296 } else { 25297 /* Normally an existing block to be partially overwritten is first read 25298 * in. However, a full block need not be read in. If it is already in 25299 * the cache, acquire it, otherwise just acquire a free buffer. 25300 */ 25301 n = (chunk == block_size ? NO_READ : NORMAL); 25302 if (!block_spec && off == 0 && position >= rip->i_size) n = NO_READ; 25303 bp = get_block(dev, b, n); 25304 } 25305 25306 /* In all cases, bp now points to a valid buffer. */ 25307 if (bp == NIL_BUF) { 25308 panic(__FILE__,"bp not valid in rw_chunk, this can’t happen", NO_NUM); 25309 } 25310 if (rw_flag == WRITING && chunk != block_size && !block_spec && 25311 position >= rip->i_size && off == 0) { 25312 zero_block(bp); 25313 } 25314 25315 if (rw_flag == READING) { 25316 /* Copy a chunk from the block buffer to user space. */ 25317 r = sys_vircopy(FS_PROC_NR, D, (phys_bytes) (bp->b_data+off), 25318 usr, seg, (phys_bytes) buff, 25319 (phys_bytes) chunk); 25320 } else { 25321 /* Copy a chunk from user space to the block buffer. */ 25322 r = sys_vircopy(usr, seg, (phys_bytes) buff, 25323 FS_PROC_NR, D, (phys_bytes) (bp->b_data+off), 25324 (phys_bytes) chunk); 25325 bp->b_dirt = DIRTY; 25326 } 25327 n = (off + chunk == block_size ? FULL_DATA_BLOCK : PARTIAL_DATA_BLOCK); 25328 put_block(bp, n); 25329 25330 return(r); 25331 }

05 – 58 File Systems/5.6–5.7 MINIX fi le system

SLIDE 60

Getting a File by Name

Idea: So far, we have only been able to read and write

files. What we also need is to look up files, and get

their inode into memory. Starting-point: We already have a pointer to a di- rectory inode, and a string containing the file name relative to that directory. 1: Search the directory’s content (i.e. read the direc- tory file), and see if you can get the inode number

f the next component.

2: Read the inode into the inode table, and check if it actually refers to a mounted filesystem (i.e. we’re dealing with a mount point). 3: If inode is a mount point, get the root inode of the mounted filesystem, and replace the previously found inode with the root inode. Note: and this is what happens in advance()...

05 – 59 File Systems/5.6–5.7 MINIX fi le system

SLIDE 61

Parsing a Pathname (1/2)

26454 PUBLIC struct inode *advance(dirp, string) 26455 struct inode *dirp; /* inode for directory to be searched */ 26456 char string[NAME_MAX]; /* component name to look for */ 26457 { 26458 /* Given a directory and a component of a path, look up the component in 26459 * the directory, find the inode, open it, and return a pointer to its inode 26460 * slot. If it can’t be done, return NIL_INODE. 26461 */ 26462 26463 register struct inode *rip; 26464 struct inode *rip2; 26465 register struct super_block *sp; 26466 int r, inumb; 26467 dev_t mnt_dev; 26468 ino_t numb; 26469 26470 /* If ’string’ is empty, yield same inode straight away. */ 26471 if (string[0] == ’\0’) { return(get_inode(dirp->i_dev, (int) dirp->i_num)); } 26472 26473 /* Check for NIL_INODE. */ 26474 if (dirp == NIL_INODE) { return(NIL_INODE); } 26475 26476 /* If ’string’ is not present in the directory, signal error. */ 26477 if ( (r = search_dir(dirp, string, &numb, LOOK_UP)) != OK) { 26478 err_code = r; 26479 return(NIL_INODE); 26480 } 26481 26482 /* Don’t go beyond the current root directory, unless the string is dot2. */ 26483 if (dirp == fp->fp_rootdir && strcmp(string, "..") == 0 && string != dot2) 26484 return(get_inode(dirp->i_dev, (int) dirp->i_num)); 26485 26486 /* The component has been found in the directory. Get inode. */ 26487 if ( (rip = get_inode(dirp->i_dev, (int) numb)) == NIL_INODE) { 26488 return(NIL_INODE); 26489 } 26490 ....

05 – 60 File Systems/5.6–5.7 MINIX fi le system

SLIDE 62

Parsing a Pathname (2/2)

26490 .... 26491 if (rip->i_num == ROOT_INODE) 26492 if (dirp->i_num == ROOT_INODE) { 26493 if (string[1] == ’.’) { 26494 for (sp = &super_block[1]; sp < &super_block[NR_SUPERS]; sp++){ 26495 if (sp->s_dev == rip->i_dev) { 26496 /* Release the root inode. Replace by the 26497 * inode mounted on. 26498 */ 26499 put_inode(rip); 26500 mnt_dev = sp->s_imount->i_dev; 26501 inumb = (int) sp->s_imount->i_num; 26502 rip2 = get_inode(mnt_dev, inumb); 26503 rip = advance(rip2, string); 26504 put_inode(rip2); 26505 break; 26506 } 26507 } 26508 } 26509 } 26510 if (rip == NIL_INODE) return(NIL_INODE); 26511 26512 /* See if the inode is mounted on. If so, switch to root directory of the 26513 * mounted file system. The super_block provides the linkage between the 26514 * inode mounted on and the root directory of the mounted file system. 26515 */ 26516 while (rip != NIL_INODE && rip->i_mount == I_MOUNT) { 26517 /* The inode is indeed mounted on. */ 26518 for (sp = &super_block[0]; sp < &super_block[NR_SUPERS]; sp++) { 26519 if (sp->s_imount == rip) { 26520 /* Release the inode mounted on. Replace by the 26521 * inode of the root inode of the mounted device. 26522 */ 26523 put_inode(rip); 26524 rip = get_inode(sp->s_dev, ROOT_INODE); 26525 break; 26526 } 26527 } 26528 } 26529 return(rip); /* return pointer to inode’s component */ 26530 }

05 – 61 File Systems/5.6–5.7 MINIX fi le system

SLIDE 63

Mounting a Filesystem

Basic idea: Mounting a filesystem is actually simple:

Provide a pathname to a mount point, and a path-

name to a block special device.

Find a free slot in the superblock table to allo-

cate the superblock of the filesystem that is to be mounted.

Read the superblock from the device specified by

the given device name.

Allocate an inode for the mount point.
Allocate inode for the root inode of the filesystem.
Set the appropriate fields.

Note: The really disturbing aspect is that a lot of things can go wrong ⇒ approximately 75 % of the mount code consists of checking for errors. Note: This is probably the best place to appreciate error checking in operating systems: just imagine that errors were not checked...

05 – 62 File Systems/5.6–5.7 MINIX fi le system

Operating Systems

Design and Implementation

Chapter 05

(version January 30, 2008)

Melanie Rieback

Vrije Universiteit Amsterdam, Faculty of Sciences

Room R4.23. Tel: (020) 598 7874 E-mail: melanie@cs.vu.nl, URL: www.cs.vu.nl/∼melanie/

01 Introduction 02 Processes 03 Input/Output 04 Memory Management 05 File Systems

File Systems

File systems

Basic model: a file is just an abstract storage device with the following operations:

type FILE ID is INT create(id: out FILE ID) delete(id: in FILE ID)

close(id: in FILE ID) read(f: in FILE ID, pos: in INT, data: out BYTE) write(f: in FILE ID, pos: in INT, data: in BYTE)

Idea: the operating system returns a unique file iden- tifier when a file is created. This file id is used on all subsequent operations. Variations:

be done starting at the head of the file, (2) writes imply appending data to the file.

stead of bytes, which might further be organized as a tree.

File Organization

Observation: Despite that many files are byte-oriented and unstructured from the OS point of view, internally, things may be quite different:

File Attributes

Directories

Basic model: It is necessary to organize files, i.e. keep track of them. A directory is a data structure in which information on files are contained [either in the directory (a) or in separate places (b)]

Note: the information maintained in a directory is of- ten so important that it is worth the trouble storing it ... in a file.

Directories – Pathnames

Idea: if files are organized in a graph, with intermedi- ate nodes acting as directories, we can easily identify files by naming:

usr ast mbox bin gcc image bin /bin/gcc /usr/ast/bin/gcc gcc root

Question: Do we actually need a root?

File System Design

Problem: how do we actually design & implement a file system?

to know exactly where each part of a file is stored

implemented as files, but there are also other pos- sibilities.

to disk, implying that we need to keep track of used and free blocks. Note: this is a different problem than file storage.

different ways. How do we handle this?

File Storage: Disk Layout

Recall: A file system is usually stored on disk, possi- bly with several other file systems. All information on this organization needs to be stored as well:

Note: Partition info is generally stored at the end of the master boot record ⇒ limits the number of pos- sible partitions. Solution: use extended partitions,

File Storage Linked List Allocation

Essence: just record in a file block, which disk block contains the next file block. Pretty bad for things like random access...

File Storage – File Allocation Table

where FAT[k] contains information on the kth disk block.

data is also stored on disk block #k.

last block.

BAD denotes a bad block, etc.

Question: Where do we store that file F has disk block #k as its first block?

File Storage – Inodes

Directory structure – Windows 98

directory entries.

sponds to the following record:

a base directory entry acting as a sentinel:

File storage – UNIX

Really simple: a directory is just an ordinary file con- sisting of directory entries that (used to) have the form:

typedef struct{ unsigned inode_number : 16; /* 2 bytes */ char file_name[14] : 112; /* 14 bytes */ } DIRECTORY_ENTRY

which adds up to 16 bytes. Now: inode numbers are

Note:

There is precisely one inode per file.

there a limit to the number of inodes?

in advance. So is its place on disk.

File Storage – Name Resolution

Disk Space Management (1/2)

Problem: the administration of free disk space has to be kept on disk as well ⇒ you don’t want too much space for that.

Disk Space Management (2/2)

Problem: There’s a tradeoff between choosing a proper disk size block when it comes to data rate and utiliza- tion:

File System Reliability

Essence: storage devices still mess up – they have so-called bad blocks that make it hard to keep a file system reliable. Solution: simply backup the system regularly so that parts of it can be restored when a bad block occurs. The problem is how to do backups efficiently:

to the backup, say, every day

two drives, but reading only from one.

File system consistency (1/2)

(a) Consistent (b) Missing block (c) Duplicate block in free list (d) Duplicate data block

File system consistency (2/2)

Count, per file, the number of references to it. (2) Go through all inodes and check the link counts. The two should match: always adjust the link counts.

Buffer cache

Idea: when files are read, you have to copy data blocks from disk into main memory (otherwise reading/writing is impossible). Keep those blocks in main memory for some time. Question: Why do you have to copy a block into main memory? Problem: what to do when you write to a block:

do another write. UNIX strategy turns out to work good: disk I/O is strongly

the buffer cache is flushed every 30 seconds. Question: What’s really so bad about removing a floppy disk in a UNIX-based system?

Log-Structured File Systems

Security

files, processes), and a collection of users (peo- ple, processes) of those resources.

be part of that system before they can use re- sources.

thorized usage. Operations by authorized users can be checked beforehand.

be let into the system. Keywords: Authentication and authorization.

Mechanism vs. Policy

typedef struct{ unsigned inode_number : 16; /* 2 bytes / char file_name[14] : 112; / 14 bytes */ } DIRECTORY_ENTRY