I/O / Filesystems 1 1 last time when LRU fails special-case for - - PowerPoint PPT Presentation

i o filesystems 1
SMART_READER_LITE
LIVE PREVIEW

I/O / Filesystems 1 1 last time when LRU fails special-case for - - PowerPoint PPT Presentation

I/O / Filesystems 1 1 last time when LRU fails special-case for single-access fjle data readahead handle scans by predicting reads device driver halfs top: from system call, use bufger, request data, wait for data bottom: from interrupt,


slide-1
SLIDE 1

I/O / Filesystems 1

1

slide-2
SLIDE 2

last time

when LRU fails special-case for single-access fjle data readahead — handle scans by predicting reads device driver halfs

top: from system call, use bufger, request data, wait for data bottom: from interrupt, fjll bufger, wake up

devices as magic memory

2

slide-3
SLIDE 3

exercise

system is running two applications

A: reading from network B: doing tons of computation

timeline:

A calls read() to 8KB of data from network 16KB of data comes in 10ms later A calls read() again to get 4KB more

exercise 1: how many kernel/user mode switches? exercise 2: how many context switches?

3

slide-4
SLIDE 4

how many mode switches?

A calls read() to 8KB of data from network 16KB of data comes in 10ms later A calls read() again to get 4KB more

read() 8KB start wait for device (driver ‘top half’) run B while A waits copy from device (driver ‘bottom half’) mark A ready run scheduler switch to A (kernel) copy fjrst 8KB (resume driver ‘top half’) return from read() syscall read() syscall copy 4KB from bufger (driver ‘top half’) return from read() syscall user mode (running A) user mode (running B) kernel mode depends — does scheduler run A right away?

1 2 3 4? 5? 6? 1 2 3 4? 5? 6? 7? 8?

4

slide-5
SLIDE 5

how many mode switches?

A calls read() to 8KB of data from network 16KB of data comes in 10ms later A calls read() again to get 4KB more

read() 8KB start wait for device (driver ‘top half’) run B while A waits copy from device (driver ‘bottom half’) mark A ready run scheduler switch to A (kernel) copy fjrst 8KB (resume driver ‘top half’) return from read() syscall read() syscall copy 4KB from bufger (driver ‘top half’) return from read() syscall user mode (running A) user mode (running B) kernel mode depends — does scheduler run A right away?

1 2 3 4? 5? 6? 1 2 3 4? 5? 6? 7? 8?

4

slide-6
SLIDE 6

how many mode switches?

A calls read() to 8KB of data from network 16KB of data comes in 10ms later A calls read() again to get 4KB more

read() 8KB start wait for device (driver ‘top half’) run B while A waits copy from device (driver ‘bottom half’) mark A ready run scheduler switch to A (kernel) copy fjrst 8KB (resume driver ‘top half’) return from read() syscall read() syscall copy 4KB from bufger (driver ‘top half’) return from read() syscall user mode (running A) user mode (running B) kernel mode depends — does scheduler run A right away?

1 2 3 4? 5? 6? 1 2 3 4? 5? 6? 7? 8?

4

slide-7
SLIDE 7

how many context switches?

A calls read() to 8KB of data from network 16KB of data comes in 10ms later A calls read() again to get remaining 4KB

read() 8KB start wait for device (driver ‘top half’) run B while A waits copy from device (driver ‘bottom half’) mark A ready run scheduler switch to A (kernel) copy fjrst 8KB (resume driver ‘top half’) return from read() syscall read() syscall copy 4KB from bufger (driver ‘top half’) return from read() syscall user mode (running A) user mode (running B) kernel mode depends — does scheduler run A right away?

1 2

5

slide-8
SLIDE 8

how many context switches?

A calls read() to 8KB of data from network 16KB of data comes in 10ms later A calls read() again to get remaining 4KB

read() 8KB start wait for device (driver ‘top half’) run B while A waits copy from device (driver ‘bottom half’) mark A ready run scheduler switch to A (kernel) copy fjrst 8KB (resume driver ‘top half’) return from read() syscall read() syscall copy 4KB from bufger (driver ‘top half’) return from read() syscall user mode (running A) user mode (running B) kernel mode depends — does scheduler run A right away?

1 2

5

slide-9
SLIDE 9

direct memory access (DMA)

processor

interrupt controller memory bus

  • ther processors…

actual memory

  • ther devices

device controller

external hardware?

  • bservation: devices can read/write memory

can have device copy data to/from memory

6

slide-10
SLIDE 10

direct memory access (DMA)

processor

interrupt controller memory bus

actual memory

  • ther devices

device controller

status read? write? bufger addr …

control registers

bufgers/queues

external hardware? OS chooses memory address

(this example: 0x9000 (physical))

write to 0x9000

(instead of internal bufger)

OS reads from 0x9000 rather than copying from device bufger best case: OS chooses location user program passed to read()/etc. (avoids copy!)

7

slide-11
SLIDE 11

direct memory access (DMA)

processor

interrupt controller memory bus

actual memory

  • ther devices

device controller

status read? write? bufger addr =0x9000 …

control registers

bufgers/queues

external hardware? OS chooses memory address

(this example: 0x9000 (physical))

write to 0x9000

(instead of internal bufger)

OS reads from 0x9000 rather than copying from device bufger best case: OS chooses location user program passed to read()/etc. (avoids copy!)

7

slide-12
SLIDE 12

direct memory access (DMA)

processor

interrupt controller memory bus

actual memory

  • ther devices

device controller

status read? write? bufger addr =0x9000 …

control registers

bufgers/queues

external hardware? OS chooses memory address

(this example: 0x9000 (physical))

write to 0x9000

(instead of internal bufger)

OS reads from 0x9000 rather than copying from device bufger best case: OS chooses location user program passed to read()/etc. (avoids copy!)

7

slide-13
SLIDE 13

direct memory access (DMA)

processor

interrupt controller memory bus

actual memory

  • ther devices

device controller

status read? write? bufger addr =0x9000 …

control registers

bufgers/queues

external hardware? OS chooses memory address

(this example: 0x9000 (physical))

write to 0x9000

(instead of internal bufger)

OS reads from 0x9000 rather than copying from device bufger best case: OS chooses location user program passed to read()/etc. (avoids copy!)

7

slide-14
SLIDE 14

direct memory access (DMA)

processor

interrupt controller memory bus

actual memory

  • ther devices

device controller

status read? write? bufger addr =0x9000 …

control registers

bufgers/queues

external hardware? OS chooses memory address

(this example: 0x9000 (physical))

write to 0x9000

(instead of internal bufger)

OS reads from 0x9000 rather than copying from device bufger best case: OS chooses location user program passed to read()/etc. (avoids copy!)

7

slide-15
SLIDE 15

direct memory access (DMA)

much faster, e.g., for disk or network I/O avoids having processor run a loop to copy data

OS can run normal program during data transfer interrupt tells OS when copy fjnished

device uses memory as very large bufger space device puts data where OS wants it directly (maybe)

OS specifjes physical address to use… instead of reading from device controller

8

slide-16
SLIDE 16

direct memory access (DMA)

much faster, e.g., for disk or network I/O avoids having processor run a loop to copy data

OS can run normal program during data transfer interrupt tells OS when copy fjnished

device uses memory as very large bufger space device puts data where OS wants it directly (maybe)

OS specifjes physical address to use… instead of reading from device controller

8

slide-17
SLIDE 17

OS puts data where it wants

so far: where it wants is the device driver’s bufger seems like OS could also put it directly where application wants it?

i.e. pointer passed to read() system call called “zero-copy I/O”

should be faster, but, in practice, very rarely done:

if part of regular fjle, can’t easily share with page cache device might expect contiguous physical addresses device might expect physical address is at start of physical page device might write data in difgernt format than application expects device might read too much data need to deal with application exiting/being killed before device fjnishes …

9

slide-18
SLIDE 18

OS puts data where it wants

so far: where it wants is the device driver’s bufger seems like OS could also put it directly where application wants it?

i.e. pointer passed to read() system call called “zero-copy I/O”

should be faster, but, in practice, very rarely done:

if part of regular fjle, can’t easily share with page cache device might expect contiguous physical addresses device might expect physical address is at start of physical page device might write data in difgernt format than application expects device might read too much data need to deal with application exiting/being killed before device fjnishes …

9

slide-19
SLIDE 19

OS puts data where it wants

so far: where it wants is the device driver’s bufger seems like OS could also put it directly where application wants it?

i.e. pointer passed to read() system call called “zero-copy I/O”

should be faster, but, in practice, very rarely done:

if part of regular fjle, can’t easily share with page cache device might expect contiguous physical addresses device might expect physical address is at start of physical page device might write data in difgernt format than application expects device might read too much data need to deal with application exiting/being killed before device fjnishes …

9

slide-20
SLIDE 20

devices summary

device controllers connected via memory bus

usually assigned physical memory addresses sometimes separate “I/O addresses” (special load/store instructions)

controller looks like “magic memory” to OS

load/store from device controller registers like memory setting/reading control registers can trigger device operations

two options for data transfer

programmed I/O: OS reads from/writes to bufger within device controller direct memory access (DMA): device controller reads/writes normal memory

10

slide-21
SLIDE 21

the FAT fjlesystem

FAT: File Allocation Table probably simplest widely used fjlesystem (family) named for important data structure: fjle allocation table

11

slide-22
SLIDE 22

FAT and sectors

FAT divides disk into clusters composed of one or more sectors sector = minimum amount hardware can read

determined by disk hardware historically 512 bytes, but often bigger now

cluster: typically 512 to 4096 bytes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk cluster (fjlesytem unit) sector

24 25

12

slide-23
SLIDE 23

FAT and sectors

FAT divides disk into clusters composed of one or more sectors sector = minimum amount hardware can read

determined by disk hardware historically 512 bytes, but often bigger now

cluster: typically 512 to 4096 bytes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk cluster (fjlesytem unit) sector

24 25

12

slide-24
SLIDE 24

FAT: clusters and fjles

a fjle’s data stored in a list of clusters fjle size isn’t multiple of cluster size? waste space reading a fjle? need to fjnd the list of clusters

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk

example.txt 13

slide-25
SLIDE 25

FAT: clusters and fjles

a fjle’s data stored in a list of clusters fjle size isn’t multiple of cluster size? waste space reading a fjle? need to fjnd the list of clusters

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk

example.txt 13

slide-26
SLIDE 26

FAT: the fjle allocation table

big array on disk, one entry per cluster each entry contains a number — usually “next cluster”

cluster num. entry value 4 1 7 2 5 3 1434 … … 1000 4503 1001 1523 … …

14

slide-27
SLIDE 27

FAT: reading a fjle (1)

get (from elsewhere) fjrst cluster of data linked list of cluster numbers next pointers? fjle allocation table entry for cluster

special value for NULL (-1 in this example; maybe difgerent in real FAT)

cluster num. entry value … … 10 14 11 23 12 54 13

  • 1 (end mark)

14 15 15 13 … … fjle starting at cluster 10 contains data in: cluster 10, then 14, then 15, then 13

15

slide-28
SLIDE 28

FAT: reading a fjle (2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk entry value index … … 21 6 8 7 9 8

  • 1 (end mark) 9

14 10 23 11 54 12

  • 1 (end mark) 15

15 14 13 15 20 16 … … fjle allocation table

block 0 block 1 block 2 block 3 block 0 block 1 block 2

16

slide-29
SLIDE 29

FAT: reading a fjle (2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk entry value index … … 21 6 8 7 9 8

  • 1 (end mark) 9

14 10 23 11 54 12

  • 1 (end mark) 15

15 14 13 15 20 16 … … fjle allocation table

block 0 block 1 block 2 block 3 block 0 block 1 block 2

16

slide-30
SLIDE 30

FAT: reading a fjle (2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk entry value index … … 21 6 8 7 9 8

  • 1 (end mark) 9

14 10 23 11 54 12

  • 1 (end mark) 15

15 14 13 15 20 16 … … fjle allocation table

block 0 block 1 block 2 block 3 block 0 block 1 block 2

16

slide-31
SLIDE 31

FAT: reading fjles

to read a fjle given it’s start location read the starting cluster X get the next cluster Y from FAT entry X read the next cluster get the next cluster from FAT entry Y … until you see an end marker

17

slide-32
SLIDE 32

start locations?

really want fjlenames stored in directories! in FAT: directory is a fjle, but its data is list of: (name, starting location, other data about fjle)

18

slide-33
SLIDE 33

fjnding fjles with directory

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 10

cluster number the disk

dir pt 0 dir pt 1

fjle “index.html” starting at cluster 10, 12792 bytes fjle “assignments.html” starting at cluster 17, 4312 bytes … directory “examples” starting at cluster 20 unused entry … fjle “info.html” starting at cluster 50, 23789 bytes

index.html pt 0 index.html pt 1 index.html pt 2 index.html pt 3

(bytes 0-4095 of index.html) (bytes 4096-8191 of index.html) (bytes 8192-12287 of index.html) (bytes 12278-12792 of index.html) (unused bytes 12792-16384)

19

slide-34
SLIDE 34

fjnding fjles with directory

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 10

cluster number the disk

dir pt 0 dir pt 1

fjle “index.html” starting at cluster 10, 12792 bytes fjle “assignments.html” starting at cluster 17, 4312 bytes … directory “examples” starting at cluster 20 unused entry … fjle “info.html” starting at cluster 50, 23789 bytes

index.html pt 0 index.html pt 1 index.html pt 2 index.html pt 3

(bytes 0-4095 of index.html) (bytes 4096-8191 of index.html) (bytes 8192-12287 of index.html) (bytes 12278-12792 of index.html) (unused bytes 12792-16384)

19

slide-35
SLIDE 35

fjnding fjles with directory

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 10

cluster number the disk

dir pt 0 dir pt 1

fjle “index.html” starting at cluster 10, 12792 bytes fjle “assignments.html” starting at cluster 17, 4312 bytes … directory “examples” starting at cluster 20 unused entry … fjle “info.html” starting at cluster 50, 23789 bytes

index.html pt 0 index.html pt 1 index.html pt 2 index.html pt 3

(bytes 0-4095 of index.html) (bytes 4096-8191 of index.html) (bytes 8192-12287 of index.html) (bytes 12278-12792 of index.html) (unused bytes 12792-16384)

19

slide-36
SLIDE 36

fjnding fjles with directory

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 10

cluster number the disk

dir pt 0 dir pt 1

fjle “index.html” starting at cluster 10, 12792 bytes fjle “assignments.html” starting at cluster 17, 4312 bytes … directory “examples” starting at cluster 20 unused entry … fjle “info.html” starting at cluster 50, 23789 bytes

index.html pt 0 index.html pt 1 index.html pt 2 index.html pt 3

(bytes 0-4095 of index.html) (bytes 4096-8191 of index.html) (bytes 8192-12287 of index.html) (bytes 12278-12792 of index.html) (unused bytes 12792-16384)

19

slide-37
SLIDE 37

FAT directory entry

box = 1 byte entry for README.TXT, 342 byte fjle, starting at cluster 0x104F4 'R' 'E' 'A' 'D' 'M' 'E' ' ␣' ' ␣' 'T' 'X' 'T' 0x00

fjlename + extension (README.TXT) attrs

directory? read-only? hidden? … 0x9C0xA10x200x7D0x3C0x7D0x3C0x010x000xEC0x620x76 creation date + time

(2010-03-29 04:05:03.56)

last access

(2010-03-29)

cluster # (high bits) last write

(2010-03-22 12:23:12)

0x3C0xF40x040x560x010x000x00 'F' 'O' 'O' …

last write con’t

cluster # (low bits) fjle size

(0x156 bytes)

next directory entry…

32-bit fjrst cluster number split into two parts (history: used to only be 16-bits) 8 character fjlename + 3 character extension longer fjlenames? encoded using extra directory entries (special attrs values to distinguish from normal entries) 8 character fjlename + 3 character extension history: used to be all that was supported attributes: is a subdirectory, read-only, … also marks directory entries used to hold extra fjlename data convention: if fjrst character is 0x0 or 0xE5 — unused 0x00: for fjlling empty space at end of directory 0xE5: ‘hole’ — e.g. from fjle deletion

20

slide-38
SLIDE 38

FAT directory entry

box = 1 byte entry for README.TXT, 342 byte fjle, starting at cluster 0x104F4 'R' 'E' 'A' 'D' 'M' 'E' ' ␣' ' ␣' 'T' 'X' 'T' 0x00

fjlename + extension (README.TXT) attrs

directory? read-only? hidden? … 0x9C0xA10x200x7D0x3C0x7D0x3C0x010x000xEC0x620x76 creation date + time

(2010-03-29 04:05:03.56)

last access

(2010-03-29)

cluster # (high bits) last write

(2010-03-22 12:23:12)

0x3C0xF40x040x560x010x000x00 'F' 'O' 'O' …

last write con’t

cluster # (low bits) fjle size

(0x156 bytes)

next directory entry…

32-bit fjrst cluster number split into two parts (history: used to only be 16-bits) 8 character fjlename + 3 character extension longer fjlenames? encoded using extra directory entries (special attrs values to distinguish from normal entries) 8 character fjlename + 3 character extension history: used to be all that was supported attributes: is a subdirectory, read-only, … also marks directory entries used to hold extra fjlename data convention: if fjrst character is 0x0 or 0xE5 — unused 0x00: for fjlling empty space at end of directory 0xE5: ‘hole’ — e.g. from fjle deletion

20

slide-39
SLIDE 39

FAT directory entry

box = 1 byte entry for README.TXT, 342 byte fjle, starting at cluster 0x104F4 'R' 'E' 'A' 'D' 'M' 'E' ' ␣' ' ␣' 'T' 'X' 'T' 0x00

fjlename + extension (README.TXT) attrs

directory? read-only? hidden? … 0x9C0xA10x200x7D0x3C0x7D0x3C0x010x000xEC0x620x76 creation date + time

(2010-03-29 04:05:03.56)

last access

(2010-03-29)

cluster # (high bits) last write

(2010-03-22 12:23:12)

0x3C0xF40x040x560x010x000x00 'F' 'O' 'O' …

last write con’t

cluster # (low bits) fjle size

(0x156 bytes)

next directory entry…

32-bit fjrst cluster number split into two parts (history: used to only be 16-bits) 8 character fjlename + 3 character extension longer fjlenames? encoded using extra directory entries (special attrs values to distinguish from normal entries) 8 character fjlename + 3 character extension history: used to be all that was supported attributes: is a subdirectory, read-only, … also marks directory entries used to hold extra fjlename data convention: if fjrst character is 0x0 or 0xE5 — unused 0x00: for fjlling empty space at end of directory 0xE5: ‘hole’ — e.g. from fjle deletion

20

slide-40
SLIDE 40

FAT directory entry

box = 1 byte entry for README.TXT, 342 byte fjle, starting at cluster 0x104F4 'R' 'E' 'A' 'D' 'M' 'E' ' ␣' ' ␣' 'T' 'X' 'T' 0x00

fjlename + extension (README.TXT) attrs

directory? read-only? hidden? … 0x9C0xA10x200x7D0x3C0x7D0x3C0x010x000xEC0x620x76 creation date + time

(2010-03-29 04:05:03.56)

last access

(2010-03-29)

cluster # (high bits) last write

(2010-03-22 12:23:12)

0x3C0xF40x040x560x010x000x00 'F' 'O' 'O' …

last write con’t

cluster # (low bits) fjle size

(0x156 bytes)

next directory entry…

32-bit fjrst cluster number split into two parts (history: used to only be 16-bits) 8 character fjlename + 3 character extension longer fjlenames? encoded using extra directory entries (special attrs values to distinguish from normal entries) 8 character fjlename + 3 character extension history: used to be all that was supported attributes: is a subdirectory, read-only, … also marks directory entries used to hold extra fjlename data convention: if fjrst character is 0x0 or 0xE5 — unused 0x00: for fjlling empty space at end of directory 0xE5: ‘hole’ — e.g. from fjle deletion

20

slide-41
SLIDE 41

FAT directory entry

box = 1 byte entry for README.TXT, 342 byte fjle, starting at cluster 0x104F4 'R' 'E' 'A' 'D' 'M' 'E' ' ␣' ' ␣' 'T' 'X' 'T' 0x00

fjlename + extension (README.TXT) attrs

directory? read-only? hidden? … 0x9C0xA10x200x7D0x3C0x7D0x3C0x010x000xEC0x620x76 creation date + time

(2010-03-29 04:05:03.56)

last access

(2010-03-29)

cluster # (high bits) last write

(2010-03-22 12:23:12)

0x3C0xF40x040x560x010x000x00 'F' 'O' 'O' …

last write con’t

cluster # (low bits) fjle size

(0x156 bytes)

next directory entry…

32-bit fjrst cluster number split into two parts (history: used to only be 16-bits) 8 character fjlename + 3 character extension longer fjlenames? encoded using extra directory entries (special attrs values to distinguish from normal entries) 8 character fjlename + 3 character extension history: used to be all that was supported attributes: is a subdirectory, read-only, … also marks directory entries used to hold extra fjlename data convention: if fjrst character is 0x0 or 0xE5 — unused 0x00: for fjlling empty space at end of directory 0xE5: ‘hole’ — e.g. from fjle deletion

20

slide-42
SLIDE 42

FAT directory entry

box = 1 byte entry for README.TXT, 342 byte fjle, starting at cluster 0x104F4 'R' 'E' 'A' 'D' 'M' 'E' ' ␣' ' ␣' 'T' 'X' 'T' 0x00

fjlename + extension (README.TXT) attrs

directory? read-only? hidden? … 0x9C0xA10x200x7D0x3C0x7D0x3C0x010x000xEC0x620x76 creation date + time

(2010-03-29 04:05:03.56)

last access

(2010-03-29)

cluster # (high bits) last write

(2010-03-22 12:23:12)

0x3C0xF40x040x560x010x000x00 'F' 'O' 'O' …

last write con’t

cluster # (low bits) fjle size

(0x156 bytes)

next directory entry…

32-bit fjrst cluster number split into two parts (history: used to only be 16-bits) 8 character fjlename + 3 character extension longer fjlenames? encoded using extra directory entries (special attrs values to distinguish from normal entries) 8 character fjlename + 3 character extension history: used to be all that was supported attributes: is a subdirectory, read-only, … also marks directory entries used to hold extra fjlename data convention: if fjrst character is 0x0 or 0xE5 — unused 0x00: for fjlling empty space at end of directory 0xE5: ‘hole’ — e.g. from fjle deletion

20

slide-43
SLIDE 43

aside: FAT date encoding

seperate date and time fjelds (16 bits, little-endian integers) bits 0-4: seconds (divided by 2), 5-10: minute, 11-15: hour bits 0-4: day, 5-8: month, 9-15: year (minus 1980) sometimes extra fjeld for 100s(?) of a second

21

slide-44
SLIDE 44

FAT directory entries (from C)

struct __attribute__((packed)) DirEntry { uint8_t DIR_Name[11]; // short name uint8_t DIR_Attr; // File attribute uint8_t DIR_NTRes; // set value to 0, never change this uint8_t DIR_CrtTimeTenth; // millisecond timestamp for file creation time uint16_t DIR_CrtTime; // time file was created uint16_t DIR_CrtDate; // date file was created uint16_t DIR_LstAccDate; // last access date uint16_t DIR_FstClusHI; // high word of this entry's first cluster number uint16_t DIR_WrtTime; // time of last write uint16_t DIR_WrtDate; // dat eof last write uint16_t DIR_FstClusLO; // low word of this entry's first cluster number uint32_t DIR_FileSize; // file size in bytes };

GCC/Clang extension to disable padding normally compilers add padding to structs (to avoid splitting values across cache blocks or pages) 8/16/32-bit unsigned integer use exact size that’s on disk just copy byte-by-byte from disk to memory (and everything happens to be little-endian) why are the names so bad (“FstClusHI”, etc.)? comes from Microsoft’s documentation this way

22

slide-45
SLIDE 45

FAT directory entries (from C)

struct __attribute__((packed)) DirEntry { uint8_t DIR_Name[11]; // short name uint8_t DIR_Attr; // File attribute uint8_t DIR_NTRes; // set value to 0, never change this uint8_t DIR_CrtTimeTenth; // millisecond timestamp for file creation time uint16_t DIR_CrtTime; // time file was created uint16_t DIR_CrtDate; // date file was created uint16_t DIR_LstAccDate; // last access date uint16_t DIR_FstClusHI; // high word of this entry's first cluster number uint16_t DIR_WrtTime; // time of last write uint16_t DIR_WrtDate; // dat eof last write uint16_t DIR_FstClusLO; // low word of this entry's first cluster number uint32_t DIR_FileSize; // file size in bytes };

GCC/Clang extension to disable padding normally compilers add padding to structs (to avoid splitting values across cache blocks or pages) 8/16/32-bit unsigned integer use exact size that’s on disk just copy byte-by-byte from disk to memory (and everything happens to be little-endian) why are the names so bad (“FstClusHI”, etc.)? comes from Microsoft’s documentation this way

22

slide-46
SLIDE 46

FAT directory entries (from C)

struct __attribute__((packed)) DirEntry { uint8_t DIR_Name[11]; // short name uint8_t DIR_Attr; // File attribute uint8_t DIR_NTRes; // set value to 0, never change this uint8_t DIR_CrtTimeTenth; // millisecond timestamp for file creation time uint16_t DIR_CrtTime; // time file was created uint16_t DIR_CrtDate; // date file was created uint16_t DIR_LstAccDate; // last access date uint16_t DIR_FstClusHI; // high word of this entry's first cluster number uint16_t DIR_WrtTime; // time of last write uint16_t DIR_WrtDate; // dat eof last write uint16_t DIR_FstClusLO; // low word of this entry's first cluster number uint32_t DIR_FileSize; // file size in bytes };

GCC/Clang extension to disable padding normally compilers add padding to structs (to avoid splitting values across cache blocks or pages) 8/16/32-bit unsigned integer use exact size that’s on disk just copy byte-by-byte from disk to memory (and everything happens to be little-endian) why are the names so bad (“FstClusHI”, etc.)? comes from Microsoft’s documentation this way

22

slide-47
SLIDE 47

FAT directory entries (from C)

struct __attribute__((packed)) DirEntry { uint8_t DIR_Name[11]; // short name uint8_t DIR_Attr; // File attribute uint8_t DIR_NTRes; // set value to 0, never change this uint8_t DIR_CrtTimeTenth; // millisecond timestamp for file creation time uint16_t DIR_CrtTime; // time file was created uint16_t DIR_CrtDate; // date file was created uint16_t DIR_LstAccDate; // last access date uint16_t DIR_FstClusHI; // high word of this entry's first cluster number uint16_t DIR_WrtTime; // time of last write uint16_t DIR_WrtDate; // dat eof last write uint16_t DIR_FstClusLO; // low word of this entry's first cluster number uint32_t DIR_FileSize; // file size in bytes };

GCC/Clang extension to disable padding normally compilers add padding to structs (to avoid splitting values across cache blocks or pages) 8/16/32-bit unsigned integer use exact size that’s on disk just copy byte-by-byte from disk to memory (and everything happens to be little-endian) why are the names so bad (“FstClusHI”, etc.)? comes from Microsoft’s documentation this way

22

slide-48
SLIDE 48

nested directories

foo/bar/baz/fjle.txt read root directory entries to fjnd foo read foo’s directory entries to fjnd bar read bar’s directory entries to fjnd baz read baz’s directory entries to fjnd fjle.txt

23

slide-49
SLIDE 49

the root directory?

but where is the fjrst directory?

24

slide-50
SLIDE 50

FAT disk header

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk

(OS startup data) … bytes per sector 512 reserved sectors 5 sectors per cluster 4 … … total sectors 4096 FAT size 11 Number of FATs 2 root directory cluster 10 … …

fjlesystem header

FAT backup FAT root directory starts here reserved sectors

25

slide-51
SLIDE 51

FAT disk header

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk

(OS startup data) … bytes per sector 512 reserved sectors 5 sectors per cluster 4 … … total sectors 4096 FAT size 11 Number of FATs 2 root directory cluster 10 … …

fjlesystem header

FAT backup FAT root directory starts here reserved sectors

25

slide-52
SLIDE 52

FAT disk header

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk

(OS startup data) … bytes per sector 512 reserved sectors 5 sectors per cluster 4 … … total sectors 4096 FAT size 11 Number of FATs 2 root directory cluster 10 … …

fjlesystem header

FAT backup FAT root directory starts here reserved sectors

25

slide-53
SLIDE 53

FAT disk header

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk

(OS startup data) … bytes per sector 512 reserved sectors 5 sectors per cluster 4 … … total sectors 4096 FAT size 11 Number of FATs 2 root directory cluster 10 … …

fjlesystem header

FAT backup FAT root directory starts here reserved sectors

25

slide-54
SLIDE 54

FAT disk header

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk

(OS startup data) … bytes per sector 512 reserved sectors 5 sectors per cluster 4 … … total sectors 4096 FAT size 11 Number of FATs 2 root directory cluster 10 … …

fjlesystem header

FAT backup FAT root directory starts here reserved sectors

25

slide-55
SLIDE 55

FAT disk header

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk

(OS startup data) … bytes per sector 512 reserved sectors 5 sectors per cluster 4 … … total sectors 4096 FAT size 11 Number of FATs 2 root directory cluster 10 … …

fjlesystem header

FAT backup FAT root directory starts here reserved sectors

25

slide-56
SLIDE 56

fjlesystem header

fjxed location near beginning of disk determines size of clusters, etc. tells where to fjnd FAT, root directory, etc.

26

slide-57
SLIDE 57

FAT header (C)

struct __attribute__((packed)) Fat32BPB { uint8_t BS_jmpBoot[3]; // jmp instr to boot code uint8_t BS_oemName[8]; // indicates what system formatted this field, default=MSWIN4.1 uint16_t BPB_BytsPerSec; // count of bytes per sector uint8_t BPB_SecPerClus; // no.of sectors per allocation unit uint16_t BPB_RsvdSecCnt; // no.of reserved sectors in the reserved region of the volume starting at 1st sector uint8_t BPB_NumFATs; // count of FAT datastructures on the volume uint16_t BPB_rootEntCnt; // count of 32-byte entries in root dir, for FAT32 set to 0 uint16_t BPB_totSec16; // total sectors on the volume uint8_t BPB_media; // value of fixed media .... uint16_t BPB_ExtFlags; // flags indicating which FATs are active

size of sector (in bytes) and size of cluster (in sectors) space before fjle allocation table number of copies of fjle allocation table extra copies in case disk is damaged typically two with writes made to both

27

slide-58
SLIDE 58

FAT header (C)

struct __attribute__((packed)) Fat32BPB { uint8_t BS_jmpBoot[3]; // jmp instr to boot code uint8_t BS_oemName[8]; // indicates what system formatted this field, default=MSWIN4.1 uint16_t BPB_BytsPerSec; // count of bytes per sector uint8_t BPB_SecPerClus; // no.of sectors per allocation unit uint16_t BPB_RsvdSecCnt; // no.of reserved sectors in the reserved region of the volume starting at 1st sector uint8_t BPB_NumFATs; // count of FAT datastructures on the volume uint16_t BPB_rootEntCnt; // count of 32-byte entries in root dir, for FAT32 set to 0 uint16_t BPB_totSec16; // total sectors on the volume uint8_t BPB_media; // value of fixed media .... uint16_t BPB_ExtFlags; // flags indicating which FATs are active

size of sector (in bytes) and size of cluster (in sectors) space before fjle allocation table number of copies of fjle allocation table extra copies in case disk is damaged typically two with writes made to both

27

slide-59
SLIDE 59

FAT header (C)

struct __attribute__((packed)) Fat32BPB { uint8_t BS_jmpBoot[3]; // jmp instr to boot code uint8_t BS_oemName[8]; // indicates what system formatted this field, default=MSWIN4.1 uint16_t BPB_BytsPerSec; // count of bytes per sector uint8_t BPB_SecPerClus; // no.of sectors per allocation unit uint16_t BPB_RsvdSecCnt; // no.of reserved sectors in the reserved region of the volume starting at 1st sector uint8_t BPB_NumFATs; // count of FAT datastructures on the volume uint16_t BPB_rootEntCnt; // count of 32-byte entries in root dir, for FAT32 set to 0 uint16_t BPB_totSec16; // total sectors on the volume uint8_t BPB_media; // value of fixed media .... uint16_t BPB_ExtFlags; // flags indicating which FATs are active

size of sector (in bytes) and size of cluster (in sectors) space before fjle allocation table number of copies of fjle allocation table extra copies in case disk is damaged typically two with writes made to both

27

slide-60
SLIDE 60

FAT header (C)

struct __attribute__((packed)) Fat32BPB { uint8_t BS_jmpBoot[3]; // jmp instr to boot code uint8_t BS_oemName[8]; // indicates what system formatted this field, default=MSWIN4.1 uint16_t BPB_BytsPerSec; // count of bytes per sector uint8_t BPB_SecPerClus; // no.of sectors per allocation unit uint16_t BPB_RsvdSecCnt; // no.of reserved sectors in the reserved region of the volume starting at 1st sector uint8_t BPB_NumFATs; // count of FAT datastructures on the volume uint16_t BPB_rootEntCnt; // count of 32-byte entries in root dir, for FAT32 set to 0 uint16_t BPB_totSec16; // total sectors on the volume uint8_t BPB_media; // value of fixed media .... uint16_t BPB_ExtFlags; // flags indicating which FATs are active

size of sector (in bytes) and size of cluster (in sectors) space before fjle allocation table number of copies of fjle allocation table extra copies in case disk is damaged typically two with writes made to both

27

slide-61
SLIDE 61

FAT: creating a fjle

add a directory entry choose clusters to store fjle data (how???) update FAT to link clusters together

28

slide-62
SLIDE 62

FAT: creating a fjle

add a directory entry choose clusters to store fjle data (how???) update FAT to link clusters together

28

slide-63
SLIDE 63

FAT: free clusters

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk entry value index … … 20 18 0 (free) 19

  • 1 (end mark)

20 0 (free) 21 0 (free) 22

  • 1 (end)

23 0 (free) 24 35 25 48 26 0 (free) 27 … … fjle allocation table

29

slide-64
SLIDE 64

FAT: writing fjle data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk entry value index … … 20 18 0 (free) 19

  • 1 (end mark)

20 0 (free) 22 21 0 (free) 24 22

  • 1 (end)

23 0 (free) -1 (end) 24 35 25 48 26 0 (free) 27 … … fjle allocation table

30

slide-65
SLIDE 65

FAT: replacing unused directory entry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk entry value index … … 20 18 0 (free) 19

  • 1 (end mark)

20 0 (free) 22 21 0 (free) 24 22

  • 1 (end)

23 0 (free) -1 (end) 24 35 25 48 26 0 (free) 27 … … fjle allocation table directory of new fjle “foo.txt”, cluster 11, size …, created … … unused entry“new.txt”, cluster 21, size … … directory’s data

31

slide-66
SLIDE 66

FAT: extending directory

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

cluster number the disk entry value index … … 20 18 0 (free) 19

  • 1 (end mark)

20 0 (free) 22 21 0 (free) 24 22

  • 1 (end)

23 0 (free) -1 (end) 24 35 25 48 26 0 (free) 27 … … fjle allocation table directory of new fjle “foo.txt”, cluster 11, size …, created … … “quux.txt”, cluster 104, size …, created … directory’s data (fjrst cluster) “new.txt”, cluster 21, size …, created … unused entry unused entry unused entry … directory’s data (new second cluster)

32

slide-67
SLIDE 67

FAT: exercise

C.txt is fjle in directory B which is in directory A consider the following items on disk:

[a] FAT entries for A [b] FAT entries for B [c] FAT entries for C.txt [d] data clusters for A [e] data clusters for B [f] data clusters for C.txt

Ignoring modifjcation timestamp updates, which of the above may be modifjed to:

1) assuming directores existed previously, create C.txt 2) truncate C.txt, making it have size 0 bytes (assume prev. not empty) 3) move C.txt from directory B into directory A

33

slide-68
SLIDE 68

FAT: deleting fjles

reset FAT entries for fjle clusters to free (0) write “unused” character in fjlename for directory entry

maybe rewrite directory if that’ll save space?

34

slide-69
SLIDE 69

exercise

say FAT fjlesystem with:

4-byte FAT entries 32-byte directory entries 2048-byte clusters

how many FAT entries+clusters (outside of the FAT) is used to store a directory of 200 30KB fjles?

count clusters for both directory entries and the fjle data

how many FAT entries+clusters is used to store a directory of 2000 3KB fjles?

35

slide-70
SLIDE 70

FAT pros and cons?

36

slide-71
SLIDE 71

backup slides

37

slide-72
SLIDE 72

IOMMUs

typically, direct memory access requires using physical addresses

devices don’t have page tables need contiguous physical addresses (multiple pages if bufger >page size) devices that messes up can overwrite arbitrary memory

recent systems have an IO Memory Management Unit

“pagetables for devices” allows non-contiguous bufgers enforces protection — broken device can’t write wrong memory location helpful for virtual machines

38

slide-73
SLIDE 73

disk scheduling

schedule I/O to the disk schedule = decide what read/write to do next

by OS: what to request from disk next? by controller: which OS request to do next?

typical goals: minimize seek time don’t starve requiests

39

slide-74
SLIDE 74

disk scheduling

schedule I/O to the disk schedule = decide what read/write to do next

by OS: what to request from disk next? by controller: which OS request to do next?

typical goals: minimize seek time don’t starve requiests

39

slide-75
SLIDE 75

shortest seek time fjrst

time = disk I/O request disk head inside of disk

  • utside of disk

some requests starved potentially forever if enough other reads missing consideration: rotational latency modifjcation called shortest positioning time fjrst

40

slide-76
SLIDE 76

shortest seek time fjrst

time = disk I/O request disk head inside of disk

  • utside of disk

some requests starved potentially forever if enough other reads missing consideration: rotational latency modifjcation called shortest positioning time fjrst

40

slide-77
SLIDE 77

shortest seek time fjrst

time = disk I/O request disk head inside of disk

  • utside of disk

some requests starved potentially forever if enough other reads missing consideration: rotational latency modifjcation called shortest positioning time fjrst

40

slide-78
SLIDE 78

shortest seek time fjrst

time = disk I/O request disk head inside of disk

  • utside of disk

some requests starved potentially forever if enough other reads missing consideration: rotational latency modifjcation called shortest positioning time fjrst

40

slide-79
SLIDE 79

disk scheduling

schedule I/O to the disk schedule = decide what read/write to do next

by OS: what to request from disk next? by controller: which OS request to do next?

typical goals: minimize seek time don’t starve requiests

41

slide-80
SLIDE 80
  • ne idea: SCAN

time = disk I/O request disk head inside of disk

  • utside of disk

42

slide-81
SLIDE 81

another idea: C-SCAN (C=circular)

time = disk I/O request disk head inside of disk

  • utside of disk

scan in single direction maybe more fair than SCAN (doesn’t favor middle of disk) maybe disk has fast way of ‘resetting’ head to outside?

43

slide-82
SLIDE 82

another idea: C-SCAN (C=circular)

time = disk I/O request disk head inside of disk

  • utside of disk

scan in single direction maybe more fair than SCAN (doesn’t favor middle of disk) maybe disk has fast way of ‘resetting’ head to outside?

43

slide-83
SLIDE 83

another idea: C-SCAN (C=circular)

time = disk I/O request disk head inside of disk

  • utside of disk

scan in single direction maybe more fair than SCAN (doesn’t favor middle of disk) maybe disk has fast way of ‘resetting’ head to outside?

43

slide-84
SLIDE 84

some disk scheduling algorithms (text)

SSTF: take request with shortest seek time next

subject to starvation — stuck on one side of disk could also take into account rotational latency — yields SPTF

shortest positioning time fjrst

SCAN/elevator: move disk head towards center, then away

let requests pile up between passes limits starvation; good overall throughput

C-SCAN: take next request closer to center of disk (if any)

variant of scan that moves head in one direction avoids bias towards center of disk

44