Building a f ile syst em To build a f ile syst em f rom an array of - - PDF document

building a f ile syst em
SMART_READER_LITE
LIVE PREVIEW

Building a f ile syst em To build a f ile syst em f rom an array of - - PDF document

Building a f ile syst em To build a f ile syst em f rom an array of disk 12: FFS,LFS and ot her f ile sect ors we have t o decide t hings like syst ems Must f iles be allocat ed cont iguously? I f not how will be f ind t he


slide-1
SLIDE 1

1

  • 1

12: FFS,LFS and ot her f ile syst ems

Last Modif ied: 6/ 9/ 2004 12:17:20 PM

  • 2

Building a f ile syst em

To build a f ile syst em f rom an array of disk

sect ors we have t o decide t hings like

Must f iles be allocat ed cont iguously? I f not how will be f ind t he pieces? What inf or mat ion is st or ed about each f ile in

t he dir ect or y?

Where do we put new f iles t hat are creat ed? What do we do when f iles gr ow or shr ink? How do we r ecover t he FS af t er a cr ash?

  • 3

Answers?

We are going t o look at t wo dif f erent f ile

syst ems

Fast File Syst em (FFS) Log-St r uct ur ed File Syst ems (LFS)

  • 4

How are t hey t he same?

Bot h allow f iles t o be broken int o mult iple

pieces

Bot h use f ixed sized blocks (f or t he most

part )

Bot h use t he inode st ruct ure we discussed

last t ime

  • 5

Fast File Syst em

Fast ? Well f ast er t han or iginal UNI X f ile syst em

(1970’s)

Original syst em had poor disk bandwidt h ut ilizat ion Remember why t hat is a problem? Too many seeks

BSD UNI X f olks r edesigned in mid 1980’s

I mproved disk ut ilizat ion by breaking f iles int o larger

pieces

Made FFS aware of disk st ruct ure (cylinder groups) and

t ried t o keep relat ed t hings t oget her

Ot her semi- random improvement s like support f or long

f ile names et c.

  • 6

Managing Free Space

Br eak disk int o cylinder gr oups and t hen int o f ixed

size pieces called blocks (commonly 4 KB)

Each cylinder gr oup has a cer t ain number of blocks

Cylinder group’s f ree list maps which blocks f ree and

which t aken

Cylinder groups also st ore a copy of t he superblock which

cont ains special boot st rapping inf ormat ion like t he locat ion of t he root direct ory (replicat ed)

Cylinder groups also cont ain a f ixed number of I - nodes Rest of blocks used t o st ore f ile/ direct ory dat a

slide-2
SLIDE 2

2

  • 7

I nodes in FFS

I n FFS, f ixed number of inodes at FS

f ormat t ime

When cr eat e f ile, pick an inode, will never move

(so dir ect or y ent r y need not be updat ed)

Can r un out of inodes and not be able t o cr eat e

f ile even t hough t her e is f r ee space

  • 8

Creat ing a new f ile

I n t he pre-FFS UNI X f ile syst em

Free list f or t he ent ire disk St ar t ed out or der ed nicely such t hat if ask f or

3 f r ee blocks likely t o get 3 t oget her

Randomized over t ime as f iles cr eat ed and

delet ed such t hat pieces of a new f ile scat t er ed over t he disk

Also when cr eat e new f ile need a new inode t oo

  • All inodes at beginning of disk, f ar f rom t he dat a

When read t hrough a f ile likely t o be seeks

bet ween each block – slow!

  • 9

FFS

Divide t he disk int o cylinder gr oups

Try t o put all blocks of f ile int o same cylinder group I nodes in each cylinder group so inodes near t heir f iles Try t o put f iles in t he same direct ory int o t he same

cylinder group

Big t hings f orced int o new cylinder group

I s t his f undament ally a new appr oach?

Not really…

space wit hin a cylinder group get s t reat ed j ust like whole disk was

Space in cylinder group get s f ragment ed et c Basically sort f iles int o bins so reduce t he f requent long

seeks

  • 10

Cylinder Groups

To keep t hings t oget her must know when t o

keep t hings apart

Put lar ge f iles int o a dif f er ent cylinder gr oup

FFS reserves 10% of t he disk as f ree

space

To be able t o sor t t hings int o cylinder gr oups,

must have f r ee space in each cylinder gr oup

10% f r ee space avoids wor st allocat ion choice

as appr oach f ull (ex. One block in each cylinder gr oup)

  • 11

Ot her FFS I mprovement s

Small or large blocks?

Orig UNI X FS had small blocks (1 KB) ¼

less ef f icient BW ut ilizat ion Lar ger blocks have pr oblems t oo

For f iles <

4K , result s in int ernal f ragment at ion

FFS uses 4K blocks but allows f ragment s wit hin a block Last <

4K of a f ile can be in f ragment s Exact ly 4K?

FFS allows FS t o be paramet erized t o t he disk and CPU

charact erist ics

Anot her cool example: when laying out logically sequent ial

blocks skip a f ew blocks in bet ween each t o allow f or CPU int errupt processing so don’t j ust miss t he blocks and f orce a whole rot at ion

  • 12

Updat e I n P lace

Bot h t he original UNI X FS and FFS were

updat e-i n-place

When block X of a f ile is writ t en t hen

f orever more, reads or writ es t o block X go t o t hat locat ion unt il f ile delet ed or t runcat ed

As t hings get f ragment ed need

“def ragment er” t o reorganize t hings

slide-3
SLIDE 3

3

  • 13

Anot her P roblem wit h Updat e- in-place

Poor cr ash r ecover y per f or mance

Some oper at ions t ake mult iple disk r equest s so ar e

impossible t o do at omically

  • Ex. Writ e a new f ile (updat e direct ory, remove space

f rom f ree list , writ e inode and dat a blocks, et c.) I f syst em cr ashes (lose power or sof t war e

f ailur e), t her e may be f ile oper at ions in pr ogr ess

When syst em comes back up, may need t o f ind a

f ix t hese half done oper at ions

Wher e ar e t hey?

Could be anywhere? How can we rest ore consist ency t o t he f ile syst em?

  • 14

Fixed order

Solut ion: Specif y or der in which FS ops ar e done Example t o add a f ile

Updat e f ree list st ruct ures t o show dat a block t aken Writ e t he dat a block Updat e f ree list st ruct ures t o show an inode t ake Writ e t he inode Add ent ry t o t he direct ory

I f cr ash occur s, on r eboot scan disk looking f or

half done oper at ions

I nodes t hat are marked t aken but are not ref erred t o by

any direct ory

Dat a blocks t hat are maked t aken but are not ref erred

t o by any inode

  • 15

Fixed order (con’t)

We’ve f ound a half done oper at ion now what ?

I f dat a blocks not point ed t o by any inode t hen release

t hem

I f inode not point ed t o by any direct ory link int o Lost

and Found Fsck and similar FS r ecover y pr ogr ams do t hese

kinds of checks

P

roblems can be anywhere wit h updat e in place so must scan t he whole FS!! P

roblems?

Recovery t akes a long t ime! (Syst em shut down

uncleanly..checking your FS.. For t he next 10 minut es!)

Even worse(?) normal operat ion t akes a long t ime because

specif ic order = many small synchronous writ es = slow!

  • 16

Writ e-Ahead Logging (J ournaling)

How can we solve pr oblem of r ecover y in

updat e in place syst ems?

Borrow a t echnique f rom dat abases!

Logging or j our naling

Bef ore perf orm a f ile syst em operat ion like

cr eat e new f ile or move a f ile, make a not e in t he log

I f crash, can simply examine t he log t o f ind

int errupt ed operat ions

Don’t need t o examine t he whole disk

  • 17

Checkpoint s

Per iodically wr it e a checkpoint t o a well known

locat ion

Checkpoint est ablishes a consist ent point in t he

f ile syst em

Checkpoint also cont ains point er t o t ail of t he log

(changes since checkpoint wr it t en)

On r ecover y st ar t at checkpoint and t hen “r oll

f or war d” t hr ough t he log

Checkpoint point s t o locat ion syst em will use f or

f ir st log wr it e af t er checkpoint , t hen each log wr it e has point er t o next locat ion t o be used

Event ually go t o next locat ion and f ind it empt y or invalid

When wr it e a checkpoint can discar d ear lier

por t ions of t he log

  • 18

P roblems wit h writ e-ahead logging

Do writ es t wice Once t o log and once t o “real” dat a (st ill

  • rganized like FFS)

Surprisingly can be more ef f icient t han

updat e-i n-place!

Bat ched t o log and t hen r eplayed t o “r eal” in

r elaxed or der (elevat or scheduling on t he disk)

slide-4
SLIDE 4

4

  • 19

Recovery of t he f ile syst em (not your dat a)

Wr it e-ahead logging or j our naling t echniques could

be used t o prot ect FS and user dat a

Nor mally j ust used t o pr ot ect t he FS I look like a consist ent FS but your dat a may be

inconsist ent

Even if some of t he last f iles you were modif ying are

inconsist ent st ill bet t er t han FS corrupt ed (insert boot able device please ) St ill, why do we need a “r eal” dat a layout why

couldn’t t he log be t he FS? Then user dat a would get same benef it s?

  • 20

Log-St ruct ured File Syst em

Treat t he disk as an inf init e append only

log

Dat a blocks, inodes, dir ect or ies ever yt hing

wr it t en t o t he log Bat ch writ es in large unit s called segment s

(~ 1 MB)

Garbage collect ion process called cleaner

reclaims holes in t he log t o regenerat e large expanses of f ree space f or log writ es

  • 21

Log Writ es and Cleaning

  • 22

Finding Dat a

I nodes used t o f ind dat a blocks Finding inodes?

Dir ect or ies specif y locat ion of a f ile’s inode

I n an FSS, inodes are preallocat ed in each

cylinder group and a given f ile’s inode never moves (updat e in place)

I n an LFS, inodes writ t en t o t he log and so

t hey move

  • 23

Chain React ion

LFS is not updat e in place when f ile block wr it t en

it s locat ion changes

File locat ion changes =>

ent ry in inode (and possibly also indirect blocks) changes => I node (and indirect blocks) must be rewrit t en Par ent dir ect or y cont ains locat ion of inode – must

dir ect or y be r ewr it t en t oo?

I f so t hen all direct ories t o root must be rewrit t en?

No! – int r oduce anot her level of indir ect ion

Direct ory says inode*number* (rat her t han locat ion) I nodemap t o map inodenumber t o current locat ion

  • 24

I node Map

I node map maps inode number s t o inode locat ion

Map kept in a special f ile t he if ile

When a f ile’s inode is wr it t en, it s par ent dir ect or y

does not change only t he if ile does

Caching inode map (if ile) in memor y is pr et t y

impor t ant f or good per f or mance

How big is t his? Approx 2*4byt es(inode number and disk

LBA) = 8 byt es f or every f ile/ direct ory in t he f ile syst em

Can grow dynamically unlike FFS

slide-5
SLIDE 5

5

  • 25

Checkpoint

Like in Wr it e Ahead Logging, wr it e per iodic

checkpoint s

Kind of like FFS superblocks

Checkpoint r egion has a f ixed locat ion

Act ually t wo f ixed locat ions and alt ernat e bet ween t hem

in case die in middle of writ ing and leave corrupt

Checksums t o verif y consist ent ; Timest amps say which is

most recent What s in checkpoint ?

Locat ion of inode f or if ile and inodenumber of t he r oot

direct ory

Locat ion of next segment will writ e log t o Basic FS paramet ers like segment size, block size, et c

  • 26

LFS P ros and Cons

What is good about t his?

Leverage disk BW wit h large sequent ial writ es Near perf ect writ e perf ormance Read perf ormance? Good if read t he same way as you

writ e and many reads absorbed by caches

Cleaning can of t en be done in idle t ime Fast ef f icient crash recovery User dat a get s benef it s of a log

What ’s bad about t his?

Cleaning overhead can be high – especially in t he case of

random updat es t o a f ull disk wit h lit t le idle t ime

Reads may not f ollow writ e pat t erns (t hey may not f ollow

direct ory st ruct ure eit her t hough!)

Addit ional met adat a handling (inodes, indirect blocks and

if ile rewrit t en f requent ly)

  • 27

Cleaning Cost s

We ar e going t o f ocus on t alking about t he

pr oblem of high cleaning cost s

Of t en cleaning is not a pr oblem

I f t here is plent y of idle t ime (many workloads have

t his), cleaning cost s hidden

Also if localit y t o writ es, t hen easier t o clean I f disk not very f ull t hen, segment s clean t hemselves

(overwrit e everyt hing in old segment s bef ore run out of f ree spaces f or new writ es) So when is cleaning a pr oblem?

Cleaning expensive when random writ es t o f ull disk wit h

no idle t ime

  • 28

High Cleaning Cost s

Random wr it es, f ull disk (lit t le f r ee space), no idle t ime = Sky-rocket t ing cleaning cost s For every 4 blocks writ t en, also read 4 segment s and writ e 3 segment s!

  • 29

Copy cleaning vs Hole-plugging

Alt er nat e cleaning met hod?

Hole- plugging = Take one segment break ext ract t he live

dat a and use it t o plug holes in ot her segment s

This will work well f or f ull disk, random updat es, lit t le

idle t ime!! Hole-plugging avoid pr oblems wit h copy cleaning

but t r ansf er s many small blocks which uses t he disk less ef f icient ly

Could we get t he best of bot h worlds?

First we have t o t alk about how t o quant if y t he t radeof f s

  • 30

Writ e Cost

How do we quant if y t he benef it s of large

I / Os vs t he penalt y of copying dat a?

Original LFS paper evaluat ed ef f iciency of

cleaning algorit hms according t o t he f ollowing met r ic

(Dat aWr it t enNewDat a + Dat aReadCleaning +

Dat aWr it t enCleaning)/ Dat aWr it t enNewDat a

Quant if ies cleaning over head in t er ms of t he

amount of dat a t r ansf er r ed while cleaning

What about t he impact of lar ge vs small

t ransf ers?

slide-6
SLIDE 6

6

  • 31

Cost of Small Transf ers

Quant if y overhead due t o using t he disk

inef f icient ly

Tr ansf er Time Act ual/ Tr ansf er Time I deal Wher e Tr ansf er Time Act ual includes seek,

r ot at ional delay and t r ansf er t ime and Tr ansf er Time I deal only includes t r ansf er t ime By f act oring in t he cost of small t ransf ers,

we see t he cost of holeplugging

  • 32

Overall Writ e Cost

Rat io of act ual t o ideal cost s where

Act ual includes cost of gar bage collect ion and

includes seek/ r ot at ional lat ency f or each t ransf er

I deal includes only cost of or iginal wr it es t o an

inf init e append only log – no seek/ r ot at ional delay and no gar bage collect ion Now we have a met ric t hat let s us compare

hole-plugging t o copy-cleaning

Syst em can use t his t o choose which one t o do! Adapt ive cleaning ☺

  • 33

Adapt ive Cleaning

When st ar t ing t o r un out of segment s, do gar bage

collect ion

Look in special f ile called t he segmap t hat t ells you

how f ull each segment is

When rewrit e a block in a segment , writ e in segmap f ile

t hat segment is one block less f ull Est imat e cost t o do copy cleaning and cost t o do

hole-plugging

Comput e overall writ e cost by seeing how f ull segment s

ar e Choose t he most cost ef f ect ive met hod t his t ime

Can choose a dif f erent one next t ime ☺

  • 34

Adapt ive Cleaning For Random Updat e Workload

Assume no idle t ime t o clean

  • 35

Adapt ive Cleaning f or Normal Usage Trace

Assume no idle t ime t o clean

  • 36

As Technology Changes

slide-7
SLIDE 7

7

  • 37

Ot her f act ors?

How does t his layout work f or reads?

Good if r ead in t he same way you wr it e Well unt il st ar t r eor ganizing dur ing cleaning

(hole-plugging is wor se t han copy cleaning her e)

Special kind of hole-plugging t hat wr it es back

  • n t op of wher e it used t o be?

Account ing f or addit ional met adat a

handling in t he cache?

Modif ying t he wr it e cost met r ic t o account f or

“chur n” in t he met adat a?

Model FFS in t his same way

  • 38

I mproving FFS also

Ext ent like perf ormance (McVoy) FFS-r ealloc (McKusick) FFS-f r ag and FFS-nochange(Smit h) Colocat ing FFS (Ganger) Sof t Updat es (Ganger)

  • 39

Ot her FS?

Updat e-in-place

FAT ext 2 (ext ent based rat her t han f ixed size blocks)

Wr it e-ahead Logging (j our naling)

NTFS ReiserFS (B+ t ree indices, opt imizat ions f or small f iles) SGI ’s XFS (ext ent based and B+ t rees) Ext 3 (j ournaling version of ext 2) Verit as VxFS BeOS’s BeFS

No Updat e?

CD-ROM FS no updat e and of t en cont iguous allocat ions

(why does t hat make sense?)

  • 40

Net work/ Dist ribut ed FS

Sun’s NFS CMU’s AFS and Coda

Tr ansar c’s(now I BM’s) commer cial AFS I nt er mezzo (Linux Coda like syst em)

Net war e’s NCP SMB

  • 41

Mult iple FS?

Wit h all t hese choices, do we really have t o

choose j ust one FS f or our OS?

I f we want t o allow mult iple FS in t he same

OS, what would be have t o do?

Mer ge t hem int o one dir ect or y hier ar chy f or

t he user

Make t hem obey a common int er f ace f or t he

rest of t he OS

  • 42

Mount point s

Anot her kind of special f ile int erpret ed by

t he f ile syst em is a mount point

Cont ains inf ormat ion about how t o access

t he root of a separat e FS t ree (device inf or mat ion if local, ser ver inf or mat ion if remot e, t ype of FS, et c.)

slide-8
SLIDE 8

8

  • 43

Mount P

  • int s

/ a b c / / x y z / / a b c / / x y z / Mount f ile syst em 2 on / b t hen can r ef er To z as / b/ x/ z File Syst em 1 File Syst em 1

  • 44

Common I nt erf ace?

Dif f erent FS usually need t he same “hooks”

int o t he OS

Some need special t hings?

Vnode int er f ace

Pr oposed in 1986 Allow mult iple FS in t he same OS (wit hout ugly

case st at ement s ever ywher e)

Allow FS t o wor k on mult iple OSes? (t hat ’s

harder)

  • 45

st ruct vnode

One vnode st r uct ur e f or ever y opened (in-

use) f ile

Cont ains:

Ar r ay of point er s t o pr ocedur es t o implement

basic oper at ions on f iles

Point er t o par ent FS Point er t o FS t hat is mount ed on t op of t his f ile

(if any)

Ref er ence count so know when t o r elease t he

vnode

  • 46

Vnode ops

Open, close, cr eat e, r emove, r ead, wr it e Mkdir, r mdir, r eaddir

You don’t know what t hat FS’s direct ory f ormat will be

Symlink, Link, r eadlink (sof t / har d links) Get at t r , set at t r , access (get / set / check

at t r ibut es like per missions)

Fsync Seek Map, get page, put page (memor y map a f ile) I oct l (misc I / O cont r ol ops) Rename …

  • 47

st ruct vf s

One vf s st ruct ure in t he OS f or each

mount ed FS

Cont ains:

Ar r ay of point er s t o pr ocedur es t hat implement

basic oper at ions on f ile syst ems

FS t ype Nat ive block size Point er t o vnode t his FS is mount ed on

  • 48

vf sops

Mount : pr ocedur e called t o mount a FS of t his

t ype on a specif ied vnode

Unmount: pr ocedur e t o r elease t his FS Root : r et ur n r oot vnode of t his Fs St at vf s: r et ur n r esear ch usage st at us of t he FS Sync: f lush all dir t y memor y buf f er s t o per sist ent

st or age managed by t his FS

Vget : t ur n a f ileI d int o a a point er t o vnode f or a

specif ic f ile

Mount r oot: mount t his FS as t he r oot FS on t his

host

Swapvp: r et ur n vnode of f ile in t his FS t o which

t he OS can swap

slide-9
SLIDE 9

9

  • 49

Evolving vnode int erf ace?

Kleiman86 =>

Rosent hal90

  • 50

Do we need FS int erf ace?

FS I nt erf ace

Giving t hings f ile names seems a bit ar bit r ar y

FS hierarchy vs direct ory search People like t o f ind inf ormat ion bot h ways

I know exact ly what I want don’t bot her looking

f or me I will get it myself

Give me ever yt hing mat ching t hese

char act er ist ics

  • 51

Out t akes