CS 423 Operating System Design: Distributed File Systems - - PowerPoint PPT Presentation

cs 423 operating system design distributed file systems
SMART_READER_LITE
LIVE PREVIEW

CS 423 Operating System Design: Distributed File Systems - - PowerPoint PPT Presentation

CS 423 Operating System Design: Distributed File Systems Acknowledgement: This slide set is based on lecture slides by Prof. John Kubiatowicz, UC Berkeley, Dr. Guohui Wang, Rice University, and Prof. Kenneth Chiu, SUNY Binghamton CS 423:


slide-1
SLIDE 1

CS 423: Operating Systems Design

CS 423
 Operating System Design: Distributed File Systems

Acknowledgement: This slide set is based on lecture slides by

  • Prof. John Kubiatowicz, UC Berkeley, Dr. Guohui Wang, Rice

University, and Prof. Kenneth Chiu, SUNY Binghamton

slide-2
SLIDE 2

CS 423: Operating Systems Design

Distributed File Systems

2

A file system provides a service for clients. The server interface is the normal set of file operations: create, read, etc. on files.

A Distributed File System (DFS) is simply a classical model of a file system distributed across multiple

  • machines. The purpose is to promote sharing of

dispersed files.

The resources on a particular machine are local to

  • itself. Resources on other machines are remote.
slide-3
SLIDE 3

CS 423: Operating Systems Design 3

Distributed File Systems

■ Naming: mapping between logical and physical

  • bjects.

■ Location transparency:

■ The name of a file does not reveal any hint of the file's

physical storage location.

■ Location independence:

■ The name of a file doesn't need to be changed when

the file's physical storage location changes.

slide-4
SLIDE 4

CS 423: Operating Systems Design

Naming Schemes

4

■ Files are named with a combination of host and local name.

This guarantees a unique name. NOT location transparent NOR location independent.

Same naming works on local and remote files. The DFS is a loose collection of independent file systems.

■ Remote directories are mounted to local directories.

So a local system seems to have a coherent directory structure.

The remote directories must be explicitly mounted. The files are location transparent.

SUN NFS is a good example of this technique.

■ A single global name structure spans all the files in the system.

The DFS is built the same way as a local filesystem. Location independent.

slide-5
SLIDE 5

CS 423: Operating Systems Design

Example 1

5

// //host1 //host2 //host3 //host4 //host1/path/file

No location transparency:

slide-6
SLIDE 6

CS 423: Operating Systems Design 6

/home /home/usr /bin /lib Machine #1 / /john /foo /bar Machine #2 /

Location transparency in NFS

Example 2

slide-7
SLIDE 7

CS 423: Operating Systems Design 7

Location transparency in NFS: mount operation

/home /home/usr /bin /lib Machine #1 / /john /foo /bar Machine #2 / Mount point

Example 2

slide-8
SLIDE 8

CS 423: Operating Systems Design 8

Location transparency in NFS: The Logical View

/home /home/usr /bin /lib Machine #1 / /home/usr/john /home/usr/foo /home/usr/bar

Example 2

slide-9
SLIDE 9

CS 423: Operating Systems Design 9

Location transparency in NFS: The Logical View

/home /home/usr /bin /lib Machine #1 / /home/usr/john /home/usr/foo /home/usr/bar

Example 2

No location independence: If I moved files from server to server, I may need to change the mount points Machine centric view (view of Machine #1)

slide-10
SLIDE 10

CS 423: Operating Systems Design 10

jim jane joe ann users students usr vmunix Client Server 2 . . . nfs Remote mount staff big bob jon people Server 1 export (root) Remote mount . . . x (root) (root)

mount –t nfs Server1:/export/people /usr/students mount –t nfs Server2:/nfs/users /usr/staff

Example 2

Local and Remote File Systems on an NFS Client:

slide-11
SLIDE 11

CS 423: Operating Systems Design

Example 3

11

/home /home/usr /bin /lib / /home/usr/john /home/usr/foo /home/usr/bar

Global name space

Host 1 Host N Host 2

Local independence in Andrew:

slide-12
SLIDE 12

CS 423: Operating Systems Design

Simple Distributed FS

12

■ Remote Disk: Reads and writes forwarded to server

Use RPC to translate file system calls

No local caching

■ Advantage: Server provides completely consistent view of file

system to multiple clients

■ Problems?

Server Read (RPC) Return (Data) W r i t e ( R P C ) A C K Client Client

slide-13
SLIDE 13

CS 423: Operating Systems Design

Simple Distributed FS

13

■ Remote Disk: Reads and writes forwarded to server

Use RPC to translate file system calls

No local caching

■ Advantage: Server provides completely consistent view of file

system to multiple clients

■ Problems?

Going over network is slower than going to local memory

Server can be a bottleneck

Server Read (RPC) Return (Data) W r i t e ( R P C ) A C K Client Client

slide-14
SLIDE 14

CS 423: Operating Systems Design

Distributed FS w/ Caching

14

■ Idea: Use caching to reduce network load ■ Advantage: if open/read/write/close can be done locally,

don’t need to do any network traffic…fast!

■ Problems:

■ Failure: Client caches have data not committed at server ■ Cache consistency! Client caches not consistent with server/

each other

Server cache

F1:V1 F1:V2

Read (RPC) Return (Data) W r i t e ( R P C ) A C K Client cache Client cache

F1:V1 F1:V2

read(f1) write(f1) →V1 read(f1)→V1 read(f1)→V1 →OK read(f1)→V1 read(f1)→V2

slide-15
SLIDE 15

CS 423: Operating Systems Design

Virtual FS

15

■ VFS: Virtual abstraction similar to local file system

■ Instead of “inodes” has “vnodes” ■ Compatible with a variety of local and remote file systems

■ VFS allows the same system call interface (the API)

to be used for different types of file systems (The API is to the VFS interface)

slide-16
SLIDE 16

CS 423: Operating Systems Design

Network File System (NFS)

16

■ Three Layers for NFS system

UNIX file-system interface: open, read, write, close calls + file descriptors

VFS layer: distinguishes local from remote files

■ Calls the NFS protocol procedures for remote requests

NFS service layer: bottom layer of the architecture

■ Implements the NFS protocol

■ NFS Protocol: RPC for file operations on server

Reading/searching a directory

manipulating links and directories

accessing file attributes/reading and writing files

■ Write-through caching: Modified data committed to

server’s disk before results are returned to the client

lose some of the advantages of caching

time to perform write() can be long

Need some mechanism for readers to eventually notice changes!

slide-17
SLIDE 17

CS 423: Operating Systems Design

Schematic View of NFS

17

slide-18
SLIDE 18

CS 423: Operating Systems Design 18

Network File System (NFS)

■ NFS servers are stateless; each request provides all

arguments required for execution

■ E.g. reads include information for entire operation, such as

ReadAt(inumber,position), not Read(openfile)

■ No need to perform network open() or close() on file – each

  • peration stands on its own

■ Idempotent: Performing requests multiple times has

same effect as performing it exactly once

■ Example: Server crashes between disk I/O and message

send, client resend read, server does operation again

■ Example: Read and write file blocks: just re-read or re-write

file block – no side effects

■ Example: What about “remove”? NFS does operation twice

and second time returns an advisory error

slide-19
SLIDE 19

CS 423: Operating Systems Design 19

Network File System (NFS)

■ Failure Model: Transparent to client system

■ Options (NFS Provides both):

■ Hang until server comes back up (next week?) ■ Return an error. (Of course, most applications don’t know they are

talking over network)

slide-20
SLIDE 20

CS 423: Operating Systems Design

NFS Cache Consistency

20

cache

F1:V2

Server W r i t e ( R P C ) A C K Client cache Client cache

F1:V1 F1:V2 F1:V2

F1 still ok? No: (F1:V2)

■ NFS protocol: weak consistency

■ Client polls server periodically to check for changes

■ Polls server if data hasn’t been checked in last 3-30 seconds (exact

timeout it tunable parameter).

■ Thus, when file is changed on one client, server is notified, but other

clients use old version of file until timeout.

slide-21
SLIDE 21

CS 423: Operating Systems Design

NFS Cache Consistency

21

cache

F1:V2

Server W r i t e ( R P C ) A C K Client cache Client cache

F1:V1 F1:V2 F1:V2

F1 still ok? No: (F1:V2)

■ NFS protocol: weak consistency

■ What if multiple clients write to same file?

■ In NFS, can get either version (or parts of both) ■ Completely arbitrary!

slide-22
SLIDE 22

CS 423: Operating Systems Design

Andrew File System

22

■ Andrew File System (AFS, late 80’s) ■ Callbacks: Server records who has copy of file

On changes, server immediately tells all with old copy

No polling bandwidth (continuous checking) needed

■ Write through on close

Changes not propagated to server until close()

Session semantics: updates visible to other clients only after the file is closed

■ As a result, do not get partial writes: all or nothing! ■ Although, for processes on local machine, updates visible immediately to other

programs who have file open

■ In AFS, everyone who has file open sees old version

Don’t get newer versions until reopen file

slide-23
SLIDE 23

CS 423: Operating Systems Design

Andrew File System

23

■ Data cached on local disk of client as well as memory

■ On open with a cache miss (file not on local disk):

■ Get file from server, set up callback with server

■ On write followed by close:

■ Send copy to server; tells all clients with copies to fetch new version

from server on next open (using callbacks)

■ What if server crashes? Lose all callback state!

■ Reconstruct callback information from client: go ask

everyone “who has which files cached?”

■ For both AFS and NFS: central server is bottleneck!

■ Performance: all writes→server, cache misses→server ■ Availability: Server is single point of failure ■ Cost: server machine’s high cost

slide-24
SLIDE 24

CS 423: Operating Systems Design

Google File System (GFS)

24

■ System is so large that failures are the norm ■ Files are very big (multi-GB is the norm) ■ Files are modified by “appending” and read

usually sequentially

slide-25
SLIDE 25

CS 423: Operating Systems Design

GFS Assumptions

25

System is built of commodity components that often fail

System stores a few million files that are 100MB or longer

Operations consist of large streaming reads and small random reads

Most writes are large appends

Large sustained bandwidth is more important than latency

slide-26
SLIDE 26

CS 423: Operating Systems Design

GFS Architecture

26

■ Master, chunk servers and clients ■ Chunks (64MB) are replicated on multiple servers ■ No file caching ■ Clients cache metadata from master

slide-27
SLIDE 27

CS 423: Operating Systems Design

Single Master

27

■ General disadvantages for distributed

systems:

■ Single point of failure ■ Bottleneck (scalability)

■ Solution?

■ Clients use master only for metadata, not reading/

writing.

■ New master recovers state from chunk servers

slide-28
SLIDE 28

CS 423: Operating Systems Design

Chunk Size

28

■ Key design parameter: chose 64 MB. ■ Each chunk is a plain Linux file, extended as needed.

■ Fragmentation: Internal vs. external?

■ Hotspots: Some files may be accessed too much.

■ How to deal with it?

slide-29
SLIDE 29

CS 423: Operating Systems Design

Leases

29

■ Master grants “lease” to primary replica of each

chunk

■ Primary replica decides order of updates to chunk ■ Lease is given for 60 seconds, renewable as

needed

■ Lease can be revoked by master

■ What happens if master loses communication with

primary replica?

slide-30
SLIDE 30

CS 423: Operating Systems Design

The “Append” Operation

30

■ Concurrent “append” operations are allowed but the

  • rder they are applied is up to the GFS

■ All chunk replicas are appended/modified in the

same order as decided by the primary replica

slide-31
SLIDE 31

CS 423: Operating Systems Design

Append

31

(1) client requests location of primary and secondary replicas from master (2) master replies (3) client sends data to replicas and receives acks (4) client asks primary replica to do the “append” on the data (5) primary replica decides on order of appends and asks secondary replicas to follow same order (6) secondary replicas perform operation and ack primary (7) primary replies to client

slide-32
SLIDE 32

CS 423: Operating Systems Design

Append

32

Linear data pipeline

(1) client requests location of primary and secondary replicas from master (2) master replies (3) client sends data to replicas and receives acks (4) client asks primary replica to do the “append” on the data (5) primary replica decides on order of appends and asks secondary replicas to follow same order (6) secondary replicas perform operation and ack primary (7) primary replies to client

slide-33
SLIDE 33

CS 423: Operating Systems Design

File Copy

33

Master received copy (snapshot) request and revokes all leases

  • n its chunks

Why revoke?

What if it loses communication with a chunk server?

After all leases are revoked or expire, master duplicates metadata (pointing to the same chunks as original) and increments reference count – no actual data copy is performed

When a client wants to write to a chunk, the request for primary comes to master who notices that the chunk has a reference count of two and asks chunk servers to replicate it

A handle is returned to the new chunk copy.

slide-34
SLIDE 34

CS 423: Operating Systems Design

Microbenchmarks

34

■ GFS cluster consisting of:

■ One master

■ Two master replicas

■ 16 chunkservers ■ 16 clients

■ Machines were:

■ Dual 1.4 GHz PIII ■ 2 GB of RAM ■ 2 80 GB 5400 RPM disks ■ 100 Mbps full-duplex Ethernet to switch ■ Servers to one switch, clients to another. Switches

connected via gigabit Ethernet.

slide-35
SLIDE 35

CS 423: Operating Systems Design

Reads

35

■ N clients

reading 4 MB region from 320 GB file set.

■ Read rate

slightly lower as clients go up due to probability reading from same chunkserver.

slide-36
SLIDE 36

CS 423: Operating Systems Design

Writes

36

■ N clients write

simultaneousl y to N files. each client writes 1 GB to a new file in a series.

■ Low

performance is due to network stack.

slide-37
SLIDE 37

CS 423: Operating Systems Design

Appends

37

■ N clients

appending to a single file.