Page 1 Page 1
Distributed File Systems
Paul Krzyzanowski pxk@cs.rutgers.edu
Distributed Systems
Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
Distributed Systems Distributed File Systems Paul Krzyzanowski - - PowerPoint PPT Presentation
Distributed Systems Distributed File Systems Paul Krzyzanowski pxk@cs.rutgers.edu Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Page 1 Page 1 Distributed File
Page 1 Page 1
Paul Krzyzanowski pxk@cs.rutgers.edu
Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
Page 2 Page 2
NFS • AFS • CODA • DFS • SMB • CIFS Dfs • WebDAV • GFS • Gmail-FS? • xFS
Page 3 Page 3
Page 4
– Any machine can be a client or server – Must support diskless workstations – Heterogeneous systems must be supported
– Access transparency
system calls (via VFS in UNIX)
– High Performance
Page 5
If resource moves to another server, client must remount resource.
Page 6
Stateless design: file locking is a problem. All UNIX file system controls may not be available.
Page 7
must support diskless workstations where every file is remote. Remote devices refer back to local devices.
Page 8
Initially NFS ran over UDP using Sun RPC
relatively reliable
Page 9
Request access to exported directory tree
Access files and directories (read, write, mkdir, readdir, …)
Page 10
– File device #, inode #, instance #
client: parses pathname contacts server for file handle client: create in-code vnode at mount point.
(points to inode for local files)
points to rnode for remote files
Page 11
– mount request contacts server Server: edit /etc/exports Client: mount fluffy:/users/paul /home/paul
Page 12
– returns file handle and attributes
– No information is stored on server
– e.g. read(handle, offset, count)
Page 13
– (version 2; six more added in version 3)
Page 14
– Goal: reduce number of remote operations – Cache results of read, readlink, getattr, lookup, readdir – Cache file data at client (buffer cache) – Cache file attribute information at client – Cache pathname bindings for faster lookups
– Caching is “automatic” via buffer cache – All NFS writes are write-through to disk to avoid unexpected data loss if server dies
Page 15
– Save timestamp of file – When file opened or server contacted for new block
Page 16
– After 3 seconds for open files (data blocks) – After 30 seconds for directories
– Marked dirty – Scheduled to be written – Flushed on file close
Page 17
– 8K bytes default
– Optimize for sequential file access – Send requests to read disk blocks before they are requested by the application
Page 18
– Separate lock manager added (stateful)
– You can delete a file you (or others) have open!
Page 19
– You can delete a file you (or others) have open!
– Create temp file, delete it, continue access – Sun’s hack:
Page 20
– Invalidating access to file
– Requests via unencrypted RPC – Authentication methods available
– Rely on user-level software to encrypt
Page 21
– Monitored locks
recovery
– Improves write performance – Normally NFS must write to disk on server before responding to client write requests – Relax this rule through the use of non-volatile RAM
Page 22
– Reduce network congestion from excess RPC retransmissions under load – Based on performance
– cacheFS – Extend buffer cache to disk for NFS
Page 23
Problem with mounts – If a client has many remote resources mounted, boot-time can be excessive – Each machine has to maintain its own name space
Automounter – Allows administrators to create a global name space – Support on-demand mounting
Page 24
– Attempt to unmount every 5 minutes
Page 25
automount /usr/src srcmap
cmd
doc:/usr/src/cmd kernel
frodo:/release/src \ bilbo:/library/source/kernel lib
sneezy:/usr/local/lib Access /usr/src/cmd: request goes to doc Access /usr/src/kernel: ping frodo and bilbo, mount first response
Page 26
VFS
KERNEL
application automounter NFS request NFS mount NFS server NFS request
Page 27
– UDP caused more problems on WANs (errors) – All traffic can be multiplexed on one connection
– No fixed limit on amount of data that can be transferred between client and server
Page 28
– Check with server after a write operation to see if data is committed – If commit fails, client must resend data – Reduce number of write requests to server – Speeds up write requests
– Saves extra RPCs
Page 29 Page 29
Page 30
– Transarc
Page 31
Page 32
– Once referenced, a file is likely to be referenced again
Page 33
– Send the entire file on open
– Client caches entire file on local disk – Client writes the file back to server on close
Page 34
– Part of disk devoted to AFS (e.g. 100 MB) – Client manages cache in LRU manner
Page 35
called cells
– Servers – Administrators – Users – Clients
present users with one uniform name space
Page 36
Disk partition contains file and directories
– Administrative unit of organization
– Each volume is a directory tree (one root) – Assigned a name and ID number – A server will often have 100s of volumes
Page 37
/afs/cellname/path /afs/mit.edu/home/paul/src/try.c
Page 38
1. Traverse AFS mount point
E.g., /afs/cs.rutgers.edu
2. AFS client contacts Volume Location DB on Volume Location server to look up the volume 3. VLDB returns volume ID and list of machines
(>1 for replicas on read-only file systems)
4. Request root directory from any machine in the list 5. Root directory contains files, subdirectories, and mount points 6. Continue parsing the file name until another mount point (from step 5) is encountered. Go to step 2 to resolve it.
Page 39
Page 40
Kerberos authentication: – Trusted third party issues tickets – Mutual authentication Before a user can access files – Authenticate to AFS with klog command
Page 41
– Server sends entire file to client and provides a callback promise: – It will notify the client when any other process modifies the file
Page 42
– Contents are written to server on close
– it notifies all clients that have been issued the callback promise – Clients invalidate cached files
Page 43
– Contact server with timestamps of all cached files to decide whether to invalidate
Page 44
– AFS caches in 64KB chunks (by default) – Entire directories are cached
– Query server to see if there is a lock
Page 45
– offers dramatically reduced load on servers
– keeps clients from having to check with server to invalidate cache
Page 46
– AFS scales well – Uniform name space – Read-only replication – Security model supports mutual authentication, data encryption
– Session semantics – Directory based permissions – Uniform name space
Page 47
– 95% NFS, 5% AFS – Approx 20 AFS cells managed by 10 regional organizations – AFS used for:
– NFS used for:
– 25000+ hosts in 50+ sites on 6 continents – AFS is primary distributed filesystem for all UNIX hosts – 24x7 system usage; near zero downtime – Bandwidth from LANs to 64 Kbps inter-continental WANs
Page 48 Page 48
Page 49
Provide better support for replication than AFS
Support mobility of PCs
Page 50
Page 51
– Volume Storage Group (VSG)
Page 52
– Replicated volume ID
– Replicated volume ID list of servers and local volume IDs – Cache results for efficiency
Page 53
– Subset of VSG
Page 54
– Some server resumed operation – Client initiates a resolution process
(if conflicts)
Page 55
– Client goes to disconnected operation mode
– Log update locally in Client Modification Log (CML) – User does not notice
Page 56
– Commence reintegration
– Optimized to send latest changes
– Not always possible
Page 57
– Ask server to send updates if necessary
– Automatically constructed by monitoring the user’s activity – And user-directed prefetch
Page 58
– Client-driven reintegration
– Client modification log – Hoard database for needed files
– Log replay on reintegration
Page 59 Page 59
Page 60
– Most file accesses are sequential – Most file lifetimes are short – Majority of accesses are whole file transfers – Most accesses are to small files
Page 61
Page 62
Page 63
– Allow token holder to open a file. – Token specifies access (read, write, execute, exclusive- write)
– Applies to a byte range – read token - can use cached data – write token - write access, cached writes
– read: can cache file attributes – write: can cache modified attributes
– Holder can lock a byte range of a file
Page 64
– Multiple read tokens OK – Multiple read and a write token or multiple write tokens not OK if byte ranges overlap
holders
Page 65
– Allows for long term caching and strong consistency
Page 66
– Server keeps track of who is reading and who is writing files – Server must be contacted on each open and close
Page 67 Page 67
Page 68
95/98/NT/200x/ME/XP/Vista
Files, devices, communication abstractions (named pipes), mailboxes
Page 69
– Send request to server (machine with resource) – Server sends response
– Persistent connection – “session”
– Fixed-size header – Command string (based on message) or reply string
Page 70
– Protocol ID – Command code (0..FF) – Error class, error code – Tree ID – unique ID for resource in use by client (handle) – Caller process ID – User ID – Multiplex ID (to route requests in a process)
– Param count, params, #bytes data, data
Page 71
– Get disk attr – create/delete directories – search for file(s) – create/delete/rename file – lock/unlock file area – open/commit/close file – get/set file attributes
Page 72
– Open/close spool file – write to spool – Query print queue
Page 73
Page 74
– negprot SMB – Responds with version number of protocol
Page 75
Page 76
– Send tcon (tree connect) SMB with name of shared resource – Server responds with a tree ID (TID) that the client will use in future requests for the resource
Page 77
Page 78
– Clients listen for broadcast – Build list of servers
– Does not scale to WANs – Microsoft introduced browse servers and the Windows Internet Name Service (WINS) – or … explicit pathname to server
Page 79
– Protection per “share” (resource) – Each share can have password – Client needs password to access all files in share – Only security model in early versions – Default in Windows 95/98
– protection applied to individual files in each share based on access rights – Client must log in to server and be authenticated – Client gets a UID which must be presented for future accesses
Page 80 Page 80
Page 81
– samba under Linux Microsoft released protocol to X/Open in 1992
Page 82
– Shared files – Byte-range locking – Coherent caching – Change notification – Replicated storage – Unicode file names
Page 83
Page 84
– Support wide-area networks
– But need reliable connection-oriented message stream transport
Page 85
– Caching
– read-ahead
– write-behind
Page 86
– Oplock tells client how/if it may cache data – Similar to DFS tokens (but more limited)
– oplock may be
Page 87
– Client can open file for exclusive access – Arbitrary caching – Cache lock information – Read-ahead – Write-behind If another client opens the file, the server has former client break its oplock: – Client must send server any lock and write data and acknowledge that it does not have the lock – Purge any read-aheads
Page 88
– Level 1 oplock is replaced with a Level 2 lock if another process tries to read the file – Request this if expect others to read – Multiple clients may have the same file open as long as none are writing – Cache reads, file attributes
Page 89
– Client can keep file open on server even if a local process that was using it has closed the file
– Client requests batch oplock if it expects programs may behave in a way that generates a lot
Page 90
– E.g., indexing service can run and open files without causing programs to get an error when they need to open the file
to access the file.
indexing and then close the file.
Page 91
– All requests must be sent to the server – can work from cache only if byte range was locked by client
Page 92
– N:\junk.doc – \\myserver\users\paul\junk.doc – file://grumpy.pk.org/users/paul/junk.doc
Page 93
– Provides a logical view of files & directories
\\servername\dfsname Each Dfs tree has one root volume and one level of leaf volumes.
– Alternate path: load balancing (read-only) – Similar to Sun’s automounter
Page 94
– Receives STATUS_DFS_PATH_NOT_COVERED – Client requests referral: TRANS2_DFS_GET_REFERRAL – Server replies with new server
Page 95
Page 96 Page 96
Page 97
– Group operations together – Receive set of responses – Reduce round-trip latency
– Ensures atomicity of share reservations for windows file sharing (CIFS) – Supports exclusive creates – Client can cache aggressively
Page 98
– Inform client if the directory changed during the
– Extensible authentication architecture
– To be defined
Page 99
– Similar to CIFS oplocks
– Notify client when file/directory contents change
Page 100 Page 100
Page 101
– Thousands of storage machines – Some are not functional at any given time
Page 102
– Files are huge by traditional standards
– Don’t optimize for small files
– Large streaming reads – Small random reads – Most files are modified by appending – Access is mostly read-only, sequential
– E.g., atomic append operation
Page 103
– Get (and cache) chunkserver/chunk ID for file
– Periodic logs and replicas
Page 104
RFC 2518
– PROPFIND: retrieve properties from a resource, including a collection (directory) structure – PROPPATCH: change/delete multiple properties on a resource – MKCOL: create a collection (directory) – COPY: copy a resource from one URI to another – MOVE: move a resource from one URI to another – LOCK: lock a resource (shared or exclusive) – UNLOCK: remove a lock
Page 105
– davfs2: Linux file system driver to mount a DAV server as a file system
– Native filesystem support in OS X (since 10.0) – Microsoft web folders (since Windows 98)
Page 106
– Python application – FUSE userland file system interface
– Read, write, open, close, stat, symlink, link, unlink, truncate, rename, directories
– Subject headers contain:
group ID, size, etc.
– File data stored in attachments
Page 107
– Point of congestion, single point of failure
– E.g., Coda – Limited replication can lead to congestion – Separate set of machines to administer
– (500 GB disks commodity items @ $45)
Page 108
– See Fraunhofer FS (www.fhgfs.com)
Page 109 Page 109