Scale and Performance in a Distributed File System (AFS) Howard et - - PowerPoint PPT Presentation

scale and performance in a distributed file system afs
SMART_READER_LITE
LIVE PREVIEW

Scale and Performance in a Distributed File System (AFS) Howard et - - PowerPoint PPT Presentation

Scale and Performance in a Distributed File System (AFS) Howard et al. CMU 1988, ACM TOCS Presenter: Dhirendra Singh Kholia Outline What is AFS? The Prototype implementation Changes for Performance Effect of Changes for


slide-1
SLIDE 1

Scale and Performance in a Distributed File System (AFS)

Howard et al. CMU 1988, ACM TOCS

Presenter: Dhirendra Singh Kholia

slide-2
SLIDE 2

Outline

  • What is AFS?
  • The Prototype implementation
  • Changes for Performance
  • Effect of Changes for Performance
  • Comparison with NFS
  • Conclusion
  • Q&A
slide-3
SLIDE 3

AFS (Andrew File System)

  • AFS is a distributed filesystem that enables

efficient sharing of storage resources across both local area and wide area networks.

  • Development started at CMU around 1983
  • Goal: 5,000 - 10,000 nodes (very high

scalability!)

  • Scale yet maintain performance and simple

administration

slide-4
SLIDE 4

AFS

  • Client-Server Architecture
  • Vice: Set of trusted servers
  • Clients run user level process called Venus
  • Venus caches files from Vice
  • Caching based on upload/download (whole-

file) transfer model

  • Venus contacts Vice only for open and close
  • perations
slide-5
SLIDE 5

The Prototype Implementation

  • Spawned a dedicated process for every client
  • Each server contained a directory hierarchy mirroring

the structure of the Vice files .admin directory – Vice file status information Stub directory – location database embedded in the file tree

  • Pathname resolution done by Vice (servers)
  • Venus verifies timestamp before using cached file

(open() and stat() force contact with Vice!)

  • Coarse-grained read-only replication
  • Dedicated lock-server process
slide-6
SLIDE 6

Benchmark Details

  • script operates on a collection of source code

files.

  • 70 files totaling 200KiB
  • 5 phases:
slide-7
SLIDE 7

Local FS performance

Benchmark took ~1000 seconds on a Sun2 workstation

slide-8
SLIDE 8

Prototype Performance

1 Load Unit = 5 Andrew users 70% slower than local FS Doesn’t scale well after 5-8 Load Units

slide-9
SLIDE 9

Call Distribution

validated cache entries gets status information for files not in cache

  • 2 calls accounted for almost 90% of total calls!
  • pen() and stat() force contact with Vice.
  • Caching Works ( > 80% Hit Ratio)
  • “Cache validation driven totally by Venus” is not a good idea

Source: http://dcslab.snu.ac.kr/courses/dip2009f/presentation_old/3.ppt

slide-10
SLIDE 10

Prototype resource usage

  • 75% CPU utilization over 5-minute period, 40% over 8-hour period!
  • CPU is the performance bottleneck!
  • Causes: pathname resolution, excessive context switches
slide-11
SLIDE 11

Problems with the Prototype

  • High virtual memory paging demands (fork model)
  • High CPU usage
  • Exceeded critical resource limits, network-related resources

frequently

  • High frequency of cache validation checks (too many stats)
  • Difficult to move directories around (and thus balance load)
  • Despite all these problems, the prototype was

robust, simple and it worked.

  • “… our users willingly suffered!” 
slide-12
SLIDE 12

Changes for Performance

  • Cache Management
  • Name Resolution + Low-Level Storage

Representation

  • Process structure

Target: Handle at least 50 clients per server.

slide-13
SLIDE 13

Cache Management

  • Status cache (in virtual memory for fast stat() perf.)
  • Data cache (in local disk)
  • Now caches directory contents and symlinks too!
  • Venus now assumes that cache entries are valid

unless otherwise notified by Vice

  • Callback – the server promises to notify client before

allowing a modification + This reduce cache validation traffic and server load.

  • Maintenance of callback state information.
  • There is a potential for inconsistency (how?)
slide-14
SLIDE 14

Name Resolution + Low-Level Storage Representation

  • Earlier, pathname resolution was done by Vice

(costly implicit namei operation caused server load)

  • Now, Venus maps Vice pathnames to Fid and passes Fid to Vice
  • 96-bit FID = 32-bit Volume Number, 32-bit Vnode number + 32-bit

Uniquifier

  • Key Idea: Eliminate pathname lookups (Use Fid on servers and

inodes on clients directly)

  • The Volume Number identifies a Volume and the location of Volume

is contained in Volume Location Database.

slide-15
SLIDE 15

Process Structure

  • Use fixed number of LWPs within one process.
  • An LWP is bound to a particular client only for

the duration of a single server operation.

  • User space RPC implementation
slide-16
SLIDE 16

AFS Consistency Semantics

  • Visibility of writes to an open file by a process on

a workstation is limited to that particular workstation

  • Commit on close (write-on-close)

changes are now visible to new opens, open instances do not see the changes

  • All other file operations are visible everywhere

immediately

  • No implicit locking, multiple clients can perform

same operation on a file concurrently

  • Application have to cooperate and manage

synchronization

slide-17
SLIDE 17

Effect of Chances for Performance

slide-18
SLIDE 18

Effect of Chances for Performance

  • Only 19% slower than a stand-alone workstation
  • ScanDir and ReadAll phases almost independent of load!
  • Scales well and the target of 50 clients is easily met!
slide-19
SLIDE 19

Comparison with NFS (Remote Open)

  • File Data is not fetched in one go
  • Advantage of remote-open model: Low Latency
slide-20
SLIDE 20

Comparison with NFS (Time)

slide-21
SLIDE 21

Comparison with NFS (CPU)

slide-22
SLIDE 22

Comparison with NFS (Disk)

slide-23
SLIDE 23

Comparison Report

  • NFS failed to work properly at high loads!
  • For 1 LU, NFS generated ~3 times as many

packets as AFS

  • NFS’s performance degrades rapidly with load
  • NFS saturated CPU and Disk and still couldn’t

keep up (despite the fact that it operates entirely in kernel!)

  • NFS doesn’t scale well (actually it doesn’t

seem to scale at all) 

slide-24
SLIDE 24

Changes for Operability

  • A Volume is a collection of files. Each user is assigned a

Volume.

  • Volume is like a mini-filesystem in itself. It can

grow/shrink in size.

  • Volumes allow quotas, consistent backups and read-
  • nly replication and painless live migration of data
  • Volumes keep the size of VLD manageable.
  • Volume abstraction is indispensable!
slide-25
SLIDE 25

Conclusion

  • Only problems I see are:

A) limit of 64K files per directory B) whole-file caching (making it slow for big files)

  • Overall, AFS is awesome 
slide-26
SLIDE 26

Questions - Scaling

  • Do they ever reach their goal of 5000

workstations? OR are distributed file systems fundamentally flawed and cannot scale indefinitely?

  • Yes, AFS should be able to manage that

magical number. (http://www.openafs.org/success.html)

slide-27
SLIDE 27

Questions - Locking

  • Isn't the lack of any form of synchronization

amongst the files dangerous? 4.2BSD doesn’t lock files implicitly and AFS conforms to these semantics . Yes, it seems dangerous but even under modern *NIX, locks are advisory by default, which again requires application to behave “correctly”.

  • Couldn't a single badly written program corrupt a

whole lot of important data? Blame the program then 

slide-28
SLIDE 28

Questions - Caching

  • Is the caching of the entire file a good idea,

given the huge size of files these days? Latency is a big problem with whole-transfer

  • model. Well even for a 24KB file the latency

was ~0.5 seconds (quite noticeable!). For huge files it would get quite worse (linearly though). However, whole-file transfer is the key to AFS scaling! Lets discuss this.

slide-29
SLIDE 29

Questions - Caching

  • Do server remove callbacks for expired cache

items in clients? If it does, how would a server know what the workstation has cached and What items have expired? Will workstation notify server about expired cache items?

  • Yes, Venus executes RemoveCallBack (while

flushing an item out of cache) which tells the server the filename to remove callback from.

slide-30
SLIDE 30

Questions - Locking

  • The authors state that user-level file locking

was implemented by a dedicated lock server

  • process. How does this centralized locking

mechanism affect scalability? Locking is not done implicitly. So only particular applications actually will use the lock mechanism.

slide-31
SLIDE 31

Questions

  • Embedding of file location information in the file

storage structure made movement of files across servers difficult, because it required "structural modifications to storage on the servers"... what structural modifications does it mean?

  • My Guess: Moving a part of namespace will

require a new partition on the new server. (Since only entire disk partitions could be mounted and the existing partition could not be used to serve as another mount point).

slide-32
SLIDE 32

Questions – Cache Size

  • Diskless operation is possible but slow and

files that are larger than the local disk cache cannot be accessed at all. Why couldn't they be accessed using the same slow method as diskless operation? File has to always fit in the cache (memory or disk)!

slide-33
SLIDE 33

Questions - Consistency

  • This paper does not mention file conflicts (i.e.

users modifying stale copies of files). Are file conflicts possible?

  • What happens when Client A and Client B open

and begin modifying the same file? If Client A closes the file first and B closes second then, are the are the changes done by Client A lost? Can the server refuse to close() for Client B because it knows that Callback for B is missing/broken?

slide-34
SLIDE 34

Questions

  • It seems like the performance of AFS will be

quite low for small updates to huge files. So how can we overcome this problem? Will the performance of the system will be hampered if small updates to huge files happen very often? Conceptually, something similar to rsync could be used to handle this.

slide-35
SLIDE 35

Questions – Threading Model

  • AFS uses a single process having a fixed number of

LWPs to service clients. Will this design cause some problem? For example, the process may be blocked at times. What about fault tolerant capability?

  • Yes, N:1 cooperative threading model described in

the paper (LWP) is pretty limited (e.g can’t use multiple cores). It seems AFS can use both POSIX Threads and LWP. On Linux it uses POSIX Threads and NPTL which results in 1:1 threading model.

slide-36
SLIDE 36

Questions

  • Why does today’s remote file system look more like NFS (remote
  • pen) than AFS?

Huge File Sizes?

  • Which model is “more” reasonable ?

(remote-open OR whole-file transfer)

  • Why do we need DFS? What is the best case to use DFS?