[PPT] - Distributed Storage and Consistency Distributed Storage and PowerPoint Presentation

SLIDE 1

Distributed Storage and Consistency Distributed Storage and Consistency

SLIDE 2

Storage moves into the net Storage moves into the net

St orage capacit y/ volume St orage capacit y/ volume Administ r at ive cost Administ r at ive cost Net work bandwidt h Net work bandwidt h Net work delays Net work delays Net work cost Net work cost Shared st orage wit h scalable bandwidt h and capacit y. Shared st orage wit h scalable bandwidt h and capacit y. Consolidat e Consolidat e — — mult iplex mult iplex — — decent ralize decent ralize — — replicat e. replicat e. Reconf igure t o mix Reconf igure t o mix-

and

and-

mat ch loads and resources.

mat ch loads and resources.

SLIDE 3

Storage as a service Storage as a service

SSP SSP ASP ASP

Storage Service Provider Storage Service Provider Application Service Provider Application Service Provider

Out sourcing: st orage and/ or applicat ions as a Out sourcing: st orage and/ or applicat ions as a service service. . For ASPs (e.g., Web services), st orage is j ust a component . For ASPs (e.g., Web services), st orage is j ust a component .

SLIDE 4

Storage Abstractions Storage Abstractions

relational database (IBM and Oracle)

tables, transactions, query language

file system

hierarchical name space of files with ACLs Each file is a linear space of fixed-size blocks.

block storage

SAN, Petal, RAID-in-a-box (e.g., EMC) Each logical unit (LU) or volume is a linear space of fixed-size blocks.

object storage
bject == file, with a flat name space: NASD, DDS, Porcupine

Varying views of the object size: NASD/OSD/Slice objects may act as large-ish “buckets” that aggregate file system state.

persistent objects

pointer structures, requires transactions: OODB, ObjectStore

SLIDE 5

Network Block Storage Network Block Storage

One approach to scalable storage is to attach raw block storage to a network.

Abstraction: OS addresses storage by <volume, sector>.

iSCSI, Petal, FibreChannel: access through special device driver

Dedicated Storage Area Network or general-purpose

network.

FibreChannel (FC) vs. Ethernet

Volume-based administrative tools

backup, volume replication, remote sharing

Called “raw” or “block”, “storage volumes” or just “SAN”.
Least common denominator for any file system or database.

SLIDE 6

“NAS vs. SAN” “NAS vs. SAN”

In the commercial sector there is a raging debate today about “NAS vs. SAN”.

Network-Attached Storage has been the dominant approach

to shared storage since NFS.

NAS == NFS or CIFS: named files over Ethernet/Internet. E.g., Network Appliance “filers”

Proponents of FibreChannel SANs market them as a

fundamentally faster way to access shared storage.

no “indirection through a file server” (“SAD”) lower overhead on clients network is better/faster (if not cheaper) and dedicated/trusted Brocade, HP, Emulex are some big players.

SLIDE 7

NAS vs. SAN: Cutting through the BS NAS vs. SAN: Cutting through the BS

FibreChannel a high-end technology incorporating NIC

enhancements to reduce host overhead....

...but bogged down in interoperability problems.

Ethernet is getting faster faster than FibreChannel.

gigabit, 10-gigabit, + smarter NICs, + smarter/faster switches

Future battleground is Ethernet vs. Infiniband.
The choice of network is fundamentally orthogonal to

storage service design.

Well, almost: flow control, RDMA, user-level access (DAFS/VI)

The fundamental questions are really about abstractions.

shared raw volume vs. shared file volume vs. private disks

SLIDE 8

Storage Architecture Storage Architecture

Any of these abstractions can be built using any, some, or all

f the others.

Use the “right” abstraction for your application.

Basic operations: create/remove, open/close, read/write. The fundamental questions are:

What is the best way to build the abstraction you want?

division of function between device, network, server, and client

What level of the system should implement the features and

properties you want?

SLIDE 9

I P LAN I P LAN

Med Med Ct r Ct r

Duke Mass Storage Duke Mass Storage Testbed Testbed

Campus FC net Campus FC net I BM Shark/ HSM I BM Shark/ HSM

I P LAN I P LAN

Brain Lab Brain Lab Goal Goal: managed st orage on : managed st orage on demand f or cross demand f or cross-

disciplinary

disciplinary research. research. Direct SAN access f or “power Direct SAN access f or “power client s” and NAS client s” and NAS PoPs PoPs; ot her ; ot her client s access t hrough NAS. client s access t hrough NAS.

SLIDE 10

Problems Problems

poor interoperability

Must have a common volume layout across heterogeneous

SAN clients.

poor sharing control

The granularity of access control is an entire volume.
SAN clients must be trusted.
SAN clients must coordinate their access.

$$$

SLIDE 11

Med Med Ct r Ct r

Duke Storage Duke Storage Testbed Testbed, v2.0 , v2.0

Campus FC net Campus FC net I BM Shark/ HSM I BM Shark/ HSM Campus I P net Campus I P net Brain Lab Brain Lab Each SAN volume is managed Each SAN volume is managed by a single NAS by a single NAS PoP PoP. . All access t o each volume is All access t o each volume is mediat ed by it s NAS mediat ed by it s NAS PoP PoP. .

SLIDE 12

Testbed Testbed v2.0: pro and con v2.0: pro and con

Supports resource sharing and data sharing. Does not leverage Fibre Channel investment. Does not scale access to individual volumes. Prone to load imbalances. Data crosses campus IP network in the clear. Identities and authentication must be centrally administered. It’s only as good as the NAS clients, which tend to be fair at best.

SLIDE 13

Sharing Network Storage Sharing Network Storage

How can we control sharing to a space of files or blocks?

Access control etc.
Data model and storage abstraction
Caching
Optimistic replication

Consistency

One-copy consistency vs. weak consistency
Read-only (immutable) files?
Read-mostly files with weak consistency?
Write-anywhere files?

SLIDE 14

File/Block Cache Consistency File/Block Cache Consistency

Basic write-ownership protocol.

Distributed shared memory (software DSM)

Timestamp validation (NFS).

Timestamp each cache entry, and periodically query the server: “has this file changed since time t?”; invalidate cache if stale.

Callback invalidation (AFS, Sprite, Spritely NFS).

Request notification (callback) from the server if the file changes; invalidate cache and/or disable caching on callback.

Leases (NQ-NFS, NFSv4, DAFS)

[Gray&Cheriton89,Macklem93]

SLIDE 15

Software DSM 101 Software DSM 101

Software-based distributed shared memory (DSM) provides an illusion of shared memory on a cluster.

remote-fork the same program on each node
data resides in common virtual address space

library/kernel collude to make the shared VAS appear consistent

The Great War: shared memory vs. message passing

for the full story, take CPS 221

switched interconnect

SLIDE 16

physical

Page Based DSM (Shared Virtual Memory) Page Based DSM (Shared Virtual Memory)

Virtual address space is shared

DRAM DRAM Virtual Address Space physical

SLIDE 17

The Sequential Consistency Memory Model The Sequential Consistency Memory Model

P1 P2 P3 switch randomly set after each memory op ensures some serial

rder among all operations

sequential processors issue memory ops in program

rder

Memory Easily implemented with shared bus. For page-based DSM, weaker consistency models may be useful….but that’s for later.

SLIDE 18

Inside Page Inside Page-

Based DSM (SVM)

Based DSM (SVM)

The page-based approach uses a write-ownership token protocol on virtual memory pages.

Kai Li [Ivy SVM, 1986], Paul Leach [Apollo, 1982]
Each node maintains per-node per-page access mode.

{shared, exclusive, no-access} determines local accesses allowed For SVM, modes are enforced with VM page protection

mode load (read) store (write) shared yes no exclusive yes yes no-access no no

SLIDE 19

Write Write-

Ownership Protocol

Ownership Protocol

A write-ownership protocol guarantees that nodes observe sequential consistency of memory accesses:

Any node with any access has the latest copy of the page.

On any transition from no-access, fetch current copy of page.

A node with exclusive access holds the only copy.

At most one node may hold a page in exclusive mode. On transition into exclusive, invalidate all remote copies and set their mode to no-access.

Multiple nodes may hold a page in shared mode.

Permits concurrent reads: every holder has the same data. On transition into shared mode, invalidate the exclusive remote copy (if any), and set its mode to shared as well.

SLIDE 20

Network File System (NFS) Network File System (NFS)

syscall layer

*FS NFS server

VFS VFS

NFS client *FS

syscall layer

client

user programs

RPC over UDP or TCP server

SLIDE 21

NFS Protocol NFS Protocol

NFS is a network protocol layered above TCP/IP.

Original implementations (and most today) use UDP

datagram transport for low overhead.

Maximum IP datagram size was increased to match FS block size, to allow send/receive of entire file blocks. Some implementations use TCP as a transport.

The NFS protocol is a set of message formats and types.

Client issues a request message for a service operation. Server performs requested operation and returns a reply message with status and (perhaps) requested data.

SLIDE 22

File Handles File Handles

Question: how does the client tell the server which file or directory the operation applies to?

Similarly, how does the server return the result of a lookup?

More generally, how to pass a pointer or an object reference as an argument/result of an RPC call?

In NFS, the reference is a file handle or fhandle, a token/ticket whose value is determined by the server.

Includes all information needed to identify the file/object on

the server, and find it quickly.

volume ID inode # generation #

SLIDE 23

Consistency for File Systems Consistency for File Systems

How is the consistency problem different for network file systems, relative to DSM/SVM?

Note: The CDK text includes a lot of detail about the kernel implementation issues for these file systems. These are interesting and useful, but in this course we focus on the distribution aspects.

SLIDE 24

NFS as a “Stateless” Service NFS as a “Stateless” Service

A classical NFS server maintains no in-memory hard state.

The only hard state is the stable file system image on disk.

no record of clients or open files
no implicit arguments to requests

E.g., no server-maintained file offsets: read and write requests must explicitly transmit the byte offset for each operation.

no write-back caching on the server
no record of recently processed requests
etc., etc....

“Statelessness makes failure recovery simple and efficient.”

SLIDE 25

Recovery in Stateless NFS Recovery in Stateless NFS

If the server fails and restarts, there is no need to rebuild in- memory state on the server.

Client reestablishes contact (e.g., TCP connection).
Client retransmits pending requests.

Classical NFS uses a connectionless transport (UDP).

Server failure is transparent to the client; no connection to

break or reestablish.

A crashed server is indistinguishable from a slow server.

Sun/ONC RPC masks network errors by retransmitting a

request after an adaptive timeout.

A dropped packet is indistinguishable from a crashed server.

SLIDE 26

Drawbacks of a Stateless Service Drawbacks of a Stateless Service

The stateless nature of classical NFS has compelling design advantages (simplicity), but also some key drawbacks:

Recovery-by-retransmission constrains the server interface.

ONC RPC/UDP has execute-mostly-once semantics (“send and pray”), which compromises performance and correctness.

Update operations are disk-limited.

Updates must commit synchronously at the server.

NFS cannot (quite) preserve local single-copy semantics.

Files may be removed while they are open on the client. Server cannot help in client cache consistency.

Let’s look at the consistency problem...

SLIDE 27

Timestamp Validation in NFS [1985] Timestamp Validation in NFS [1985]

NFSv2 and NFSv3 cache consistency uses a form of timestamp validation like today’s Web

Timestamp cached data at file grain.
Maintain per-file expiration time (TTL)
Probe for new timestamp to revalidate if cache TTL has expired.

Get attributes (getattr)

Key difference: NFS file cache and access primitives are block-grained, and the client may issue many operations in sequence on the same file.

Clustering: File-grained timestamp for block-grained cache
Piggyback file attributes on each response
Adaptive TTL

What happens on server failure? Client failure?

SLIDE 28

AFS [1985] AFS [1985]

AFS is an alternative to NFS developed at CMU.

Duke still uses it.

Designed for wide area file sharing:

Internet is large and growing exponentially.
Global name hierarchy with local naming contexts and location

info embedded in fully qualified names.

Much like DNS

Security features, with per-domain authentication / access control.
Whole file caching or 64KB chunk caching

Amortize request/transfer cost

Client uses a disk cache

Cache is preserved across client failure. Again, it looks a lot like the Web.

SLIDE 29

Callback Invalidations in AFS Callback Invalidations in AFS-

2

2

AFS-1 uses timestamp validation like NFS; AFS-2 uses callback invalidations.

Server returns “callback promise” token with file access.

Like ownership protocol, confers a right to cache the file. Client caches the token on its disk.

Token states: {valid, invalid, cancelled}
On a sharing collision, server cancels token with a callback.

Client invalidates cached copy of the associated file. Detected on client write to server: last writer wins. (No distinction between read/write token.)

SLIDE 30

Issues with AFS Callback Invalidations Issues with AFS Callback Invalidations

What happens after a failure?

Client invalidates its tokens on client restart.

Invalid tokens may be revalidated, like NFS getattr or WWW.

Server must remember tokens across server restart.
Can the client distinguish a server failure from a network

failure?

Client invalidates tokens after a timeout interval T if the

client has no communication with the server.

Weakens consistency in failures.

Then there’s the problem of update semantics: two clients may be actively updating the same file at the same time.

SLIDE 31

NQ NQ-

NFS Leases

NFS Leases

In NQ-NFS, a client obtains a lease on the file that permits the client’s desired read/write activity.

“A lease is a ticket permitting an activity; the lease is valid until some expiration time.”

A read-caching lease allows the client to cache clean data.

Guarantee: no other client is modifying the file.

A write-caching lease allows the client to buffer modified

data for the file.

Guarantee: no other client has the file cached. Allows delayed writes: client may delay issuing writes to improve write performance (i.e., client has a writeback cache).

SLIDE 32

Using NQ Using NQ-

NFS Leases

NFS Leases

1. Client NFS piggybacks lease requests for a given file on

I/O operation requests (e.g., read/write).

NQ-NFS leases are implicit and distinct from file locking.

2. The server determines if it can safely grant the request, i.e.,

does it conflict with a lease held by another client.

read leases may be granted simultaneously to multiple clients write leases are granted exclusively to a single client

3. If a conflict exists, the server may send an eviction notice

to the holder of the conflicting lease.

If a client is evicted from a write lease, it must write back. Grace period: server grants extensions while the client writes. Client sends vacated notice when all writes are complete.

SLIDE 33

NQ NQ-

NFS Lease Recovery

NFS Lease Recovery

Key point: the bounded lease term simplifies recovery.

Before a lease expires, the client must renew the lease.
What if a client fails while holding a lease?

Server waits until the lease expires, then unilaterally reclaims the lease; client forgets all about it. If a client fails while writing on an eviction, server waits for write slack time before granting conflicting lease.

What if the server fails while there are outstanding leases?

Wait for lease period + clock skew before issuing new leases.

Recovering server must absorb lease renewal requests and/or

writes for vacated leases.

SLIDE 34

NQ NQ-

NFS Leases and Cache Consistency

NFS Leases and Cache Consistency

Every lease contains a file version number.

Invalidation cache iff version number has changed.

Clients may disable client caching when there is concurrent

write sharing.

no-caching lease (Sprite)

What consistency guarantees do NQ-NFS leases provide?

Does the server eventually receive/accept all writes? Does the server accept the writes in order? Are groups of related writes atomic? How are write errors reported? What is the relationship to NFS V3 commit?

SLIDE 35

The Distributed Lock Lab The Distributed Lock Lab

The lock implementation is similar to DSM systems, with reliability features similar to distributed file caches.

use Java RMI
lock token caching with callbacks

lock tokens passed through server, not peer-peer as DSM

synchronizes multiple threads on same client
state bit for pending callback on client
server must reissue callback each lease interval (or use RMI

timeouts to detect a failed client)

client must renew token each lease interval

SLIDE 36

Remote Method Invocation (RMI) Remote Method Invocation (RMI)

3: stub2 = stub1->method()

stub

RMI layer transport

skeleton

RMI layer transport

client VM server VM RMI registry

2: stub1 = Naming.lookup(URL) 1: Naming.bind(URL, obj1)

The registry provides a bootstrap naming service using URLs.

rmi://slowww.server.edu/object1

bj1
bj2
bj3

RMI is “RPC in Java”, supporting Emerald-like distributed object references, invocation, and garbage collection, derived from SRC Modula-3 network objects [SOSP 93].

server app client app

SLIDE 37

Background Slides Background Slides

These slides were not discussed. I use t hem in CPS 210, t he

perat ing syst ems course. They provide usef ul background f or

t he mat erial on NFS.

SLIDE 38

Cluster File Systems Cluster File Systems

shared block storage service (FC/SAN, Petal, NASD)

xFS [Dahlin95] Petal/Frangipani [Lee/Thekkath] GFS Veritas EMC Celerra

storage client

cluster FS cluster FS

storage client

issues trust compatibility with NAS protocols sharing, coordination, recovery

SLIDE 39

Sharing and Coordination Sharing and Coordination

storage service + lock manager

*FS client *FS client

*FS svc *FS svc

block allocation and layout locking/leases, granularity shared access separate lock service logging and recovery network partitions reconfiguration

NAS “SAN”

What does Frangipani need from Petal? How does Petal contribute to F’s *ility? Could we build Frangipani without Petal?

SLIDE 40

A Typical Unix File Tree A Typical Unix File Tree

/ tmp usr etc File trees are built by grafting volumes from different volumes

r from network servers.

Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host.

bin vmunix ls sh project users packages (volume root) tex emacs In Unix, the graft operation is the privileged mount system call, and each volume is a filesystem. mount point

mount (coveredDir, volume) coveredDir: directory pathname volume: device specifier or network volume volume root contents become visible at pathname coveredDir

SLIDE 41

Filesystems Filesystems

Each file volume (filesystem) has a type, determined by its disk layout or the network protocol used to access it.

ufs (ffs), lfs, nfs, rfs, cdfs, etc. Filesystems are administered independently.

Modern systems also include “logical” pseudo-filesystems in the naming tree, accessible through the file syscalls.

procfs: the /proc filesystem allows access to process internals. mfs: the memory file system is a memory-based scratch store.

Processes access filesystems through common system calls.

SLIDE 42

VFS: the VFS: the Filesystem Filesystem Switch Switch

syscall layer (file, uio, etc.)

user space

Virtual File System (VFS)

network protocol stack (TCP/IP)

NFS FFS LFS etc. *FS etc.

device drivers

Sun Microsystems introduced the virtual file system interface in 1985 to accommodate diverse filesystem types cleanly.

VFS allows diverse specific file systems to coexist in a file tree, isolating all FS-dependencies in pluggable filesystem modules.

VFS was an internal kernel restructuring with no effect on the syscall interface. Incorporates object-oriented concepts: a generic procedural interface with multiple implementations. Based on abstract objects with dynamic method binding by type...in C.

Other abstract interfaces in the kernel: device drivers, file objects, executable files, memory objects.

SLIDE 43

Vnodes Vnodes

In the VFS framework, every file or directory in active use is represented by a vnode object in kernel memory. syscall layer NFS UFS

free vnodes Each vnode has a standard file attributes struct. Vnode operations are macros that vector to filesystem-specific procedures. Generic vnode points at filesystem-specific struct (e.g., inode, rnode), seen

nly by the filesystem.

Each specific file system maintains a cache of its resident vnodes.

SLIDE 44

Vnode Vnode Operations and Attributes Operations and Attributes

directories only vop_lookup (OUT vpp, name) vop_create (OUT vpp, name, vattr) vop_remove (vp, name) vop_link (vp, name) vop_rename (vp, name, tdvp, tvp, name) vop_mkdir (OUT vpp, name, vattr) vop_rmdir (vp, name) vop_symlink (OUT vpp, name, vattr, contents) vop_readdir (uio, cookie) vop_readlink (uio) files only vop_getpages (page**, count, offset) vop_putpages (page**, count, sync, offset) vop_fsync () vnode attributes (vattr) type (VREG, VDIR, VLNK, etc.) mode (9+ bits of permissions) nlink (hard link count)

wner user ID
wner group ID

filesystem ID unique file ID file size (bytes and blocks) access time modify time generation number generic operations vop_getattr (vattr) vop_setattr (vattr) vhold() vholdrele()

SLIDE 45

V/ V/Inode Inode Cache Cache

HASH(fsid, fileid) VFS free list head

Active vnodes are reference- counted by the structures that hold pointers to them.

system open file table
process current directory
file system mount points
etc.

Each specific file system maintains its

wn hash of vnodes (BSD).
specific FS handles initialization
free list is maintained by VFS

vget(vp): reclaim cached inactive vnode from VFS free list vref(vp): increment reference count on an active vnode vrele(vp): release reference count on a vnode vgone(vp): vnode is no longer valid (file is removed)

SLIDE 46

Pathname Traversal Pathname Traversal

When a pathname is passed as an argument to a system call, the syscall layer must “convert it to a vnode”.

Pathname traversal is a sequence of vop_lookup calls to descend the tree to the named file or directory.

pen(“/tmp/zot”)

vp = get vnode for / (rootdir) vp->vop_lookup(&cvp, “tmp”); vp = cvp; vp->vop_lookup(&cvp, “zot”);

Issues:

1. crossing mount points
2. obtaining root vnode (or current dir)
3. finding resident vnodes in memory
4. caching name->vnode translations
5. symbolic (soft) links
6. disk implementation of directories
7. locking/referencing to handle races

with name create and delete operations

SLIDE 47

Problem 1: Retransmissions and Problem 1: Retransmissions and Idempotency Idempotency

For a connectionless RPC transport, retransmissions can saturate an overloaded server.

Clients “kick ‘em while they’re down”, causing steep hockey stick.

Execute-at-least-once constrains the server interface.

Service operations should/must be idempotent.

Multiple executions should/must have the same effect.

Idempotent operations cannot capture the full semantics we

expect from our file system.

remove, append-mode writes, exclusive create

SLIDE 48

Solutions to the Retransmission Problem Solutions to the Retransmission Problem

1. Hope for the best and smooth over non-idempotent requests.

E.g., map ENOENT and EEXIST to ESUCCESS.

2. Use TCP or some other transport protocol that produces

reliable, in-order delivery.

higher overhead...and we still need sessions.

3. Implement an execute-at-most once RPC transport.

TCP-like features (sequence numbers)...and sessions.

4. Keep a retransmission cache on the server [Juszczak90].

Remember the most recent request IDs and their results, and just resend the result....does this violate statelessness? DAFS persistent session cache.

SLIDE 49

Problem 2: Synchronous Writes Problem 2: Synchronous Writes

Stateless NFS servers must commit each operation to stable storage before responding to the client.

Interferes with FS optimizations, e.g., clustering, LFS, and

disk write ordering (seek scheduling).

Damages bandwidth and scalability.

Imposes disk access latency for each request.

Not so bad for a logged write; much worse for a complex

peration like an FFS file write.

The synchronous update problem occurs for any storage service with reliable update (commit).

SLIDE 50

Speeding Up Synchronous NFS Writes Speeding Up Synchronous NFS Writes

Interesting solutions to the synchronous write problem, used in high-performance NFS servers:

Delay the response until convenient for the server.

E.g., NFS write-gathering optimizations for clustered writes (similar to group commit in databases). Relies on write-behind from NFS I/O daemons (iods).

Throw hardware at it: non-volatile memory (NVRAM)

Battery-backed RAM or UPS (uninterruptible power supply). Use as an operation log (Network Appliance WAFL)... ...or as a non-volatile disk write buffer (Legato).

Replicate server and buffer in memory (e.g., MIT Harp).

SLIDE 51

NFS V3 Asynchronous Writes NFS V3 Asynchronous Writes

NFS V3 sidesteps the synchronous write problem by adding a new asynchronous write operation.

Server may reply to client as soon as it accepts the write,

before executing/committing it.

If the server fails, it may discard any subset of the accepted but uncommitted writes.

Client holds asynchronously written data in its cache, and

reissues the writes if the server fails and restarts.

When is it safe for the client to discard its buffered writes? How can the client tell if the server has failed?

SLIDE 52

NFS V3 Commit NFS V3 Commit

NFS V3 adds a new commit operation to go with async-write.

Client may issue a commit for a file byte range at any time.
Server must execute all covered uncommitted writes before

replying to the commit.

When the client receives the reply, it may safely discard any

buffered writes covered by the commit.

Server returns a verifier with every reply to an async write or

commit request.

The verifier is just an integer that is guaranteed to change if the server restarts, and to never change back.

What if the client crashes?