D D u k e S y s t t e m s
Scalable File Storage Jeff Chase Duke University Why a - - PowerPoint PPT Presentation
Scalable File Storage Jeff Chase Duke University Why a - - PowerPoint PPT Presentation
D D u k e S y s t t e m s Scalable File Storage Jeff Chase Duke University Why a shared network file service? Data sharing across people and their apps common name space (/usr/project/stuff )
Why a shared network file service?
- Data sharing across people and their apps
– common name space (/usr/project/stuff…)
- Resource sharing
– fluid mapping of data to storage resources – incremental scalability – diversity of demands, not predictable in advance – statistical multiplexing, central limit theorem
- Obvious?
– how is this different from opendht?
Network File System (NFS)
syscall layer
UFS NFS server
VFS VFS
NFS client UFS
syscall layer
client
user programs
network server
Virtual File System (VFS) enables pluggable file system implementations as OS kernel modules (“drivers”).
Google File System (GFS)
- SOSP 2003
- Foundation for data-intensive parallel cluster
computing at Google
– MapReduce OSDI 2004, 2000+ cites
- Client access by RPC library, through kernel
system calls (via FUSE)
- Uses Chubby lock service for consensus
– e.g., Master election
- Hadoop HDFS is an “open-source GFS”
Google File System (GFS)
Similar: Hadoop HDFS, p-NFS, many other parallel file systems.
A master server stores metadata (names, file maps) and acts as lock server. Clients call master to open file, acquire locks, and obtain metadata. Then they read/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is.
GFS (or HDFS) and MapReduce
- Large files
- Streaming access (sequential)
- Parallel access
- Append-mode writes
- Record-oriented
- (Sorting.)
MapReduce: Example
Handles failures automatically, e.g., restarts tasks if a node fails; Runs multiple copies of a task so a slow node does not limit the job.
HDFS Architecture
GFS Architecture
Separate data (chunks) from metadata (names etc.). Centralize the metadata; spread the chunks around.
Chunks
- Variable size, up to 64MB
- Stored as a file, named by a handle
- Replicated on multiple nodes, e.g., x3
– chunkserver == datanode
- Master caches chunk maps
– per-file chunk map: what chunks make up a file – chunk replica map: which nodes store each chunk
GFS Architecture
- Single master, multiple chunkservers
What could go wrong?
Single master
- From distributed systems we know this is a:
– Single point of failure – Scalability bottleneck
- GFS solutions:
– Shadow masters – Minimize master involvement
- never move data through it, use only for metadata
– and cache metadata at clients
- large chunk size
- master delegates authority to primary replicas in
data mutations (chunk leases)
- Simple, and good enough!
GFS Read
GFS Read
GFS Write
The client asks the master for a list of replicas, and which replica holds the lease to act as primary. If no one has a lease, the master grants a lease to a replica it
- chooses. ...The master may sometimes try to revoke a lease before it expires (e.g.,
when the master wants to disable mutations on a file that is being renamed).
GFS Write
GFS Write
GFS writes: control and data flow
Google File System (GFS)
Similar: Hadoop HDFS, p-NFS, many other parallel file systems.
A master server stores metadata (names, file maps) and acts as lock server. Clients call master to open file, acquire locks, and obtain metadata. Then they read/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is.
GFS Scale
GFS: leases
- Primary must hold a “lock” on its chunks.
- Use leased locks to tolerate primary failures.
We use leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary. The lease mechanism is designed to minimize management overhead at the
- master. A lease has an initial timeout of 60 seconds. However, as long as the
chunk is being mutated, the primary can request and typically receive exten- sions from the master indefinitely. These extension requests and grants are piggybacked
- n the HeartBeat messages regularly exchanged between the master and all
- chunkservers. …Even if the master loses communication with a primary, it can
safely grant a new lease to another replica after the old lease expires.
Leases (leased locks)
- A lease is a grant of ownership or
control for a limited time.
- The owner/holder can renew or
extend the lease.
- If the owner fails, the lease expires
and is free again.
- The lease might end early.
– lock service may recall or evict – holder may release or relinquish
A lease service in the real world
acquire grant acquire
A B
x=x+1
X
grant release x=x+1
Leases and time
- The lease holder and lease service must agree when
a lease has expired.
– i.e., that its expiration time is in the past – Even if they can’t communicate!
- We all have our clocks, but do they agree?
– synchronized clocks
- For leases, it is sufficient for the clocks to have a
known bound on clock drift.
– |T(Ci) – T(Cj)| < ε – Build in slack time > ε into the lease protocols as a safety margin.
A network partition
Cr ashed route r A network partition is any event that blocks all message traffic between subsets of nodes.
Never two kings at once
acquire grant acquire
A
x=x+1
???
B
grant release x=x+1
Lease callbacks/recalls
- GFS master recalls primary leases to give the master
control for metadata operations
– rename – snapshots
...The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed). Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires…. ...When the master receives a snapshot request, it first revokes any
- utstanding leases on the chunks in the files it is about to snapshot. This
ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder. This will give the master an opportunity to create a new copy of the chunk first.....
GFS: who is the primary?
- The master tells clients which chunkserver is the primary
for a chunk.
- The primary is the current lease owner for the chunk.
- What if the primary fails?
– Master gives lease to a new primary. – The client’s answer may be cached and may be stale.
Since clients cache chunk locations, they may read from a stale replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next open of the file, which purges from the cache all chunk information for that file.
Lease sequence number
- Each lease has a lease sequence number.
– Master increments it when it issues a lease. – Client and replicas get it from the master.
- Use it to validate that a replica is up to date before
accessing the replica.
– If replica fails/disconnects, its lease number lags. – Easy to detect by comparing lease numbers.
- The lease sequence number is a common technique. In
GFS it is called the chunk version number.
GFS chunk version number
- In GFS, the sequence number for a chunk lease acts as
a chunk version number.
- Master passes it to the replicas after issuing a lease, and
to the client in the chunk handle.
– If a replica misses updates to a chunk, its version falls behind – Client checks for stale chunks on reads. – Replicas report chunk versions to master: master reclaims any stale chunks, creates new replicas.
Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas…before any client is notified and therefore before it can start writing to the chunk. If another replica is currently unavailable, its chunk version number will not be advanced. The master will detect that this chunkserver has a stale replica when the chunkserver restarts and reports its set of chunks...
Consistency
GFS consistency model
Easy.
Primary chooses an arbitrary total
- rder for concurrent writes. Writes
that cross chunk boundaries are not atomic: the two chunk primaries may choose different orders. Records appended atomically at least once. …but file may contain duplicates and/or padding. Anything can happen if writes fail: a failed write may succeed at some replicas but not others. Reads from different replicas may return different results.
PSM: a closer look
- The following slides are from Roxana Geambasu
– Summer internship at Microsoft Research – Now at Columbia
- Goal: specify consistency and failure formally
- For primary/secondary/master (PSM) systems
– GFS – BlueFS (MSR scalable storage behind Hotmail etc.)
- The study is useful to understand where PSM protocols
differ, and the implications.
35
GFS
Primary Client write ACK write read read value/error write write write ACK ACK/error Master:
- Maintains replica group config.
- Monitoring
- Reconfiguration
- Recovery
Master
ACK
[Roxana Geambasu]
36
Counter-examples to GFS’ Linearizability
n Counter-example 1 – Non-atomic writes
¨ In GFS, files are split into chunks, replicated by separate groups ¨ So, writes to file can be split and executed non-atomically
n Counter-example 2 – Stale reads
¨ In GFS, no strong membership checks are done for reads ¨ So, a read can go to a stale replica
n Counter-example 3 – Read uncommitted
¨ In GFS, a read can go to any replica ¨ So, a read can return the value of an in-progress write
Chunks write
“Append atomically at least once”
- If append would cross chunk boundary
– Primary pads to chunk boundary; client retries (on next chunk).
- If append fails on any replica: client retries.
– May cause duplicates on subset of replicas where earlier write succeeded..
The primary checks to see if appending the record to the current chunk would cause the chunk to exceed the maximum size (64 MB). If so, it pads the chunk to the maximum size, tells secondaries to do the same, and replies to the client indicating that the operation should be retried on the next chunk. … If a record append fails at any replica, the client retries the operation. As a result, replicas of the same chunk may contain different data possibly including duplicates of the same record in whole or in part. GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit. …
GFS: choose scalable semantics
- GFS argues that the loose semantics are “good enough”.
– “Regardless of consistency and concurrency issues, this approach has served us well.”
Appending is far more efficient and more resilient to application failures than random writes. In one typical use, a writer generates a file from beginning to end. It … periodically checkpoints how much has been successfully written. Checkpoints may include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined state. …Checkpointing allows writers to restart incrementally... …Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunk rather than outdated data. When a reader retries and contacts the master, it will immediately get current chunk locations.
Impact on apps
- Sometimes it is useful to push problems up to a higher level so that
the system doesn’t have to solve them.
- Instead, change the semantics and place more burden on apps.
- Here’s how it’s done:
GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.... Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra padding and record fragments using the checksums. If it cannot tolerate the
- ccasional duplicates (e.g., if they would trigger non-idempotent
- perations), it can filter them out using unique identifiers in the records,
which are often needed anyway…
GFS: talking points
- commodity solution
- executables are replicated more widely (why?)
- pipelining and dissemination tree
- hard vs. soft state?
– operation log on the master for namespace – chunk inventories are soft state on master
- Q: could multiple masters (volumes) share the same
pool of chunkservers?
- atomic namespace ops, and namespace locking
- chunk balancing, and interconnect
- snapshots and garbage collection
Fault Tolerance
- High availability
– fast recovery
- master and chunkservers restartable in a few seconds
– chunk replication
- default: 3 replicas.
– shadow masters
- Data integrity
– checksum every 64KB block in each chunk
GFS: master log
log snapshot
Recoverable Data with a Log
volatile memory log snapshot Your program (or file system
- r database software)
executes transactions that read or change the data structures in memory. Your data structures in memory Push a checkpoint or snapshot of the entire structure to disk periodically Log transaction events as they occur After a failure, replay the log into the last snapshot to recover the data structure.
Anatomy of a Log
- A log is a sequence of records (entries) on
recoverable storage (e.g., disk).
- Each entry is associated with some
- peration/transaction T.
- Create log entries for T as T executes, to
record progress of T.
- Log writes are atomic and durable, and
complete detectably in order.
– Append/write entry to log – Truncate older entries up to time t
- Log entries for T must be durable (e.g.,
pushed to disk) before any effects of T become visible. – Called write-ahead logging
(old) (new)
LSN 11 XID 18 LSN 13 XID 19 LSN 12 XID 18 LSN 14 XID 18 commit
...
Log Sequence Number (LSN) Transaction ID (XID) commit record
GFS master log: details
All metadata is kept in the master’s memory. The first two types (names- paces and file-to-chunk mapping) are also kept persistent by logging mutations to an operation log…. The operation log contains a historical [and persistent] record of critical metadata changes… it also serves as a logical time line that defines the order of concurrent operations…the master’s
- peration log defines a global total order of these operations.
The log records contain the sequence of commands/requests executed, rather than the updates made as they executed: this is called operational logging. The master recovers its file system state by replaying the operation log… loading the latest checkpoint from local disk and replaying only the limited number of log records after that. “Replay” the log by re-executing the operations listed in the log, in order. They are deterministic: the result is the same as their execution before the crash. First it rolls back to the state of the most recent checkpoint. Then it re-executes
- nly the operations after the checkpoint, since the checkpoint already includes
the results from all preceding operations.
GFS master log: details
We must store [the operation log] reliably and not make changes visible to clients until metadata changes are made persistent. …Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. The master batches several log records together before flushing thereby reducing the impact of flushing and replication on overall system throughput. The log is made durable by writing copies to multiple disks/machines. The log writes must complete before the master responds to the operation request. This is a form of write-ahead logging also called output commit. This is a common optimization called group commit. It increases disk write bandwidth for the log at the price of increasing commit latency.
Parallel File Systems 101
§ Manage data sharing in large data stores
[Renu Tewari, IBM]
Asymmetric
- E.g., PVFS2, Lustre, High Road
- Ceph, GFS
Symmetric
- E.g., GPFS, Polyserve
- Classical: Frangipani
Parallel NFS (pNFS)
pNFS Clients Block (FC) / Object (OSD) / File (NFS) Storage NFSv4+ Server
data [David Black, SNIA]
Modifications to standard NFS protocol (v4.1, 2005-2010) to offload bulk data storage to a scalable cluster of block servers or OSDs. Based on an asymmetric structure similar to GFS and Ceph.
pNFS architecture
- Only this is covered by the pNFS protocol
- Client-to-storage data path and server-to-storage control path are
specified elsewhere, e.g.
– SCSI Block Commands (SBC) over Fibre Channel (FC) – SCSI Object-based Storage Device (OSD) over iSCSI – Network File System (NFS)
pNFS Clients Block (FC) / Object (OSD) / File (NFS) Storage NFSv4+ Server
data [David Black, SNIA]
pNFS basic operation
- Client gets a layout from the NFS Server
- The layout maps the file onto storage devices and addresses
- The client uses the layout to perform direct I/O to storage
- At any time the server can recall the layout (leases/delegations)
- Client commits changes and returns the layout when it’s done
- pNFS is optional, the client can always use regular NFSv4 I/O
Clients Storage NFSv4+ Server
layout [David Black, SNIA]
Cache consistency
- How to ensure that each read sees the value stored by
the most recent write? (Or some reasonable value)?
- This problem also appears in multi-core architecture.
- It appears in distributed data systems of various kinds.
– DNS, Web
- Various solutions are available.
– It may be OK for clients to read data that is “a little bit stale”. – In some cases, the clients themselves don’t change the data.
- But for “strong” consistency, we need (in essence) a
distributed reader/writer lock (SharedLock) for the clients to coordinate for each piece of data.
NFS: revised picture
BufferCache FS
Applications
BufferCache FS Client File server
Multiple clients
BufferCache FS
Applications
BufferCache FS File server BufferCache FS
Applications
BufferCache FS
Applications
Multiple clients
BufferCache FS
Applications
BufferCache FS
Applications
BufferCache FS
Applications Read(server=xx.xx…, inode=i27412, blockID=27, …)
Multiple clients
BufferCache FS
Applications
BufferCache FS
Applications
BufferCache FS
Applications Write(server=xx.xx…, inode=i27412, blockID=27, …)
Multiple clients
BufferCache FS
Applications
BufferCache FS
Applications
BufferCache FS
Applications What if either of the other clients reads that block? Will it get the right data? What is the “right” data? Will it get the “last” version of the block written? How to coordinate reads/writes and caching on multiple clients? How to keep the copies “in sync”?
Lease example:
network file cache
- A read lease ensures that no other client is writing the
- data. Holder is free to read from its cache.
- A write lease ensures that no other client is reading or
writing the data. Holder is free to read/write from cache.
- Writer must push modified (dirty) cached data to the
server before relinquishing lease.
– Must ensure that another client can see all updates before it is able to acquire a lease allowing it to read or write.
- If some client requests a conflicting lock, server may
recall or evict on existing leases.
– Callback RPC from server to lock holder: “please release now.” – Writers get a grace period to push cached writes and release.
Lease example network file cache consistency
This approach is used in NFS and various other networked data services.
Stronger write atomicity
- The systems in this module are not suitable to store
complex data with multiple writers.
- “Complex” means structured data, in which multiple
writes to different parts of the structure must complete together (grouped writes).
- Q: How to ensure that a group of writes completes “all or
nothing” with respect to crashes and other client reads?
- A: transactions.
Transactions
BEGIN T1 read X read Y … write X END Database systems and other systems use a programming construct called atomic transactions (“ACID”) to represent a group of related reads/writes, often on different data items. BEGIN T2 read X write Y … write X END Transactions commit atomically in a serial order.
Atomic Consistent Independent Durable
ACID properties of transactions
- Transactions are Atomic
– Each transaction either commits or aborts: it either executes entirely or not at all. – Transactions don’t interfere with one another (I).
- Transactions appear to commit in some serial order
(serializable schedule).
- Each transaction is coded to transition the store from one
Consistent state to another.
- One-copy serializability (1SR): Transactions observe the effects
- f their predecessors, and not of their successors.
- Transactions are Durable.
– Committed effects survive failure.
Serial schedule
Sn S0 S1 S2 T1 T2 Tn Consistent States A consistent state is one that does not violate any internal invariant relationships in the data.
Transaction bodies must be coded correctly!
Limitations of Transactions?
- Why not use ACID transactions for
everything?
- How much work is it to serialize and commit
transactions?
- E.g., what if I want to add more servers?
Two-phase commit (2PC)
“commit or abort?” “here’s my vote” “commit/abort!” TM/C RM/P precommit
- r prepare
vote decide notify RMs validate Tx and prepare by logging their local updates and decisions TM logs commit/ abort (commit point) If unanimous to commit decide to commit else decide to abort
More on 2PC later. What about CAP?
Transactions: logging
1. Begin transaction 2. Append info about modifications to a log 3. Append “commit” to log to end x-action 4. Write new data to normal database ê Single-sector write commits x-action (3)
Invariant: ¡append ¡new ¡data ¡to ¡log ¡before ¡applying ¡to ¡DB ¡ Called ¡“write-‑ahead ¡logging” ¡
Begin ¡ Write1 ¡ … ¡ WriteN ¡ Commit ¡ TransacHon ¡Complete ¡
Transactions: logging
1. Begin transaction 2. Append info about modifications to a log 3. Append “commit” to log to end x-action 4. Write new data to normal database ê Single-sector write commits x-action (3)
What ¡if ¡we ¡crash ¡here ¡(between ¡3,4)? ¡ On ¡reboot, ¡reapply ¡commiPed ¡updates ¡in ¡log ¡order. ¡
Begin ¡ Write1 ¡ … ¡ WriteN ¡ Commit ¡
Transactions: logging
1. Begin transaction 2. Append info about modifications to a log 3. Append “commit” to log to end x-action 4. Write new data to normal database ê Single-sector write commits x-action (3)
What ¡if ¡we ¡crash ¡here? ¡ On ¡reboot, ¡discard ¡uncommiPed ¡updates. ¡
Begin ¡ Write1 ¡ … ¡ WriteN ¡
Recovery from a log
- Log entries for T record the writes by
T (or operations in T).
– Redo logging: the log records contain sufficient info to “redo” T after a crash.
- To recover, read the checkpoint and
replay committed log entries.
– “Redo” by reissuing writes or reinvoking the operations/methods. – Redo in order (old to new) – Skip the records of uncommitted Ts – Skip records of any Ts that committed before the atomic checkpoint was taken.
(old) (new)
LSN 11 XID 18 LSN 13 XID 19 LSN 12 XID 18 LSN 14 XID 18 commit
...
Log Sequence Number (LSN) Transaction ID (XID) commit record
Managing a log
- On checkpoint, truncate the log
– No longer need the entries to recover.
- Checkpoint how often? Tradeoff:
– Checkpoints are expensive, BUT – Long logs take up space. – Long logs increase recovery time.
- Checkpoint+truncate is “atomic”
– Is it safe to redo/replay records whose effect is already in the checkpoint? – Checkpoint “between” transactions, so checkpoint records a consistent state. – Lots of approaches
(old) (new)
LSN 11 XID 18 LSN 13 XID 19 LSN 12 XID 18 LSN 14 XID 18 commit
...
Log Sequence Number (LSN) Transaction ID (XID) commit record