Scalable File Storage Jeff Chase Duke University Why a - PowerPoint PPT Presentation

D D u k e S y s t t e m s Scalable ¡File ¡Storage ¡ Jeff ¡Chase ¡ Duke ¡University ¡

Why a shared network file service? • Data sharing across people and their apps – common name space (/usr/project/stuff … ) • Resource sharing – fluid mapping of data to storage resources – incremental scalability – diversity of demands, not predictable in advance – statistical multiplexing, central limit theorem • Obvious? – how is this different from opendht?

Network File System (NFS) server client syscall layer user programs VFS syscall layer NFS VFS server UFS NFS UFS client network Virtual File System (VFS) enables pluggable file system implementations as OS kernel modules (“drivers”).

Google File System (GFS) • SOSP 2003 • Foundation for data-intensive parallel cluster computing at Google – MapReduce OSDI 2004, 2000+ cites • Client access by RPC library, through kernel system calls (via FUSE) • Uses Chubby lock service for consensus – e.g., Master election • Hadoop HDFS is an “open-source GFS”

Google File System (GFS) Similar : Hadoop HDFS, p-NFS, many other parallel file systems. A master server stores metadata (names, file maps) and acts as lock server. Clients call master to open file, acquire locks, and obtain metadata. Then they read/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is.

GFS (or HDFS) and MapReduce • Large files • Streaming access (sequential) • Parallel access • Append-mode writes • Record-oriented • (Sorting.)

MapReduce: Example Handles failures automatically, e.g., restarts tasks if a node fails; Runs multiple copies of a task so a slow node does not limit the job.

HDFS Architecture

GFS Architecture Separate data (chunks) from metadata (names etc.). Centralize the metadata; spread the chunks around.

Chunks • Variable size, up to 64MB • Stored as a file, named by a handle • Replicated on multiple nodes, e.g., x3 – chunkserver == datanode • Master caches chunk maps – per-file chunk map: what chunks make up a file – chunk replica map: which nodes store each chunk

GFS Architecture • Single master, multiple chunkservers What could go wrong?

Single master • From distributed systems we know this is a: – Single point of failure – Scalability bottleneck • GFS solutions: – Shadow masters – Minimize master involvement • never move data through it, use only for metadata – and cache metadata at clients • large chunk size • master delegates authority to primary replicas in data mutations (chunk leases) • Simple, and good enough!

GFS Read

GFS Write The client asks the master for a list of replicas, and which replica holds the lease to act as primary. If no one has a lease, the master grants a lease to a replica it chooses. ...The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed).

GFS Write

GFS writes: control and data flow

Google File System (GFS) Similar : Hadoop HDFS, p-NFS, many other parallel file systems. A master server stores metadata (names, file maps) and acts as lock server. Clients call master to open file, acquire locks, and obtain metadata. Then they read/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is.

GFS Scale

GFS: leases • Primary must hold a “lock” on its chunks. • Use leased locks to tolerate primary failures. We use leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas, which we call the primary . The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary. The lease mechanism is designed to minimize management overhead at the master. A lease has an initial timeout of 60 seconds. However, as long as the chunk is being mutated, the primary can request and typically receive exten- sions from the master indefinitely. These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all chunkservers. … Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires.

Leases (leased locks) • A lease is a grant of ownership or control for a limited time. • The owner/holder can renew or extend the lease. • If the owner fails, the lease expires and is free again. • The lease might end early. – lock service may recall or evict – holder may release or relinquish

A lease service in the real world acquire acquire grant X x=x+1 A grant x=x+1 release B

Leases and time • The lease holder and lease service must agree when a lease has expired. – i.e., that its expiration time is in the past – Even if they can’t communicate! • We all have our clocks, but do they agree? – synchronized clocks • For leases, it is sufficient for the clocks to have a known bound on clock drift. – |T(C i ) – T(C j )| < ε – Build in slack time > ε into the lease protocols as a safety margin.

A network partition Cr ashed route r A network partition is any event that blocks all message traffic between subsets of nodes.

Never two kings at once acquire acquire grant x=x+1 ??? A grant x=x+1 release B

Lease callbacks/recalls • GFS master recalls primary leases to give the master control for metadata operations – rename – snapshots ...The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed). Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires … . ...When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files it is about to snapshot. This ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder. This will give the master an opportunity to create a new copy of the chunk first.....

GFS: who is the primary? • The master tells clients which chunkserver is the primary for a chunk. • The primary is the current lease owner for the chunk. • What if the primary fails? – Master gives lease to a new primary. – The client’s answer may be cached and may be stale. Since clients cache chunk locations, they may read from a stale replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next open of the file, which purges from the cache all chunk information for that file.

Lease sequence number • Each lease has a lease sequence number . – Master increments it when it issues a lease. – Client and replicas get it from the master. • Use it to validate that a replica is up to date before accessing the replica. – If replica fails/disconnects, its lease number lags. – Easy to detect by comparing lease numbers. • The lease sequence number is a common technique. In GFS it is called the chunk version number .

GFS chunk version number • In GFS, the sequence number for a chunk lease acts as a chunk version number. • Master passes it to the replicas after issuing a lease, and to the client in the chunk handle . – If a replica misses updates to a chunk, its version falls behind – Client checks for stale chunks on reads. – Replicas report chunk versions to master: master reclaims any stale chunks, creates new replicas. Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas … before any client is notified and therefore before it can start writing to the chunk. If another replica is currently unavailable, its chunk version number will not be advanced. The master will detect that this chunkserver has a stale replica when the chunkserver restarts and reports its set of chunks...

Consistency

GFS consistency model Records appended atomically at least once. Easy. … but file may contain duplicates and/or padding. Primary chooses an arbitrary total Anything can happen if writes fail: order for concurrent writes. Writes a failed write may succeed at that cross chunk boundaries are not some replicas but not others. atomic: the two chunk primaries Reads from different replicas may choose different orders. may return different results.

PSM: a closer look • The following slides are from Roxana Geambasu – Summer internship at Microsoft Research – Now at Columbia • Goal: specify consistency and failure formally • For primary/secondary/master (PSM) systems – GFS – BlueFS (MSR scalable storage behind Hotmail etc.) • The study is useful to understand where PSM protocols differ, and the implications.

GFS Master: • Maintains replica group config. • Monitoring Master • Reconfiguration Client • Recovery value/error ACK/error write write read read ACK write Primary ACK write ACK write [Roxana Geambasu] 35

Scalable File Storage Jeff Chase Duke University Why a - PowerPoint PPT Presentation

D D u k e S y s t t e m s Scalable File Storage Jeff Chase Duke University Why a shared network file service? Data sharing across people and their apps common name space (/usr/project/stuff )

Storage and File Structure December 12, 2008 Storage and File Structure Magnetic Discs RAID

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Chapter 10: Storage and File Structure Overview of Physical Storage Media Magnetic Disks

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

File Systems: Semantics & Structure What is a File a file is a named collection of

What if... There is no file with the name given to the File constructor: new File

CPSC 410/611: File Management What is a file? Elements of file management

File Systems: Semantics & Structure What is a File a file is a named collection of

File Management File Management File is a named collection of information The file

Zarr - scalable storage of tensor Zarr - scalable storage of tensor data for parallel and

Part III Part III Storage Management Storage Management Chapter 11: File System Implementation

HYDRAstor: a Scalable Secondary Storage 7th USENIX Conference on File and Storage Technologies

Column Generation Method Frdric Giroire FG Simplex 1/38 Column Generation in Two Words

An occasional rarity or a pervasive effect? Dirk Pijpops, Isabeau De Smet & Freek Van de Velde

Remote Procedure Calls Adapted from: Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Except as

Memory & Game Content 1 Memory is precious Memory is precious, especially on simpler

Search Overview Introduction to Search Blind Search Techniques Heuristic Search

Career Opportunities for MS Vasanthi Holtcamp, Microsoft Rebecca Schultz, Google Vasanthis

Relax Into Enlightenment W E L C O M E S H A U M B R A T H E T R A N S H U M A N S E R I E S

SoCGen: A Push Button Idea to GDS2 SoC Design Flow Habiba Gamal, Amr Gouhar, Mohamed Shalan What

Scalable File Storage Jeff Chase Duke University Why a - PowerPoint PPT Presentation

D D u k e S y s t t e m s Scalable File Storage Jeff Chase Duke University Why a shared network file service? Data sharing across people and their apps common name space (/usr/project/stuff )

Storage and File Structure December 12, 2008 Storage and File Structure Magnetic Discs RAID

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Chapter 10: Storage and File Structure Overview of Physical Storage Media Magnetic Disks

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

What if... There is no file with the name given to the File constructor: new File

CPSC 410/611: File Management What is a file? Elements of file management

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

File Management File Management File is a named collection of information The file

Zarr - scalable storage of tensor Zarr - scalable storage of tensor data for parallel and

Part III Part III Storage Management Storage Management Chapter 11: File System Implementation

HYDRAstor: a Scalable Secondary Storage 7th USENIX Conference on File and Storage Technologies

Column Generation Method Frdric Giroire FG Simplex 1/38 Column Generation in Two Words

An occasional rarity or a pervasive effect? Dirk Pijpops, Isabeau De Smet &amp; Freek Van de Velde

Remote Procedure Calls Adapted from: Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Except as

Memory &amp; Game Content 1 Memory is precious Memory is precious, especially on simpler

Search Overview Introduction to Search Blind Search Techniques Heuristic Search

Career Opportunities for MS Vasanthi Holtcamp, Microsoft Rebecca Schultz, Google Vasanthis

Relax Into Enlightenment W E L C O M E S H A U M B R A T H E T R A N S H U M A N S E R I E S

SoCGen: A Push Button Idea to GDS2 SoC Design Flow Habiba Gamal, Amr Gouhar, Mohamed Shalan What

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

An occasional rarity or a pervasive effect? Dirk Pijpops, Isabeau De Smet & Freek Van de Velde

Memory & Game Content 1 Memory is precious Memory is precious, especially on simpler