The Google File System Joanna Swietlicka October 13, 2010 Joanna - - PowerPoint PPT Presentation

the google file system
SMART_READER_LITE
LIVE PREVIEW

The Google File System Joanna Swietlicka October 13, 2010 Joanna - - PowerPoint PPT Presentation

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements The Google File System Joanna Swietlicka October 13, 2010 Joanna Swietlicka The Google File System Design overview Interactions Master


slide-1
SLIDE 1

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

The Google File System

Joanna ´ Swietlicka October 13, 2010

Joanna ´ Swietlicka The Google File System

slide-2
SLIDE 2

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Based on:

  • S. Ghemawat, H. Gobioff, and S.-T. Leung: “The Google file

system,” in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003.

Joanna ´ Swietlicka The Google File System

slide-3
SLIDE 3

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

1

Design overview Assumptions Interface Architecture Single master Chunk size Metadata

2

Interactions Mutation mechanism Additional operations

3

Master operation

4

Fault tolerance and diagnosis

5

Measurements

Joanna ´ Swietlicka The Google File System

slide-4
SLIDE 4

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Frequent failures

Hundreds of machines built from inexpensive commodity parts Component failures are the norm rather than the exception Constant monitoring, error detection, fault tolerance, and prompt automatic recovery must be integral to the system

Joanna ´ Swietlicka The Google File System

slide-5
SLIDE 5

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Huge files

Modest number of large files Multi-GB files are common Small files supported, but not optimized for Design assumptions and parameters such as I/O operation and blocksizes had to be revisited

Joanna ´ Swietlicka The Google File System

slide-6
SLIDE 6

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Writing

Mostly appending new data rather than overwriting existing data Large, sequential writes Once written, files are seldom modified again Appending is the focus of performance optimization and atomicity guarantees

Joanna ´ Swietlicka The Google File System

slide-7
SLIDE 7

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Reading

Once written, files are only read, often only sequentially Mostly large streaming reads and small random reads Batching and sorting small reads to advance steadily through the file

Joanna ´ Swietlicka The Google File System

slide-8
SLIDE 8

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Concurrency

Files often used as producer-consumer queues or for many-way merging Hundreds of producers concurrently append to a single file The file may be read later, or a consumer may be reading through the file simultaneously Atomicity with minimal synchronization overhead is essential

Joanna ´ Swietlicka The Google File System

slide-9
SLIDE 9

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Bandwidth vs. latency

High sustained bandwidth is more important than low latency Most applications place a premium on processing data in bulk at a high rate Few have stringent response time requirements for an individual read or write

Joanna ´ Swietlicka The Google File System

slide-10
SLIDE 10

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Interface

GFS doesn’t implement a standard API such as POSIX Files are organized hierarchically in directories and identified by pathnames Standard operations: create, delete, open, close, read, and write Additional operations: snapshot and record append Snapshot creates a copy of a file or a directory tree at low cost Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity

  • f each individual client’s append

Joanna ´ Swietlicka The Google File System

slide-11
SLIDE 11

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Architecture

A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients Each of these is a commodity Linux machine running a user-level server process

Joanna ´ Swietlicka The Google File System

slide-12
SLIDE 12

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Files

Files are divided into fixed-size chunks Each chunk is identified by a 64 bit chunk handle Chunkservers store chunks on local disks as Linux files Each chunk is replicated on multiple chunkservers (default: 3)

Joanna ´ Swietlicka The Google File System

slide-13
SLIDE 13

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Master

Maintains all file system metadata:

namespace access control information mapping from files to chunks current locations of chunks

Controls system-wide activities:

chunk lease management garbage collection of orphaned chunks chunk migration between chunkservers

Periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state

Joanna ´ Swietlicka The Google File System

slide-14
SLIDE 14

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Communication

GFS client communicates with the master and chunkservers to read or write data on behalf of the application Clients interact with the master only for metadata

  • perations

All data-bearing communication goes directly to the chunkservers

Joanna ´ Swietlicka The Google File System

slide-15
SLIDE 15

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Cache

Clients cache only metadata Caching data offers little benefit because most applications stream through huge files Not having them simplifies the client and the overall system Chunkservers need not cache file data because chunks are stored as local files (Linux’s buffer cache already keeps frequently accessed data in memory)

Joanna ´ Swietlicka The Google File System

slide-16
SLIDE 16

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Single master

Having a single master simplifies the design Minimizing its involvement in reads and writes ensures that it does not become a bottleneck Clients only ask the master which chunkservers they should contact They cache this information for a limited time and interact with the chunkservers directly for many subsequent

  • perations

Joanna ´ Swietlicka The Google File System

slide-17
SLIDE 17

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Interactions

1

Client translates the file name and byte offset into chunk index within the file

2

It sends the master a request

3

The master replies with the corresponding chunk handle and locations of the replicas

4

The client caches this information

5

The client then sends a request to one of the replicas

6

Further reads of the same chunk require no more client-master interaction

Joanna ´ Swietlicka The Google File System

slide-18
SLIDE 18

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Interactions - scheme

Joanna ´ Swietlicka The Google File System

slide-19
SLIDE 19

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Chunk size

64 MB Lazy space allocation – avoids wasting space due to internal fragmentation Advantages:

Reduction of clients’ need to interact with the master Reduction of network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time Reduction of the size of metadata

Joanna ´ Swietlicka The Google File System

slide-20
SLIDE 20

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Metadata

Three types:

File and chunk namespaces Mapping from files to chunks Locations of each chunk’s replicas

All metadata is kept in the master’s memory Namespaces and mapping are also kept in an operation log stored on the master’s local disk and replicated on remote machines The master does not store chunk location information persistently – it asks each chunkserver about its chunks

Joanna ´ Swietlicka The Google File System

slide-21
SLIDE 21

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

In-Memory Data Structures

Since metadata is stored in memory, master operations are fast Amount of memory the master has is not a concern: there is less than 64 bytes of metadata for each 64 MB chunk and file

Joanna ´ Swietlicka The Google File System

slide-22
SLIDE 22

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Chunk locations

The master does not keep a persistent record of which chunkservers have a replica of a given chunk It polls chunkservers for that information at startup and periodically thereafter (with HeartBeat messages) This eliminates the problem of keeping the master and chunkservers in sync

Joanna ´ Swietlicka The Google File System

slide-23
SLIDE 23

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Operation log

Contains a historical record of critical metadata changes Serves as a logical time line that defines the order of concurrent operations It is replicated on multiple machines Responds to a client operation only after flushing the corresponding log record to disk

Joanna ´ Swietlicka The Google File System

slide-24
SLIDE 24

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Assumptions Interface Architecture Single master Chunk size Metadata

Operation log

The master recovers its file system state by replaying the

  • peration log

To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints

Joanna ´ Swietlicka The Google File System

slide-25
SLIDE 25

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Mutation mechanism Additional operations

Leases and mutation order

Mutation is an operation that changes the contents or metadata of a chunk (e.g. write) Leases are used to maintain a consistent mutation order across replicas The master grants a chunk lease to one of the replicas, which we call the primary The primary picks a serial order for all mutations to the chunk All replicas follow this order when applying mutations Lease has an extendible 60-seconds timeout Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires

Joanna ´ Swietlicka The Google File System

slide-26
SLIDE 26

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Mutation mechanism Additional operations

Data flow

The flow of data is decoupled from the flow of control to use the network efficiently Control flows from the client to the primary and then to all secondaries Data is pushed linearly along a carefully picked chain of chunkservers Once a chunkserver receives some data, it starts forwarding immediately

Joanna ´ Swietlicka The Google File System

slide-27
SLIDE 27

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Mutation mechanism Additional operations

Write control - scheme

Joanna ´ Swietlicka The Google File System

slide-28
SLIDE 28

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Mutation mechanism Additional operations

Record append

The client specifies only the data GFS appends it to the file at least once atomically at an

  • ffset of GFS’s choosing

If appending the record to the current chunk would cause the chunk to exceed the maximum size (64 MB), it is padded up to max size and next chunk is created

Joanna ´ Swietlicka The Google File System

slide-29
SLIDE 29

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Mutation mechanism Additional operations

Record append

For the operation to report success, the data must have been written at the same offset on all replicas of some chunk If a record append fails at any replica, the client retries the

  • peration

Replicas of the same chunk may contain different data possibly including duplicates of the same record GFS does not guarantee that all replicas are bytewise

  • identical. It only guarantees that the data is written at least
  • nce as an atomic unit

Joanna ´ Swietlicka The Google File System

slide-30
SLIDE 30

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Mutation mechanism Additional operations

Snapshot

The snapshot operation makes a copy of a file or a directory tree Uses standard copy-on-write techniques

Joanna ´ Swietlicka The Google File System

slide-31
SLIDE 31

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements Mutation mechanism Additional operations

Snapshot

1

When the master receives a snapshot request, it first revokes any relevant leases

2

Then, the master logs the operation to disk

3

It then applies this log record to its in-memory state by duplicating the metadata

4

The newly created snapshot files point to the same chunks as the source files

5

Next time the chunk is to be written, master notices that the reference count is greater than one

6

It then asks each chunkserver that has a current replica of

  • riginal chunk to create its copy

Joanna ´ Swietlicka The Google File System

slide-32
SLIDE 32

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Master operation

The master executes all namespace operations Manages chunk replicas throughout the system:

Makes placement decisions Creates new chunks and hence replicas Coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers, and to reclaim unused storage

Joanna ´ Swietlicka The Google File System

slide-33
SLIDE 33

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Namespace management and locking

GFS represents its namespace as a lookup table mapping full pathnames to metadata Each node in the namespace tree has an associated read-write lock Each master operation acquires a set of locks before it runs (read locks for all superdirectories’ pathnames and read or write lock for the whole pathname) Creating a file doesn’t require write lock on parent directory, as there is no inode-like data structure Multiple file creations can be executed concurrently in the same directory

Joanna ´ Swietlicka The Google File System

slide-34
SLIDE 34

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Replica placement

The chunk replica placement policy serves two purposes: maximize data reliability and availability, and maximize network bandwidth utilization Replicas are spread across different machines and racks

Joanna ´ Swietlicka The Google File System

slide-35
SLIDE 35

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Chunk creation

When the master creates a chunk, it chooses where to place the initially empty replicas. It considers several factors: Chunkservers with below-average disk space utilization are preferred The number of ’recent’ creations on each chunkserver should be limited Replicas of a chunk should be spread across racks

Joanna ´ Swietlicka The Google File System

slide-36
SLIDE 36

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Re-replication

The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal Priority of re-replication is based on several factors:

How far it is from the replication goal Chunks from live files are replicated before chunks that belong to recently deleted files Chunks that are blocking client progress are prioritized

Joanna ´ Swietlicka The Google File System

slide-37
SLIDE 37

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Rebalancing

It is performed periodically Master examines the current replica distribution and moves replicas for better disk space and load balancing Through this process, master gradually fills up new chunkservers Replicas are removed from the chunkservers with below-average free space

Joanna ´ Swietlicka The Google File System

slide-38
SLIDE 38

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Garbage collection

GFS does not immediately reclaim the available physical storage The master logs a file’s deletion immediately The file is renamed to a hidden name During the master’s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days In a similar scan, the master identifies orphaned chunks and erases the metadata for those chunks In a HeartBeat message, each chunkserver reports what chunks it has, and the master replies with the chunks that are no longer present in the master’s metadata

Joanna ´ Swietlicka The Google File System

slide-39
SLIDE 39

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Garbage collection

Garbage collection provides a uniform and dependable way to clean up any replicas not known to be useful It merges storage reclamation into the regular background activities of the master The delay in reclaiming storage provides a safety net against accidental, irreversible deletion

Joanna ´ Swietlicka The Google File System

slide-40
SLIDE 40

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Stale Replica Detection

Chunk replicas may become stale if a chunkserver fails and misses mutations to the chunk while it is down For each chunk, the master maintains a chunk version number to distinguish between up-to-date and stale replicas Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas The master removes stale replicas in its regular garbage collection

Joanna ´ Swietlicka The Google File System

slide-41
SLIDE 41

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

High availability

Fast recovery of master and chunkservers Chunk replication Master replication

Joanna ´ Swietlicka The Google File System

slide-42
SLIDE 42

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Master replication

Replication of operation log and checkpoints Mutation considered committed only after flushing its log record locally and on all replicas If a master machine fails, monitoring infrastructure starts a new master process elsewhere ’Shadow’ masters provide read-only access to the file system even when the primary master is down Shadow master reads a replica of the log and applies the same changes to its data structures exactly as the primary does Like the primary, it polls chunkservers at startup and exchanges frequent handshake messages with them to monitor their status

Joanna ´ Swietlicka The Google File System

slide-43
SLIDE 43

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Data Integrity

Chunkservers us checksumming to detect data corruption A chunk is broken up into 64 KB blocks, each has a corresponding 32 bit checksum Checksums are kept in memory and stored persistently with logging Chunkserver verifies the checksum of data blocks that

  • verlap the read range before returning any data (reads)

If a block doesn’t match the checksum, chunkserver returns an error and reports it to the master, who will clone the chunk from another replica. The invalid replica is removed

Joanna ´ Swietlicka The Google File System

slide-44
SLIDE 44

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Data Integrity

Checksum computation is optimized for appends Checksum is incrementally updated for the last partial block, and computed for any brand new blocks filled by the append For writes, the first and last blocks of the range being

  • verwritten must be read and verified first

Scanning inactive chunks during idle periods

Joanna ´ Swietlicka The Google File System

slide-45
SLIDE 45

Design overview Interactions Master operation Fault tolerance and diagnosis Measurements

Micro-benchmarks

Joanna ´ Swietlicka The Google File System