The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak - - PowerPoint PPT Presentation

the google file system
SMART_READER_LITE
LIVE PREVIEW

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak - - PowerPoint PPT Presentation

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in GFS Evaluation


slide-1
SLIDE 1

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo

slide-2
SLIDE 2

Outline

  • GFS Background, Concepts and Key words
  • Example of GFS Operations
  • Some optimizations in GFS
  • Evaluation
  • Conclusion
slide-3
SLIDE 3

Motivation

slide-4
SLIDE 4

What is the GFS?

  • Google File System is a scalable distributed file

system for large distributed data-intensive applications, which runs on inexpensive commodity hardware and provides fault tolerance, high performance to a large number of clients.

  • GFS shares many of the same goals as previous

distributed file systems such as performance, scalability, reliability, and availability

slide-5
SLIDE 5

GFS Assumptions

  • Hardware: The system is built from many inexpensive commodity

components that often fail

  • File: The system stores a modest number of large files
  • Workloads characteristics:
  • Large streaming reads
  • Small random reads.
  • Many large, sequential writes that append data to files
  • Client: the system must efficiently implement for multiple clients that

concurrently append to the same file.

  • Target: High sustained bandwidth is more important than low latency
slide-6
SLIDE 6

Interface of GFS

  • GFS provides a familiar file system interface:

support the usual operations to create, delete,

  • pen, close, read, and write files.
  • GFS supports snapshot and record append
  • perations
  • Producer-Consumer queues
  • Many-way merging
slide-7
SLIDE 7

Architecture of GFS

  • GFS components:
  • One single master
  • Multiple Clients
  • Multiple GFS chunkserver
slide-8
SLIDE 8

Chunk Size

  • Chunksize is set as 64MB
  • Pro:
  • Less interoperation between client and master node
  • Keep TCP long connection, less network overhead
  • Less meta data on master node
  • Con:
  • Small file
  • Too many clients visit the same file, hot spots
slide-9
SLIDE 9

Metadata

  • Three types of metadata:
  • (1) File and chunk namespaces
  • (2) Mapping from files to chunks
  • (3) Locations of each chunk’s replicas
  • All metadata is kept in master memory (performance)
  • Fast
  • Easily accessible
  • (1) & (2) are kept persistent by logging (Reliability); (3)

will be updated periodically

slide-10
SLIDE 10

Master Node

  • Metadata storage
  • Namespace management
  • Periodically communicate with chunkservers
  • Chunk operation: create, re-replicate, delete,

garbage collection, load balance, etc.

slide-11
SLIDE 11

System Interaction

  • (1) Mutation
  • (2) Lease
  • Minimize management
  • verhead at the master
slide-12
SLIDE 12

Mutation

  • Mutation = write or append to the contents or metadata
  • f a chunk
  • Must be done for all replicas (Consistency)
  • Lease
  • Master picks one replica as primary; gives it a “lease”for

mutations for all replicas

  • Purpose
  • Data flow decoupled from control flow
  • Minimize master involvement
slide-13
SLIDE 13

Outline

  • GFS Background, Concepts and Key words

(Question)

  • Example of GFS Operations
  • Some optimizations in GFS
  • Evaluation
  • Conclusion
slide-14
SLIDE 14

Question [1]

  • “…its design has been driven by key observations of our

application workloads and technological environment,…” What are the workload and technology characteristics GFS assumed in its design and what are their corresponding design choices?

—> GFS design assumptions and target workload

slide-15
SLIDE 15

GFS Assumptions

  • Hardware: The system is built from many inexpensive commodity

components that often fail

  • File: The system stores a modest number of large files
  • Workloads characteristics:
  • Large streaming reads
  • Small random reads.
  • Many large, sequential writes that append data to files
  • Client: the system must efficiently implement for multiple clients that

concurrently append to the same file.

  • Target: High sustained bandwidth is more important than low latency
slide-16
SLIDE 16

Question [2]

  • “…while caching data blocks in the client loses its appeal.”

GFS does not cache file data. Why does this design choice not lead to performance loss? What benefit does this choice have?

client server

(1) stream through huge files (2) working sets too large Client caches offer little benefit. However, clients still cache metadata for future access. (a) Simply design of GFS (b)Eliminating cache coherence issues, challenging

slide-17
SLIDE 17

Question [3]

  • “Small files must be supported, but we need not optimize for

them.” Why? Large and small files exist in almost every systems.

(a) GFS is designed to store millions of large files, each typically 100 MB or larger in size (b) The chunkservers storing chunks which belong to small files may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially. (c) One of disadvantages of GFS

slide-18
SLIDE 18

Outline

  • GFS Background, Concepts and Key words
  • Example of GFS Operations
  • Some optimizations in GFS
  • Evaluation
  • Conclusion
slide-19
SLIDE 19

Read in GFS

  • 1, Application
  • riginates the

read request

  • 2, GFS client

translates request and sends it to master

  • 3, Master

responds with chunk handle and replica locations

Client Master Chunk Chunk Chunk Application

① ② ③ ④ ⑤ ⑥

file name, byte range file name, chunk index chunk handle replica location chunk handle byte range data from file data

slide-20
SLIDE 20

Read in GFS

Client Master Chunk Chunk Chunk Application

① ② ③ ④ ⑤ ⑥

file name, byte range file name, chunk index chunk handle replica location chunk handle byte range data from file data

  • 4, Client picks a

location and sends the request

  • 5, Chunkserver

sends requested data to the client

  • 6, Client forwards

the data to the application

slide-21
SLIDE 21

Write on GFS

Client Master Chunk replica Chunk (Primary) Chunk replica

② ③ ④ ⑤ ⑥ ⑥ ⑦ ⑦ ⑧

Application

① ⑨

file name, byte range

  • 1. Application originates

the request

  • 2. GFS client translates

request and sends it to master

  • 3. Master responds with

chunk handle and replica locations

slide-22
SLIDE 22

Write on GFS

Client Master Chunk replica Chunk (Primary) Chunk replica

② ③ ④ ⑤ ⑥ ⑥ ⑦ ⑦ ⑧

Application

① ⑨

file name, byte range

  • 4, Client pushes write data

to all locations. Data is stored in chunkserver’s internal buffers

  • 5, Client sends write

command to primary

slide-23
SLIDE 23

Write on GFS

Client Master Chunk replica Chunk (Primary) Chunk replica

② ③ ④ ⑤ ⑥ ⑥ ⑦ ⑦ ⑧

Application

① ⑨

file name, byte range

  • 6, Primary determines

serial order for data instances in its buffer and writes the instances in that

  • rder to the chunk
  • Primary sends the serial
  • rder to the secondaries

and tells them to perform the write

slide-24
SLIDE 24

Write on GFS

Client Master Chunk replica Chunk (Primary) Chunk replica

② ③ ④ ⑤ ⑥ ⑥ ⑦ ⑦ ⑧

Application

① ⑨

file name, byte range

  • 7, Secondaries respond

back to primary

  • 8, Primary responds back

to the client

  • 9, Client responds to

applications

slide-25
SLIDE 25

Append on GFS

  • In a traditional write, the client specifies the offset at which

data is to be written.

  • Append is same as write, but no offset. GFS picks the offset

and works for concurrent writers

difference

slide-26
SLIDE 26

Outline

  • GFS Background, Concepts and Key words
  • Example of GFS Operations (Question)
  • Some optimizations in GFS
  • Evaluation
  • Conclusion
slide-27
SLIDE 27

Question [4]

  • “Clients interact with the master for metadata operations, but all

data-bearing communication goes directly to the chunkservers.” How does this design help improve the system’s performance?

Potential bottleneck minimize clients’ involvement in reads and writes with the master node

slide-28
SLIDE 28

Question [5]

  • “A GFS cluster consists of a single master…”. What’s

benefit of having only a single master? What’s its potential performance risk? How does GFS minimize such a risk?

1, Simplify Design 2, Potential bottleneck 3, Minimize clients’ involvement in reads and writes with the master node

slide-29
SLIDE 29

Question [6]

  • “Each chunk replica is stored as a plain Linux file on a

chunkserver and is extended only as needed.” How does GFS collaborate with chunkserver’s local file system to store file chunks? What’s lazy space allocation and what’s its benefit? GFS is composed of many servers Each server is typically a commodity Linux machine running a user-level server process. The file in GFS is finally stored in local server as regular Linux file

slide-30
SLIDE 30

Question [6]

  • “Each chunk replica is stored as a plain Linux file on a

chunkserver and is extended only as needed.” How does GFS collaborate with chunkserver’s local file system to store file chunks? What’s lazy space allocation and what’s its benefit?

with help of local file system

slide-31
SLIDE 31

Question [6]

  • “Each chunk replica is stored as a plain Linux file on a

chunkserver and is extended only as needed.” How does GFS collaborate with chunkserver’s local file system to store file chunks? What’s lazy space allocation and what’s its benefit? Lazy allocation simply means not allocating a resource until it is actually needed. Benefits: Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunksize.

slide-32
SLIDE 32

Question [7]

  • “On the other hand, a large chunks size, even with

lazy space allocation, has its disadvantages.” Give an example disadvantage.

A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots did develop when GFS was first used by a batch-queue system. The few chunkservers storing an executable problem were overloaded by hundreds of simultaneous requests. Fixed by storing such executables with a higher replication factor and by making the batchqueue system stagger application start times.

slide-33
SLIDE 33

Question [7]

  • “On the other hand, a large chunks size, even with

lazy space allocation, has its disadvantages.” Give an example disadvantage.

[Example] hot spot for small files Chunk

slide-34
SLIDE 34

Question [8]

  • “One potential concern for this memory-only approach is that the

number of chunks and hence the capacity of the whole system is limited by how much memory the master has.” Why is GFS’s master able to keep the metadata in memory? Chunk size (64MB) —> less than 64 bytes Metadata, small enough

slide-35
SLIDE 35

Question [9]

  • “We use leases to maintain a consistent mutation order across

replicas.” Could you show a scenario where unexpected result may appear if the lease mechanism is not implemented? Also explain how leases help address the problem?

  • rder: A, B, C
  • rder: A, C, B
  • rder: B, A, C

primary order non-primary order without lease

slide-36
SLIDE 36

Question [9]

  • “We use leases to maintain a consistent mutation order across

replicas.” Could you show a scenario where unexpected result may appear if the lease mechanism is not implemented? Also explain how leases help address the problem?

  • rder: A, B, C
  • rder: A, B, C
  • rder: A, B, C

follow it with lease primary order non-primary order

slide-37
SLIDE 37

Question [9]

  • “We use leases to maintain a consistent mutation order across

replicas.” Could you show a scenario where unexpected result may appear if the lease mechanism is not implemented? Also explain how leases help address the problem?

Lease: keep mutation

  • rder

Secondary replicas follows primary replica

slide-38
SLIDE 38

Outline

  • GFS Background, Concepts and Key words
  • Example of GFS Operations
  • Some optimizations in GFS
  • Evaluation
  • Conclusion
slide-39
SLIDE 39

Some Optimizations on GFS

  • Snapshot
  • Fault tolerance
  • Relaxed Consistency Model
slide-40
SLIDE 40

Snapshot

  • A snapshot is a copy of a system at a moment at low cost
  • Snapshot is implemented based on standard copy-on-

write

  • Why we use snapshot?
  • To quickly create branch copies of huge data sets

(Performance)

  • A quick data access for end users (Performance)
  • Changes committed or rolled-back easily (Reliability)
slide-41
SLIDE 41

Fault Tolerance

  • High availability
  • Fast recovery
  • Master and Chunkservers: failed, restart in a few seconds
  • Chunk replication
  • Each chunk is replicated on multiple chunkservers on different
  • tracks. Users can specify different levels for different parts of the

file namespace.

  • default: 3 replicas
  • Shadow masters
  • Checksum every 64KB block in each chunk
slide-42
SLIDE 42

Relaxed Consistency Model

  • Relying on appends rather than overwrites, checkpointing, and

writing self-validating, self-identifying records

  • far more efficient and resilient to Apps
  • Many writers concurrently append to a file for merged results or

as a producer-consumer queue

  • simple, efficient
  • Google apps live with it
slide-43
SLIDE 43

Outline

  • GFS Background, Concepts and Key words
  • Example of GFS Operations
  • Some optimizations in GFS (Question)
  • Evaluation
  • Conclusion
slide-44
SLIDE 44

Question [10]

  • “When the master creates a chunk, it chooses where to place the

initially empty replicas.“ What are criteria for choosing where to place the initially empty replicas? 1, place new replicas on chunkservers with below-average diskspace utilization (balance) 2, limit the number of “recent” creations on each chunkserver (imminent heavy write soon) 3, spread replicas of a chunkacross racks (reliability) new

slide-45
SLIDE 45

Question [11]

  • “The master re-replicates a chunk as soon as the number of

available replicas falls below a user-specified goal.” When a new chunkserver is added into the system, the master mostly uses chunk rebalancing rather than using new chunks to fill up it. Why? 2, limit the number of “recent” creations on each chunkserver (imminent heavy write soon) 3, spread replicas of a chunkacross racks (reliability) Heavy I/O flow, bad :( Put eggs in one basket, not safe

slide-46
SLIDE 46

Question [12]

  • “After a file is deleted, GFS does not immediately reclaim the

available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels.” How are files and chunks are deleted? What’s the advantages of the delayed space reclamation (garbage collection), rather than eager deletion?

File: When a file is deleted by the application, the master logs the deletion immediately. The file is just renamed to a hidden name that includes the deletion timestamp. During the master’s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days. Then remove namespace, metadata, etc. Chunk: the master identifies not reachable chunks with heartbeat message and erases the metadata for those chunks.

slide-47
SLIDE 47

Question [12]

  • “After a file is deleted, GFS does not immediately reclaim the

available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels.” How are files and chunks are deleted? What’s the advantages of the delayed space reclamation (garbage collection), rather than eager deletion?

Advantages: 1, simple and reliable for large distribute systems 2, it merges storage reclamation into the regular background activities of the master, less overhead or burden for master node 3, avoid accidental, irreversible deletion

slide-48
SLIDE 48

Outline

  • GFS Background, Concepts and Key words
  • Example of GFS Operations
  • Some optimizations in GFS
  • Evaluation
  • Conclusion
slide-49
SLIDE 49

Evaluation Environment

  • Cluster
  • 1 master
  • 16 chunkservers (1.4GHz PIII CPU, 2G Ram, 2*80GB Disk,

100Mpbs Ethernet)

  • 16 clients
  • Server machines connected to central switch by 100

Mbps Ethernet

  • Switches (HP2524) connected with 1 Gbps link
slide-50
SLIDE 50

Aggregate Throughputs

  • N clients reading 4 MB region

from 320 GB file set simultaneously.

  • Read rate slightly lower as

clients go up due to probability reading from same chunkserver

  • 1 client:
  • 10MB/s, 80% limit
  • 16 client:
  • 6MB/s, 75% limit
slide-51
SLIDE 51

Aggregate Throughputs

  • N clients writing to N files

simultaneously.

  • Low write rate is due to delay in

propagating data among replicas.

  • Slow write is not major problem with

aggregate write bandwidth to large clients.

  • 1 client:
  • 6.3 MB/s, 50% limit
  • 16 client:
  • 2.2 MB/s per client
slide-52
SLIDE 52

Aggregate Throughputs

  • N clients appending to a single file

simultaneously.

  • Append rate slightly lower as clients go

up due to network congestion by different clients.

  • Chunkserver network congestion is not

major issue with large clients appending to large shared files.

  • 1 client:
  • 6 MB/s
  • 16 client:
  • 4.8 MB/s per client
slide-53
SLIDE 53

Real World Clusters

  • A: research and

development

  • B: production data

processing

slide-54
SLIDE 54

GFS Deployment in Google

  • Many GFS clusters
  • Hundreds/thousands of storage nodes each
  • Managing petabytes of data
  • GFS is under BigTable, etc.
slide-55
SLIDE 55

Outline

  • GFS Background, Concepts and Key words
  • Example of GFS Operations
  • Some optimizations in GFS
  • Evaluation
  • Conclusion
slide-56
SLIDE 56

Conclusion

  • Google File System is a scalable distributed file system

for large distributed data-intensive applications, which runs on inexpensive commodity hardware and provides fault tolerance, high performance to a large number of clients.

  • GFS shares many of the same goals as previous

distributed file systems but has its own innovations and limitations (master bottleneck, designed for large files, hotspot, etc)

  • GFS meets Google’s storage needs and serves Google’s

apps and services

slide-57
SLIDE 57

One Comparison

  • Taobao File System

from Alibaba

  • Hundreds of

Millions of Products

  • Product images,

description, comments, transactions, etc. are all small files.

slide-58
SLIDE 58

Taobao File System

One chunk contains many small files with hierarchy 1st level index Nth level index Open sourced

Optimization for small files

slide-59
SLIDE 59

Reference

  • cs.brown.edu/~debrabant/cis570-website/slides/

gfs.ppt

  • https://www.cs.umd.edu/class/spring2011/

cmsc818k/Lectures/gfs-hdfs.pdf

  • http://www.slideshare.net/omerfarukinceedutr/

google-file-system-gfs-presentation

  • http://www.slideshare.net/romain_jacotin/the-

google-file-system-gfs

slide-60
SLIDE 60

Q & A