The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun- - - PowerPoint PPT Presentation
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia Outline Background Assumption and Google Workloads Architecture System Interaction
Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia
– Disks, Network, Machines
– A few million files; Each typically 100MB or larger in size
– Typically read hundreds of KBs of an individual
– Small reads typically read a few KBs at arbitrary
avoid seeking back and forth
High availability: Replicas, Heartbeat, Fast recovery, checksums Large data chunk & No Client data cache Make appending fast Relaxed consistency model & Atomicity Record Append Customizable API & Don’t need to conform to POSIX semantic! Throughput is more important
A Chunk
lazy space allocation
Single Master
masters, operation log
involvment
metadata management, etc.
placement and replication decisions.
– Employs network topology to transfer data – Easily to control the
– File and chunk namespaces – (File, chunk) mapping – Location of each chunk’s replicas
– Possibility: (<64B/chunk, 226 (64M) * 64 = 4G, a few millions
– Fast access & scanning
– Store First two types of metadata – Any changes to the first two types of metadata are first written to the operation log – “It is replicated on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely” – Checkpoints are made to avoid larger log and long starting up time for masters.
– Balance disk space utilization – Performance: Limit the number of “recent” creations (hot pot) on each chunkserver – Availability: different racks; as far as possible – Maximize network utilization: as less switches as possible; exploit the aggregate bandwidth of multiple racks; – Multiple racks:
– Consistent: All the clients will always see the same data, regardless of which replicas they read from – Defined: Consistent & Clients will see what the mutation writes in its entirety
– whether the ops succeeds or fails – Whether there are concurrent mutations
– Error code check consistency – Application checksum check if it is defined – Identifier for record figure out duplicates – Retry
– Handled exclusively by the master
particular chunk goes in, shouldn't it consider the potential applications that is going to access that chunk ?
– Reply: Not really. More replicas if it is hot spot.
it mentioned that it "rebalances" for load balancing but a way of doing it is not mentioned.
– Reply : Master knows the disk usages of all chunkservers
provide any mutation operations. But the processes that got lease from the primary before primary got crashed will continue on writing to the disks right ? So how can we keep track of those writes that is taken place while the shadow master is in operation ?
– Reply : It can run until finished.
down? Will one of the “shadow” masters become a new primary master? If so, how to pick one from several “shadow” ones?
– Reply : Maybe.
degrade the GFS's scalability?
– Reply : Seems not under Google’s workload.
However, the term "closest" is never explained - is it based on network latency, or does it factor in network load on the closest chunkserver, does it factor in the system latency? How does one chunkserver figure out the closest server on the fly?
– Reply : Deduced from IP addresss
atomically, and can associate an application defined checksum with each record. This seems to imply that GFS keeps metadata for each record. How is this done?
– Reply : The checksum is in the record data, not in the metadata.
cannot contact the master it may unwittingly serve stale data to clients. Comments?
– Reply : The master includes the chunk version number when it informs clients.
“Concurrent appends to the same file operation” and the “relaxed” consistency model under normal workload? (I believe this is only suits well to google). Since appending is key to GFS write performance in a multi-writer environment, it might be that GFS would give up much of its performance advantage even in large serial
– It is a tradeoff between complexity between GFS and its clients. – I think it is good for performance. – It is tailored for google’s workloads (appending).