File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Farnaz Farahanipad 1001134035
File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung - - PowerPoint PPT Presentation
File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Farnaz Farahanipad 1001134035 Overview Introduction Design overview GFS structure System interaction Master operation Questions and answers Conclusion
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Farnaz Farahanipad 1001134035
5
data
metadata change.
log goes beyond a certain size.
can be deleted freely.
consistent mutation order across replicas.
TCP connections.
Decoupled
B=1MB L= 1ms t ~ 80 ms T=100Mbps
locking to ensure proper serialization.
mapping full pathnames to metadata.
ensure serialization.
Read locks Write locks Snapshot
/home /home/user /save /save/user
Creation
/home /home/user/foo /home/user
Allow concurrent mutations in the same directory
created Prevent from creating files with the same name simultaneously
space utilization.
server.(cheap but heavy write traffic)
replicas fall below verified goal.
several factors.
some chunk servers by the chunk data directly from an exiting replica
spends on each replication by controlling its reads requests to the source chunk servers.
space and load balancing
swamps it with new chunks and heavy traffic comes with them.
to equalize disk space usage.
does not reclaim the resources.
timestamp.
more tan 3 days.
metadata is erased.
be useful.
1)“…its design has been driven by key observations of our application workloads and technological environment,…” What are the workload and technology characteristics GFS assumed in its design and what are their corresponding design choices? A)Answer in Slide 7 and 8
2)“…while caching data blocks in the client loses its appeal.” GFS does not cache file data. Why does this design choice not lead to performance loss? What benefit does this choice have? A)Because of large streaming read and limited cache size, the GFS does not cache file data. Otherwise the cache data will be overwritten due to less space. The advantages is we will have more memory space.
5) “A GFS cluster consists of a single master…”. What’s benefit of having only a single master? What’s its potential performance risk? How does GFS minimize such a risk? A)By applying the single master architecture, we simplify the design and enable Master to make sophisticated chunk placement and replication decisions. In other words, it increases the flexibility. We should do our best to minimize the involvement of master to avoid
The single point of failure In GFS, is addressed by shadow Masters. To solve bottle neck issue, only metadata are saved in Master.
6)“Each chunk replica is stored as a plain Linux file on a chunk server and is extended only as needed.” How does GFS collaborate with chunk server's local file system to store file chunks? What’s lazy space allocation and what’s its benefit? A)It tries to have balanced data distribution and distribute the chunks
physically allocate a space to files until data reach the 64MB. It will reduce the chance of internal fragmentation.
7) “On the other hand, a large chunks size, even with lazy space allocation, has its disadvantages.” Give an example disadvantage. A) Files might be accessed a lot and became hotspots. GFS fixed this problem by storing such executable with higher replication factor. Another solution is to allow clients to read data from
8) “One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has.” Why is GFS’s master able to keep the metadata in memory? A) As GFS only supports large-size chunk(64MB) with 64bytes of metadata, the memory is not a problem.(64/64MB)
9) “We use leases to maintain a consistent mutation order across replicas.” Could you show a scenario where unexpected result may appear if the lease mechanism is not implemented? Also explain how leases help address the problem? A) Absence of leases could result in data inconsistency across the
is different)in chunk servers.
9) “We use leases to maintain a consistent mutation order across replicas.” Could you show a scenario where unexpected result may appear if the lease mechanism is not implemented? Also explain how leases help address the problem? A)GFS client send data to chunk servers, which is kept in buffer. Master designate a primary chunk server, which will save/write chunk data on its storage. Primary chunk server then send the order in which it saved chunks, to
Secondary chunk servers will then save/write chunk data in same order as of primary chunk server’s and cause consistency.
10) “When the master creates a chunk, it chooses where to place the initially empty replicas.“ What are criteria for choosing where to place the initially empty replicas? A)Place new replicas on a chunk servers with below average disk space utilization.(This will equalize disk utilization) Second, the replicas should be placed over different racks on a chunk server. Third, to avoid exhausting the chunk server by heavy write traffic, the number of replica creations should be minimized on each chunk servers.
11)“The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal.” When a new chunk server is added into the system, the master mostly uses chunk rebalancing rather than using new chunks to fill up it. Why? A) Master rebalance chunk servers gradually, because this will ensure that chunk servers are not being exhausted by heavy traffic, and balances the load across chunk servers.
12)“After a file is deleted, GFS does not immediately reclaim the available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels.” How are files and chunks are deleted? What’s the advantages of the delayed space reclamation (garbage collection), rather than eager deletion? A) Mechanism: When file is deleted by application, master logs the deletion immediately. File is renamed to hidden name that include the deletion timestamp. During master’s regular scan, it remove any such hidden files if they have existed for more than three days. After hidden files are removed from namespace, its in-memory metadata is erased.
12)“After a file is deleted, GFS does not immediately reclaim the available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels.” How are files and chunks are deleted? What’s the advantages of the delayed space reclamation (garbage collection), rather than eager deletion? A)
much workload for master to do.