File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Farnaz Farahanipad 1001134035
Overview • Introduction • Design overview • GFS structure • System interaction • Master operation • Questions and answers • Conclusion
Introduction: Distributed File systems • What is distributed file system? • A DFS is any file system that allows access to file from multiple hosts sharing via a computer network.
Introduction: • Why not use an existing file system? • Bottle neck problem • Balancing issue • Different workload and design properties • GFS is designed for google apps and workloads
Google file system • GFS is a scalable distributed file system for large data intensive applications. • GFS, has master slave architecture. • Shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability to large networks. 5
GFS design assumption • High component failure rates • “Modest” number of huge files – Just a few millions of big files • Files are write-once, mostly append to • Large streaming reads • High sustained throughput favored over low latency
GFS design decisions • Files stored as chunks • At the fixed size of 64MB • Single Master • Simple centralized management • Reliability through replication • Each chunk is replicated across 3 or more chunk servers • No data caching • Little benefit due to large size of data sets • Familiar interface, but customize the API • Suitable to Google apps • Add snapshot, record append operation
GFS architecture
Client • Interacts with the master for metadata operations: • Client translates offset in file into chunk index within file • Send master request with file name/chunk index • Caches info using file name/chunk index as key • Interact with chunk servers for read/write operation
Chunk servers • Chunk servers are the workers of GFS. • Responsible for storing 64-MB file chunks. • Each chunk replica is stored on a chunk server and is extended only as needed.
Why large chunk size? • Size of meta data is reduced • Involvement of Master is reduced • Network overhead is reduced • Lazy space allocation avoids internal fragmentation
Reliability issue • What if a chunk server goes down? • The GFS, copies every chunk multiple times and store it on different chunk servers.
Single master weakness: Single point of failure • What if master goes down? • GFS Solution: • Shadow master
Single master weakness: Scalability bottleneck • How to solve bottle neck problem? • GFS Solution: • Minimize master involvement • Never move data through master. Use only for meta data • Large chunk size less meta data • Data mutation is done by chunk servers
Master • The master maintains all file system metadata. • Periodically communicates with chunk-servers • Gives instruction, collects state • Chunk creation, re-replication, rebalancing • Garbage collection • Simpler and more reliable • Lazily garbage collects hidden files
Master-Metadata • Global metadata is stored on the master • File and chunk namespaces • Mapping from files to chunks • Location of each chunk’s replica • All in memory(64bytes/chunk) • Fast • Easy access
Master-Operation log • The operation log contains a historical record of critical metadata change. • Defines the order of concurrent operations • Critical • Replicated to multiple remote machines • Respond to client only when it is log locally and remotely
Master-Operation log • Master checkpoints its state whenever the log goes beyond a certain size. • Fast recovery by using checkpoints • Recovery needs only latest files so older files can be deleted freely.
Why it is important to log on information of master? • Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash.
Master-Keep chunk servers and master synchronized • By sending heart beat messages
System interaction: lease and mutation order • A lease is a grant of ownership or control for a limited time. • The owner/holder can renew or extend the lease. • If the owner fails, the lease expires and is free again.
System interaction: lease and mutation order • A mutation is an operation that changes the contents or metadata of a chunk such as write or an append operation. • Each mutation is performed to all the chunk’s replicas.
System interaction: lease and mutation order • Leases are used to maintain consistent mutation order across replicas.
System interaction: data flow • To avoid network bottlenecks and high-latency links • Each machine forwards data to the closest machine • Latency is being minimized by pipelining the data transfer over TCP connections. Decoupled
System interaction: data flow Time for transferring B bytes to R replicas between two machines without network congestion: 𝑢 = 𝐶 𝑈 + 𝑆𝑀 B=1MB L= 1ms t ~ 80 ms T=100Mbps
Master operation • Namespace management and locking • Replica placement • Creation, Re-replication, Rebalancing
Master operation: Namespace management and locking • We allow multiple operation to be active in master by using locking to ensure proper serialization. • Recall that GFS does not have per-directory data structure. • It only store file and chunks mapping • So, GFS logically represent its namespace as a look up table mapping full pathnames to metadata. • Using read/write lock on each node in the namespace tree to ensure serialization. • Each master operation acquires a set of locks before it runs
Master operation: Namespace management and locking If it involves: /d1/d2/…/ dn/leaf Read locks on the /d1 directory name /d1/d2 … /d1/d2/…/ dn Either a read lock /d1/d2/…/ dn/leaf or a write lock on the full pathname
Master operation: Namespace management and locking • How this locking mechanism can prevent a file /home/user/foo from being created while /home/user is being snap shotted to /save/user Read locks Write locks Snapshot /home /home/user operation /save /save/user Creation /home /home/user/foo operation /home/user
Master operation: Namespace management and locking
Master operation: Namespace management and locking Create new file under a directory: e.g.,create/dir/file3, /dir/file4/, ...... Allow concurrent mutations in the same directory • Key: using read lock for dir. • By locking pathname, it can lock the new file before it is created Prevent from creating files with the same name simultaneously
Master operation: Replica placement • Serves two purposes: • Maximize data reliability and availability • Maximize network bandwidth utilization • Spread chunk replicas across racks: • To ensure chunk survivability • To exploit aggregate read bandwidth of multiple rack • Write traffic has to flow through multiple racks
Master operation: Creation, Re-replication, Rebalancing • Chunk Replicas are created for three reasons: • Chunk Creation • Chunk Replication • Rebalancing
Master operation: Creation, Re-replication, Rebalancing • New chunks are created on chunk servers • Master has to decide which chunk servers could be used for chunk creation. • Put new replicas on chunk servers with below-average disk space utilization. • It wants to reduce the number of creation on each chunk server.(cheap but heavy write traffic) • Spread replicas of a chunk on to different racks.
Master operation: Creation, Re-replication, Rebalancing • Master re-replicates a chunk as soon as the number of available replicas fall below verified goal. • Each chunk that need to be re-replicated is prioritized based on several factors. • Master picks highest priority chunk and “clones” it by instructing some chunk servers by the chunk data directly from an exiting replica • Additionally, each chunk server limits the amount of bandwidth it spends on each replication by controlling its reads requests to the source chunk servers.
Master operation: Creation, Re-replication, Rebalancing • Master rebalances replicas periodically • Examines the current replica distribution and move replica for best space and load balancing • Master gradually fills up a new chunk server rather than instantly swamps it with new chunks and heavy traffic comes with them. • Master must choose which existing replica to remove. • It prefers to remove those on chunk servers with below average free space so as to equalize disk space usage.
Master operation: Garbage collection • Replica that is not known to master is garbage • Master logs the deletion immediately like other changes, but it does not reclaim the resources. • The file renamed to hidden name which includes the deletion timestamp. • During master’s regular scan, it removes the hidden files with in more tan 3 days. • After removing the hidden file from namespace, its in memory metadata is erased.
Recommend
More recommend