large scale file systems
play

Large Scale File Systems Amir H. Payberah payberah@kth.se - PowerPoint PPT Presentation

Large Scale File Systems Amir H. Payberah payberah@kth.se 31/08/2018 The Course Web Page https://id2221kth.github.io 1 / 69 Where Are We? 2 / 69 File System 3 / 69 What is a File System? Controls how data is stored in and retrieved


  1. Large Scale File Systems Amir H. Payberah payberah@kth.se 31/08/2018

  2. The Course Web Page https://id2221kth.github.io 1 / 69

  3. Where Are We? 2 / 69

  4. File System 3 / 69

  5. What is a File System? ◮ Controls how data is stored in and retrieved from disk. 4 / 69

  6. What is a File System? ◮ Controls how data is stored in and retrieved from disk. 4 / 69

  7. Distributed File Systems ◮ When data outgrows the storage capacity of a single machine: partition it across a number of separate machines. ◮ Distributed filesystems: manage the storage across a network of machines. 5 / 69

  8. Google File System (GFS) 6 / 69

  9. Motivation and Assumptions ◮ Node failures happen frequently ◮ Huge files (multi-GB) ◮ Most files are modified by appending at the end • Random writes (and overwrites) are practically non-existent 7 / 69

  10. Files and Chunks ◮ Files are split into chunks. ◮ Chunks, single unit of storage. • Immutable • Transparent to user • Each chunk is stored as a plain Linux file 8 / 69

  11. GFS Architecture ◮ Main components: • GFS master • GFS chunk server • GFS client 9 / 69

  12. GFS Master ◮ Responsible for all system-wide activities 10 / 69

  13. GFS Master ◮ Responsible for all system-wide activities ◮ Maintains all file system metadata • Namespaces, ACLs, mappings from files to chunks, and current locations of chunks 10 / 69

  14. GFS Master ◮ Responsible for all system-wide activities ◮ Maintains all file system metadata • Namespaces, ACLs, mappings from files to chunks, and current locations of chunks • All kept in memory, namespaces and file-to-chunk mappings are also stored persistently in operation log 10 / 69

  15. GFS Master ◮ Responsible for all system-wide activities ◮ Maintains all file system metadata • Namespaces, ACLs, mappings from files to chunks, and current locations of chunks • All kept in memory, namespaces and file-to-chunk mappings are also stored persistently in operation log ◮ Periodically communicates with each chunkserver • Determine chunk locations • Assesses state of the overall system 10 / 69

  16. GFS Chunk Server ◮ Manage chunks ◮ Tells master what chunks it has ◮ Store chunks as files ◮ Maintain data consistency of chunks 11 / 69

  17. GFS Client ◮ Issues control requests to master server. ◮ Issues data requests directly to chunk servers. ◮ Caches metadata. ◮ Does not cache data. 12 / 69

  18. Data Flow and Control Flow ◮ Data flow is decoupled from control flow ◮ Clients interact with the master for metadata operations (control flow) ◮ Clients interact directly with chunkservers for all files operations (data flow) 13 / 69

  19. Chunk Size ◮ 64MB or 128MB (much larger than most file systems) ◮ Advantages ◮ Disadvantages 14 / 69

  20. Chunk Size ◮ 64MB or 128MB (much larger than most file systems) ◮ Advantages • Reduces the size of the metadata stored in master • Reduces clients need to interact with master ◮ Disadvantages 14 / 69

  21. Chunk Size ◮ 64MB or 128MB (much larger than most file systems) ◮ Advantages • Reduces the size of the metadata stored in master • Reduces clients need to interact with master ◮ Disadvantages • Wasted space due to internal fragmentation • Small files consist of a few chunks, which then get lots of traffic from concurrent clients 14 / 69

  22. System Interactions 15 / 69

  23. The System Interface ◮ Not POSIX-compliant, but supports typical file system operations • create , delete , open , close , read , and write ◮ snapshot : creates a copy of a file or a directory tree at low cost ◮ append : allow multiple clients to append data to the same file concurrently 16 / 69

  24. Read Operation (1/2) ◮ 1. Application originates the read request. ◮ 2. GFS client translates request and sends it to the master. ◮ 3. The master responds with chunk handle and replica locations. 17 / 69

  25. Read Operation (2/2) ◮ 4. The client picks a location and sends the request. ◮ 5. The chunk server sends requested data to the client. ◮ 6. The client forwards the data to the application. 18 / 69

  26. Update Order (1/2) ◮ Update (mutation): an operation that changes the content or metadata of a chunk. 19 / 69

  27. Update Order (1/2) ◮ Update (mutation): an operation that changes the content or metadata of a chunk. ◮ For consistency, updates to each chunk must be ordered in the same way at the different chunk replicas. ◮ Consistency means that replicas will end up with the same version of the data and not diverge. 19 / 69

  28. Update Order (2/2) ◮ For this reason, for each chunk, one replica is designated as the primary. ◮ The other replicas are designated as secondaries ◮ Primary defines the update order. ◮ All secondaries follows this order. 20 / 69

  29. Primary Leases (1/2) ◮ For correctness there needs to be one single primary for each chunk. 21 / 69

  30. Primary Leases (1/2) ◮ For correctness there needs to be one single primary for each chunk. ◮ At any time, at most one server is primary for each chunk. ◮ Master selects a chunk-server and grants it lease for a chunk. 21 / 69

  31. Primary Leases (2/2) ◮ The chunk-server holds the lease for a period T after it gets it, and behaves as primary during this period. ◮ If master does not hear from primary chunk-server for a period, it gives the lease to someone else. 22 / 69

  32. Write Operation (1/3) ◮ 1. Application originates the request. ◮ 2. The GFS client translates request and sends it to the master. ◮ 3. The master responds with chunk handle and replica locations. 23 / 69

  33. Write Operation (2/3) ◮ 4. The client pushes write data to all locations. Data is stored in chunk-server’s internal buffers. 24 / 69

  34. Write Operation (3/3) ◮ 5. The client sends write command to the primary. ◮ 6. The primary determines serial order for data instances in its buffer and writes the instances in that order to the chunk. ◮ 7. The primary sends the serial order to the secondaries and tells them to perform the write. 25 / 69

  35. Write Consistency ◮ Primary enforces one update order across all replicas for concurrent writes. ◮ It also waits until a write finishes at the other replicas before it replies. 26 / 69

  36. Write Consistency ◮ Primary enforces one update order across all replicas for concurrent writes. ◮ It also waits until a write finishes at the other replicas before it replies. ◮ Therefore: • We will have identical replicas. • But, file region may end up containing mingled fragments from different clients: e.g., writes to different chunks may be ordered differently by their different primary chunk-servers • Thus, writes are consistent but undefined state in GFS. 26 / 69

  37. Append Operation (1/2) ◮ 1. Application originates record append request. ◮ 2. The client translates request and sends it to the master. ◮ 3. The master responds with chunk handle and replica locations. ◮ 4. The client pushes write data to all locations. 27 / 69

  38. Append Operation (2/2) ◮ 5. The primary checks if record fits in specified chunk. 28 / 69

  39. Append Operation (2/2) ◮ 5. The primary checks if record fits in specified chunk. ◮ 6. If record does not fit, then the primary: • Pads the chunk, • Tells secondaries to do the same, • And informs the client. • The client then retries the append with the next chunk. 28 / 69

  40. Append Operation (2/2) ◮ 5. The primary checks if record fits in specified chunk. ◮ 6. If record does not fit, then the primary: • Pads the chunk, • Tells secondaries to do the same, • And informs the client. • The client then retries the append with the next chunk. ◮ 7. If record fits, then the primary: • Appends the record, • Tells secondaries to do the same, • Receives responses from secondaries, • And sends final response to the client 28 / 69

  41. Delete Operation ◮ Meta data operation. ◮ Renames file to special name. ◮ After certain time, deletes the actual chunks. ◮ Supports undelete for limited time. ◮ Actual lazy garbage collection. 29 / 69

  42. The Master Operations 30 / 69

  43. A Single Master ◮ The master has a global knowledge of the whole system ◮ It simplifies the design ◮ The master is (hopefully) never the bottleneck • Clients never read and write file data through the master • Client only requests from master which chunkservers to talk to • Further reads of the same chunk do not involve the master 31 / 69

  44. The Master Operations ◮ Namespace management and locking ◮ Replica placement ◮ Creating, re-replicating and re-balancing replicas ◮ Garbage collection ◮ Stale replica detection 32 / 69

  45. Namespace Management and Locking (1/2) ◮ Represents its namespace as a lookup table mapping pathnames to metadata. 33 / 69

  46. Namespace Management and Locking (1/2) ◮ Represents its namespace as a lookup table mapping pathnames to metadata. ◮ Each master operation acquires a set of locks before it runs. ◮ Read lock on internal nodes, and read/write lock on the leaf. 33 / 69

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend