distributed and cloud storage systems
play

Distributed and Cloud Storage Systems Corso di Sistemi Distribuiti e - PDF document

Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Distributed and Cloud Storage Systems Corso di Sistemi Distribuiti e Cloud Computing A.A. 2018/19 Valeria Cardellini Why to scale


  1. Università degli Studi di Roma “ Tor Vergata ” Dipartimento di Ingegneria Civile e Ingegneria Informatica Distributed and Cloud Storage Systems Corso di Sistemi Distribuiti e Cloud Computing A.A. 2018/19 Valeria Cardellini Why to scale the storage? The storage capacities and data transfer rate have increased massively over the years HDD SSD Size: ~1TB Size: ~1TB Speed: 250MB/s Speed: 850MB/s Let's consider the time needed to transfer data* Data Size HDD SSD 10 GB 40s 12s We need to scale out! 100 GB 6m 49s 2m 1 TB 1h 9m 54s 20m 33s 10 TB ? ? * we consider no overhead 1 Valeria Cardellini - SDCC 2018/19

  2. General principles for scalable data storage • Scalability and high performance – To face the continuous growth of data to store – Use multiple storage nodes • Ability to run on commodity hardware – Hardware failures are the norm rather than the exception • Reliability and fault tolerance – Transparent data replication • Availability – Data should be available when needed – CAP theorem: trade-off with consistency Valeria Cardellini - SDCC 2018/19 2 Solutions for scalable data storage Various forms of scalable data storage: • Distributed file systems – Manage (large) files on multiple nodes – Examples: Google File System, Hadoop Distributed File System • NoSQL databases (more generally, NoSQL data stores) – Simple and flexible non-relational data models – Horizontal scalability and fault tolerance – Key-value, column family, document, and graph stores – Examples: BigTable, Cassandra, MongoDB, HBase, DynamoDB – Existing time series databases are built on top of NoSQL databases (examples: InfluxDB, KairosDB) • NewSQL databases – Add horizontal scalability and fault tolerance to the relational model – Examples: VoltDB, Google Spanner 3 Valeria Cardellini - SDCC 2018/19

  3. Scalable data storage solutions The whole picture of the different solutions Valeria Cardellini - SDCC 2018/19 4 Data storage in the Cloud • Main goals: – Massive scaling “on demand” (elasticity) – Data availability – Simplified application development and deployment • Some storage systems offered only as Cloud services – Either directly (e.g., Amazon DynamoDB, Google Bigtable, Google Cloud Storage) or as part of a programming environment • Other proprietary systems used only internally (e.g., Dynamo, GFS) 5 Valeria Cardellini - SDCC 2018/19

  4. Distributed file systems • Represent the primary support for data management • Manage data storage across a network of machines • Provide an interface whereby to store information in the form of files and later access them for read and write operations – Using the traditional file system interface • Several solutions with different design choices – GFS , Apache HDFS (GFS open - source clone): designed for batch applications with large files – Alluxio: in-memory (high-throughput) storage system – Lustre, Ceph: designed for high performance Valeria Cardellini - SDCC 2018/19 6 Where to store data? • Memory I/O vs. disk I/O • See “Latency numbers every programmer should know” http://bit.ly/2pZXIU9 7 Valeria Cardellini - SDCC 2018/19

  5. Case study: Google File System • Distributed fault-tolerant file system implemented in user space • Manages (very) large files: usually multi-GB • Divide et impera : file divided into fixed-size chunks • Chunks : – Have a fixed size – Transparent to users – Each chunk is stored as plain file • Files follow the write-once, read-many-times pattern – Efficient append operation: appends data at the end of a file atomically at least once even in the presence of concurrent operations (minimal synchronization overhead) • Fault tolerance, high availability through chunk replication S. Ghemawat, H. Gobioff, S.-T. Leung, "The Google File System”, ACM SOSP ‘03. Valeria Cardellini - SDCC 2018/19 8 GFS operation environment 9 Valeria Cardellini - SDCC 2018/19

  6. GFS: architecture • Master – Single, centralized entity (this simplifies the design) – Manages file metadata (stored in memory) • Metadata: access control information, mapping from files to chunks, chunk locations – Does not store data (i.e., chunks) – Manages chunks: creation, replication, load balancing, deletion Valeria Cardellini - SDCC 2018/19 10 GFS: architecture • Chunkservers (100s – 1000s) – Store chunks as files – Spread across cluster racks • Clients – Issue control (metadata) requests to GFS master – Issue data requests directly to GFS chunkservers – Cache metadata but not data (simplifies the design) 11 Valeria Cardellini - SDCC 2018/19

  7. GFS: metadata • The master stores three major types of metadata: – File and chunk namespace (directory hierarchy) – Mapping from files to chunks – Current locations of chunks • Metadata are stored in memory (64B per chunk) – Pro: fast; easy and efficient to scan the entire state – Con : the number of chunks is limited by the amount of memory of the master: "The cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility gained" • The master also keeps an operation log with a historical record of metadata changes – Persistent on local disk – Replicated – Checkpoint for fast recovery Valeria Cardellini - SDCC 2018/19 12 GFS: chunk size • Chunk size is either 64 MB or 128 MB – Much larger than typical block sizes • Why? Large chunk size reduces: – Number of interactions between client and master – Size of metadata stored on master – Network overhead (persistent TCP connection to the chunk server over an extended period of time) • Potential disadvantage – Chunks for small files may become hot spots 13 Valeria Cardellini - SDCC 2018/19

  8. GFS: fault-tolerance and replication • The master replicates (and maintains the replication) of each chunk on several chunkservers – At least 3 replicas on different chunkservers – Replication based on primary-backup schema – Replication degree > 3 for highly requested chunks • Multi-level placement of replicas – Different machines, local rack + reliability and availability – Different machines, different racks + aggregated bandwidth • Data integrity – Chunk divided in 64KB blocks; 32B checksum for each block – Checksum kept in memory – Checksum is checked every time a client reads data Valeria Cardellini - SDCC 2018/19 14 GFS: master operations • Stores metadata • Manages and locks namespace – Namespace represented as a lookup table • Periodic communication with each chunkserver – Sends instructions and collects chunkserver state ( heartbeat messages) • Creates, re-replicates, rebalances chunks – Balances the disk space utilization and load balancing – Distributes replicas among racks to increase fault-tolerance – Re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal 15 Valeria Cardellini - SDCC 2018/19

  9. GFS: master operations (2) • Garbage collection – File deletion logged by the master – File renamed to a hidden name with deletion timestamp: its deletion is postponed – Deleted files can be easily recovered in a limited timespan • Stale replica detection – Chunk replicas may become stale if a chunkserver fails or misses updates to the chunk – For each chunk, the master keeps a chunk version number – Chunk version number updated for each chunk mutation – The master removes stale replicas in its regular garbage collection Valeria Cardellini - SDCC 2018/19 16 GFS: system interactions • Files are hierarchically organized in directories – There is no data structure that represents a directory • A file is identified by its pathname – GFS does not support aliases • GFS supports traditional file system operations (but no Posix API) – create , delete , open , close , read , write • Also supports two special operations: – snapshot : makes a copy of a file or a directory tree almost instantaneously (based on copy-on-write techniques) – record append : atomically appends data to a file; multiple clients can append to the same file concurrently without fear of overwriting one another’s data 17 Valeria Cardellini - SDCC 2018/19

  10. GFS: system interactions 1 2 3 4 • Read operation - Data flow is decoupled from control flow (1) Client sends master: read(file name, chunk index) (2) Master’s reply: chunk ID, chunk version number, locations of replicas (3) Client sends “closest” chunkserver w/replica: read(chunk ID, byte range) (4) Chunkserver replies with data Valeria Cardellini - SDCC 2018/19 18 GFS: mutations • Mutations are write or append – Mutations are performed at all the chunk's replicas in the same order (3): Client sends • Based on lease mechanism: data to closest replica first – Goal: minimize management overhead at the master – Master grants chunk lease to primary replica – Primary picks a serial order for all the mutations to the chunk – All replicas follow this order when • Data flow is decoupled from applying mutations control flow – Primary replies to client, see (7) • To fully utilize network – Leases renewed using periodic bandwidth, data are pushed heartbeat messages between linearly along a chain of master and chunkservers chunkservers 19 Valeria Cardellini - SDCC 2018/19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend