Ceph: A Scalable, High-Performance Distributed File System
Sage A. Weil, Scott A. Brandt, Ethan L. Milner, Darrel D. E. Long Presenter: Md Rajib Hossen
Ceph: A Scalable, High-Performance Distributed File System Sage A. - - PowerPoint PPT Presentation
Ceph: A Scalable, High-Performance Distributed File System Sage A. Weil, Scott A. Brandt, Ethan L. Milner, Darrel D. E. Long Presenter: Md Rajib Hossen Ceph- A single, Open, and Unified platform Horizontally scalable Interoperability
Sage A. Weil, Scott A. Brandt, Ethan L. Milner, Darrel D. E. Long Presenter: Md Rajib Hossen
Horizontally scalable Interoperability No single point of failure Workload include tens of thousands of client concurrently read/write to same file/directory Handle allocation and mapping with dynamic algorithm-CRUSH Enhanced local disk with object storage device(OSD)
MDS-metadata server: perform file operations(open, rename), manage namespace, ensure consistency, security, and safety OSD-Object Storage Device: store files, maintain replication, data serialization, recovery Client: Support three different client types: Object, Block and Posix File System Monitors: keep track of active and failed cluster nodes Store files as object in storage level, stripped files into several objects. object size, stripe width, and stripe count are configurable. CRUSH: removes allocation tables, dynamically maps object(strip of files) to storage devices, retrieve location of objects, load balance among the nodes.
Ceph provides scalability as well as high performance, reliability, and availibility. To achieve these, ceph has three design features Decoupled Data and metadata
Metadata operations(open, rename) managed by MDS, and OSD perform file I/O Moreover, CRUSH distribute file objects to storage devices algorithmically
Dynamic Distributed Metadata Management
Uses Dynamic Subtree partitioning to distribute responsibilities among several MDS. Dynamic hierarchical partition also preserve locality, and distribution is based on access pattern
Reliable Autonomic Distributed Object Storage
Delegates responsibilities to OSD and give intelligence to OSD to utilize Memory and CPU
performance, reliability and availability through three fundamental design features:…” What are the Ceph’s design features? Compare Figure 1 with “Figure 1: GFS Architecture” in the GFS’s paper, read Section 2 and indicate the fundamental differences between them? [Hint: “…Figure 1: GFS Architecture”, “Ceph utilizes a novel metadata cluster architecture…”, “Ceph delegates responsibility for data migration, replication,failure detection, and failure recovery to the cluster
through three fundamental design features:…” What are the Ceph’s design features? Compare Figure 1 with “Figure 1: GFS Architecture” in the GFS’s paper, read Section 2 and indicate the fundamental differences between them? [Hint: “…Figure 1: GFS Architecture”, “Ceph utilizes a novel metadata cluster architecture…”, “Ceph delegates responsibility for data migration, replication, failure detection, and failure recovery to the cluster of OSDs…”]
through three fundamental design features:…” What are the Ceph’s design features? Compare Figure 1 with “Figure 1: GFS Architecture” in the GFS’s paper, read Section 2 and indicate the fundamental differences between them? [Hint: “…Figure 1: GFS Architecture”, “Ceph utilizes a novel metadata cluster architecture…”, “Ceph delegates responsibility for data migration, replication,failure detection, and failure recovery to the cluster of OSDs…”]
GFS has single master to coordinate and maintain all works except whereas ceph has metadata cluster Ceph distribute replication, failure detection to osd whereas gfs master manage these tasks. GFS use fixed size chunk i.e. 64MB whereas ceph has variable object size. GFS use file mapping table and kept them in memory whereas ceph uses CRUSH GFS depends on local file system whereas Ceph make intelligent OSD on top of local fs/customize fs Ceph doesn’t require metadata locks or leases to clients whereas GFS does.
Replaced traditional hard disk with intelligent object storage device Client can read/write continuously in OSD which isn’t present on traditional HDD Client can perform continuous reading/writing of large and variable size object Object sizes are configurable(2MB, 4MB, etc) OSD distribute low-level block allocation decisions to devices Moreover, reliance on traditional fs principles i.e. allocation lists and inode table limited scalability and performance. Intelligence present on OSD can utilize CPU and memory in storage nodes
write byte ranges to much larger (and often variably sized) named objects, distributing low-level block allocation decisions to the devices themselves.” What are the major differences between OSD (Object Storage Device) and conventional hard disk?
Ceph delegates some responsibility to osd and reduce dependency on MDS MDS manage file system namespace and file operations OSD performs data access, data serialization, replication, reliability It also removes allocation table and provide CRUSH algorithm to dynamically mapping between object and storage device
On the other hand, GFS have file mapping information stored in memory of Master. Master maintains file system metadata i.e. namespaces, files to chunk mapping, location of replicas. It also perform chunk lease management, garbage collection, chunk replication and migrations, “Ceph decouples data and metadata operations by eliminating file allocation tables and replacing them with generating functions. This allows Ceph to leverage the intelligence present in OSDs to distribute the complexity surrounding data access, update serialization, replication and reliability, failure detection, and recovery.” Does GFS have a file allocation table? Who is responsible for managing “data access, update serialization, replication and reliability, failure detection, and recovery” in GFS?
To store a named object, client
Ceph takes these & hashes Ceph calculate the hash modulo the # of PGs(e.g. 58) to get PG ID Ceph gets the POOL ID given pool name(e.e. ”juventus” = 4) Ceph prepends the pool id to get PG ID (e.g. 4.58)
First file is stripped into several objects Maps objects into PGs using a hash function and adjustable bit mask to control # of PGs Each osd to the order of 100 PGs for balance in osd utilization PGs are then mapped to OSD via CRUSH To locate an object, CRUSH requires PG ID and cluster map Q4. “Figure 3: Files are striped across many objects, grouped into placement groups (PGs), and distributed to OSDs via CRUSH, a specialized replica placement function.”. Describe how to find the data associated with an inode and an in-file object number (“ino, ono”).
CRUSH is introduced to remove mapping table that requires significant memory and overhead to keep the list consistent Moreover, any entity can calculate object location and map can be updated infrequently. Mapping relying on block or object list metadata have several drawbacks
We need to exchange distribution related metadata Upon removal of node, we need to make the block/object list consistent and make several changes. But now, pg will be mapped with new osd, and we will find pg id dynamically Same approach also help in data rebalancing and new osd node. These remove the dependency on underlying storage node
Q5. Does a mapping method (from an object number to its hosting storage server) relying on “block or object list metadata” (a table listing all object- server mappings) work as well? What’s its drawback?
PG aggregates series of object into a group, and maps them to series of OSD Tracking per-object placement and metadata is highly expensive PG reduce # of process and per object metadata to track when storing and retrieving data There are other advantage of having logical placement group on top of osd cluster
Can apply placement rules on some specific PGs that belong to pool Easy to provide distribution policy like SSD group/HDD group/ Same RACK/Different Rack OSD can self-report and monitor peers within its same PG which reduce load from master
Mapping oid directly to OSD will lose these benefits Q6. Why are placement groups (PGs) introduced? Can we construct a hash function mapping an object (“oid”) directly to a list of OSDs?
Determines how to store and retrieve data by computing data storage locations CRUSH maps takes placement group, cluster map and placement rules as input It produce list of osd to maps each pg CRUSH map also consider placement constriant i.e. place each pg on osd such that it can reduce inter-row replication traffic and minimal exposure to power/switch failure. Q7. What are inputs of a CRUSH hash function? What can be included in an OSD cluster map? [Hint: read the last paragraph Section 5.1 for the second question.]
Cluster map contains cluster full physical composition. There are five maps that are part of ”Cluster Map”
The monitor Map: contains current epoch, cluster fsid, creation time, name & address port of each monitor The OSD Map: list of pools, replica size, PG #, list of OSD & their status The PG Map: PG version, timestamp, last OSD map epoch, details of each PG ID, data usage statistics for each pool. The CRUSH Map: list of storage devices, failure domain hierarchy(device, host, rack, room, etc), rules for placing data on hierarchy The MDS Map: current MDS map epoch, pool metadata, list of MD servers, and their status.
Q7. What are inputs of a CRUSH hash function? What can be included in an OSD cluster map? [Hint: read the last paragraph Section 5.1 for the second question.]