bluestore a new storage backend for ceph one year in
play

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL - PowerPoint PPT Presentation

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL 2017.03.23 OUTLINE Ceph background and context FileStore, and why POSIX failed us BlueStore a new Ceph OSD backend Performance Recent challenges


  1. BLUESTORE: A NEW STORAGE BACKEND FOR CEPH – ONE YEAR IN SAGE WEIL 2017.03.23

  2. OUTLINE ● Ceph background and context – FileStore, and why POSIX failed us ● BlueStore – a new Ceph OSD backend ● Performance ● Recent challenges ● Future ● Status and availability ● Summary 2

  3. MOTIVATION

  4. CEPH Object, block, and fjle storage in a single cluster ● All components scale horizontally ● No single point of failure ● Hardware agnostic, commodity hardware ● Self-manage whenever possible ● Open source (LGPL) ● “A Scalable, High-Performance Distributed File System” ● “performance, reliability, and scalability” ● 4

  5. CEPH COMPONENTS OBJECT BLOCK FILE RGW RBD CEPHFS A web services gateway A reliable, fully-distributed A distributed fjle system for object storage, block device with cloud with POSIX semantics and compatible with S3 and platform integration scale-out metadata Swift management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 5

  6. OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs ext4 FS FS FS FS DISK DISK DISK DISK M 6

  7. OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs FileStore FileStore FileStore FileStore ext4 FS FS FS FS DISK DISK DISK DISK M 7

  8. OBJECTSTORE AND DATA MODEL ObjectStore Object – “fjle” ● ● abstract interface for storing data (fjle-like byte stream) – – local data attributes (small key/value) – EBOFS, FileStore – omap (unbounded key/value) – Collection – “directory” ● placement group shard (slice of – the RADOS pool) EBOFS ● All writes are transactions ● a user-space e xtent- b ased – – A tomic + C onsistent + D urable o bject f ile s ystem I solation provided by OSD – deprecated in favor of FileStore – on btrfs in 2009 8

  9. FILESTORE FileStore /var/lib/ceph/osd/ceph-123/ ● ● current/ – PG = collection = directory – meta/ ● object = fjle – – osdmap123 – osdmap124 Leveldb ● 0.1_head/ ● large xattr spillover – – object1 – object12 object omap (key/value) data – 0.7_head/ ● – object3 – object5 Originally just for development... ● 0.a_head/ ● – object4 later, only supported backend – – object6 (on XFS) omap/ ● – <leveldb files> 9

  10. POSIX FAILS: TRANSACTIONS Most transactions are simple ● [ { "op_name": "write", write some bytes to object (fjle) – "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", update object attribute (fjle – "length": 4194304, "offset": 0, xattr) "bufferlist length": 4194304 }, append to update log (kv insert) { – "op_name": "setattrs", "collection": "0.6_head", ...but others are arbitrarily "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { large/complex "_": 269, "snapset": 31 } Serialize and write-ahead txn to ● }, journal for atomicity { "op_name": "omap_setkeys", "collection": "0.6_head", We double-write everything! – "oid": "#0:60000000::::head#", "attr_lens": { "0000000005.00000000000000000006": 178, Lots of ugly hackery to make – "_info": 847 replayed events idempotent } } ] 10

  11. POSIX FAILS: ENUMERATION Ceph objects are distributed by a 32-bit hash ● … Enumeration is in hash order ● DIR_A/ scrubbing – DIR_A/A03224D3_qwer “backfjll” (data rebalancing, recovery) – DIR_A/A247233E_zxcv enumeration via librados client API – … POSIX readdir is not well-ordered DIR_B/ ● DIR_B/DIR_8/ And even if it were, it would be a difgerent hash – DIR_B/DIR_8/B823032D_foo DIR_B/DIR_8/B8474342_bar Need O(1) “split” for a given shard/range ● DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz Build directory tree by hash-value prefjx ● DIR_B/DIR_A/ split any directory when size > ~100 fjles – DIR_B/DIR_A/BA4328D2_asdf merge when size < ~50 fjles – … read entire directory, sort in-memory – 11

  12. THE HEADACHES CONTINUE New FileStore problems continue to surface as we approach switch to ● BlueStore Recently discovered bug in FileStore omap implementation, revealed by new – CephFS scrubbing FileStore directory splits lead to throughput collapse when an entire pool’s – PG directories split in unison Read/modify/write workloads perform horribly – RGW index objects ● RBD object bitmaps ● QoS efgorts thwarted by deep queues and periodicity in FileStore throughput – Cannot bound deferred writeback work, even with fsync(2) – {RBD, CephFS} snapshots triggering ineffjcient 4MB object copies to create – object clones 12

  13. BLUESTORE

  14. BLUESTORE BlueStore = Bl ock + N ewStore ● consume raw block device(s) – ObjectStore key/value database (RocksDB) for metadata – BlueStore data written directly to block device – metadata data pluggable block Allocator (policy) – RocksDB pluggable compression – BlueRocksEnv checksums, ponies, ... – BlueFS We must share the block device with RocksDB BlockDevice ● 14

  15. ROCKSDB: BLUEROCKSENV + BLUEFS class BlueRocksEnv : public rocksdb::EnvWrapper Map “directories” to difgerent block devices ● ● db.wal/ – on NVRAM, NVMe, SSD passes “fjle” operations to BlueFS – – – db/ – level0 and hot SST s on SSD BlueFS is a super-simple “fjle system” ● db.slow/ – cold SST s on HDD – all metadata lives in the journal – all metadata loaded in RAM on start/mount – BlueStore periodically balances free space ● no need to store block free list – coarse allocation unit (1 MB blocks) – journal rewritten/compacted when it gets large – superblock journal … data data more journal … data data fjle 10 fjle 11 fjle 12 fjle 12 fjle 13 rm fjle 12 fjle 13 ... 15

  16. MULTI-DEVICE SUPPORT Single device T wo devices ● ● HDD or SSD a few GB of SSD – – Bluefs db.wal/ + db/ (wal and sst fjles) bluefs db.wal/ (rocksdb wal) ● ● object data blobs bluefs db/ (warm sst fjles) ● ● big device – bluefs db.slow/ (cold sst fjles) ● object data blobs ● T wo devices Three devices ● ● 512MB of SSD or NVRAM – 512MB NVRAM – bluefs db.wal/ (rocksdb wal) ● bluefs db.wal/ (rocksdb wal) ● big device – a few GB SSD – bluefs db/ (sst fjles, spillover) ● bluefs db/ (warm sst fjles) ● object data blobs ● big device – bluefs db.slow/ (cold sst fjles) ● object data blobs ● 16

  17. METADATA

  18. BLUESTORE METADATA Everything in fmat kv database (rocksdb) ● Partition namespace for difgerent metadata ● S* – “superblock” properties for the entire store – B* – block allocation metadata (free block bitmap) – T* – stats (bytes used, compressed, etc.) – C* – collection name → cnode_t – O* – object name → onode_t or bnode_t – X* – shared blobs – L* – deferred writes (promises of future IO) – M* – omap (user key/value data, stored in objects) – 18

  19. CNODE Collection metadata struct spg_t { ● uint64_t pool; Interval of object namespace – uint32_t hash; shard_id_t shard; shard pool hash name bits }; C<NOSHARD,12,3d3e0000> “12.e3d3” = <19> struct bluestore_cnode_t { uint32_t bits; shard pool hash name snap gen }; O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = … Nice properties O<NOSHARD,12,3d3e 02c2 ,baz,NOSNAP,NOGEN> = … ● Ordered enumeration of objects – O<NOSHARD,12,3d3e 125d ,zip,NOSNAP,NOGEN> = … We can “split” collections by adjusting – O<NOSHARD,12,3d3e 1d41 ,dee,NOSNAP,NOGEN> = … collection metadata only O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = … 19

  20. ONODE Per object metadata struct bluestore_onode_t { ● uint64_t size; Lives directly in key/value pair – map<string,bufferptr> attrs; uint64_t flags; Serializes to 100s of bytes – Size in bytes struct shard_info { ● uint32_t offset; Attributes (user attr data) ● uint32_t bytes; }; Inline extent map (maybe) ● vector<shard_info> shards; bufferlist inline_extents; bufferlist spanning_blobs; }; 20

  21. BLOBS Blob struct bluestore_blob_t { ● vector<bluestore_pextent_t> extents; Extent(s) on device – uint32_t compressed_length_orig = 0; uint32_t compressed_length = 0; Lump of data originating from – uint32_t flags = 0; same object uint16_t unused = 0; // bitmap May later be referenced by – multiple objects uint8_t csum_type = CSUM_NONE; uint8_t csum_chunk_order = 0; Normally checksummed – bufferptr csum_data; }; May be compressed – SharedBlob ● struct bluestore_shared_blob_t { uint64_t sbid; Extent ref count on cloned – bluestore_extent_ref_map_t ref_map; blobs }; In-memory bufger cache – 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend