BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL - PowerPoint PPT Presentation

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH – ONE YEAR IN SAGE WEIL 2017.03.23

OUTLINE ● Ceph background and context – FileStore, and why POSIX failed us ● BlueStore – a new Ceph OSD backend ● Performance ● Recent challenges ● Future ● Status and availability ● Summary 2

MOTIVATION

CEPH Object, block, and fjle storage in a single cluster ● All components scale horizontally ● No single point of failure ● Hardware agnostic, commodity hardware ● Self-manage whenever possible ● Open source (LGPL) ● “A Scalable, High-Performance Distributed File System” ● “performance, reliability, and scalability” ● 4

CEPH COMPONENTS OBJECT BLOCK FILE RGW RBD CEPHFS A web services gateway A reliable, fully-distributed A distributed fjle system for object storage, block device with cloud with POSIX semantics and compatible with S3 and platform integration scale-out metadata Swift management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 5

OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs ext4 FS FS FS FS DISK DISK DISK DISK M 6

OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs FileStore FileStore FileStore FileStore ext4 FS FS FS FS DISK DISK DISK DISK M 7

OBJECTSTORE AND DATA MODEL ObjectStore Object – “fjle” ● ● abstract interface for storing data (fjle-like byte stream) – – local data attributes (small key/value) – EBOFS, FileStore – omap (unbounded key/value) – Collection – “directory” ● placement group shard (slice of – the RADOS pool) EBOFS ● All writes are transactions ● a user-space e xtent- b ased – – A tomic + C onsistent + D urable o bject f ile s ystem I solation provided by OSD – deprecated in favor of FileStore – on btrfs in 2009 8

FILESTORE FileStore /var/lib/ceph/osd/ceph-123/ ● ● current/ – PG = collection = directory – meta/ ● object = fjle – – osdmap123 – osdmap124 Leveldb ● 0.1_head/ ● large xattr spillover – – object1 – object12 object omap (key/value) data – 0.7_head/ ● – object3 – object5 Originally just for development... ● 0.a_head/ ● – object4 later, only supported backend – – object6 (on XFS) omap/ ● – <leveldb files> 9

POSIX FAILS: TRANSACTIONS Most transactions are simple ● [ { "op_name": "write", write some bytes to object (fjle) – "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", update object attribute (fjle – "length": 4194304, "offset": 0, xattr) "bufferlist length": 4194304 }, append to update log (kv insert) { – "op_name": "setattrs", "collection": "0.6_head", ...but others are arbitrarily "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { large/complex "_": 269, "snapset": 31 } Serialize and write-ahead txn to ● }, journal for atomicity { "op_name": "omap_setkeys", "collection": "0.6_head", We double-write everything! – "oid": "#0:60000000::::head#", "attr_lens": { "0000000005.00000000000000000006": 178, Lots of ugly hackery to make – "_info": 847 replayed events idempotent } } ] 10

POSIX FAILS: ENUMERATION Ceph objects are distributed by a 32-bit hash ● … Enumeration is in hash order ● DIR_A/ scrubbing – DIR_A/A03224D3_qwer “backfjll” (data rebalancing, recovery) – DIR_A/A247233E_zxcv enumeration via librados client API – … POSIX readdir is not well-ordered DIR_B/ ● DIR_B/DIR_8/ And even if it were, it would be a difgerent hash – DIR_B/DIR_8/B823032D_foo DIR_B/DIR_8/B8474342_bar Need O(1) “split” for a given shard/range ● DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz Build directory tree by hash-value prefjx ● DIR_B/DIR_A/ split any directory when size > ~100 fjles – DIR_B/DIR_A/BA4328D2_asdf merge when size < ~50 fjles – … read entire directory, sort in-memory – 11

THE HEADACHES CONTINUE New FileStore problems continue to surface as we approach switch to ● BlueStore Recently discovered bug in FileStore omap implementation, revealed by new – CephFS scrubbing FileStore directory splits lead to throughput collapse when an entire pool’s – PG directories split in unison Read/modify/write workloads perform horribly – RGW index objects ● RBD object bitmaps ● QoS efgorts thwarted by deep queues and periodicity in FileStore throughput – Cannot bound deferred writeback work, even with fsync(2) – {RBD, CephFS} snapshots triggering ineffjcient 4MB object copies to create – object clones 12

BLUESTORE

BLUESTORE BlueStore = Bl ock + N ewStore ● consume raw block device(s) – ObjectStore key/value database (RocksDB) for metadata – BlueStore data written directly to block device – metadata data pluggable block Allocator (policy) – RocksDB pluggable compression – BlueRocksEnv checksums, ponies, ... – BlueFS We must share the block device with RocksDB BlockDevice ● 14

ROCKSDB: BLUEROCKSENV + BLUEFS class BlueRocksEnv : public rocksdb::EnvWrapper Map “directories” to difgerent block devices ● ● db.wal/ – on NVRAM, NVMe, SSD passes “fjle” operations to BlueFS – – – db/ – level0 and hot SST s on SSD BlueFS is a super-simple “fjle system” ● db.slow/ – cold SST s on HDD – all metadata lives in the journal – all metadata loaded in RAM on start/mount – BlueStore periodically balances free space ● no need to store block free list – coarse allocation unit (1 MB blocks) – journal rewritten/compacted when it gets large – superblock journal … data data more journal … data data fjle 10 fjle 11 fjle 12 fjle 12 fjle 13 rm fjle 12 fjle 13 ... 15

MULTI-DEVICE SUPPORT Single device T wo devices ● ● HDD or SSD a few GB of SSD – – Bluefs db.wal/ + db/ (wal and sst fjles) bluefs db.wal/ (rocksdb wal) ● ● object data blobs bluefs db/ (warm sst fjles) ● ● big device – bluefs db.slow/ (cold sst fjles) ● object data blobs ● T wo devices Three devices ● ● 512MB of SSD or NVRAM – 512MB NVRAM – bluefs db.wal/ (rocksdb wal) ● bluefs db.wal/ (rocksdb wal) ● big device – a few GB SSD – bluefs db/ (sst fjles, spillover) ● bluefs db/ (warm sst fjles) ● object data blobs ● big device – bluefs db.slow/ (cold sst fjles) ● object data blobs ● 16

METADATA

BLUESTORE METADATA Everything in fmat kv database (rocksdb) ● Partition namespace for difgerent metadata ● S* – “superblock” properties for the entire store – B* – block allocation metadata (free block bitmap) – T* – stats (bytes used, compressed, etc.) – C* – collection name → cnode_t – O* – object name → onode_t or bnode_t – X* – shared blobs – L* – deferred writes (promises of future IO) – M* – omap (user key/value data, stored in objects) – 18

CNODE Collection metadata struct spg_t { ● uint64_t pool; Interval of object namespace – uint32_t hash; shard_id_t shard; shard pool hash name bits }; C<NOSHARD,12,3d3e0000> “12.e3d3” = <19> struct bluestore_cnode_t { uint32_t bits; shard pool hash name snap gen }; O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = … Nice properties O<NOSHARD,12,3d3e 02c2 ,baz,NOSNAP,NOGEN> = … ● Ordered enumeration of objects – O<NOSHARD,12,3d3e 125d ,zip,NOSNAP,NOGEN> = … We can “split” collections by adjusting – O<NOSHARD,12,3d3e 1d41 ,dee,NOSNAP,NOGEN> = … collection metadata only O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = … 19

ONODE Per object metadata struct bluestore_onode_t { ● uint64_t size; Lives directly in key/value pair – map<string,bufferptr> attrs; uint64_t flags; Serializes to 100s of bytes – Size in bytes struct shard_info { ● uint32_t offset; Attributes (user attr data) ● uint32_t bytes; }; Inline extent map (maybe) ● vector<shard_info> shards; bufferlist inline_extents; bufferlist spanning_blobs; }; 20

BLOBS Blob struct bluestore_blob_t { ● vector<bluestore_pextent_t> extents; Extent(s) on device – uint32_t compressed_length_orig = 0; uint32_t compressed_length = 0; Lump of data originating from – uint32_t flags = 0; same object uint16_t unused = 0; // bitmap May later be referenced by – multiple objects uint8_t csum_type = CSUM_NONE; uint8_t csum_chunk_order = 0; Normally checksummed – bufferptr csum_data; }; May be compressed – SharedBlob ● struct bluestore_shared_blob_t { uint64_t sbid; Extent ref count on cloned – bluestore_extent_ref_map_t ref_map; blobs }; In-memory bufger cache – 21

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL - PowerPoint PPT Presentation

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL 2017.03.23 OUTLINE Ceph background and context FileStore, and why POSIX failed us BlueStore a new Ceph OSD backend Performance Recent challenges

Ceph: All-in-One Network Data Storage What is Ceph and how we use it to backend the Arbutus cloud

Managing and Monitoring Ceph with the Ceph Dashboard Lenz Grimmer <lgrimmer@suse.com> |

Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat SNIA, 2017 1 WHAT IS CEPH?

MetaPost 1.207 (TEXLive 2009) EuroTEX 2009 SVG backend SVG backend SVG backend SVG backend A

CEPHALOPODS AND SAMBA IRA COOPER - SambaXP 2016.05.12 AGENDA CEPH Architecture. Why CEPH?

Agenda Openstack CEPH Storage Dream team: CEPH and Openstack Summary GUUG FFG 2015

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Linux Open Source Distributed Filesystem Ceph at SURFsara Remco van Vugt July 2, 2013 1/ 34

Ceph & RocksDB (Cloud Storage ) Ceph Basics Placement Group PG#1 PG#2 PG#3

Ceph storage with Rook Running Ceph on Kubernetes Alexander Trost, Rook Maintainer and DevOps

Scaling Your Storage Using Ceph Wido den Hollander #CCCEU Who am I? Wido den Hollander

Red Hat Ceph Storage Free Test Drive Environment Introduction Karan Singh Sr. Storage Architect

Cloud object storage in Ceph Orit Wasserman owasserm@redhat.com Fosdem 2017 AGENDA What is

How to backup Ceph at scale FOSDEM, Brussels, 2018.02.04 About me Bartomiej wicki OVH

The Shmitah Cycle Common Holy Year 1 Year 2 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

Presentation: 1. I-Max Ceph key points 2. Exams 3. Dimensions 4. Technical features 5. I-Max

Why Does Your Data Leak? Uncovering the Data Leakage in Cloud from Mobile Apps Chaoshun Zuo ,

The Betrayal At Cloud City: An Empirical Analysis Of Cloud-Based Mobile Backends Omar Alrawi* ,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 5: Analyzing Relational

Memcache as a Service Tom Anderson Goals Rapid application development (velocity) - Speed

CS371m - Mobile Computing Persistence - Web Based Storage CHECK OUT

What is Cloud Native? WW Developer Advocacy Contents App Modernization Docker

SQL Injection Slides thanks to Prof. Shmatikov at UT Austin Dynamic Web Application GET /

Large-scale system of persistently connected devices using WebSocket 7/22/2013 NTT Advanced

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL - PowerPoint PPT Presentation

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL 2017.03.23 OUTLINE Ceph background and context FileStore, and why POSIX failed us BlueStore a new Ceph OSD backend Performance Recent challenges

Ceph: All-in-One Network Data Storage What is Ceph and how we use it to backend the Arbutus cloud

Managing and Monitoring Ceph with the Ceph Dashboard Lenz Grimmer &lt;lgrimmer@suse.com&gt; |

Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat SNIA, 2017 1 WHAT IS CEPH?

MetaPost 1.207 (TEXLive 2009) EuroTEX 2009 SVG backend SVG backend SVG backend SVG backend A

CEPHALOPODS AND SAMBA IRA COOPER - SambaXP 2016.05.12 AGENDA CEPH Architecture. Why CEPH?

Agenda Openstack CEPH Storage Dream team: CEPH and Openstack Summary GUUG FFG 2015

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Linux Open Source Distributed Filesystem Ceph at SURFsara Remco van Vugt July 2, 2013 1/ 34

Ceph &amp; RocksDB (Cloud Storage ) Ceph Basics Placement Group PG#1 PG#2 PG#3

Ceph storage with Rook Running Ceph on Kubernetes Alexander Trost, Rook Maintainer and DevOps

Scaling Your Storage Using Ceph Wido den Hollander #CCCEU Who am I? Wido den Hollander

Red Hat Ceph Storage Free Test Drive Environment Introduction Karan Singh Sr. Storage Architect

Cloud object storage in Ceph Orit Wasserman owasserm@redhat.com Fosdem 2017 AGENDA What is

How to backup Ceph at scale FOSDEM, Brussels, 2018.02.04 About me Bartomiej wicki OVH

The Shmitah Cycle Common Holy Year 1 Year 2 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

Presentation: 1. I-Max Ceph key points 2. Exams 3. Dimensions 4. Technical features 5. I-Max

Why Does Your Data Leak? Uncovering the Data Leakage in Cloud from Mobile Apps Chaoshun Zuo ,

The Betrayal At Cloud City: An Empirical Analysis Of Cloud-Based Mobile Backends Omar Alrawi* ,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 5: Analyzing Relational

Memcache as a Service Tom Anderson Goals Rapid application development (velocity) - Speed

CS371m - Mobile Computing Persistence - Web Based Storage CHECK OUT

What is Cloud Native? WW Developer Advocacy Contents App Modernization Docker

SQL Injection Slides thanks to Prof. Shmatikov at UT Austin Dynamic Web Application GET /

Large-scale system of persistently connected devices using WebSocket 7/22/2013 NTT Advanced

Managing and Monitoring Ceph with the Ceph Dashboard Lenz Grimmer <lgrimmer@suse.com> |

Ceph & RocksDB (Cloud Storage ) Ceph Basics Placement Group PG#1 PG#2 PG#3