ceph rados block device
play

Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat - PowerPoint PPT Presentation

Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat SNIA, 2017 1 WHAT IS CEPH? Software-defined distributed storage All components scale horizontally No single point of failure Self managing/healing Commodity


  1. Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat SNIA, 2017 1

  2. WHAT IS CEPH? ▪ Software-defined distributed storage ▪ All components scale horizontally ▪ No single point of failure ▪ Self managing/healing ▪ Commodity hardware ▪ Object, block & file ▪ Open source 2

  3. CEPH COMPONENTS OBJECT BLOCK FILE RGW RBD CEPHFS Web services gateway for Reliable, fully-distributed A distributed file system with object storage,compatible with block device with cloud POSIX semantics and S3 and Swift platform integration scale-out metadata management LIBRADOS A library allowing apps to directly access RADOS(C,C+,Java,Python,Ruby,PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 3

  4. RADOS BLOCK DEVICE Userspace Client Linux Host LIBRBD KRBD LIBRADOS LIBCEPH IMAGE UPDATES M M M RADOS CLUSTER 4

  5. LIBRADOS ▪ API around doing transactions ▪ single object ▪ Rich Object API ▪ partial overwrites, reads, attrs ▪ compound “atomic” operations per object ▪ rados classes 5

  6. RBD FEATURES ▪ Stripes images across cluster ▪ Thin provisioned ▪ Read-only snapshots ▪ Copy-on-Write clones ▪ Image features (exclusive-lock, fast-diff, ...) ▪ Integration ▪ QEMU, libvirt ▪ Linux kernel ▪ Openstack ▪ Version ▪ v1 (deprecated), v2 6

  7. IMAGE METADATA ▪ rbd_id.<image-name> ▪ Internal ID - locatable by user specified image name ▪ rbd_header.<image-id> ▪ Image metadata (features, snaps, etc…) ▪ rbd_directory ▪ list of images (maps image name to id and vice versa) ▪ rbd_children ▪ list of clones and parent map 7

  8. IMAGE DATA ▪ Striped across the cluster ▪ Thin provisioned ▪ Non-existent data object to start with ▪ Object name based on offset in image ▪ rbd_data.<image-id>.* ▪ Objects are mostly sparse ▪ Snapshots handled by RADOS ▪ Clone CoW performed by librbd 8

  9. RBD IMAGE INFO rbd image 'r0': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.101774b0dc51 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten flags: 9

  10. STRIPING ▪ Uniformly sized objects ▪ default: 4M ▪ Objects randomly distributed among OSDs (CRUSH) ▪ Spreads I/O workload across cluster (nodes/spindles) ▪ Tunables ▪ stripe_unit ▪ stripe_count 10

  11. I/O PATH 11

  12. SNAPSHOTS ▪ Per image snapshots ▪ Snapshots handled by RADOS ▪ CoW per object basis ▪ Snapshot context (list of snap ids, latest snap) ▪ stored in image header ▪ sent with each I/O ▪ Self-managed by RBD ▪ Snap spec ▪ image@snap 12

  13. CLONES ▪ CoW at object level ▪ performed by librbd ▪ clone has “reference” to parent ▪ optional : CoR ▪ “clone” a protected snapshot ▪ protected : cannot be deleted ▪ Can be “flattened” ▪ copy all data from parent ▪ remove parent “reference” (rbd_children) ▪ Can be in different pool ▪ Can have different feature set 13

  14. CLONES : I/O (READ) ▪ Object doesn’t exist ▪ thin provisioned ▪ data objects don’t exist just after a clone ▪ just the metadata (header, etc…) ▪ rbd_header has reference to parent snap ▪ Copy-on-Read (optional) ▪ async object copy after serving read ▪ 4k read turns into 4M ( stripe_unit ) I/O ▪ helpful for some workloads 14

  15. CLONES : I/O (WRITE) ▪ Opportunistically sent a I/O guard ▪ fail if object doesn’t exist ▪ do a write if it does ▪ Object doesn’t exist ▪ copy data object from parent ▪ NOTE: parent could have a different object number ▪ optimization (full write) ▪ Object exist ▪ perform the write operation 15

  16. IMAGE FEATURES ▪ layering : snapshots, clones ▪ exclusive-lock : “lock” the header object ▪ operation(s) forwarded to the lock owner ▪ client blacklisting ▪ object-map : index of which object exists ▪ fast-diff : fast diff calculation ▪ deep-flatten : snapshot flatten support ▪ journaling : journal data before image update ▪ data-pool : optionally place image data on a separate pool ▪ stripingv2 : “fancy” striping 16

  17. Kernel RBD ▪ “catches up” with librbd for features ▪ USAGE: rbd map … ▪ fails if feature set not supported ▪ /etc/ceph/rbdmap ▪ No specialized cache, uses page cache used by filesystems 17

  18. TOOLS, FEATURES & MORE What about data center failures? 18

  19. RBD MIRRORING ▪ Online, continuous backup ▪ Asynchronous replication ▪ across WAN ▪ no IO slowdown ▪ transient connectivity issues ▪ Crash consistent ▪ Easy to use/monitor ▪ Horizontally scalable 19

  20. RBD MIRRORING : OVERVIEW ▪ Configuration ▪ pool, images ▪ Journaling feature ▪ Mirroring daemon ▪ rbd-mirror utility ▪ pull model ▪ replays journal from remote to local ▪ Two way replication b/w 2 sites ▪ One way replication b/w N sites 20

  21. RBD MIRRORING : DESIGN ▪ Log all image modifications ▪ recall : journaling feature ▪ Journal ▪ separate objects in rados (splayed) ▪ stores appended event logs ▪ Delay image modifications ▪ Commit journal events ▪ Ordered view of updates 21

  22. RBD MIROR DAEMON SITE-A SITE-B Client RBD-MIRROR LIBRBD LIBRBD IMAGE UPDATES IMAGE UPDATES M M M M M M JOURNAL EVENTS CLUSTER CLUSTER B A 22

  23. RBD MIRRORING : FUTURE ▪ Mirror HA ▪ >1 rbd-mirror daemons ▪ Parallel image replication ▪ WIP ▪ Replication statistics ▪ Image scrub 23

  24. Ceph iSCSI ▪ HA ▪ exclusive-lock + initiator multipath ▪ Started out with kernel only ▪ LIO iblock + krbd ▪ now TCM-User + librbd 24

  25. Questions? 21

  26. THANK YOU! Venky Shankar vshankar@redhat.com 22

  27. CEPH ON STEROIDS ▪ Bluestore ▪ newstore + block (avoid double write) ▪ k/v store : rocksdb ▪ hdd, ssd, nvme ▪ block checksum ▪ Impressive initial benchmark results ▪ helps rbd a lot! 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend