Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat - - PowerPoint PPT Presentation

ceph rados block device
SMART_READER_LITE
LIVE PREVIEW

Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat - - PowerPoint PPT Presentation

Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat SNIA, 2017 1 WHAT IS CEPH? Software-defined distributed storage All components scale horizontally No single point of failure Self managing/healing Commodity


slide-1
SLIDE 1

1

Ceph Rados Block Device

Venky Shankar Ceph Developer, Red Hat SNIA, 2017

slide-2
SLIDE 2

WHAT IS CEPH?

▪ Software-defined distributed storage ▪ All components scale horizontally ▪ No single point of failure ▪ Self managing/healing ▪ Commodity hardware ▪ Object, block & file ▪ Open source

2

slide-3
SLIDE 3

RGW

Web services gateway for

  • bject storage,compatible with

S3 and Swift

LIBRADOS

A library allowing apps to directly access RADOS(C,C+,Java,Python,Ruby,PHP)

RADOS

A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

RBD

Reliable, fully-distributed block device with cloud platform integration

CEPHFS

A distributed file system with POSIX semantics and scale-out metadata management

BLOCK FILE

CEPH COMPONENTS

3

OBJECT

slide-4
SLIDE 4

Userspace Client

M M M RADOS CLUSTER

RADOS BLOCK DEVICE

KRBD Linux Host

IMAGE UPDATES

4

LIBCEPH LIBRBD LIBRADOS

slide-5
SLIDE 5

LIBRADOS

▪ API around doing transactions ▪ single object ▪ Rich Object API ▪ partial overwrites, reads, attrs ▪ compound “atomic” operations per object ▪ rados classes

5

slide-6
SLIDE 6

RBD FEATURES

▪ Stripes images across cluster ▪ Thin provisioned ▪ Read-only snapshots ▪ Copy-on-Write clones ▪ Image features (exclusive-lock, fast-diff, ...) ▪ Integration

▪ QEMU, libvirt ▪ Linux kernel ▪ Openstack ▪ Version ▪ v1 (deprecated), v2

6

slide-7
SLIDE 7

IMAGE METADATA

▪ rbd_id.<image-name> ▪ Internal ID - locatable by user specified image name ▪ rbd_header.<image-id> ▪ Image metadata (features, snaps, etc…) ▪ rbd_directory ▪ list of images (maps image name to id and vice versa) ▪ rbd_children ▪ list of clones and parent map

7

slide-8
SLIDE 8

IMAGE DATA

▪ Striped across the cluster ▪ Thin provisioned ▪ Non-existent data object to start with ▪ Object name based on offset in image ▪ rbd_data.<image-id>.* ▪ Objects are mostly sparse ▪ Snapshots handled by RADOS ▪ Clone CoW performed by librbd

8

slide-9
SLIDE 9

RBD IMAGE INFO

rbd image 'r0': size 10240 MB in 2560 objects

  • rder 22 (4096 kB objects)

block_name_prefix: rbd_data.101774b0dc51 format: 2 features: layering, exclusive-lock,

  • bject-map, fast-diff, deep-flatten

flags:

9

slide-10
SLIDE 10

STRIPING

▪ Uniformly sized objects ▪ default: 4M ▪ Objects randomly distributed among OSDs (CRUSH) ▪ Spreads I/O workload across cluster (nodes/spindles) ▪ Tunables ▪ stripe_unit ▪ stripe_count

10

slide-11
SLIDE 11

I/O PATH

11

slide-12
SLIDE 12

SNAPSHOTS

▪ Per image snapshots ▪ Snapshots handled by RADOS ▪ CoW per object basis ▪ Snapshot context (list of snap ids, latest snap) ▪ stored in image header ▪ sent with each I/O ▪ Self-managed by RBD ▪ Snap spec ▪ image@snap

12

slide-13
SLIDE 13

CLONES

▪ CoW at object level ▪ performed by librbd ▪ clone has “reference” to parent ▪ optional : CoR ▪ “clone” a protected snapshot ▪ protected : cannot be deleted ▪ Can be “flattened” ▪ copy all data from parent ▪ remove parent “reference” (rbd_children) ▪ Can be in different pool ▪ Can have different feature set

13

slide-14
SLIDE 14

CLONES : I/O (READ)

▪ Object doesn’t exist ▪ thin provisioned ▪ data objects don’t exist just after a clone ▪ just the metadata (header, etc…) ▪ rbd_header has reference to parent snap ▪ Copy-on-Read (optional) ▪ async object copy after serving read ▪ 4k read turns into 4M (stripe_unit) I/O ▪ helpful for some workloads

14

slide-15
SLIDE 15

CLONES : I/O (WRITE)

▪ Opportunistically sent a I/O guard ▪ fail if object doesn’t exist ▪ do a write if it does ▪ Object doesn’t exist ▪ copy data object from parent ▪ NOTE: parent could have a different object number ▪ optimization (full write) ▪ Object exist ▪ perform the write operation

15

slide-16
SLIDE 16

IMAGE FEATURES

▪ layering : snapshots, clones ▪ exclusive-lock : “lock” the header object ▪ operation(s) forwarded to the lock owner ▪ client blacklisting ▪ object-map : index of which object exists ▪ fast-diff : fast diff calculation ▪ deep-flatten : snapshot flatten support ▪ journaling : journal data before image update ▪ data-pool : optionally place image data on a separate pool ▪ stripingv2 : “fancy” striping

16

slide-17
SLIDE 17

Kernel RBD

▪ “catches up” with librbd for features ▪ USAGE: rbd map … ▪ fails if feature set not supported ▪ /etc/ceph/rbdmap ▪ No specialized cache, uses page cache used by filesystems

17

slide-18
SLIDE 18

TOOLS, FEATURES & MORE What about data center failures?

18

slide-19
SLIDE 19

RBD MIRRORING

▪ Online, continuous backup ▪ Asynchronous replication ▪ across WAN ▪ no IO slowdown ▪ transient connectivity issues ▪ Crash consistent ▪ Easy to use/monitor ▪ Horizontally scalable

19

slide-20
SLIDE 20

RBD MIRRORING : OVERVIEW

▪ Configuration ▪ pool, images ▪ Journaling feature ▪ Mirroring daemon ▪ rbd-mirror utility ▪ pull model ▪ replays journal from remote to local ▪ Two way replication b/w 2 sites ▪ One way replication b/w N sites

20

slide-21
SLIDE 21

RBD MIRRORING : DESIGN

▪ Log all image modifications ▪ recall : journaling feature ▪ Journal ▪ separate objects in rados (splayed) ▪ stores appended event logs ▪ Delay image modifications ▪ Commit journal events ▪ Ordered view of updates

21

slide-22
SLIDE 22

RBD MIROR DAEMON

SITE-B

RBD-MIRROR

LIBRBD SITE-A

Client

LIBRBD

CLUSTER A CLUSTER B M

IMAGE UPDATES

M M M M M

JOURNAL EVENTS

22

IMAGE UPDATES

slide-23
SLIDE 23

RBD MIRRORING : FUTURE

▪ Mirror HA ▪ >1 rbd-mirror daemons ▪ Parallel image replication ▪ WIP ▪ Replication statistics ▪ Image scrub

23

slide-24
SLIDE 24

Ceph iSCSI

▪ HA ▪ exclusive-lock + initiator multipath ▪ Started out with kernel only ▪ LIO iblock + krbd ▪ now TCM-User + librbd

24

slide-25
SLIDE 25

Questions?

21

slide-26
SLIDE 26

THANK YOU!

Venky Shankar

vshankar@redhat.com

22

slide-27
SLIDE 27

CEPH ON STEROIDS

▪ Bluestore ▪ newstore + block (avoid double write) ▪ k/v store : rocksdb ▪ hdd, ssd, nvme ▪ block checksum ▪ Impressive initial benchmark results ▪ helps rbd a lot!

27