Ceph Snapshots: Diving into Deep Waters Greg Farnum Red hat Vault - PowerPoint PPT Presentation

Ceph Snapshots: Diving into Deep Waters Greg Farnum – Red hat Vault – 2017.03.23

Hi, I’m Greg Greg Farnum ● Principal Software Engineer, Red Hat ● gfarnum@redhat.com ● 2

Outline RADOS, RBD, CephFS: (Lightning) overview and how writes happen ● The (self-managed) snapshots interface ● A diversion into pool snapshots ● Snapshots in RBD, CephFS ● RADOS/OSD Snapshot implementation, pain points ● 3

Ceph’s Past & Present Then: UC Santa Cruz Storage Research Now: Red Hat, a commercial open-source ● ● software & support provider you might have Systems Center heard of :) (Mirantis, SuSE, Canonical, 42on, Hastexo, ...) Long-term research project in petabyte- ● Building a business; customers in virtual block ● scale storage devices and object storage trying to develop a Lustre successor. ...and reaching for fjlesystem users! ● ● 4

Ceph Projects OBJECT BLOCK FILE RGW RBD CEPHFS S3 and Swift compatible A virtual block device with A distributed POSIX fjle object storage with object snapshots, copy-on-write system with coherent versioning, multi-site clones, and multi-site caches and snapshots on federation, and replication replication any directory LIBRADOS A library allowing apps to direct access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomic, distributed object store comprised of self-healing, self-managing, intelligent storage nodes (OSDs) and lightweight monitors (Mons) 5

RADOS: Overview

RADOS Components OSDs:  10s to 10000s in a cluster  One per disk (or one per SSD, RAID group…)  Serve stored objects to clients  Intelligently peer for replication & recovery Monitors: M  Maintain cluster membership and state  Provide consensus for distributed decision- making  Small, odd number  These do not serve stored objects to clients 7 7

Object Storage Daemons M OSD OSD OSD OSD M FS FS FS FS DISK DISK DISK DISK M 8 8

CRUSH: Dynamic Data Placement CRUSH:  Pseudo-random placement algorithm  Fast calculation, no lookup  Repeatable, deterministic  Statistically uniform distribution  Stable mapping  Limited data migration on change  Rule-based confjguration  Infrastructure topology aware  Adjustable replication  Weighting 9 9

DATA IS ORGANIZED INTO POOLS 10 11 10 01 10 01 01 11 OBJECTS POOL A 01 01 01 10 01 10 11 10 01 01 10 01 POOL OBJECTS B 10 01 01 01 POOL OBJECTS 10 01 10 11 C 01 11 10 10 01 10 01 01 POOL OBJECTS 11 10 01 10 D 10 10 01 01 01 01 10 01 CLUSTER POOLS (CONTAINING PGs) 10 10

librados: RADOS Access for Apps LIBRADOS: L  Direct access to RADOS for applications  C, C++, Python, PHP, Java, Erlang  Direct access to storage nodes  No HTTP overhead  Rich object API  Bytes, attributes, key/value data  Partial overwrite of existing data  Single-object compound atomic operations  RADOS classes (stored procedures) 11 11

RADOS: The Write Path (user) aio_write(const object_t &oid, AioCompletionImpl *c, const bufgerlist& bl, size_t len, uint64_t ofg); c->wait_for_safe(); write(const std::string& oid, bufgerlist& bl, size_t len, uint64_t ofg) 12

RADOS: The Write Path (Network) Client Replica Primary L 13 13

RADOS: The Write Path (OSD) Queue write for PG ● Lock PG ● Assign order to write op ● Package it for persistent storage ● Find current object state, etc – Send to replica op ● Send to local persistent storage ● Unlock PG ● Wait for commits from persistent storage and replicas ● Send commit back to client ● 14

RBD: Overview

STORING VIRTUAL DISKS VM HYPERVISOR LIBRBD M M RADOS CLUSTER 16 16

RBD STORES VIRTUAL DISKS RADOS BLOCK DEVICE:  Storage of disk images in RADOS  Decouples VMs from host  Images are striped across the cluster (pool)  Snapshots  Copy-on-write clones  Support in:  Mainline Linux Kernel (2.6.39+)  Qemu/KVM, native Xen coming soon  OpenStack, CloudStack, Nebula, Proxmox 17 17

RBD: The Write Path ssize_t Image::write(uint64_t ofs, size_t len, bufgerlist& bl) int Image::aio_write(uint64_t ofg, size_t len, bufgerlist& bl, RBD::AioCompletion *c) 18

CephFS: Overview

LINUX HOST KERNEL MODULE Ceph-fuse, samba, Ganesha metadata data 01 10 M M M RADOS CLUSTER 20

CephFS: The Write Path (User) extern "C" int ceph_write(struct ceph_mount_info *cmount, int fd, const char *buf, int64_t size, int64_t ofgset) 21

CephFS: The Write Path (Network) Client OSD MDS L 22 22

CephFS: The Write Path Request write capability from MDS if not already present ● Get “cap” from MDS ● Write new data to “ObjectCacher” ● (Inline or later when fmushing) ● Send write to OSD – Receive commit from OSD – Return to caller ● 23

The Origin of Snapshots

[john@schist backups]$ touch history [john@schist backups]$ cd .snap [john@schist .snap]$ mkdir snap1 [john@schist .snap]$ cd .. [john@schist backups]$ rm -f history [john@schist backups]$ ls [john@schist backups]$ ls .snap/snap1 history # Deleted file still there in the snapshot! 25

Snapshot Design: Goals & Limits For CephFS ● Arbitrary subtrees: lots of seemingly-unrelated objects snapshotting – together Must be cheap to create ● We have external storage for any desired snapshot metadata ● 26

Snapshot Design: Outcome Snapshots are per-object ● Driven on object write ● So snaps which logically apply to any object don’t touch it if it’s not written – Very skinny data ● per-object list of existing snaps – Global list of deleted snaps – 27

RADOS: “Self-managed” snapshots

Librados snaps interface int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps); int selfmanaged_snap_create(uint64_t *snapid); void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c); int selfmanaged_snap_remove(uint64_t snapid); void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c); int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid); 29

Allocating Self-managed Snapshots “snapids” are allocated by incrementing the “snapid” and “snap_seq” members of the per-pool “pg_pool_t” OSDMap struct 30

Allocating Self-managed Snapshots Client Peons Monitor L M M Disk commit 31 31

Allocating Self-managed Snapshots Client Peons Monitor L M M Disk ...or just make them up yourself (CephFS does so in the MDS) commit 32 32

Librados snaps interface int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps); int selfmanaged_snap_create(uint64_t *snapid); void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c); int selfmanaged_snap_remove(uint64_t snapid); void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c); int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid); 33

Writing With Snapshots write(const std::string& oid, bufgerlist& bl, size_t len, uint64_t ofg) Replica Client Primary L 34 34

Snapshots: The OSD Path Queue write for PG ● Lock PG ● Assign order to write op ● Package it for persistent storage ● Find current object state, etc – make_writeable() – Send to replica op ● Send to local persistent storage ● Wait for commits from persistent storage and replicas ● Send commit back to client ● 35

Snapshots: The OSD Path The PrimaryLogPG::make_writeable() function ● Is the “SnapContext” newer than the object already has on disk? – (Create a transaction to) clone the existing object – Update the stats and clone range overlap information – PG::append_log() calls update_snap_map() ● Updates the “SnapMapper”, which maintains LevelDB entries from: – snapid → object ● And Object → snapid ● 36

Snapshots: OSD Data Structures struct SnapSet { snapid_t seq; bool head_exists; vector<snapid_t> snaps; // descending vector<snapid_t> clones; // ascending map<snapid_t, interval_set<uint64_t> > clone_overlap; map<snapid_t, uint64_t> clone_size; } This is attached to the “HEAD” object in an xattr ● 37

RADOS: Pool Snapshots :(

Pool Snaps: Desire Make snapshots “easy” for admins ● Leverage the existing per-object implementation ● Overlay the correct SnapContext automatically on writes – Spread that SnapContext via the OSDMap – 39

Librados pool snaps interface int snap_list(vector<uint64_t> *snaps); int snap_lookup(const char *name, uint64_t *snapid); int snap_get_name(uint64_t snapid, std::string *s); int snap_get_stamp(uint64_t snapid, time_t *t); int snap_create(const char* snapname); int snap_remove(const char* snapname); int rollback(const object_t& oid, const char *snapName); Note how that’s still per-object! – 40

Ceph Snapshots: Diving into Deep Waters Greg Farnum Red hat Vault - PowerPoint PPT Presentation

Ceph Snapshots: Diving into Deep Waters Greg Farnum Red hat Vault 2017.03.23 Hi, Im Greg Greg Farnum Principal Software Engineer, Red Hat gfarnum@redhat.com 2 Outline RADOS, RBD, CephFS: (Lightning) overview and

Managing and Monitoring Ceph with the Ceph Dashboard Lenz Grimmer <lgrimmer@suse.com> |

Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat SNIA, 2017 1 WHAT IS CEPH?

CEPHALOPODS AND SAMBA IRA COOPER - SambaXP 2016.05.12 AGENDA CEPH Architecture. Why CEPH?

Linux Open Source Distributed Filesystem Ceph at SURFsara Remco van Vugt July 2, 2013 1/ 34

Agenda Openstack CEPH Storage Dream team: CEPH and Openstack Summary GUUG FFG 2015

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL 2017.03.23 OUTLINE Ceph

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Diving into Mastery Guidance for Educators Diving Deeper Deepest Aim Read Roman numerals to

Introduction Introduction CBM WATERS CBM WATERS CBM WATERS: CBM WATERS Characterization and

Recent advances in diving medicine research DAN Europe VGE Studies A Century of Diving Medicine

Day 2: Diving Deeper into Day 2: Diving Deeper into Data Visualization with R Data Visualization

6 and 8 Times Table and Division Facts Diving into Mastery Guidance for Educators Diving Deeper

7, 11 and 12 Multiplication Tables Diving into Mastery Guidance for Educators Diving Deeper

Diving Group RNLN Dive into the future with the RNLN Royal Netherlands Navy Anton van Dijk 19

Presentation: 1. I-Max Ceph key points 2. Exams 3. Dimensions 4. Technical features 5. I-Max

Ceph: All-in-One Network Data Storage What is Ceph and how we use it to backend the Arbutus cloud

On the relationship of maximal clones and maximal C-clones Mike Behrisch Institute of Computer

Chapter 19 Collection Classes Collections are classes designed for holding groups of ob

Threads / Synchronization 1 1 Changelog Changes made in this version not seen in fjrst lecture:

Attack vectors on mobile devices Tam Hanna aka @tamhanna About /me Tam HANNA

Presented by Jason A. Donenfeld Who Who Am I? Am I? Jason Donenfeld, president of Edge

Making quantum computers fault tolerant Data Quantum data nonlinear gates for Encoded Toffoli

Java Performance Analysis on Linux with Flame Graphs Brendan Gregg Senior Performance Architect

A Short Introduction to Selected Classes of the Boost C++ Library Dimitri Reiswich December 2010

Sambuz

Useful Links

Newsletter

Mail Us

Ceph Snapshots: Diving into Deep Waters Greg Farnum Red hat Vault - PowerPoint PPT Presentation

Ceph Snapshots: Diving into Deep Waters Greg Farnum Red hat Vault 2017.03.23 Hi, Im Greg Greg Farnum Principal Software Engineer, Red Hat gfarnum@redhat.com 2 Outline RADOS, RBD, CephFS: (Lightning) overview and

Managing and Monitoring Ceph with the Ceph Dashboard Lenz Grimmer &lt;lgrimmer@suse.com&gt; |

Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat SNIA, 2017 1 WHAT IS CEPH?

CEPHALOPODS AND SAMBA IRA COOPER - SambaXP 2016.05.12 AGENDA CEPH Architecture. Why CEPH?

Linux Open Source Distributed Filesystem Ceph at SURFsara Remco van Vugt July 2, 2013 1/ 34

Agenda Openstack CEPH Storage Dream team: CEPH and Openstack Summary GUUG FFG 2015

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL 2017.03.23 OUTLINE Ceph

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Diving into Mastery Guidance for Educators Diving Deeper Deepest Aim Read Roman numerals to

Introduction Introduction CBM WATERS CBM WATERS CBM WATERS: CBM WATERS Characterization and

Recent advances in diving medicine research DAN Europe VGE Studies A Century of Diving Medicine

Day 2: Diving Deeper into Day 2: Diving Deeper into Data Visualization with R Data Visualization

6 and 8 Times Table and Division Facts Diving into Mastery Guidance for Educators Diving Deeper

7, 11 and 12 Multiplication Tables Diving into Mastery Guidance for Educators Diving Deeper

Diving Group RNLN Dive into the future with the RNLN Royal Netherlands Navy Anton van Dijk 19

Presentation: 1. I-Max Ceph key points 2. Exams 3. Dimensions 4. Technical features 5. I-Max

Ceph: All-in-One Network Data Storage What is Ceph and how we use it to backend the Arbutus cloud

On the relationship of maximal clones and maximal C-clones Mike Behrisch Institute of Computer

Chapter 19 Collection Classes Collections are classes designed for holding groups of ob

Threads / Synchronization 1 1 Changelog Changes made in this version not seen in fjrst lecture:

Attack vectors on mobile devices Tam Hanna aka @tamhanna About /me Tam HANNA

Presented by Jason A. Donenfeld Who Who Am I? Am I? Jason Donenfeld, president of Edge

Making quantum computers fault tolerant Data Quantum data nonlinear gates for Encoded Toffoli

Java Performance Analysis on Linux with Flame Graphs Brendan Gregg Senior Performance Architect

A Short Introduction to Selected Classes of the Boost C++ Library Dimitri Reiswich December 2010

Sambuz

Useful Links

Newsletter

Mail Us

Managing and Monitoring Ceph with the Ceph Dashboard Lenz Grimmer <lgrimmer@suse.com> |