Ceph Snapshots: Diving into Deep Waters Greg Farnum Red hat Vault - - PowerPoint PPT Presentation

ceph snapshots diving into deep waters
SMART_READER_LITE
LIVE PREVIEW

Ceph Snapshots: Diving into Deep Waters Greg Farnum Red hat Vault - - PowerPoint PPT Presentation

Ceph Snapshots: Diving into Deep Waters Greg Farnum Red hat Vault 2017.03.23 Hi, Im Greg Greg Farnum Principal Software Engineer, Red Hat gfarnum@redhat.com 2 Outline RADOS, RBD, CephFS: (Lightning) overview and


slide-1
SLIDE 1

Ceph Snapshots: Diving into Deep Waters

Greg Farnum – Red hat

Vault – 2017.03.23

slide-2
SLIDE 2

2

  • Greg Farnum
  • Principal Software Engineer, Red Hat
  • gfarnum@redhat.com

Hi, I’m Greg

slide-3
SLIDE 3

3

  • RADOS, RBD, CephFS: (Lightning) overview and how writes happen
  • The (self-managed) snapshots interface
  • A diversion into pool snapshots
  • Snapshots in RBD, CephFS
  • RADOS/OSD Snapshot implementation, pain points

Outline

slide-4
SLIDE 4

4

Ceph’s Past & Present

  • Then: UC Santa Cruz Storage Research

Systems Center

  • Long-term research project in petabyte-

scale storage

  • trying to develop a Lustre successor.
  • Now: Red Hat, a commercial open-source

software & support provider you might have heard of :) (Mirantis, SuSE, Canonical, 42on, Hastexo, ...)

  • Building a business; customers in virtual block

devices and object storage

  • ...and reaching for fjlesystem users!
slide-5
SLIDE 5

5

Ceph Projects

RGW

S3 and Swift compatible

  • bject storage with object

versioning, multi-site federation, and replication

LIBRADOS

A library allowing apps to direct access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS

A software-based, reliable, autonomic, distributed object store comprised of self-healing, self-managing, intelligent storage nodes (OSDs) and lightweight monitors (Mons)

RBD

A virtual block device with snapshots, copy-on-write clones, and multi-site replication

CEPHFS

A distributed POSIX fjle system with coherent caches and snapshots on any directory

OBJECT BLOCK FILE

slide-6
SLIDE 6

RADOS: Overview

slide-7
SLIDE 7

7

RADOS Components

7

OSDs:

  • 10s to 10000s in a cluster
  • One per disk (or one per SSD, RAID group…)
  • Serve stored objects to clients
  • Intelligently peer for replication & recovery

Monitors:

  • Maintain cluster membership and state
  • Provide consensus for distributed decision-

making

  • Small, odd number
  • These do not serve stored objects to clients

M

slide-8
SLIDE 8

8

Object Storage Daemons

8

FS DISK OSD DISK OSD FS DISK OSD FS DISK OSD FS M M M

slide-9
SLIDE 9

9

CRUSH: Dynamic Data Placement

9

CRUSH:

  • Pseudo-random placement algorithm
  • Fast calculation, no lookup
  • Repeatable, deterministic
  • Statistically uniform distribution
  • Stable mapping
  • Limited data migration on change
  • Rule-based confjguration
  • Infrastructure topology aware
  • Adjustable replication
  • Weighting
slide-10
SLIDE 10

10

DATA IS ORGANIZED INTO POOLS

10

CLUSTER

OBJECTS

10 01 01 10 10 01 11 01 10 01 01 10 10 01 11 01

POOLS

(CONTAINING PGs) 10 01 11 01 10 01 01 10 01 10 10 01 11 01 10 01 10 01 10 11 01 11 01 10 10 01 01 01 10 10 01 01

POOL A POOL B POOL C POOL D

OBJECTS OBJECTS OBJECTS

slide-11
SLIDE 11

11

L

librados: RADOS Access for Apps

11

LIBRADOS:

  • Direct access to RADOS for applications
  • C, C++, Python, PHP, Java, Erlang
  • Direct access to storage nodes
  • No HTTP overhead
  • Rich object API
  • Bytes, attributes, key/value data
  • Partial overwrite of existing data
  • Single-object compound atomic operations
  • RADOS classes (stored procedures)
slide-12
SLIDE 12

12

aio_write(const object_t &oid, AioCompletionImpl *c, const bufgerlist& bl, size_t len, uint64_t ofg); c->wait_for_safe(); write(const std::string& oid, bufgerlist& bl, size_t len, uint64_t ofg)

RADOS: The Write Path (user)

slide-13
SLIDE 13

13

L

RADOS: The Write Path (Network)

13

Client Primary Replica

slide-14
SLIDE 14

14

  • Queue write for PG
  • Lock PG
  • Assign order to write op
  • Package it for persistent storage

Find current object state, etc

  • Send to replica op
  • Send to local persistent storage
  • Unlock PG
  • Wait for commits from persistent storage and replicas
  • Send commit back to client

RADOS: The Write Path (OSD)

slide-15
SLIDE 15

RBD: Overview

slide-16
SLIDE 16

16

STORING VIRTUAL DISKS

16

M M RADOS CLUSTER

HYPERVISOR

LIBRBD

VM

slide-17
SLIDE 17

17

RBD STORES VIRTUAL DISKS

17

RADOS BLOCK DEVICE:

  • Storage of disk images in RADOS
  • Decouples VMs from host
  • Images are striped across the cluster (pool)
  • Snapshots
  • Copy-on-write clones
  • Support in:
  • Mainline Linux Kernel (2.6.39+)
  • Qemu/KVM, native Xen coming soon
  • OpenStack, CloudStack, Nebula, Proxmox
slide-18
SLIDE 18

18

ssize_t Image::write(uint64_t ofs, size_t len, bufgerlist& bl) int Image::aio_write(uint64_t ofg, size_t len, bufgerlist& bl, RBD::AioCompletion *c)

RBD: The Write Path

slide-19
SLIDE 19

CephFS: Overview

slide-20
SLIDE 20

20

LINUX HOST

M M M RADOS CLUSTER

KERNEL MODULE

data metadata

01 10

Ceph-fuse, samba, Ganesha

slide-21
SLIDE 21

21

extern "C" int ceph_write(struct ceph_mount_info *cmount, int fd, const char *buf, int64_t size, int64_t ofgset)

CephFS: The Write Path (User)

slide-22
SLIDE 22

22

L

CephFS: The Write Path (Network)

22

Client MDS OSD

slide-23
SLIDE 23

23

  • Request write capability from MDS if not already present
  • Get “cap” from MDS
  • Write new data to “ObjectCacher”
  • (Inline or later when fmushing)

Send write to OSD

Receive commit from OSD

  • Return to caller

CephFS: The Write Path

slide-24
SLIDE 24

The Origin of Snapshots

slide-25
SLIDE 25

25

[john@schist backups]$ touch history [john@schist backups]$ cd .snap [john@schist .snap]$ mkdir snap1 [john@schist .snap]$ cd .. [john@schist backups]$ rm -f history [john@schist backups]$ ls [john@schist backups]$ ls .snap/snap1 history # Deleted file still there in the snapshot!

slide-26
SLIDE 26

26

  • For CephFS

Arbitrary subtrees: lots of seemingly-unrelated objects snapshotting together

  • Must be cheap to create
  • We have external storage for any desired snapshot metadata

Snapshot Design: Goals & Limits

slide-27
SLIDE 27

27

  • Snapshots are per-object
  • Driven on object write

So snaps which logically apply to any object don’t touch it if it’s not written

  • Very skinny data

per-object list of existing snaps

Global list of deleted snaps

Snapshot Design: Outcome

slide-28
SLIDE 28

RADOS: “Self-managed” snapshots

slide-29
SLIDE 29

29

int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps); int selfmanaged_snap_create(uint64_t *snapid); void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c); int selfmanaged_snap_remove(uint64_t snapid); void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c); int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid);

Librados snaps interface

slide-30
SLIDE 30

30

“snapids” are allocated by incrementing the “snapid” and “snap_seq” members of the per-pool “pg_pool_t” OSDMap struct

Allocating Self-managed Snapshots

slide-31
SLIDE 31

31

L

Allocating Self-managed Snapshots

31

Client Monitor

M M

Peons Disk commit

slide-32
SLIDE 32

32

L

Allocating Self-managed Snapshots

32

Client Monitor

M M

Peons Disk commit ...or just make them up yourself (CephFS does so in the MDS)

slide-33
SLIDE 33

33

int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps); int selfmanaged_snap_create(uint64_t *snapid); void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c); int selfmanaged_snap_remove(uint64_t snapid); void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c); int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid);

Librados snaps interface

slide-34
SLIDE 34

34

L

Writing With Snapshots

34

Client Primary Replica

write(const std::string& oid, bufgerlist& bl, size_t len, uint64_t ofg)

slide-35
SLIDE 35

35

  • Queue write for PG
  • Lock PG
  • Assign order to write op
  • Package it for persistent storage

Find current object state, etc

make_writeable()

  • Send to replica op
  • Send to local persistent storage
  • Wait for commits from persistent storage and replicas
  • Send commit back to client

Snapshots: The OSD Path

slide-36
SLIDE 36

36

  • The PrimaryLogPG::make_writeable() function

Is the “SnapContext” newer than the object already has on disk?

(Create a transaction to) clone the existing object

Update the stats and clone range overlap information

  • PG::append_log() calls update_snap_map()

Updates the “SnapMapper”, which maintains LevelDB entries from:

  • snapid → object
  • And Object → snapid

Snapshots: The OSD Path

slide-37
SLIDE 37

37

struct SnapSet { snapid_t seq; bool head_exists; vector<snapid_t> snaps; // descending vector<snapid_t> clones; // ascending map<snapid_t, interval_set<uint64_t> > clone_overlap; map<snapid_t, uint64_t> clone_size; }

  • This is attached to the “HEAD” object in an xattr

Snapshots: OSD Data Structures

slide-38
SLIDE 38

RADOS: Pool Snapshots :(

slide-39
SLIDE 39

39

  • Make snapshots “easy” for admins
  • Leverage the existing per-object implementation

Overlay the correct SnapContext automatically on writes

Spread that SnapContext via the OSDMap

Pool Snaps: Desire

slide-40
SLIDE 40

40

int snap_list(vector<uint64_t> *snaps); int snap_lookup(const char *name, uint64_t *snapid); int snap_get_name(uint64_t snapid, std::string *s); int snap_get_stamp(uint64_t snapid, time_t *t); int snap_create(const char* snapname); int snap_remove(const char* snapname); int rollback(const object_t& oid, const char *snapName);

Note how that’s still per-object!

Librados pool snaps interface

slide-41
SLIDE 41

41

  • “Spread that SnapContext via the OSDMap”

It’s not a point-in-time snapshot

SnapContext spread virally as OSDMaps get pushed out

No guaranteed temporal order between two difgerent RBD volumes in the pool – even when attached to the same VM :(

  • Infmates the OSDMap size:

per-pool map<snapid_t, pool_snap_info_t> snaps; struct pool_snap_info_t { snapid_t snapid; utime_t stamp; string name; }

  • They are unlikely to solve a problem you want

Pool Snaps: Reality

slide-42
SLIDE 42

42

  • “Overlay the correct SnapContext automatically on writes”

No sensible way to merge that with a self-managed SnapContext

...so we don’t: pick one or the other for a pool

All in all, pool snapshots are unlikely to usefully solve any problems.

Pool Snaps: Reality

slide-43
SLIDE 43

RBD: Snapshot Structures

slide-44
SLIDE 44

44

RBD Snapshots: Data Structures

struct cls_rbd_snap { snapid_t id; string name; uint64_t image_size; uint64_t features; uint8_t protection_status; cls_rbd_parent parent; uint64_t fmags; utime_t timestamp; cls::rbd::SnapshotNamespaceOnDisk snapshot_namespace; }

slide-45
SLIDE 45

45

RBD Snapshots: Data Structures

  • cls_rbd_snap for every snapshot
  • Stored in “omap” (read: LevelDB) key-value space on the RBD volume’s

header object

  • RBD object class exposes get_snapcontext() function, called on mount
  • RBD clients “watch” on the header, get “notify” when a new snap is

created to update themselves

slide-46
SLIDE 46

CephFS: Snapshot Structures

slide-47
SLIDE 47

47

  • For CephFS

Arbitrary subtrees: lots of seemingly-unrelated objects snapshotting together

  • Must be cheap to create
  • We have external storage for any desired snapshot metadata

CephFS Snapshots: Goals & Limits

slide-48
SLIDE 48

48

CephFS Snapshots: Memory

  • Directory “Cinodes” have “SnapRealms”
  • Important elements:

snapid_t seq; // a version/seq # for changes to _this_ realm. snapid_t created; // when this realm was created. snapid_t last_created; // last snap created in _this_ realm. snapid_t last_destroyed; // seq for last removal snapid_t current_parent_since; map<snapid_t, SnapInfo> snaps; map<snapid_t, snaplink_t> past_parents; // key is "last" (or NOSNAP)

slide-49
SLIDE 49

49

/home /home greg greg sage sage ric ric

CephFS Snapshots: SnapRealms

slide-50
SLIDE 50

50

/home /home greg greg sage sage ric ric

CephFS Snapshots: SnapRealms

slide-51
SLIDE 51

51

/home /home greg greg sage sage ric ric

CephFS Snapshots: SnapRealms

slide-52
SLIDE 52

52

/home /home greg greg sage sage ric ric

CephFS Snapshots: SnapRealms

slide-53
SLIDE 53

53

CephFS Snapshots: SnapRealms

  • Directory “Cinodes” have “SnapRealms”
  • Important elements:

snapid_t seq; // a version/seq # for changes to _this_ realm. snapid_t created; // when this realm was created. snapid_t last_created; // last snap created in _this_ realm. snapid_t last_destroyed; // seq for last removal snapid_t current_parent_since; map<snapid_t, SnapInfo> snaps; map<snapid_t, snaplink_t> past_parents; // key is "last" (or NOSNAP)

Construct the SnapContext!

slide-54
SLIDE 54

54

CephFS Snapshots: Memory

  • All “CInodes” have “old_inode_t” map representing its past states for

snapshots struct old_inode_t { snapid_t fjrst; inode_t inode; std::map<string,bufgerptr> xattrs; }

slide-55
SLIDE 55

55

CephFS Snapshots: Disk

  • SnapRealms are encoded as part of inode
  • Snapshotted metadata stored as old_inode_t map in memory/disk
  • Snapshotted data stored in RADOS object self-managed snapshots

/<v2>/home<v5>/greg<v9>/foo foo -> ino 1342, 4 MB, [<1>,<3>,<10>] bar -> ino 1001, 1024 KBytes baz -> ino 1242, 2 MB /<ino 0,v2>/home<ino 1,v5>/greg<ino 5,v9>/ Mydir[01], total size 7MB 1342.0/HEAD

slide-56
SLIDE 56

56

CephFS Snapshots

  • Arbitrary sub-tree snapshots of the hierarchy
  • Metadata stored as old_inode_t map in memory/disk
  • Data stored in RADOS object snapshots

/<v2>/home<v5>/greg<v9>/foo 1342.0/HEAD /<v1>/home<v3>/greg<v7>/foo 1342.0/1

slide-57
SLIDE 57

CephFS: Snapshot Pain

slide-58
SLIDE 58

58

CephFS Pain: Opening past parents

  • Directory “Cinodes” have “SnapRealms”
  • Important elements:

snapid_t seq; // a version/seq # for changes to _this_ realm. snapid_t created; // when this realm was created. snapid_t last_created; // last snap created in _this_ realm. snapid_t last_destroyed; // seq for last removal snapid_t current_parent_since; map<snapid_t, SnapInfo> snaps; map<snapid_t, snaplink_t> past_parents;

slide-59
SLIDE 59

59

CephFS Pain: Opening past parents

  • T
  • construct the SnapContext for a write, we need the all the

SnapRealms it has ever participated in

Because it could have been logically snapshotted in an old location but not written to since, and a new write must refmect that old location’s snapid

  • So we must open all the directories the fjle has been a member of!

With a single MDS, this isn’t too hard

With multi-MDS, this can be very diffjcult in some scenarios

  • We may not know who is “authoritative” for a directory under all

failure and recovery scenarios

  • If there’s been a disaster metadata may be inaccessible, but we don’t

have mechanisms for holding operations and retrying when “unrelated” metadata is inaccessible

slide-60
SLIDE 60

60

CephFS Pain: Opening past parents

  • Directory “Cinodes” have “SnapRealms”
  • Important elements:

snapid_t seq; // a version/seq # for changes to _this_ realm. snapid_t created; // when this realm was created. snapid_t last_created; // last snap created in _this_ realm. snapid_t last_destroyed; // seq for last removal snapid_t current_parent_since; map<snapid_t, SnapInfo> snaps; map<snapid_t, snaplink_t> past_parents;

Why not store snaps in all descendants Instead of maintaining ancestor links?

slide-61
SLIDE 61

61

CephFS Pain: Eliminating past parents

  • The MDS opens an inode for any operation performed on it

This includes its SnapRealm

  • So we can merge snapid lists down whenever we open an inode that has

a new SnapRealm

  • So if we rename a directory/fjle into a new location; its SnapRealm

already contains all the right snapids and then we don’t need a link to the past!

  • I got this almost completely fjnished

Reduced code line count

Much simpler snapshot tracking code

But….

slide-62
SLIDE 62

62

CephFS Pain: Hard links

  • Hard links and snapshots do not interact :(
  • They should!
  • That means we need to merge SnapRealms from all the linked parents
  • f an inode

And this is the exact same problem we have with past_parents

Since we need to open “remote” inodes correctly, avoiding it in the common case doesn’t help us

  • So, back to debugging and more edge cases
slide-63
SLIDE 63

RADOS: Deleting Snapshots

slide-64
SLIDE 64

64

int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps); int selfmanaged_snap_create(uint64_t *snapid); void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c); int selfmanaged_snap_remove(uint64_t snapid); void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c); int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid);

Librados snaps interface

slide-65
SLIDE 65

65

L

“Deleting” Snapshots (Client)

65

Client Monitor

M M

Peons Disk commit

slide-66
SLIDE 66

66

  • Generate new OSDMap updating pg_pool_t

interval_set<snapid_t> removed_snaps;

  • This is really space-effjcient if you consistently delete your oldest

snapshots!

Rather less so if you keep every other one forever

  • ...and this looks sort of like some sensible RBD snapshot strategies

(daily for a week, weekly for a month, monthly for a year)

Deleting Snapshots (Monitor)

slide-67
SLIDE 67

67

  • OSD advances its OSDMap
  • Asynchronously:

List objects with that snapshot via “SnapMapper”

  • int get_next_objects_to_trim( snapid_t snap, unsigned max, vector<hobject_t>

*out);

For each object:

  • “unlink” object clone for that snapshot – 1 coalescable IO

– Sometimes clones belong to multiple snaps so we might not delete right away

  • Update object HEAD’s “SnapSet” xattr – 1+ unique IO
  • Remove SnapMapper’s LevelDB entries for that object/snap pair – 1 coalescable IO
  • Write down “PGLog” entry recording clone removal – 1 coalescable IO

Note that if you trim a bunch of snaps, you do this for each one – no coalescing it down to one pass on each object :(

Deleting Snapshots (OSD)

slide-68
SLIDE 68

68

  • So that’s at least 1 IO per object in a snap

potentially a lot more if we needed to fetch KV data ofg disk, didn’t have directories cached, etc

This will be a lot better in BlueStore! It’s just coalescable metadata ops

  • Ouch!
  • Even worse: throttling is hard

Why is a whole talk on its own

It’s very diffjcult to not overwhelm clusters if you do a lot of trimming at

  • nce

Deleting Snapshots (OSD)

slide-69
SLIDE 69

RADOS: Alternate Approaches

slide-70
SLIDE 70

70

  • Maintain a per snapid directory with hard links!

Every clone is linked into (all) its snapid directory(s)

Just list the directory to identify them, then

  • update the object’s SnapSet
  • Unlink from all relevant directories
  • Turns out this destroys locality, in addition to being icky code

Past: Deleting Snapshots

slide-71
SLIDE 71

71

  • For instance, LVM snapshots?
  • We don’t want to snapshot everything on an OSD at once

No implicit “consistency groups” across RBD volumes, for instance

  • So we ultimately need a snap→object mapping, since each snap touches

so few objects

Present: Why per-object?

slide-72
SLIDE 72

72

  • Update internal interfaces for more coalescing

There’s no architectural reason we need to scan each object per-snapshot

Instead, maintain iterators for each snapshot we are still purging and advance them through the keyspace in step so we can do all snapshots of a particular object in one go

  • Change deleted snapshots representation so it doesn’t infmate ODSMaps

“deleting_snapshots” instead, which can be trimmed once all OSDs report they’ve been removed

Store the full list of deleted snapshots in confjg-keys or similar, handle it with ceph-mgr

Future: Available enhancements

slide-73
SLIDE 73

73

  • BlueStore: It makes everything better

Stop having to map our semantics onto a fjlesystem

Snapshot deletes still require the snapid→object mapping, but the actual delete is a few key updates rolled into RocksDB – easy to coalesce

  • Better throttling for users

Short-term: hacks to enable sleeps so we don’t overwhelm the local FS

Long-term: proper cost estimates for BlueStore that we can budget correctly (not really possible in Fses since they don’t expose number of IOs needed to fmush current dirty state)

Future: Available enhancements

slide-74
SLIDE 74

THANK YOU!

Greg Farnum

Principal Engineer, Ceph

gfarnum@redhat.com

@gregsfortytwo