1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg - PowerPoint PPT Presentation

CephFS fsck: Distributed Filesystem Checking

Hi, I’m Greg Greg Farnum CephFS Tech Lead, Red Hat gfarnum@redhat.com Been working as a core Ceph developer since June 2009 3

What is Ceph? An awesome, software-based, scalable, distributed storage system that is designed for failures • Object storage (our native API) • Block devices (Linux kernel, QEMU/KVM, others) • RESTful S3 & Swift API object store • POSIX Filesystem 5

APP APP HOST/VM CLIENT RADOSGW CEPH FS RBD LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible with distributed block device, distributed file system, apps to directly S3 and Swift with a Linux kernel client with a Linux kernel client access RADOS, and a QEMU/KVM driver and support for FUSE with support for C, C++, Java, Python, Ruby, and PHP RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes Reliable Autonomic Distributed Object Store 6

What is CephFS? An awesome, software-based, scalable, distributed POSIX- compliant file system that is designed for failures 7

RADOS A user perspective 8

Objects in RADOS Data 01110010101 01010101010 00010101101 xattrs version: 1 omap foo -> bar baz -> qux 9

The librados API C, C++, Python, Java, shell. File-like API: • read/write (extent), truncate, remove; get/set/remove xattr or key • efficient copy-on-write clone • Snapshots — single object or pool-wide • atomic compound operations/transactions • read + getxattr, write + setxattr • compare xattr value, if match write + setxattr • “object classes” • load new code into cluster to implement new methods • calc sha1, grep/filter, generate thumbnail • encrypt, increment, rotate image • Implement your own access mechanisms — HDF5 on the node • watch/notify: use object as communication channel between clients (locking primitive) • pgls: list the objects within a placement group 10

The RADOS Cluster M M M CLIENT Object Storage Devices (OSDs) 11

Object Storage Object 1 Object 3 Object 1 Object 3 Devices (OSDs) Object 2 Object 4 Object 2 Object 4 Pool 1 Pool 2 M Monitor 12

10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg CRUSH(pg, cluster state, rule set) 13

RADOS data guarantees • Any write acked as safe will be visible to all subsequent readers • Any write ever visible to a reader will be visible to all subsequent readers • Any write acked as safe will not be lost unless the whole containing PG is lost • A PG will not be lost unless all N copies are lost ( N is admin- configured, usually 3)… • …and in case of OSD failure the system will try to bring you back up to N copies (no user intervention required) 14

RADOS data guarantees • Data is regularly scrubbed to ensure copies are consistent with each other, and administrators are alerted if inconsistencies arise • …and while it’s not automated, it’s usually easy to identify the correct data with “majority voting” or similar. • btrfs maintains checksums for certainty and we think this is the future 15

CephFS System Design 16

CephFS Design Goals Infinitely scalable Avoid all Single Points Of Failure Self Managing 17

CLIENT 01 10 Metadata Server (MDS) M M M 18

Scaling Metadata So we have to use multiple MetaData Servers (MDSes) Two Issues: • Storage of the metadata • Ownership of the metadata 19

Scaling Metadata – Storage Some systems store metadata on the MDS system itself But that’s a Single Point Of Failure! • Hot standby? • External metadata storage √ 20

Scaling Metadata – Ownership Traditionally: assign hierarchies manually to each MDS • But if workloads change, your nodes can unbalance Newer: hash directories onto MDSes • But then clients have to jump around for every folder traversal 21

one tree two metadata servers 22

one tree two metadata servers 23

The Ceph Metadata Server Key insight: If metadata is stored in RADOS, ownership should be impermanent One MDS is authoritative over any given subtree, but... • That MDS doesn’t need to keep the whole tree in-memory • There’s no reason the authoritative MDS can’t be changed! 24

The Ceph MDS – Partitioning Cooperative Partitioning between servers: • Keep track of how hot metadata is • Migrate subtrees to keep heat distribution similar • Cheap because all metadata is in RADOS • Maintains locality 25

The Ceph MDS – Persistence All metadata is written to RADOS • And changes are only visible once in RADOS 26

The Ceph MDS – Clustering Benefits Dynamic adjustment to metadata workloads • Replicate hot data to distribute workload Dynamic cluster sizing: • Add nodes as you wish • Decommission old nodes at any time Recover quickly and easily from failures 27

DYNAMIC SUBTREE PARTITIONING 32

Does it work? 33

It scales! Click to edit Master text styles 34

It redistributes! Click to edit Master text styles 35

Cool Extras Besides POSIX-compliance and scaling 36

Snapshots $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot 37

Recursive statistics $ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2" 38

Different Storage strategies • Set a “virtual xattr” on a directory and all new files underneath it follow that layout. • Layouts can specify lots of detail about storage: • pool file data goes into • how large file objects and stripes are • how many objects are in a stripe set • So in one cluster you can use • one slow pool with big objects for Hadoop workloads • one fast pool with little objects for a scratch space • one slow pool with small objects for home directories • or whatever else makes sense... 39

CephFS Important Data structures 40

Directory objects •One (or more!) per directory •Deterministically named: <inode number>.<directory piece> •Embeds dentries and inodes for each child of the folder •Contains a potentially-stale versioned backtrace (path location) •Located in the metadata pool 41

File objects •One or more per file •Deterministically named <ino number>.<object number> •First object contains a potentially-stale versioned backtrace •Located in any of the data pools 42

MDS log (objects) The MDS fully journals all metadata operations. The log is chunked across objects. •Deterministically named <log inode number>.<log piece> •Log objects may or may not be replayable if previous entries are lost •each entry contains what it needs, but eg a file move can depend on a previous rename entry •Located in the metadata pool 43

MDSTable objects • Single objects • SessionMap (per-MDS) • stores the state of each client Session • particularly: preallocated inodes for each client • InoTable (per-MDS) • Tracks which inodes are available to allocate • (this is not a traditional inode mapping table or similar) • SnapTable (shared) • Tracks system snapshot IDs and their state (in use pending create/delete) • All located in the metadata pool 44

CephFS Metadata update flow 45

Client Sends Request Object Storage Devices (OSDs) Create dir CLIENT MDS log.1 log.2 log.3 dir.1 dir.2 dir.3 46

MDS Processes Request Object Storage “Early Reply” and journaling Devices (OSDs) Early Reply Journal Write CLIENT MDS log.1 log.2 log.3 dir.1 dir.2 dir.3 47

MDS Processes Request Object Storage Journaling and safe reply Devices (OSDs) CLIENT Journal ack Safe Reply MDS log.1 log.2 log.3 log.4 dir.1 dir.2 dir.3 dir.3 48

…time passes… Object Storage Devices (OSDs) MDS log.4 dir.1 dir.2 dir.3 log.5 log.6 log.7 log.8 49

MDS Flushes Log Object Storage Devices (OSDs) Directory Write MDS log.4 dir.1 dir.2 dir.3 log.5 log.6 log.7 log.8 50

MDS Flushes Log Object Storage Devices (OSDs) Write ack MDS log.4 dir.1 dir.2 dir.3 dir.4 log.5 log.6 log.7 log.8 51

MDS Flushes Log Object Storage Devices (OSDs) Log Delete MDS dir.1 dir.2 dir.3 dir.4 log.5 log.6 log.7 log.8 52

Traditional fsck 53

1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg - PowerPoint PPT Presentation

1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg Farnum CephFS Tech Lead, Red Hat gfarnum@redhat.com Been working as a core Ceph developer since June 2009 3 4 What is Ceph? An awesome, software-based, scalable,

Distributed File Storage in Multi-Tenant Clouds using CephFS Openstack Vancouver 2018 May 23

CephFS Development Update John Spray john.spray@redhat.com Vault 2015 Agenda Introduction

FrontendFS Creating a userspace filesystem in node.js Clay Smith, New Relic BUILDING A

Mostafa Z. Ali Mostafa Z. Ali mzali@just.edu.jo 1 1 The Linux FileSystem A filesystem is

CephFS as a service with OpenStack Manila John Spray john.spray@redhat.com jcsp on #ceph-devel

DSPACE CLUSTERING DSPACE CLUSTERING VIA PUPPET, HAPROXY AND CEPHFS VIA PUPPET, HAPROXY AND

Draft Draft Dynamic Storage Provisioning of Manila/CephFS Shares on Kubernetes Rbert Vaek

CephFS as a service with OpenStack Manila John Spray john.spray@redhat.com jcsp on #ceph-devel

Linux Open Source Distributed Filesystem Ceph at SURFsara Remco van Vugt July 2, 2013 1/ 34

From Model Checking to Proof Checking ... and Back Kedar Namjoshi Bell Labs April 29, 2005

The Btrfs Filesystem Chris Mason The Btrfs Filesystem Jointly developed by a number of

The Btrfs Filesystem Chris Mason The Btrfs Filesystem Jointly developed by a number of

Btrfs Filesystem Chris Mason Btrfs Goals General purpose filesystem that scales to very large

Linux Filesystem Hierarchy Linux Filesystem Hierarchy and Hard Disk Partitioning and Hard Disk

SElinux filesystem filesystem labeling labeling SElinux and type enforcement and type

Lecture 02: Unix Filesystem APIs Software layered over hardware, filesystem API calls

using Amazon Key Management Service Jan Lindstrm, Principal Engineer, MariaDB Corporation

STAT 113: FINAL EXAM PRACTICE PROBLEMS COLIN REIMER DAWSON, FALL 2015 Research Design /

NLU lecture 6: Compositional character representations Adam Lopez alopez@inf.ed.ac.uk Credits:

Shell Scripting Dalhousie University Winter 2019 Reading Glass and Ables, Chapter 8: bash Your

ASSENT COMPLIANCE Building a Conflict Minerals Program 2015/16 info@assentcompliance.com |

Implementing a Simple SMF Service: Lessons Learned OSDevCon09, October 30th, 2009 Constantin

Copy On Write (COW) allows for atomic transactions without a separate journal Snapshots

DRAM RELIABILITY Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah