Agenda Background CephFS CephStorage Summary Linuxtag 2012 Ceph - - PowerPoint PPT Presentation

agenda
SMART_READER_LITE
LIVE PREVIEW

Agenda Background CephFS CephStorage Summary Linuxtag 2012 Ceph - - PowerPoint PPT Presentation

Ceph OR The link between file systems and octopuses Udo Seidel Linuxtag 2012 Agenda Background CephFS CephStorage Summary Linuxtag 2012 Ceph what? So-called parallel distributed cluster file system Started as part


slide-1
SLIDE 1

Linuxtag 2012

Ceph

OR

The link between file systems and

  • ctopuses

Udo Seidel

slide-2
SLIDE 2

Linuxtag 2012

Agenda

  • Background
  • CephFS
  • CephStorage
  • Summary
slide-3
SLIDE 3

Linuxtag 2012

Ceph – what?

  • So-called parallel distributed cluster file system
  • Started as part of PhD studies at UCSC
  • Public announcement in 2006 at 7th OSDI
  • File system shipped with Linux kernel since

2.6.34

  • Name derived from pet octopus - cephalopods
slide-4
SLIDE 4

Linuxtag 2012

Shared file systems – short intro

  • Multiple server access the same data
  • Different approaches
  • Network based, e.g. NFS, CIFS
  • Clustered

– Shared disk, e.g. CXFS, CFS, GFS(2), OCFS2 – Distributed parallel, e.g. Lustre .. and Ceph

slide-5
SLIDE 5

Linuxtag 2012

Ceph and storage

  • Distributed file system => distributed storage
  • Does not use traditional disks or RAID arrays
  • Does use so-called OSD’s

– Object based Storage Devices – Intelligent disks

slide-6
SLIDE 6

Linuxtag 2012

Storage – looking back

  • Not very intelligent
  • Simple and well documented interface, e.g.

SCSI standard

  • Storage management outside the disks
slide-7
SLIDE 7

Linuxtag 2012

Storage – these days

  • Storage hardware powerful => Re-define: tasks
  • f storage hardware and attached computer
  • Shift of responsibilities towards storage
  • Block allocation
  • Space management
  • Storage objects instead of blocks
  • Extension of interface -> OSD standard
slide-8
SLIDE 8

Linuxtag 2012

Object Based Storage I

  • Objects of quite general nature
  • Files
  • Partitions
  • ID for each storage object
  • Separation of meta data operation and storing

file data

  • HA not covered at all
  • Object based Storage Devices
slide-9
SLIDE 9

Linuxtag 2012

Object Based Storage II

  • OSD software implementation
  • Usual an additional layer between between

computer and storage

  • Presents object-based file system to the computer
  • Use a “normal” file system to store data on the

storage

  • Delivered as part of Ceph
  • File systems: LUSTRE, EXOFS
slide-10
SLIDE 10

Linuxtag 2012

Ceph – the full architecture I

  • 4 components
  • Object based Storage Devices

– Any computer – Form a cluster (redundancy and load balancing)

  • Meta Data Servers

– Any computer – Form a cluster (redundancy and load balancing)

  • Cluster Monitors

– Any computer

  • Clients ;-)
slide-11
SLIDE 11

Linuxtag 2012

Ceph – the full architecture II

slide-12
SLIDE 12

Linuxtag 2012

Ceph client view

  • The kernel part of Ceph
  • Unusual kernel implementation
  • “light” code
  • Almost no intelligence
  • Communication channels
  • To MDS for meta data operation
  • To OSD to access file data
slide-13
SLIDE 13

Linuxtag 2012

Ceph and OSD

  • User land implementation
  • Any computer can act as OSD
  • Uses BTRFS as native file system
  • Since 2009
  • Before self-developed EBOFS
  • Provides functions of OSD-2 standard

– Copy-on-write – snapshots

  • No redundancy on disk or even computer level
slide-14
SLIDE 14

Linuxtag 2012

Ceph and OSD – file systems

  • BTRFS preferred
  • Non-default configuration for mkfs
  • XFS and EXT4 possible
  • XATTR (size) is key -> EXT4 less recommended
slide-15
SLIDE 15

Linuxtag 2012

OSD failure approach

  • Any OSD expected to fail
  • New OSD dynamically added/integrated
  • Data distributed and replicated
  • Redistribution of data after change in OSD

landscape

slide-16
SLIDE 16

Linuxtag 2012

Data distribution

  • File stripped
  • File pieces mapped to object IDs
  • Assignment of so-called placement group to
  • bject ID
  • Via hash function
  • Placement group (PG): logical container of storage
  • bjects
  • Calculation of list of OSD’s out of PG
  • CRUSH algorithm
slide-17
SLIDE 17

Linuxtag 2012

CRUSH I

  • Controlled Replication Under Scalable Hashing
  • Considers several information
  • Cluster setup/design
  • Actual cluster landscape/map
  • Placement rules
  • Pseudo random -> quasi statistical distribution
  • Cannot cope with hot spots
  • Clients, MDS and OSD can calculate object

location

slide-18
SLIDE 18

Linuxtag 2012

CRUSH II

slide-19
SLIDE 19

Linuxtag 2012

Data replication

  • N-way replication
  • N OSD’s per placement group
  • OSD’s in different failure domains
  • First non-failed OSD in PG -> primary
  • Read and write to primary only
  • Writes forwarded by primary to replica OSD’s
  • Final write commit after all writes on replica OSD
  • Replication traffic within OSD network
slide-20
SLIDE 20

Linuxtag 2012

Ceph caches

  • Per design
  • OSD: Identical to access of BTRFS
  • Client: own caching
  • Concurrent write access
  • Caches discarded
  • Caching disabled -> synchronous I/O
  • HPC extension of POSIX I/O
  • O_LAZY
slide-21
SLIDE 21

Linuxtag 2012

Meta Data Server

  • Form a cluster
  • don’t store any data
  • Data stored on OSD
  • Journaled writes with cross MDS recovery
  • Change to MDS landscape
  • No data movement
  • Only management information exchange
  • Partitioning of name space
  • Overlaps on purpose
slide-22
SLIDE 22

Linuxtag 2012

Dynamic subtree partitioning

  • Weighted subtrees per MDS
  • “load” of MDS re-belanced
slide-23
SLIDE 23

Linuxtag 2012

Meta data management

  • Small set of meta data
  • No file allocation table
  • Object names based on inode numbers
  • MDS combines operations
  • Single request for readdir() and stat()
  • stat() information cached
slide-24
SLIDE 24

Linuxtag 2012

Ceph cluster monitors

  • Status information of Ceph components critical
  • First contact point for new clients
  • Monitor track changes of cluster landscape
  • Update cluster map
  • Propagate information to OSD’s
slide-25
SLIDE 25

Linuxtag 2012

Ceph cluster map I

  • Objects: computers and containers
  • Container: bucket for computers or containers
  • Each object has ID and weight
  • Maps physical conditions
  • rack location
  • fire cells
slide-26
SLIDE 26

Linuxtag 2012

Ceph cluster map II

  • Reflects data rules
  • Number of copies
  • Placement of copies
  • Updated version sent to OSD’s
  • OSD’s distribute cluster map within OSD cluster
  • OSD re-calculates via CRUSH PG membership

– data responsibilities – Order: primary or replica

  • New I/O accepted after information synch
slide-27
SLIDE 27

Linuxtag 2012

Ceph – file system part

  • Replacement of NFS or other DFS
  • Storage just a part
slide-28
SLIDE 28

Linuxtag 2012

Ceph - RADOS

  • Reliable Autonomic Distributed Object Storage
  • Direct access to OSD cluster via librados
  • Drop/skip of POSIX layer (cephfs) on top
  • Visible to all ceph cluster members => shared

storage

slide-29
SLIDE 29

Linuxtag 2012

RADOS Block Device

  • RADOS storage exposed as block device
  • /dev/rbd
  • qemu/KVM storage driver via librados
  • Upstream since kernel 2.6.37
  • Replacement of
  • shared disk clustered file systems for HA

environments

  • Storage HA solutions for qemu/KVM
slide-30
SLIDE 30

Linuxtag 2012

RADOS – part I

slide-31
SLIDE 31

Linuxtag 2012

RADOS Gateway

  • RESTful API
  • Amazon S3 -> s3 tools work
  • SWIFT API's
  • Proxy HTTP to RADOS
  • Tested with apache and lighthttpd
slide-32
SLIDE 32

Linuxtag 2012

Ceph storage – all in one

slide-33
SLIDE 33

Linuxtag 2012

Ceph – first steps

  • A few servers
  • At least one additional disk/partition
  • Recent Linux installed
  • ceph installed
  • Trusted ssh connections
  • Ceph configuration
  • Each servers is OSD, MDS and Monitor
slide-34
SLIDE 34

Linuxtag 2012

Summary

  • Promising design/approach
  • High grade of parallelism
  • still experimental status -> limited

recommendation production

  • Big installations?
  • Back-end file system
  • Number of components
  • Layout
slide-35
SLIDE 35

Linuxtag 2012

References

  • http://ceph.com
  • @ceph-devel
  • http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf
slide-36
SLIDE 36

Linuxtag 2012

Thank you!