FROM HPC TO CLOUD AND BACK AGAIN? SAGE WEIL PDSW 2014.11.16 - - PowerPoint PPT Presentation

from hpc to cloud and back again
SMART_READER_LITE
LIVE PREVIEW

FROM HPC TO CLOUD AND BACK AGAIN? SAGE WEIL PDSW 2014.11.16 - - PowerPoint PPT Presentation

FROM HPC TO CLOUD AND BACK AGAIN? SAGE WEIL PDSW 2014.11.16 AGENDA A bit of history and architecture T echnology Community Challenges Looking forward 2 CEPH APP HOST/VM CLIENT RGW RBD CEPHFS A web services A


slide-1
SLIDE 1

FROM HPC TO CLOUD … AND BACK AGAIN?

SAGE WEIL – PDSW 2014.11.16

slide-2
SLIDE 2

2

AGENDA

  • A bit of history and architecture

– T

echnology

– Community

  • Challenges
  • Looking forward
slide-3
SLIDE 3

3

CEPH

RGW

A web services gateway for object storage, compatible with S3 and Swift

LIBRADOS

A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS

A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

RBD

A reliable, fully- distributed block device with cloud platform integration

CEPHFS

A distributed fjle system with POSIX semantics and scale-

  • ut metadata

management

APP HOST/VM CLIENT

slide-4
SLIDE 4

SOME HISTORY ...AND ARCHITECTURE

slide-5
SLIDE 5

5

ORIGINS

  • Petascale object storage

– DOE: LANL, LLNL, Sandia – Scalability, reliability, performance

  • Scalable metadata management
  • First line of Ceph code

– Summer internship at LLNL

slide-6
SLIDE 6

6

MOTIVATING PRINCIPLES

  • “Intelligent” everything

Smart disks

Smart MDS

Dynamic load balancing

  • Design tenets

All components must scale horizontally

There can be no single point of failure

Self-manage whenever possible

  • Open source

The solution must be hardware agnostic

slide-7
SLIDE 7

7

CLIENT / SERVER

reliable disk array redundant heads access network

“clients stripe data across reliable things”

slide-8
SLIDE 8

8

CLIENT / CLUSTER

“client stripe across unreliable things” “servers coordinate replication, recovery”

access network mortal targets backside network (optional)

slide-9
SLIDE 9

9

RADOS CLUSTER

APPLICATION

M M M M M RADOS CLUSTER

slide-10
SLIDE 10

10

RADOS CLUSTER

APPLICATION

M M M M M RADOS CLUSTER

LIBRADOS

slide-11
SLIDE 11

11

MANY OSDS PER HOST

FS DISK OSD DISK OSD FS DISK OSD FS DISK OSD FS

xfs btrfs ext4

M M M

slide-12
SLIDE 12

12

WHERE DO OBJECTS LIVE?

?? M M M

OBJECT LIBRADOS

slide-13
SLIDE 13

13

A METADATA SERVER?

1 M M M 2

LIBRADOS

slide-14
SLIDE 14

14

CALCULATED PLACEMENT

F

M M M A-G H-N O-T U-Z location = f(object name, cluster state, policy)

LIBRADOS

slide-15
SLIDE 15

15

CRUSH

CLUSTER

OBJECTS

10 01 01 10 10 01 11 01

10 01 01 10 10 01 11 01 10 01 01 10 10 01 11 01

PLACEMENT GROUPS (PGs)

slide-16
SLIDE 16

16

CRUSH IS A QUICK CALCULATION

RADOS CLUSTER

OBJECT

10 01 01 10 10 01 11 01 10 01 01 10 10 01 11 01

slide-17
SLIDE 17

17

CRUSH AVOIDS FAILED DEVICES

RADOS CLUSTER

OBJECT

10 01 01 10 10 01 11 01 10 01 01 10 10 01 11 01 10

slide-18
SLIDE 18

18

DECLUSTERED PLACEMENT

RADOS CLUSTER

10 01 01 10 10 01 11 01 10 01 01 10 10 01 11 01

18

  • OSDs store many PGs
  • PGs that map to the same OSD

generally have replicas that do not

No spares

Highly parallel recovery

  • Recovery is loosely coordinated

Monitors publish new CRUSH map

  • “OSD.123 is now down”

OSDs migrate data cooperatively

With strong client consistency

slide-19
SLIDE 19

19

FILE SYSTEM

LINUX HOST

M M M RADOS CLUSTER

CEPH CLIENT

data metadata

01 10

slide-20
SLIDE 20
  • ne tree

three metadata servers

??

slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25

DYNAMIC SUBTREE PARTITIONING

slide-26
SLIDE 26

WHAT NEXT?

slide-27
SLIDE 27

27

WHAT CLIENT PROTOCOL?

  • Prototype client was FUSE-based

Slow, some cache consistency limitations

  • Considered [p]NFS

Abandon ad hoc client/MDS protocol and use a standard?

Avoid writing kernel client code?

  • pNFS would abandon most of the MDS value

Dynamic/adaptive balancing, hot spot mitigation, strong fjne-grained coherent caching

  • Built native Linux kernel client

Upstream in ~2.6.36

slide-28
SLIDE 28

28

FOSS >> OPEN STANDARDS

  • Open source client and server
  • Unencumbered integration

Linux, Qemu/KVM

  • No need to adopt standard legacy protocols

iSCSI, NFS, CIFS are client/server

  • Lesson:

standards critical for proprietary products

  • fger no value to end-to-end open solutions
  • Intelligent OSDs can do more than read/write blocks

What else should they do?

slide-29
SLIDE 29

30

INCUBATION (2007-2011)

  • Skunkworks project at DreamHost

Native Linux kernel client (2007-)

Per-directory snapshots (2008)

Recursive accounting (2008)

librados (2009)

radosgw (2009)

Object classes (2009)

strong authentication (2009)

RBD: rados block device (2010)

slide-30
SLIDE 30

31

LINUX KERNEL SUPPORT

  • Began attending LSF (Linux Storage and File

systems) workshops

  • Hear stories about early attempts to upstream Lustre
  • Engage community with their processes
  • Eventually merged into mainline in 2.6.34
slide-31
SLIDE 31

32

RBD – VIRTUAL BLOCK DEVICES

M M RADOS CLUSTER

HYPERVISOR

LIBRBD

VM

slide-32
SLIDE 32

33

RBD KERNEL MODULE

M M RADOS CLUSTER

LINUX HOST

KRBD

slide-33
SLIDE 33

34

THE RADOS GATEWAY

M M M RADOS CLUSTER

RADOSGW

LIBRADOS

socket

RADOSGW

LIBRADOS

APPLICATION APPLICATION

REST

slide-34
SLIDE 34

35

RADOS OBJECTS

  • Flat object namespace within logical pools
  • Rich data model for each “object”

Byte array

Attributes (small inline key/value data)

Bulk key/value data

  • Mutable objects

Partial overwrite of existing data

  • Single-object “transactions” (compound operations)

Atomic reads or updates to data and metadata

Atomic test-and-set, conditional updates

slide-35
SLIDE 35

36

RADOS CLASSES

  • “Objects” in the OOP sense of the word (data + code)
  • RADOS provides basic “methods”

Read, write, setattr, delete, ...

  • Plugin interface to implement new “methods”

Via a dynamically loaded .so

  • Methods executed inside normal IO pipeline

Read methods can accept or return arbitrary data

Write methods generate an update transaction

  • Moving computation is cheap; moving data is not
slide-36
SLIDE 36

37

RADOS LUA CLASS

  • Noah Watkins (UCSC)
  • RADOS class links embedded LUA interpreter
  • Clients can submit arbitrary script code
  • Simple execution environment

Can call existing methods (like read, write)

  • Caches compiled code
slide-37
SLIDE 37

38

INKTANK (2012-2014)

  • Spinout in 2012

DreamHost a poor fjt to support open source software

Funding from DreamHost, Mark Shuttleworth

  • Productize Ceph for the enterprise

Focus on stability, testing automation, technical debt

Object and block “Cloud” use-cases

  • Real users, real customers
slide-38
SLIDE 38

39

INKTANK INCUBATION RESEARCH

CONTRIBUTORS / MONTH

slide-39
SLIDE 39

40

HPC?

  • Lustre works
  • Lustre hardware model a poor match for Ceph

Redundancy within expensive arrays unnecessary

Ceph replicates or erasure codes across devices

  • More disks, cheaper hardware

Ceph uses NVRAM/fmash directly (not buried in array)

  • ORNL experiment

Tune Ceph on OST s backed by DDN array

Started terrible; reached 90% of theoretical peak

Still double-writing, IPoIB, ...

Ineffjcient HW investment

slide-40
SLIDE 40

41

LINUX?

  • Did kernel client investment engage Linux

community?

Not really

Developers have small environments

  • Red Hat bought Gluster Inc.

CephFS not stable enough for production

  • Canonical / Ubuntu

Pulled Ceph into supported distro for librbd

Mark Shuttleworth invested in Inktank

slide-41
SLIDE 41

42

THE CLOUD

  • OpenStack mania
  • Inktank focus on object and block interfaces

Start at bottom of stack and work up

Same interfaces needed for IaaS

  • Helped motivate Cinder (block provisioning service)

Enable support of RBD image cloning from Cinder

No data copying, fast VM startup

  • Ceph now #1 block storage backend for OpenStack

More popular than LVM (local disk)

  • Most Inktank customers ran OpenStack
  • Lesson: fjnd some bandwagon to draft behind
slide-42
SLIDE 42

43

RED HAT

  • Red Hat buys Inktank in April 2014

45 people

$190MM

  • OpenStack
slide-43
SLIDE 43

CHALLENGES

slide-44
SLIDE 44

45

SUPPORTABILITY

  • Distros

Ubuntu 12.04 LTS at Inktank launch

  • Dependencies

Leveldb suckage – reasonably fast moving project, distros don't keep up

  • Kernels

Occasionally trigger old bugs

  • Rolling upgrades

Large testing matrix

Automation critical

  • Lesson: not shipping hardware makes QA & support

harder

slide-45
SLIDE 45

46

USING DISK EFFICIENTLY

  • OBFS: simpler data model → faster
  • Ebofs: userspace extent and btree-based object storage

Transaction-based interface

  • Btrfs: how do expose transactions to userspace?

Start and end transaction ioctls

Pass full transaction description to kernel

Snapshot on every checkpoint; rollback on restart

Still need ceph-osd's full data journal for low latency

  • XFS: stable enough for production

Need journal for basic atomicity and consistency

  • Lesson: interfaces can tend to clean and respectable

...but implementations generally do not

slide-46
SLIDE 46

47

MAKING IT WORK AT SCALE

  • Goal: manage to a steady state

Declare desired state of system; components move there

System may never be completely “clean”

  • Dynamic / emergent behaviors

Various feedback loops in autonomic systems

Equilibrium may be unstable

  • Lesson: importance of observability

Convenient state querying, summaries

  • Lesson: operator intervention

Need ability to suspend autonomic processes

slide-47
SLIDE 47

48

ENTERPRISE

  • Ecosystem

Ubuntu dominated early OpenStack

RHEL/CentOS dominate enterprise

  • Vendor needs a compelling product

Simple support on open code is a diffjcult model

Conundrum: better software → reduces product value

Engineering expertise is necessary but not suffjcient

  • Inktank Ceph Enterprise

Management layer, GUI (proprietary add-ons)

Enterprise integrations (SNMP, VMWare, Hyper-V)

  • Legacy

Back to talking about iSCSI, NFS, CIFS (as gateway drug)

slide-48
SLIDE 48

50

COMMUNITY BUILDING

  • User community

Huge investment in making things easy to deploy

Documentation

Hand-holding over email, IRC

  • Developer community

Forcing tight developer team to use open processes

Email, IRC, public design and code review

  • Ceph Developer Summits

100% online Google hangout, IRC, wiki

Every few months

  • Lessson: developers need employers; partners matter
slide-49
SLIDE 49

LOOKING FORWARD

slide-50
SLIDE 50

52

PERFORMANCE

  • Have demonstrated Ceph works; now users would

like it to be faster

  • Polish internal APIs; replace original implementations

OSD backend (XFS + leveldb)

Message passing interface

  • Modularity helpers new developers engage
  • Critical mass of developer community stakeholders

Intel, Mellanox, Fujitsu, UnitedStack

Challenge is in shepherding efgorts

slide-51
SLIDE 51

53

NEW HARDWARE COMING

  • Flash and NVRAM for high IOPS

Locking and threading → improve parallelism

  • Low-power processors for cold storage

Limit data copies, CRC → reduce memory bandwidth

  • Challenge: remain hardware agnostic

Keep interface general

  • Lesson: LGPL is great for infrastructure software
slide-52
SLIDE 52

54

ETHERNET DISKS

  • Ethernet-attached HDDs

On-board, general purpose ARM processors

Standard form factor, ethernet instead of SATA

Eliminate usual Intel-based host tier

  • Seagate Kinetic

New key/value interfaces to move beyond block

Well-suited to new shingled drives

Strategy: defjne a new “standard” interface

  • HGST open ethernet drives

General purpose Linux host on HDD

Standard block interface from host

Strategy: build ecosystem of solutions around an open disk architecture

  • Prediction

Hiding drive capabilities behind new APIs will limit innovation, adoption

Opportunity to leverage existing “software defjned” platforms

slide-53
SLIDE 53

55

MULTIPLE VENDORS

  • Avoiding vendor lock-in resonates with users
  • Hardware vendor independence

Architect system for commodity hardware

Customers can buy piecemeal or full solutions

  • Open source → software vendor independence

Code is free (as in speech and beer)

  • Need credible competitors

Linux: Red Hat, SUSE, Canonical

Ceph: Red Hat, ?

  • Lesson: being too successful undermines your value

prop

slide-54
SLIDE 54

56

ACADEMIA → FOSS PIPELINE

  • Incredible innovation in graduate programs
  • Most academic work based on open platforms
  • Very little work survives post-thesis to become free
  • r open source software
  • I see three key problems

Lack of engagement and education about FOSS communities

Pool of employers are dominated by non-free software vendors

Gap between prototype code that is typical at thesis stage and the production quality needed for paying users or venture investors

slide-55
SLIDE 55

57

PARTING THOUGHTS

  • Ceph is awesome.
  • Building a successful community around open source

technology is just as challenging as the technology.

  • Successful business model (and business

environment) is a huge catalyst to driving community.

  • Sacrifjcing software freedoms to enable the business
  • pportunity is frequently tempting, but unnecessary.
slide-56
SLIDE 56

THANK YOU!

Sage Weil

CEPH PRINCIPAL ARCHITECT

sage@redhat.com

@liewegas