Cloud Filesystem Jeff Darcy for BBLISA, October 2011 What is a - - PowerPoint PPT Presentation

cloud filesystem
SMART_READER_LITE
LIVE PREVIEW

Cloud Filesystem Jeff Darcy for BBLISA, October 2011 What is a - - PowerPoint PPT Presentation

Cloud Filesystem Jeff Darcy for BBLISA, October 2011 What is a Filesystem? The thing every OS and language knows Directories, files, file descriptors Directories within directories Operate on single record (POSIX: single


slide-1
SLIDE 1

Cloud Filesystem

Jeff Darcy for BBLISA, October 2011

slide-2
SLIDE 2

What is a Filesystem?

  • “The thing every OS and language knows”
  • Directories, files, file descriptors
  • Directories within directories
  • Operate on single record (POSIX: single byte)

within a file

  • Built-in permissions model (e.g. UID, GID,

ugo·rwx)

  • Defined concurrency behaviors (e.g. fsync)
  • Extras: symlinks, ACLs, xattrs
slide-3
SLIDE 3

Are Filesystems Relevant?

  • Supported by every language and OS natively
  • Shared data with rich semantics
  • Graceful and efficient handling of multi-GB
  • bjects
  • Permission model missing in some alternatives
  • Polyglot storage, e.g. DB to index data in FS
slide-4
SLIDE 4

Network Filesystems

  • Extend filesystem to multiple clients
  • Awesome idea so long as total required

capacity/performance doesn't exceed a single server

  • ...otherwise you get server sprawl
  • Plenty of commercial vendors, community

experience

  • Making NFS highly available brings extra

headaches

slide-5
SLIDE 5

Distributed Filesystems

  • Aggregate capacity/performance across servers
  • Built-in redundancy
  • ...but watch out: not all deal with HA transparently
  • Among the most notoriously difficult kinds of

software to set up, tune and maintain

  • Anyone want to see my Lustre scars?
  • Performance profile can be surprising
  • Result: seen as specialized solution (esp. HPC)
slide-6
SLIDE 6

Example: NFS4.1/pNFS

  • pNFS distributes data access across servers
  • Referrals etc. offload some metadata
  • Only a protocol, not an implementation
  • OSS clients, proprietary servers
  • Does not address metadata scaling at all
  • Conclusion: partial solution, good for

compatibility, full solution might layer on top of something else

slide-7
SLIDE 7

Example: Ceph

  • Two-layer architecture
  • Object layer (RADOS) is self-organizing
  • can be used alone for block storage via RBD
  • Metadata layer provides POSIX file semantics
  • n top of RADOS objects
  • Full-kernel implementation
  • Great architecture, some day it will be a great

implementation

slide-8
SLIDE 8

Ceph Diagram

Data Data Data Data Metadata Metadata Client RADOS Layer Ceph Layer

slide-9
SLIDE 9

Example: GlusterFS

  • Single-layer architecture
  • sharding instead of layering
  • one type of server – data and metadata
  • Servers are dumb, smart behavior driven by

clients

  • FUSE implementation
  • Native, NFSv3, UFO, Hadoop
slide-10
SLIDE 10

GlusterFS Diagram

Client Data Metadata Brick A Data Metadata Data Metadata Brick B Data Metadata Data Metadata Brick C Data Metadata Data Metadata Brick D Data Metadata

slide-11
SLIDE 11

OK, What About HekaFS?

  • Don't blame me for the name
  • trademark issues are a distraction from real work
  • Existing DFSes solve many problems already
  • sharding, replication, striping
  • What they don't address is cloud-specific

deployment

  • lack of trust (user/user and user/provider)
  • location transparency
  • operationalization
slide-12
SLIDE 12

Why Start With GlusterFS?

  • Not going to write my own from scratch
  • been there, done that
  • leverage existing code, community, user base
  • Modular architecture allows adding functionality

via an API

  • separate licensing, distribution, support
  • By far the best configuration/management
  • OK, so it's FUSE
  • not as bad as people think + add more servers
slide-13
SLIDE 13

HekaFS Current Features

  • Directory isolation
  • ID isolation
  • “virtualize” between server ID space and tenants'
  • SSL
  • encryption useful on its own
  • authentication is needed by other features
  • At-rest encryption
  • Keys ONLY on clients
  • AES-256 through AES-1024, “ESSIV-like”
slide-14
SLIDE 14

HekaFS Future Features

  • Enough of multi-tenancy, now for other stuff
  • Improved (local/sync) replication
  • lower latency, faster repair
  • Namespace (and small-file?) caching
  • Improved data integrity
  • Improved distribution
  • higher server counts, smoother reconfiguration
  • Erasure codes?
slide-15
SLIDE 15

HekaFS Global Replication

  • Multi-site asynchronous
  • Arbitrary number of sites
  • Write from any site, even during partition
  • ordered, eventually consistent with conflict resolution
  • Caching is just a special case of replication
  • interest expressed (and withdrawn) not assumed
  • Some infrastructure being done early for local

replication

slide-16
SLIDE 16

Project Status

  • All open source
  • code hosted by Fedora, bugzilla by Red Hat
  • Red Hat also pays me (and others) to work on it
  • Close collaboration with Gluster
  • they do most of the work
  • they're open-source folks too
  • completely support their business model
  • “current” = Fedora 16
  • “future” = Fedora 17+ and Red Hat product
slide-17
SLIDE 17

Contact Info

  • Project
  • http://hekafs.org
  • jdarcy@redhat.com
  • Personal
  • http://pl.atyp.us
  • jeff@pl.atyp.us
slide-18
SLIDE 18

Cloud Filesystem

Jeff Darcy for BBLISA, October 2011

slide-19
SLIDE 19

What is a Filesystem?

  • “The thing every OS and language knows”
  • Directories, files, file descriptors
  • Directories within directories
  • Operate on single record (POSIX: single byte)

within a file

  • Built-in permissions model (e.g. UID, GID,

ugo·rwx)

  • Defined concurrency behaviors (e.g. fsync)
  • Extras: symlinks, ACLs, xattrs
slide-20
SLIDE 20

Are Filesystems Relevant?

  • Supported by every language and OS natively
  • Shared data with rich semantics
  • Graceful and efficient handling of multi-GB
  • bjects
  • Permission model missing in some alternatives
  • Polyglot storage, e.g. DB to index data in FS
slide-21
SLIDE 21

Network Filesystems

  • Extend filesystem to multiple clients
  • Awesome idea so long as total required

capacity/performance doesn't exceed a single server

  • ...otherwise you get server sprawl
  • Plenty of commercial vendors, community

experience

  • Making NFS highly available brings extra

headaches

slide-22
SLIDE 22

Distributed Filesystems

  • Aggregate capacity/performance across servers
  • Built-in redundancy
  • ...but watch out: not all deal with HA transparently
  • Among the most notoriously difficult kinds of

software to set up, tune and maintain

  • Anyone want to see my Lustre scars?
  • Performance profile can be surprising
  • Result: seen as specialized solution (esp. HPC)
slide-23
SLIDE 23

Example: NFS4.1/pNFS

  • pNFS distributes data access across servers
  • Referrals etc. offload some metadata
  • Only a protocol, not an implementation
  • OSS clients, proprietary servers
  • Does not address metadata scaling at all
  • Conclusion: partial solution, good for

compatibility, full solution might layer on top of something else

slide-24
SLIDE 24

Example: Ceph

  • Two-layer architecture
  • Object layer (RADOS) is self-organizing
  • can be used alone for block storage via RBD
  • Metadata layer provides POSIX file semantics
  • n top of RADOS objects
  • Full-kernel implementation
  • Great architecture, some day it will be a great

implementation

slide-25
SLIDE 25

C e p h D i a g r a m D a t a D a t a D a t a D a t a M e t a d a t a M e t a d a t a C l i e n t R A D O S L a y e r C e p h L a y e r

slide-26
SLIDE 26

Example: GlusterFS

  • Single-layer architecture
  • sharding instead of layering
  • one type of server – data and metadata
  • Servers are dumb, smart behavior driven by

clients

  • FUSE implementation
  • Native, NFSv3, UFO, Hadoop
slide-27
SLIDE 27

G l u s t e r F S D i a g r a m C l i e n t D a t a M e t a d a t a B r i c k A D a t a M e t a d a t a D a t a M e t a d a t a B r i c k B D a t a M e t a d a t a D a t a M e t a d a t a B r i c k C D a t a M e t a d a t a D a t a M e t a d a t a B r i c k D D a t a M e t a d a t a

slide-28
SLIDE 28

OK, What About HekaFS?

  • Don't blame me for the name
  • trademark issues are a distraction from real work
  • Existing DFSes solve many problems already
  • sharding, replication, striping
  • What they don't address is cloud-specific

deployment

  • lack of trust (user/user and user/provider)
  • location transparency
  • operationalization
slide-29
SLIDE 29

Why Start With GlusterFS?

  • Not going to write my own from scratch
  • been there, done that
  • leverage existing code, community, user base
  • Modular architecture allows adding functionality

via an API

  • separate licensing, distribution, support
  • By far the best configuration/management
  • OK, so it's FUSE
  • not as bad as people think + add more servers
slide-30
SLIDE 30

HekaFS Current Features

  • Directory isolation
  • ID isolation
  • “virtualize” between server ID space and tenants'
  • SSL
  • encryption useful on its own
  • authentication is needed by other features
  • At-rest encryption
  • Keys ONLY on clients
  • AES-256 through AES-1024, “ESSIV-like”
slide-31
SLIDE 31

HekaFS Future Features

  • Enough of multi-tenancy, now for other stuff
  • Improved (local/sync) replication
  • lower latency, faster repair
  • Namespace (and small-file?) caching
  • Improved data integrity
  • Improved distribution
  • higher server counts, smoother reconfiguration
  • Erasure codes?
slide-32
SLIDE 32

HekaFS Global Replication

  • Multi-site asynchronous
  • Arbitrary number of sites
  • Write from any site, even during partition
  • ordered, eventually consistent with conflict resolution
  • Caching is just a special case of replication
  • interest expressed (and withdrawn) not assumed
  • Some infrastructure being done early for local

replication

slide-33
SLIDE 33

Project Status

  • All open source
  • code hosted by Fedora, bugzilla by Red Hat
  • Red Hat also pays me (and others) to work on it
  • Close collaboration with Gluster
  • they do most of the work
  • they're open-source folks too
  • completely support their business model
  • “current” = Fedora 16
  • “future” = Fedora 17+ and Red Hat product
slide-34
SLIDE 34

Contact Info

  • Project
  • http://hekafs.org
  • jdarcy@redhat.com
  • Personal
  • http://pl.atyp.us
  • jeff@pl.atyp.us