MarFS : A Scalable Near-POSIX File System over Cloud Objects Kyle - - PowerPoint PPT Presentation

marfs a scalable near posix file system over cloud objects
SMART_READER_LITE
LIVE PREVIEW

MarFS : A Scalable Near-POSIX File System over Cloud Objects Kyle - - PowerPoint PPT Presentation

MarFS : A Scalable Near-POSIX File System over Cloud Objects Kyle E. Lamb HPC Storage Team Lead Nov 18 th 2015 LA-UR-15-27431 Why Do We Need a MarFS HPC Post Trinity HPC Pre Trinity HPC At Trinity 1-2 PB/sec Memory Residence hours


slide-1
SLIDE 1

LA-UR-15-27431

MarFS: A Scalable Near-POSIX File System

  • ver Cloud Objects

Kyle E. Lamb HPC Storage Team Lead Nov 18th 2015

slide-2
SLIDE 2

Why Do We Need a MarFS

  • If the Burst Buffer does its job very well (and indications

are capacity of in system NV will grow radically) and campaign storage works out well (leveraging cloud), do we need a parallel file system anymore, or an archive?

  • Maybe just a bw/iops tier and a capacity tier.
  • Too soon to say, seems feasible longer term

Memory Burst Buffer Parallel File System (PFS) Campaign Storage Archive Memory IOPS/BW Tier Parallel File System (PFS) Capacity Tier Archive Factoids

LANL HPSS = 53 PB and 543 M files Trinity 2 PB memory, 4 PB flash (11% of HPSS) and 80 PB PFS or 150% HPSS) Crossroads may have 5-10 PB memory, 40 PB solid state or 100% of HPSS with data residency measured in days or weeks

Memory Parallel File System Archive HPC Pre Trinity

HPSS Parallel Tape Lustre Parallel File System DRAM 1-2 PB/sec Residence – hours Overwritten – continuous 4-6 TB/sec Residence – hours Overwritten – hours 1-2 TB/sec Residence – days/weeks Flushed – weeks 100-300 GB/sec Residence – months-years Flushed – months-years 10s GB/sec (parallel tape Residence – forever

HPC Post Trinity HPC At Trinity

slide-3
SLIDE 3

How about a Scalable Near-POSIX File System over Object Erasure?

  • Best of both worlds

– Objects Systems

  • Provide massive scaling and efficient erasure
  • Friendly to applications, not to people. People need a name space.
  • Huge Economic appeal (erasure enables use of inexpensive storage)

– POSIX name space is powerful but has issues scaling

  • The challenges

– Mismatch of POSIX an Object metadata, security, read/write semantics, efficient

  • bject/file sizes and no update in place with Objects

– How do we

  • scale POSIX name space to trillions of files/directories
  • leverage massive Object Scalability for “Capacity Tier) with 100X BW of HSM’s but 1/10 BW of PFS
  • with many years of data protection longevity?
  • Looked at

– GPFS, Lustre, Panasas, OrangeFS, Cleversafe/Scality/EMC ViPR/Ceph/Swift, Glusterfs, General Atomics Nirvana/Storage Resource Broker/IRODS, Maginatics, Camlistore, Bridgestore, Avere, HDFS

slide-4
SLIDE 4

Object Repo X

MarFS Scaling (Metadata and Data)

MarFS ProjectA Dir DirA DirA.A DirA.A. A DirA.A.A.A DirB DirA.A.B FA FF FB FD FC FE FI FH FG ProjectN Dir DirA DirA.A DirA.A.A DirA.A.A.A DirB DirA.A.B FA FF FB FD FC FE FI FH FG Namespace Project A Namespace Project N

N N a m e S p a c e s

PFS MDS A PFS MDS N

PFS MDS A.1 PFS MDS A.2 PFS MDS A.M PFS MDS N.1 PFS MDS N.2 PFS MDS N.M Uni Object File Packed Object File Multi Object File

Object Repo A N X M MDS File Systems (for metadata only)

Namespaces MDS holds Directory Metadata File Metadata is hashed

  • ver M multiple MDS

Striping across 1 to X Object Repos File Name Hash File Name Hash

slide-5
SLIDE 5

MarFS Internals Overview

/GPFS-MarFS-md1 /GPFS-MarFS-mdN Dir1.1 UniFile - Attrs: uid, gid, mode, size, dates, etc.

Xattrs - objid repo=1, id=Obj001, objoffs=0, chunksize=256M, Objtype=Uni, NumObj=1, etc.

trashdir /MarFS top level namespace aggregation M e t a d a t a D a t a Object System 1 Object System X Dir2.1 Obj001

slide-6
SLIDE 6

Users do data movement here Metadata Servers

Simple MarFS Deployment

GPFS Server (NSD) Dual Copy Raided Fast Stg GPFS Server (NSD) Obj md/d ata server Obj md/d ata server Batch FTA Have your enterprise file systems and MarFS mounted Interactive FTA Have your enterprise file systems and MarFS mounted Batch FTA Have your enterprise file systems and MarFS mounted Separate interactive and batch FTAs due to object security and performance reasons. Data Repos Dual Copy Raided Fast Stg Scale Scale Scale

slide-7
SLIDE 7

Open Source, BSD License Partners Welcome

https://github.com/mar-file-system/marfs https://github.com/pftool/pftool)

Thank You For Your Attention

BOF: Two Tiers Scalable Storage: Building POSIX-Like Namespaces with Object Stores Date: Nov 18th, 2015 Time: 5:30PM - 7:00PM Ron: Hilton Salon A Session leaders : Sorin Faibish, Gary A. Grider, John Bent

slide-8
SLIDE 8

Backup

slide-9
SLIDE 9

Won’t someone else do it?

  • There is evidence others see the need but no magic bullets yet: (partial list)

– Cleversafe/Scality/EMC ViPR/Ceph/Swift etc. are moving towards multi-personality data lakes over erasure coded objects, all are young and assume update in place for posix – Glusterfs is probably the closes thing to MarFS. Gluster is aimed more for the enterprise and midrange HPC and less for extreme HPC. It also is making the trade off space for update in place which we can live without. Glusterfs is a way to unify file and object systems, MarFS is another, each coming at it from a different stance in trade space – General Atomics Nirvana, Storage Resource Broker/IRODS is optimized for WAN and HSM metadata rates. There are some capabilities for putting POSIX files over objects, but these methods are largely via NFS or other methods that try to mimic full file system semantics including update in place. These methods are not designed for massive parallelism in a single file, etc. – Maginatics from EMC but it is in its infancy and isnt a full solution to our problem yet. – Camlistore appears to be targeted and personal storage. – Bridgestore is a POSIX name space over objects but they put their metadata in a flat space so rename of a directory is painful. – Avere over objects is focused at NFS so N to 1 is a non starter. – HPSS or SamQFS or a classic HSM? The metadata rates design target way too low. – HDFS metadata doesn’t scale well.

slide-10
SLIDE 10

MarFS Requirements

  • Linux system(s) with C/C++ and FUSE support
  • MPI for parallel comms in Pftool (a parallel data transfer tool)

– MPI library can use many comm methods like TCP/IP, Infiniband OFED, etc.

  • Support lazy data and metadata quotas per user per name

space

  • Wide parallelism for data and metadata
  • Try hard not to walk trees for management (use inode scans etc.)
  • Use trash mechanism for user recovery
  • If use MarFS to combine multiple POSIX file systems into one

mount point, any set of POSIX file systems can be used.

  • Multi-node parallelism MD FS’s must be globally visible somehow
  • Using object store data repo, object store needs globally visible.
  • The MarFS MD FS’s must be capable of POSIX xattr and sparse

– don’t have to use GPFS, we use due to ILM inode scan features

slide-11
SLIDE 11

What is MarFS?

  • Near-POSIX global scalable name space over many POSIX and non POSIX

data repositories (Scalable object systems - CDMI, S3, etc.)

  • It scales name space by sewing together multiple POSIX file systems

both as parts of the tree and as parts of a single directory allowing scaling across the tree and within a single directory

  • It is small amount of code (C/C++/Scripts)

– A small Linux Fuse – A pretty small parallel batch copy/sync/compare/ utility – A set of other small parallel batch utilities for management – A moderate sized library both FUSE and the batch utilities call

  • Data movement scales just like many scalable object systems
  • Metadata scales like NxM POSIX name spaces both across the tree and

within a single directory

  • It is friendly to object systems by

– Spreading very large files across many objects – Packing many small files into one large data object

slide-12
SLIDE 12

What are all these storage layers? Why do we need all these storage layers?

  • Why

– BB: Economics (disk bw/iops too expensive)

Memory Burst Buffer Parallel File System Campaign Storage Archive Memory Parallel File System Archive HPC Before Trinity HPC After Trinity

1-2 PB/sec Residence – hours Overwritten – continuous 4-6 TB/sec Residence – hours Overwritten – hours 1-2 TB/sec Residence – days/weeks Flushed – weeks 100-300 GB/sec Residence – months-year Flushed – months-year 10s GB/sec (parallel tape Residence – forever HPSS Parallel Tape Lustre Parallel File System DRAM 

Campaign: Economics (PFS Raid too expensive, PFS solution too rich in function, PFS metadata not scalable enough, PFS designed for scratch use not years residency, Archive BW too expensive/difficult, Archive metadata too slow)

slide-13
SLIDE 13

What it is not!

  • Doesn’t allow update file in place for object data

repo’s ( no seeking around and writing – it isnt a parallel file system)

  • FUSE

– Does not check for or protect against multiple writers into the same file (when writing into object repos), use batch copy utility or library to do this efficiently) – Fuse is targeted at interactive use – Writing to object backed files works but FUSE will not create data objects that are packed as optimized as the parallel copy utility. – Batch utilities to reshape data written by fuse

slide-14
SLIDE 14

MarFS Internals Overview Multi-File

/GPFS-MarFS-md1 /GPFS-MarFS-mdN Dir1.1 MultiFile - Attrs: uid, gid, mode, size, dates, etc.

Xattrs - objid repo=1, id=Obj002., objoffs=0, chunksize=256M, ObjType=Multi, NumObj=2, etc.

trashdir /MarFS top level namespace aggregation M e t a d a t a D a t a Object System 1 Object System X Dir2.1 Obj002.1 Obj002.2

slide-15
SLIDE 15

MarFS Internals Overview Multi-File

(striped Object Systems)

/GPFS-MarFS-md1 /GPFS-MarFS-mdN Dir1.1 MultiFile - Attrs: uid, gid, mode, size, dates, etc.

Xattrs - objid repo=S, id=Obj002., objoffs=0, chunksize=256M, ObjType=Multi, NumObj=2, etc.

trashdir /MarFS top level namespace aggregation M e t a d a t a D a t a Object System 1 Object System X Dir2.1 Obj002.1 Obj002.2

slide-16
SLIDE 16

MarFS Internals Overview Packed-File

/GPFS-MarFS-md1 /GPFS-MarFS-mdN Dir1.1 UniFile - Attrs: uid, gid, mode, size, dates, etc.

Xattrs - objid repo=1, id=Obj003, objoffs=4096, chunksize=256M, Objtype=Packed, NumObj=1, Ojb=4 of 5, etc.

trashdir /MarFS top level namespace aggregation M e t a d a t a D a t a Object System 1 Object System X Dir2.1 Obj003

slide-17
SLIDE 17

Configuration

  • Top level MarFS mountpoint
  • Stanza for every name space/metadata file

system you want to bring into MarFS

– Describes how metadata is to be handled in this part of the MarFS system – Describes where data is to be put for files of various sizes/shapes (into which data repo)

  • Stanza for every Repo - object system/area of
  • bject system, or other access method for

data

– Describes how data is to be stored into this data repo, chunksizes/methods/etc.

slide-18
SLIDE 18

Recoverability

  • Parallel backup of the metadata file system(s)

backs up the metadata for MarFS

– (METADATA ONLY)

  • Objects are encoded with CREATE TIME info

– (path,uidgid,mode,times, MarFS/other xattrs, etc.

  • Parallel backup of object metadata

– Object name has some recovery info in it, so just listing/saving the objects/buckets is useful – Any other info your object server will allow you to back up

slide-19
SLIDE 19

Pftool

  • A highly parallel copy/rsync/compare/list tool
  • Walks tree in parallel, copy/rsync/compare in

parallel.

  • Parallel Readdir’s, stat’s, and copy/rsinc/compare

– Dynamic load balancing – Restart-ability for large trees or even very large files – Repackaging: breaks up big files, coalesces small files – To/From NFS/POSIX/parallel FS/MarFS

Load Balancer Scheduler Reporter Stat Readdir Stat Copy/Rsync/Co mpare

D

  • n

e Q u e u e

Dirs Queue Stat Queue Cp/R/C Queue

slide-20
SLIDE 20

4 PB BB 2 TB/S

How does it fit into our environment (circa FY16) ?

Premier Machine 2PB Dram General IO Nodes Private IO Nodes Capacity machines ~50-300 TB Dram General IO Nodes Capacity machines ~50-300 TB Dram General IO Nodes

/localscratch(s) /sitescratch(s) /home /project /sitescratch(s) /home /project

Local Scratch 100 PB 1 TB/sec 1-4 Weeks Site Scratch 10’s PB 100 GB/sec 1-4 Weeks

Private IB Site IB/Ether/Lnet Routers/switches (damselfly)

HPSS 100 PB 10 GB/sec Forever Site Scratch 10’s PB 100 GB/sec 1-4 Weeks Campaign MarFS 100’s PB 100’s GB/sec Few Years (erasure) Batch File Transfer Agents ~100 at 2-8 GB/sec per Interactive FTA(s) WAN FTA(s) Special Security Rules

100(s) Gbits/sec

/localscratch(s) /sitescratch(s) /home /project /campaign HPSS /analytics (HDFS other)

Parallel load balanced movers

Parallel Tape with Disk Cache

NFS /home /project Analytics machine potentially disk full/big memory HDFS?

/sitescratch(s) /campaign /analytics (HDFS other) Use HDFS – POSIX Shim for access to POSIX resources /sitescratch(s) /home /project

slide-21
SLIDE 21

Security Model

  • All POSIX security is obeyed by MarFS
  • Addition special security can be added by configuration to manage what parts of

the name space allow metadata and data update/read – and you can control those special permissions for interactive and batch separately per name space.

  • rm – read metadata wm – write metadata rd – read data wd – write

data ud – unlink data

  • Can lock down data read/write separately from metadata read/update
  • Value is not stored with the file, real time, fast way to control access.
  • Object Security is provided by the following methods

– Password for Object Server access is stashed safely, can be time based, crypto securely sent to Object Server on every request. – Encryption in the data path to objects can be turned on – Encryption at rest could be implemented and is on the futures list. – Protecting the trash is essential as well

slide-22
SLIDE 22

Futures

  • File data versioning – using data pointers in trash
  • Dual copy/MarFS erasure to enable erasure on erasure
  • Metadata update logging, investigate how this might be done and

what the cost is

  • Compression and Encrypton in MarFS ( for repos that don’t

compress)

  • Offline optimizations/sorting/indexing of attrs and user xattrs etc.
  • Append or sparse support, need to consider carefully, hard to do

because of book keeping

  • Other access methods HPSS, Globus, other.
  • HDFS alternate access of same data, via java hdfs lib
  • Offline deep reconcile/repack – if trash is lost
  • Semi-direct – store data into parallel file system file(s) for Globus or
  • ther parallel N to 1 write staging