LA-UR-15-27431
MarFS: A Scalable Near-POSIX File System
- ver Cloud Objects
MarFS : A Scalable Near-POSIX File System over Cloud Objects Kyle - - PowerPoint PPT Presentation
MarFS : A Scalable Near-POSIX File System over Cloud Objects Kyle E. Lamb HPC Storage Team Lead Nov 18 th 2015 LA-UR-15-27431 Why Do We Need a MarFS HPC Post Trinity HPC Pre Trinity HPC At Trinity 1-2 PB/sec Memory Residence hours
LA-UR-15-27431
are capacity of in system NV will grow radically) and campaign storage works out well (leveraging cloud), do we need a parallel file system anymore, or an archive?
Memory Burst Buffer Parallel File System (PFS) Campaign Storage Archive Memory IOPS/BW Tier Parallel File System (PFS) Capacity Tier Archive Factoids
LANL HPSS = 53 PB and 543 M files Trinity 2 PB memory, 4 PB flash (11% of HPSS) and 80 PB PFS or 150% HPSS) Crossroads may have 5-10 PB memory, 40 PB solid state or 100% of HPSS with data residency measured in days or weeks
Memory Parallel File System Archive HPC Pre Trinity
HPSS Parallel Tape Lustre Parallel File System DRAM 1-2 PB/sec Residence – hours Overwritten – continuous 4-6 TB/sec Residence – hours Overwritten – hours 1-2 TB/sec Residence – days/weeks Flushed – weeks 100-300 GB/sec Residence – months-years Flushed – months-years 10s GB/sec (parallel tape Residence – forever
HPC Post Trinity HPC At Trinity
– Mismatch of POSIX an Object metadata, security, read/write semantics, efficient
– How do we
– GPFS, Lustre, Panasas, OrangeFS, Cleversafe/Scality/EMC ViPR/Ceph/Swift, Glusterfs, General Atomics Nirvana/Storage Resource Broker/IRODS, Maginatics, Camlistore, Bridgestore, Avere, HDFS
Object Repo X
MarFS ProjectA Dir DirA DirA.A DirA.A. A DirA.A.A.A DirB DirA.A.B FA FF FB FD FC FE FI FH FG ProjectN Dir DirA DirA.A DirA.A.A DirA.A.A.A DirB DirA.A.B FA FF FB FD FC FE FI FH FG Namespace Project A Namespace Project N
N N a m e S p a c e s
PFS MDS A PFS MDS N
PFS MDS A.1 PFS MDS A.2 PFS MDS A.M PFS MDS N.1 PFS MDS N.2 PFS MDS N.M Uni Object File Packed Object File Multi Object File
Object Repo A N X M MDS File Systems (for metadata only)
Namespaces MDS holds Directory Metadata File Metadata is hashed
Striping across 1 to X Object Repos File Name Hash File Name Hash
/GPFS-MarFS-md1 /GPFS-MarFS-mdN Dir1.1 UniFile - Attrs: uid, gid, mode, size, dates, etc.
Xattrs - objid repo=1, id=Obj001, objoffs=0, chunksize=256M, Objtype=Uni, NumObj=1, etc.
trashdir /MarFS top level namespace aggregation M e t a d a t a D a t a Object System 1 Object System X Dir2.1 Obj001
Users do data movement here Metadata Servers
GPFS Server (NSD) Dual Copy Raided Fast Stg GPFS Server (NSD) Obj md/d ata server Obj md/d ata server Batch FTA Have your enterprise file systems and MarFS mounted Interactive FTA Have your enterprise file systems and MarFS mounted Batch FTA Have your enterprise file systems and MarFS mounted Separate interactive and batch FTAs due to object security and performance reasons. Data Repos Dual Copy Raided Fast Stg Scale Scale Scale
https://github.com/mar-file-system/marfs https://github.com/pftool/pftool)
BOF: Two Tiers Scalable Storage: Building POSIX-Like Namespaces with Object Stores Date: Nov 18th, 2015 Time: 5:30PM - 7:00PM Ron: Hilton Salon A Session leaders : Sorin Faibish, Gary A. Grider, John Bent
– Cleversafe/Scality/EMC ViPR/Ceph/Swift etc. are moving towards multi-personality data lakes over erasure coded objects, all are young and assume update in place for posix – Glusterfs is probably the closes thing to MarFS. Gluster is aimed more for the enterprise and midrange HPC and less for extreme HPC. It also is making the trade off space for update in place which we can live without. Glusterfs is a way to unify file and object systems, MarFS is another, each coming at it from a different stance in trade space – General Atomics Nirvana, Storage Resource Broker/IRODS is optimized for WAN and HSM metadata rates. There are some capabilities for putting POSIX files over objects, but these methods are largely via NFS or other methods that try to mimic full file system semantics including update in place. These methods are not designed for massive parallelism in a single file, etc. – Maginatics from EMC but it is in its infancy and isnt a full solution to our problem yet. – Camlistore appears to be targeted and personal storage. – Bridgestore is a POSIX name space over objects but they put their metadata in a flat space so rename of a directory is painful. – Avere over objects is focused at NFS so N to 1 is a non starter. – HPSS or SamQFS or a classic HSM? The metadata rates design target way too low. – HDFS metadata doesn’t scale well.
data repositories (Scalable object systems - CDMI, S3, etc.)
both as parts of the tree and as parts of a single directory allowing scaling across the tree and within a single directory
– A small Linux Fuse – A pretty small parallel batch copy/sync/compare/ utility – A set of other small parallel batch utilities for management – A moderate sized library both FUSE and the batch utilities call
within a single directory
– Spreading very large files across many objects – Packing many small files into one large data object
Memory Burst Buffer Parallel File System Campaign Storage Archive Memory Parallel File System Archive HPC Before Trinity HPC After Trinity
1-2 PB/sec Residence – hours Overwritten – continuous 4-6 TB/sec Residence – hours Overwritten – hours 1-2 TB/sec Residence – days/weeks Flushed – weeks 100-300 GB/sec Residence – months-year Flushed – months-year 10s GB/sec (parallel tape Residence – forever HPSS Parallel Tape Lustre Parallel File System DRAM
Campaign: Economics (PFS Raid too expensive, PFS solution too rich in function, PFS metadata not scalable enough, PFS designed for scratch use not years residency, Archive BW too expensive/difficult, Archive metadata too slow)
/GPFS-MarFS-md1 /GPFS-MarFS-mdN Dir1.1 MultiFile - Attrs: uid, gid, mode, size, dates, etc.
Xattrs - objid repo=1, id=Obj002., objoffs=0, chunksize=256M, ObjType=Multi, NumObj=2, etc.
trashdir /MarFS top level namespace aggregation M e t a d a t a D a t a Object System 1 Object System X Dir2.1 Obj002.1 Obj002.2
/GPFS-MarFS-md1 /GPFS-MarFS-mdN Dir1.1 MultiFile - Attrs: uid, gid, mode, size, dates, etc.
Xattrs - objid repo=S, id=Obj002., objoffs=0, chunksize=256M, ObjType=Multi, NumObj=2, etc.
trashdir /MarFS top level namespace aggregation M e t a d a t a D a t a Object System 1 Object System X Dir2.1 Obj002.1 Obj002.2
/GPFS-MarFS-md1 /GPFS-MarFS-mdN Dir1.1 UniFile - Attrs: uid, gid, mode, size, dates, etc.
Xattrs - objid repo=1, id=Obj003, objoffs=4096, chunksize=256M, Objtype=Packed, NumObj=1, Ojb=4 of 5, etc.
trashdir /MarFS top level namespace aggregation M e t a d a t a D a t a Object System 1 Object System X Dir2.1 Obj003
Load Balancer Scheduler Reporter Stat Readdir Stat Copy/Rsync/Co mpare
D
e Q u e u e
Dirs Queue Stat Queue Cp/R/C Queue
4 PB BB 2 TB/S
Premier Machine 2PB Dram General IO Nodes Private IO Nodes Capacity machines ~50-300 TB Dram General IO Nodes Capacity machines ~50-300 TB Dram General IO Nodes
/localscratch(s) /sitescratch(s) /home /project /sitescratch(s) /home /project
Local Scratch 100 PB 1 TB/sec 1-4 Weeks Site Scratch 10’s PB 100 GB/sec 1-4 Weeks
Private IB Site IB/Ether/Lnet Routers/switches (damselfly)
HPSS 100 PB 10 GB/sec Forever Site Scratch 10’s PB 100 GB/sec 1-4 Weeks Campaign MarFS 100’s PB 100’s GB/sec Few Years (erasure) Batch File Transfer Agents ~100 at 2-8 GB/sec per Interactive FTA(s) WAN FTA(s) Special Security Rules
100(s) Gbits/sec
/localscratch(s) /sitescratch(s) /home /project /campaign HPSS /analytics (HDFS other)
Parallel load balanced movers
Parallel Tape with Disk Cache
NFS /home /project Analytics machine potentially disk full/big memory HDFS?
/sitescratch(s) /campaign /analytics (HDFS other) Use HDFS – POSIX Shim for access to POSIX resources /sitescratch(s) /home /project
the name space allow metadata and data update/read – and you can control those special permissions for interactive and batch separately per name space.
data ud – unlink data
– Password for Object Server access is stashed safely, can be time based, crypto securely sent to Object Server on every request. – Encryption in the data path to objects can be turned on – Encryption at rest could be implemented and is on the futures list. – Protecting the trash is essential as well