s n he e ur a title
play

S N he e ur a title Operated by Los Alamos National Security, - PowerPoint PPT Presentation

S N he e ur a title Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA S Los Alamos National Laboratory LA-UR-17-24107 MarFS and DeltaFS @ LANL you e User Level FS Challenges and Opportunities logo


  1. S N he e ur a title Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

  2. S Los Alamos National Laboratory LA-UR-17-24107 MarFS and DeltaFS @ LANL you e User Level FS Challenges and Opportunities logo and delete wo e Brad Settlemyer, LANL HPC May 16, 2017 is Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

  3. Los Alamos National Laboratory Acknowledgements • G. Grider, D. Bonnie, C. Hoffman, J. Inman, W. Vining, A. Torrez, H.B. Chen (LANL MarFS Team) • Q. Zheng, G. Amvrosiadis, C. Cranor, S. Kadekodi, G. Gibson (CMU-LANL IRHPIT, CMU Mochi) • P. Carns, R. Ross, S. Snyder, R. Latham, M. Dorier (Argonne Mochi) • J. Soumagne, J. Lee (HDF Group Mochi) • G. Shipman (LANL Mochi) • F. Guo, B. Albright (LANL VPIC) • J. Bent, K. Lamb (Seagate) 5/16/17 | 3

  4. Los Alamos National Laboratory Overview • 2 Special Purpose User-Level File Systems • No free lunch! • Tradeoff one thing to get something else • Campaign Storage and MarFS • MarFS constrained to its use case • Addresses data center problem • VPIC and DeltaFS • VPIC is open source, scalable, beautiful, modern C++ cosmology sim • Scientist has a needle-haystack problem • Lessons Learned 5/16/17 | 4

  5. Los Alamos National Laboratory MarFS 5/16/17 | 5

  6. Los Alamos National Laboratory Why MarFS? • Science campaigns often stretch beyond 6 months • ScratchFS purge more like 12 weeks • Parallel Tape $$$ • High capacity PFS $$$ • Need new (to HPC) set of tradeoffs • Capacity growth over time (~500PB in 2020) • Support Petabyte files (Silverton uses N:1 I/O at huge scale) • Billions of “tiny” files • Scalable append-only workload (desire 1 GB/s/PB) • Requirements blend PFS and Object Storage capabilities • LANL wants compromise of both! 5/16/17 | 6

  7. Los Alamos National Laboratory MarFS Access Model • Simplify, simplify, simplify • Data plane of object stores present some issues • Object stores tend to have an ideal object size • Pack small files together • Split large files • Only write access is via PFTool • User’s want to use traditional tools • Read-only mount via FUSE • We could make this read-write (e.g. O_APPEND only), but object store security is not the same as POSIX security 5/16/17 | 7

  8. Los Alamos National Laboratory Uni-Object and Multi Object File Scality, S3, erasure, etc. /MarFS top level namespace aggregation GPFS MarFS Metadata File System(s). Obj repo 1 /GPFS-MarFS-md1 /GPFS-MarFS-md2 Obj001 Lazy Tree Info Dir1.1 Dir2.1 Lazy Tree Info trashdir Obj repo 2 trashdir MultiFile-1 UniFile-1 Obj001 Additional meta: All md is just normal except mtime Xattr-repo=2 Obj002 and size which are set by pftool/fuse chunksize= 256M , on close. ojjtype-=Multi Config file/db (means it’s a multi-part file and the obj/offset Additional meta: list is in the GPFS mdfile Xattr-objid Obj repo1 access methods File: list of obj name space/objname/offset/ repo=1 info length (obj name space=2, Obj001 offs/ id=Obj001 Obj repo2 access methods length, Obj002 … objoffs=0 info Xattr-restart chunksize=256M 5/16/17 | 8

  9. Los Alamos National Laboratory Pftool • A highly parallel copy/rsync/compare/list tool • Walks tree in parallel, copy/rsync/compare in parallel. • Parallel Readdir’s, stat’s, and copy/rsinc/compare – Dynamic load balancing – Repackaging: breaks up big files, coalesces small files – To/From NFS/POSIX/parallel FS/MarFS D Dirs Queue o Stat Readdir n Load e Reporter Stat Balancer Stat Queue Q Scheduler u Copy/Rsync/ e u Compare Cp/R/C Queue e 5/16/17 | 9

  10. Los Alamos National Laboratory Moving Data into MarFS efficiently FTA User: Submit batch job FE Scratch1 pfcp –r /scratch1/fs1 /marfs/fs1 (78 PB) FTA1 FTA2 Store1 (~3PB) FTA Cluster FTA3 A collection of pftool worker FTA4 nodes capable Store2 of performing (38PB) data movement FTA5 in parallel FTA6 Store3 (38PB) 5/16/17 | 10

  11. Los Alamos National Laboratory Current MarFS Deployment at LANL • Successfully deployed during Trinity Open Science (6 PB of Scality storage (1 RING), 1GPFS cluster) • Small file packing features weren’t efficient (yet) • Cleanup tools needed better feature sets • It worked! • PFTool cluster deployed for enclaves • Transfers instrumented with Darshan • Deployed 32PB of MarFS campaign storage for Trinity Phase 2 • Plan to grow that by 50 – 100PB every year during Trinity deployment • SMR drive performance is challenging 5/16/17 | 11

  12. Los Alamos National Laboratory Los Alamos National Laboratory Future MarFS Deployment at LANL Parity over multiple ZFS pools Meta-data servers File • 2020 – 350PB Transfer Agent Parity of 10+2 D D D D D D D D D D P P Storage Storage Storage Storage Storage Storage Storage Storage Storage Storage Storage Storage Storage Node Node Node Node Node Node Node Node Node Node Node Node Node Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Each Zpool Storage nodes Multiple JBODs Data and Parity are round-robined Storage Nodes is a 17+3 in separate racks per Storage Node to storage nodes NFS export to FTAs 5/16/17 | 12 4/19/17 | 12

  13. Los Alamos National Laboratory Los Alamos National Laboratory MarFS at LANL MarFS Growth over time Trinity Open Science Trinity Production (current) • Successfully deployed during Trinity Open Science (6 PB of Scality storage (1 RING), 1GPFS cluster) • Some of the small file packing features aren’t efficient yet Future MarFS deployment (Crossroads 2020) • Some of the cleanup tools need better feature sets • But it actually works • All PFTool transfers were logged with Darshan during Open Science Key • Unfortunately, I haven’t had the opportunity to parse that data yet 4 PB • Curently deploying 32PB of MarFS campaign storage for Trinity Phase 2 • Plan is to grow that by 50 – 100PB every year during Trinity deployment 5/16/17 | 13 4/19/17 | 16

  14. Los Alamos National Laboratory DeltaFS 5/16/17 | 14

  15. Los Alamos National Laboratory Why DeltaFS? • DeltaFS provides extremely scalable metadata plane • Don’t want to provision MDS to suit most demanding application • Purchase MDS to address all objects in the data plane • DeltaFS allows applications to be as metadata-intensive as desired • But not the DeltaFS vision talk • Instead I will talk about what exists • VPIC is a good, but imperfect match for DeltaFS • State of the art is a single h5part log for VPIC that has written greater than 1Trillion particles (S. Byna) • Not much opportunity to improve write performance 5/16/17 | 15

  16. Los Alamos National Laboratory Vector Particle in Cell (VPIC) Setup • 128 Million particles/node • Part/node should be higher • Need to store a sample in the range of 0.1 – 10% • Scientist interested in trajectory of the particles with highest energy at end of simulation • At the end of the simulation, easy to dump the highest energy particle ids • Hard to get those trajectories 5/16/17 | 16

  17. Los Alamos National Laboratory DeltaFS for VPIC • Scientists wants trajectory of N highest energy particles • Trajectory is ~40 bytes/particle, once per timestep dump/cycle • Current State of Art (HDF5+H5Part+Byna) 1. At each timestep dump/cycle write particles in log order 2. At end of simulation identify N highest energy particles 3. Sort each timestep dump by particle id in parallel 4. Extract N high energy trajectories (open, seek, read, open, seek, read) • DeltaFS Experiments • Original Goal: Create a file+object per particle • New Goal: 2500 Trinity Nodes -> 1- 4T particles -> 40-160TB per timestep • Speedup particle extraction, minimize slowdown during write (tanstafl) 5/16/17 | 17

  18. Los Alamos National Laboratory Wait What!?! • Always confident we could create 1T files • Key-Value Metadata organization • TableFS -> IndexFS -> BatchFS -> DeltaFS • Planning papers assumed a scalable object store as data plane • RADOS has some cool features that may have worked • No RADOS on Trinity without root/vendor assist! • Cannot create a data plane faster than Lustre, DataWarp • Trillions of files open at one time • 40 Byte appends! (the dribble write problem) • Bitten off far more than originally intended • DeltaFS scope increased to optimize VPIC data plane 5/16/17 | 18

  19. Los Alamos National Laboratory FS Representation vs. Storage Representation 5/16/17 | 19

  20. Los Alamos National Laboratory DeltaFS is a composed User-Level FS 5/16/17 | 20

  21. Los Alamos National Laboratory Write Indexing Overhead 5/16/17 | 21

  22. Los Alamos National Laboratory Trajectory Extraction Time 5/16/17 | 22

  23. Los Alamos National Laboratory DeltaFS UDF (only VPIC example works so far) 5/16/17 | 23

  24. Los Alamos National Laboratory Summary 5/16/17 | 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend