S N he e ur a title Operated by Los Alamos National Security, - - PowerPoint PPT Presentation

s n he e ur a title
SMART_READER_LITE
LIVE PREVIEW

S N he e ur a title Operated by Los Alamos National Security, - - PowerPoint PPT Presentation

S N he e ur a title Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA S Los Alamos National Laboratory LA-UR-17-24107 MarFS and DeltaFS @ LANL you e User Level FS Challenges and Opportunities logo


slide-1
SLIDE 1

S N he e ur a title

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

slide-2
SLIDE 2

S you e logo and delete wo e is

Los Alamos National Laboratory

MarFS and DeltaFS @ LANL

Brad Settlemyer, LANL HPC May 16, 2017

User Level FS Challenges and Opportunities

LA-UR-17-24107

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

slide-3
SLIDE 3

Los Alamos National Laboratory 5/16/17 | 3

Acknowledgements

  • G. Grider, D. Bonnie, C. Hoffman, J. Inman, W. Vining,
  • A. Torrez, H.B. Chen (LANL MarFS Team)
  • Q. Zheng, G. Amvrosiadis, C. Cranor, S. Kadekodi, G.

Gibson (CMU-LANL IRHPIT, CMU Mochi)

  • P. Carns, R. Ross, S. Snyder, R. Latham, M. Dorier

(Argonne Mochi)

  • J. Soumagne, J. Lee (HDF Group Mochi)
  • G. Shipman (LANL Mochi)
  • F. Guo, B. Albright (LANL VPIC)
  • J. Bent, K. Lamb (Seagate)
slide-4
SLIDE 4

Los Alamos National Laboratory 5/16/17 | 4

Overview

  • 2 Special Purpose User-Level File Systems
  • No free lunch!
  • Tradeoff one thing to get something else
  • Campaign Storage and MarFS
  • MarFS constrained to its use case
  • Addresses data center problem
  • VPIC and DeltaFS
  • VPIC is open source, scalable, beautiful, modern C++ cosmology sim
  • Scientist has a needle-haystack problem
  • Lessons Learned
slide-5
SLIDE 5

Los Alamos National Laboratory 5/16/17 | 5

MarFS

slide-6
SLIDE 6

Los Alamos National Laboratory 5/16/17 | 6

Why MarFS?

  • Science campaigns often stretch beyond 6 months
  • ScratchFS purge more like 12 weeks
  • Parallel Tape $$$
  • High capacity PFS $$$
  • Need new (to HPC) set of tradeoffs
  • Capacity growth over time (~500PB in 2020)
  • Support Petabyte files (Silverton uses N:1 I/O at huge scale)
  • Billions of “tiny” files
  • Scalable append-only workload (desire 1 GB/s/PB)
  • Requirements blend PFS and Object Storage capabilities
  • LANL wants compromise of both!
slide-7
SLIDE 7

Los Alamos National Laboratory 5/16/17 | 7

MarFS Access Model

  • Simplify, simplify, simplify
  • Data plane of object stores present some issues
  • Object stores tend to have an ideal object size
  • Pack small files together
  • Split large files
  • Only write access is via PFTool
  • User’s want to use traditional tools
  • Read-only mount via FUSE
  • We could make this read-write (e.g. O_APPEND only), but object store

security is not the same as POSIX security

slide-8
SLIDE 8

Los Alamos National Laboratory 5/16/17 | 8

Uni-Object and Multi Object File

Obj001 Obj repo 1 Obj repo 2 Obj001 Obj002 /GPFS-MarFS-md1 /GPFS-MarFS-md2 Dir1.1 Dir2.1 UniFile-1 All md is just normal except mtime and size which are set by pftool/fuse

  • n close.

Additional meta: Xattr-objid repo=1 id=Obj001

  • bjoffs=0

chunksize=256M

MultiFile-1 Obj repo1 access methods info Obj repo2 access methods info Config file/db

Additional meta: Xattr-repo=2 chunksize=256M,

  • jjtype-=Multi

(means it’s a multi-part file and the obj/offset list is in the GPFS mdfile File: list of obj name space/objname/offset/ length (obj name space=2, Obj001 offs/ length, Obj002 … Xattr-restart

trashdir trashdir Lazy Tree Info Lazy Tree Info

Scality, S3, erasure, etc.

GPFS MarFS Metadata File System(s). /MarFS top level namespace aggregation

slide-9
SLIDE 9

Los Alamos National Laboratory 5/16/17 | 9

Pftool

  • A highly parallel copy/rsync/compare/list tool
  • Walks tree in parallel, copy/rsync/compare in parallel.
  • Parallel Readdir’s, stat’s, and copy/rsinc/compare

– Dynamic load balancing – Repackaging: breaks up big files, coalesces small files – To/From NFS/POSIX/parallel FS/MarFS

Load Balancer Scheduler Reporter Stat Readdir Stat Copy/Rsync/ Compare

D

  • n

e Q u e u e

Dirs Queue Stat Queue Cp/R/C Queue

slide-10
SLIDE 10

Los Alamos National Laboratory 5/16/17 | 10

Moving Data into MarFS efficiently

Scratch1 (78 PB) Store1 (~3PB) Store2 (38PB) Store3 (38PB)

FTA FE FTA1 FTA2 FTA3 FTA4 FTA5 FTA6

User: Submit batch job pfcp –r /scratch1/fs1 /marfs/fs1

FTA Cluster A collection of pftool worker nodes capable

  • f performing

data movement in parallel

slide-11
SLIDE 11

Los Alamos National Laboratory 5/16/17 | 11

Current MarFS Deployment at LANL

  • Successfully deployed during Trinity Open Science (6 PB of Scality

storage (1 RING), 1GPFS cluster)

  • Small file packing features weren’t efficient (yet)
  • Cleanup tools needed better feature sets
  • It worked!
  • PFTool cluster deployed for enclaves
  • Transfers instrumented with Darshan
  • Deployed 32PB of MarFS campaign storage for Trinity Phase 2
  • Plan to grow that by 50 – 100PB every year during Trinity deployment
  • SMR drive performance is challenging
slide-12
SLIDE 12

Los Alamos National Laboratory 5/16/17 | 12

Future MarFS Deployment at LANL

  • 2020 – 350PB

Los Alamos National Laboratory 4/19/17 | 12

Parity over multiple ZFS pools

File Transfer Agent Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 D D D D D D D D D P P Each Zpool is a 17+3 Parity of 10+2 Storage Node Zpool 1 Zpool 2 Zpool 3 Zpool 4 D Meta-data servers Storage nodes in separate racks Multiple JBODs per Storage Node Data and Parity are round-robined to storage nodes Storage Nodes NFS export to FTAs

slide-13
SLIDE 13

Los Alamos National Laboratory 5/16/17 | 13

MarFS at LANL

  • Successfully deployed during Trinity Open Science (6 PB of Scality

storage (1 RING), 1GPFS cluster)

  • Some of the small file packing features aren’t efficient yet
  • Some of the cleanup tools need better feature sets
  • But it actually works
  • All PFTool transfers were logged with Darshan during Open Science
  • Unfortunately, I haven’t had the opportunity to parse that data yet
  • Curently deploying 32PB of MarFS campaign storage for Trinity Phase 2
  • Plan is to grow that by 50 – 100PB every year during Trinity deployment

Los Alamos National Laboratory 4/19/17 | 16

MarFS Growth over time

Trinity Open Science Key 4 PB Trinity Production (current) Future MarFS deployment (Crossroads 2020)

slide-14
SLIDE 14

Los Alamos National Laboratory 5/16/17 | 14

DeltaFS

slide-15
SLIDE 15

Los Alamos National Laboratory 5/16/17 | 15

Why DeltaFS?

  • DeltaFS provides extremely scalable metadata plane
  • Don’t want to provision MDS to suit most demanding application
  • Purchase MDS to address all objects in the data plane
  • DeltaFS allows applications to be as metadata-intensive as desired
  • But not the DeltaFS vision talk
  • Instead I will talk about what exists
  • VPIC is a good, but imperfect match for DeltaFS
  • State of the art is a single h5part log for VPIC that has written greater than

1Trillion particles (S. Byna)

  • Not much opportunity to improve write performance
slide-16
SLIDE 16

Los Alamos National Laboratory 5/16/17 | 16

Vector Particle in Cell (VPIC) Setup

  • 128 Million particles/node
  • Part/node should be higher
  • Need to store a sample in the

range of 0.1 – 10%

  • Scientist interested in

trajectory of the particles with highest energy at end of simulation

  • At the end of the simulation,

easy to dump the highest energy particle ids

  • Hard to get those trajectories
slide-17
SLIDE 17

Los Alamos National Laboratory 5/16/17 | 17

DeltaFS for VPIC

  • Scientists wants trajectory of N highest energy particles
  • Trajectory is ~40 bytes/particle, once per timestep dump/cycle
  • Current State of Art (HDF5+H5Part+Byna)
  • 1. At each timestep dump/cycle write particles in log order
  • 2. At end of simulation identify N highest energy particles
  • 3. Sort each timestep dump by particle id in parallel
  • 4. Extract N high energy trajectories (open, seek, read, open, seek, read)
  • DeltaFS Experiments
  • Original Goal: Create a file+object per particle
  • New Goal: 2500 Trinity Nodes -> 1- 4T particles -> 40-160TB per timestep
  • Speedup particle extraction, minimize slowdown during write (tanstafl)
slide-18
SLIDE 18

Los Alamos National Laboratory 5/16/17 | 18

Wait What!?!

  • Always confident we could create 1T files
  • Key-Value Metadata organization
  • TableFS -> IndexFS -> BatchFS -> DeltaFS
  • Planning papers assumed a scalable object store as data plane
  • RADOS has some cool features that may have worked
  • No RADOS on Trinity without root/vendor assist!
  • Cannot create a data plane faster than Lustre, DataWarp
  • Trillions of files open at one time
  • 40 Byte appends! (the dribble write problem)
  • Bitten off far more than originally intended
  • DeltaFS scope increased to optimize VPIC data plane
slide-19
SLIDE 19

Los Alamos National Laboratory 5/16/17 | 19

FS Representation vs. Storage Representation

slide-20
SLIDE 20

Los Alamos National Laboratory 5/16/17 | 20

DeltaFS is a composed User-Level FS

slide-21
SLIDE 21

Los Alamos National Laboratory 5/16/17 | 21

Write Indexing Overhead

slide-22
SLIDE 22

Los Alamos National Laboratory 5/16/17 | 22

Trajectory Extraction Time

slide-23
SLIDE 23

Los Alamos National Laboratory 5/16/17 | 23

DeltaFS UDF (only VPIC example works so far)

slide-24
SLIDE 24

Los Alamos National Laboratory 5/16/17 | 24

Summary

slide-25
SLIDE 25

Los Alamos National Laboratory 5/16/17 | 25

Burst Buffer(3.7 PB, 3.3 TB/s)

Tightly Coupled Parallel Application

Parallel File System(78PB, 1.145 TB/s) Tape Archive

Trinity 2016 - 2020

Object Storage (300PB*, 100GB/s) Trinity Platform

slide-26
SLIDE 26

Los Alamos National Laboratory 5/16/17 | 26

***/Fast Storage Loosely-coupled Parallel Application (DeltaFS)

Researching a More Scalable Future

MarFS/Object Storage Future Platform Vision

slide-27
SLIDE 27

Los Alamos National Laboratory 5/16/17 | 27

Lessons Learned: Challenges

  • [Both] Standard performance debugging problems
  • Diagnosing SMR firmware errors is hard!
  • [Both] Maintenance!
  • Possible for LANL, hard elsewhere
  • [MarFS] Doesn’t solve facility archive charging problem
  • Voluntary data deletion rare
  • [DeltaFS] Scaling with application => more engineering
  • Meager memory, CPU available
  • Alberto compared to embedded system devel, Rob said same problem in

in-situ (both seemed right to me)

slide-28
SLIDE 28

Los Alamos National Laboratory 5/16/17 | 28

Lessons Learned: Opportunities

  • [Both] Enable new procurement approach
  • Provision capacity over time
  • Extend purge cycles
  • Metadata plane to address total objects, not application scale
  • [DeltaFS] Enable natural I/O model
  • POSIX (mostly) file per particle - O(trillion)
  • [Both] Augmenting/Reducing POSIX
  • Enable UDF, Indexing, etc. (DeltaFS)
  • Disallow random updates, read-write mounts (MarFS)