 
              FROM FILE SYSTEMS TO SERVICES: CHANGING THE DATA MANAGEMENT MODEL IN HPC Simulation, Observation, and Software: Supporting exascale storage and I/O ROB ROSS , PHILIP CARNS, KEVIN HARMS, Argonne National Laboratory JOHN JENKINS, AND SHANE SNYDER GARTH GIBSON, GEORGE AMVROSIADIS, Carnegie Mellon University CHUCK CRANOR, AND QING ZHENG JEROME SOUMAGNE AND JOE LEE The HDF Group GALEN SHIPMAN AND BRAD SETTLEMYER Los Alamos National Laboratory
CHANGES IMPACTING HPC DATA AND STORAGE
MORE STORAGE/MEMORY LAYERS … HPC After 2016 1-2 PB/sec Residence – hours Memory Overwritten – continuous HPC Before 2016 4-6 TB/sec Burst Buffer Residence – hours Memory DRAM Overwritten – hours Lustre 1-2 TB/sec Parallel File System Parallel File System Residence – days/weeks Parallel File Flushed – weeks System 100-300 GB/sec HPSS Campaign Storage Archive Residence – months-year Parallel Flushed – months-year Tape Archive 10s GB/sec (parallel tape Residence – forever § Why – BB: Economics (disk bw/iops too expensive) – PFS: Maturity and BB capacity too small – Campaign: Economics (tape bw too expensive) – Archive: Maturity and we really do need a “forever” Slide from Gary Grider (LANL).
APEX Workflows, LANL, NERSC, SNL, SIMULATION WORKFLOW SAND2015-10342 O, LA-UR-15-29113
SPECIALIZATION OF DATA SERVICES Application Executables Intermediate Checkpoints and Libraries Data Products Application Data DataSpaces SCR SPINDLE Kelpie FTI MDHIM 5
Local Storage Provisioning Membership Fault Mgmt. and Group Security Comm. Rusty ADLB MPI ranks MPI RAM N/A N/A Data store and pub/sub. Manish DataSpaces RAM Under Indep. job Dart N/A Data store and pub/sub. (SSD) devel. DataWarp Admin./ DVS/ Ext. Kernel, XFS, SSD Burst Buffer mgmt. sched. lnet monitor lnet Franck FTI MPI ranks MPI RAM, SSD N/A N/A Checkpoint/restart mgmt. Kelpie RAM Obfusc. MPI ranks Nessie N/A Dist. in-mem. key/val store (Object) IDs SPINDLE Launch Shared TCP RAMdisk N/A Exec. and library mgmt. MON secret
COMPOSING DATA SERVICES
OUR GOAL Enable composition of data services for DOE science and systems § Application-driven – Identify and match to science needs – Traditional data roles (e.g., checkpoint, data migration) – New roles (e.g., equation of state/opacity databases) § Develop/adapt building blocks – Communication – Concurrency – Local Storage – Resilience – Authentication/Authorization 8
COMMUNICATION: MERCURY https://mercury-hpc.github.io/ Mercury is an RPC system for use in the development of high performance system services. Development is driven by the HDF Group with Argonne participation. § Portable across systems and network technologies § Efficient bulk data movement to complement control messages § Builds on lessons learned from IOFSL, Nessie, lnet, and others Metadata (unexpected + expected messaging) RPC proc RPC proc Client Server Bulk Data (RMA transfer) 9 Network Abstraction Layer
CONCURRENCY: ARGOBOTS https://collab.cels.anl.gov/display/argobots/ Argobots is a lightweight threading/tasking framework. § Features relevant to I/O services: Argobots Execution Model – Flexible mapping of work to hardware resources ES 1 ES n Sched – Ability to delegate service work with fine granularity across those resources U S E ... – Modular scheduling U E T § We developed asynchronous bindings to: S E T T – Mercury U E T T – LevelDB S U E T – POSIX I/O Scheduler Pool ULT Tasklet Event § Working with Argobots team to identify needed functionality (e.g., idling) 10
THREE EXAMPLE SERVICES
1. REMOTELY ACCESSIBLE OBJECTS § API for remotely creating, reading, writing, destroying fixed-size objects/extents § libpmem (http://pmem.io/nvml/libpmemobj/) for management of data on device Client app Object API RAM, Argobots Argobots libpmem NVM, Margo Margo SSD Mercury Mercury CCI CCI Target IB/verbs P. Carns et al. “Enabling NVM for Data-Intensive Scientific Services.” INFLOW 2016, November 2016. 12
1. REMOTELY ACCESSIBLE OBJECTS: HOW MUCH LATENCY IN THE STACK? 1000 Write C1 C2 Read 100 Latency (us) 5.8 usec NOOP 10 1 n 1 2 4 8 1 3 6 1 2 5 1 1 2 4 8 1 3 6 1 2 5 o 6 2 4 2 5 1 6 2 4 2 5 1 K K K K M o 8 6 2 8 6 2 K K K i i i i p i B B B B B K K K i i i B B B i i i B B B Access size (bytes) FDR IB, RAM disk, 2.6 usec round-trip (MPI) latency measured separately
2. TRANSIENT FILE SYSTEM VIEWS: DELTAFS Supporting legacy POSIX I/O in a scalable way. ls -l tail -F App proc App proc Deltafs Deltafs …… … server proc server proc Deltafs lib Deltafs lib Deltafs fuse RPC deltafs servers 2 for metadata /deltafs monitor 4 Deltafs comm world progress All procs are user-space, and run on compute nodes directly access 3 1 load snapshot(s) 5 dump snapshot(s) file data 14
3. CONTINUUM MODEL COUPLED WITH VISCOPLASTICITY MODEL § Future applications are exploring the use of multi-scale modeling § As an example: Loosely coupling continuum scale models with more realistic constitutive/response properties Viscoplasticity model [1]: Lulesh continuum model: - FFT based PDE solver § e.g., Lulesh from ExMatEx - Lagrangian hydro dynamics - Structured sub-mesh § Fine scale model results can be - Unstructured mesh cached and new values interpolated from similar prior model calculations Shockwave R. Lebensohn et al, Modeling void growth in polycrystalline materials, Acta Materialia, http://dx.doi.org/10.1016/j.actamat.2013.08.004.
3. FINE SCALE MODEL DATABASE § Goals – Minimize fine scale model executions Query 6D space for nearest neighbors – Minimize query/response time – Load balance DB distribution Application domain § Approach Distributed DB – Start with a key/value store – Distributed approx. nearest-neighbor query – Data distributed to co-locate values for interpolation – Import/export to persistent store § Status – Mercury-based, centralized in-memory DB service – Investigating distributed, incremental Import/export nearest-neighbor indexing DB instances 16
FINAL THOUGHTS § Stage is set for distributed services in HPC – Richer resource management – Increasing emphasis on workflows – Convergence of data intensive and computational science § If we’re going to “get rid of POSIX”, we need alternative(s) § Real opportunity to make life easier for applications – And have fun doing it! 17
THIS WORK IS SUPPORTED BY THE DIRECTOR, OFFICE OF ADVANCED SCIENTIFIC COMPUTING RESEARCH, OFFICE OF SCIENCE, OF THE U.S. DEPARTMENT OF ENERGY UNDER CONTRACT NO. DE-AC02-06CH11357.
Recommend
More recommend