 
              DataMods Programmable File System Services Noah Watkins*, Carlos Maltzahn, Scott Brandt UC Santa Cruz, *Inktank Adam Manzanares California State University, Chico 1
Talk Agenda 1. Middleware and modern IO stacks 2. Services in middleware and parallel file systems 3. Avoid duplicating work with DataMods 4. Case study: Checkpoint/restart 2
Why DataMods? • Applications struggle to scale on POSIX I/O • Parallel FS rarely provide other interfaces – POSIX I/O designed to prevent lock-in • Open-source PFS are now available – Ability to avoid lock-in • Can we generalize PFS services to provide new behavior to new users? 3
Application Middleware • Complex data models and interfaces • Difficult to work directly with simple byte stream • Middleware maps the complex onto the simple 4
Middleware Complexity Bloat • Hadoop and “Big Data” data models – Ordered key/value pairs stored in file – Dictionary for random key-oriented access – Common table abstractions 5
Middleware Complexity Bloat • Scientific data – Multi-dimensional arrays – Imaging – Genomics 6
Middleware Complexity Bloat • IO Middleware – Low-level data models and I/O optimization – Transformative I/O avoids POSIX limitations 7
Middleware Scalability Challenges • Scalable storage system • Exposes one data model • Must find ‘magic’ alignment 8
Data Model Modules • Plugin new “ file ” interfaces and behavior • Native support; atop existing scalable services New behavior Generalized storage services Pluggable customization ( new programmer role ) 9
What does middleware do? Metadata Data Management Placement Intelligent Asynchronous Access Services 10
Middleware: Metadata Management File • Byte stream layout Header • Data type information • Data model attributes • Example: Mesh Data Model – How is the mesh represented? – What does it represent? 11
Middleware: Data Placement • Serialization Header • Placement index • Physical alignment Data – Including the metadata • Example: Mesh Data Model Met a – Vertex lists Data – Mesh elements Met a – Metadata Data 12
Middleware: Intelligent Access • Data model specific interfaces Header • Rich access methods – Views, subsetting, filtering Data • Write-time optimizations • Locality and data movement Met a Data HDF5 Library Met read(array-slice) Array-based a Application Data 13
Middleware: Asynchronous Services • Workflows Header – Regridding • Compression HDF5 Library Workflow Data • Indexing Driver • Layout optimization Met a • Performed online Data Met a Data 14
Middleware Challenges • Inflexible byte stream abstraction • Scalability rules are simple – But middleware is complex • Applying ‘magic number’ – Unnatural and difficult to propogate • Loss of detail at lower-levels – Difficult for in-transit / co-located compute 15
Storage System Services • Scalable meta data – Clustered service – Scalability invariants • Distributed object store – Local compute resources – Define new behavior • File operations – POSIX • Fault-tolerance – Scrubbing and replication 16
DataMods Abstraction File Manifold (Metadata and Data Placement) Typed and Active Asynchronous Storage Services 17
DataMods Architecture • Generalized file system services • Exposed through programming model 18
File Manifold • Metadata management and data placement – Flexible, custom layouts • Extensible interfaces • Object namespace managed by manifold • Placement rules evaluated by system 19
Typed and Active Storage • Active storage adoption has been slow – Code injection is scary – Security and QoS • Reading, writing, and checksums are not free • Exposing scalable services is tractable – Well-defined data models supports optimization – Programming model support data model creation – Indexing and filtering 20
Asynchronous Services • Re-use of active / typed storage components • Temporal relationship to file manifold – Incremental processing – After file is closed – Object update trigger • Scheduling – Exploit idle time – Integrate with larger ecosystem – Preempted or aborted 21
Case Study: PLFS Checkpoint/Restart • Long-running simulations need fault-tolerance – Checkpoint simulation state • Simulations run on expensive machines – Very expensive machines. Really, very expensive. • Decrease cost (time) of checkpoint/restart • Translation: increase bulk I/O bandwidth 22
Overview of PLFS • Middleware layer – Transforms I/O pattern • IO Pattern: N-1 – Most common • IO Pattern: N-N – File system friendly • Convert N-1 into N-N • Applications see the same logical file 23
Simplified PLFS I/O Behavior Client 1 Client 2 Client 3 Parallel Log-structured File System Index Index Index Log-structured Log-structured Log-structured 24
PLFS Scalability Challenges • Index maintenance and volume • Optimization above file system – Compression and reorganization Compute Application PLFS File System Optimization Process Time 25
Moving Overhead to Storage System • Checkpoints are not read immediately (if at all) – Index maintenance and optimization in storage Compute Application PLFS File System Return to compute sooner Time Optimization Process 26
DataMods Module for PLFS • File Manifold – Logical file view – Per-process log-structured files – Index • Hierarchical Solution – Top-level manifold routes to logs – Inner manifold implements log-structured file – Automatic namespace management (metadata) 27
PLFS Outer File Manifold Logical top-half file is not materialized 28
PLFS Outer File Manifold Logical top-half file is not materialized Routes to per- process log file 29
PLFS Inner File Manifold Logical top-half file is not materialized Routes to per- process log file Append striping within object namespace 30
PLFS Inner File Manifold Logical top-half file is not materialized Routes to per- process log file Append striping within object namespace Index-enabled objects record logical-to-phy 31
PLFS Inner File Manifold Logical top-half file is not materialized Routes to per- process log file Interface to index maintenance Append striping routines within object namespace Index-enabled objects record logical-to-phy 32
Active and Typed Objects • Append-only object • Automatic indexing • Managed layout • Built on existing services • Logical view at lowest level • Index maintenance interface
Offline Index Optimization • Extreme index fragmentation (per-object) • Exploit opportunities for optimization – Storage system idle time – Re-use of analysis I/O – Piggy-backed on scrubbing / healing • Index Compression – Merging contiguous entries – Pattern discovery and replacement – Consolidation 34
Offline Index Optimization • Three stage pipeline – Incremental compression and consolidation • Incremental compression 1. Merging physically contiguous entries (in PLFS) Not subject to buffer size limits • Applied technique to 92 PLFS indexes • published by LANL 35
Merging Reduces PLFS Index Size 10000000 Raw Trace (Baseline) Large, Strided Merge Compress 1000000 100000 Number of Index Entries 10000 Contiguous Writes 1000 100 10 1 1 11 21 31 41 51 61 71 81 91 PLFS Map File 36
Index Compression: Pattern • Compactly represent extents using patterns • Example pattern template – offset + stride * i, low < i < high • Fit data to this pattern to reduce index size • Linear algorithm; parallel across logs 37
Pattern Compression Improves Over Merging 10000000 Raw Trace (Baseline) Strided pattern identified Merge Compress 1000000 Pattern Compress 100000 Number of Index Entries 10000 1000 100 10 1 1 11 21 31 41 51 61 71 81 91 PLFS Map File 38
Index Consolidation • Combines all logs together (in PLFS) • Increases index read efficiency Index Consolidation Index Pack … 39
Wrapping Up • Implementing new data model plugins – Hadoop and Visualization – Refining API with more use cases – Constructing specification language • Thank you to supporters – DOE funding (DE-SC0005428), Gary Grider John Bent, James Nunez • Questions? --- jayhawk@cs.ucsc.edu • Poster session 40
Extra Slides 41
Recommend
More recommend