Efficient Scientific Data Efficient Scientific Data Management on - - PowerPoint PPT Presentation
Efficient Scientific Data Efficient Scientific Data Management on - - PowerPoint PPT Presentation
Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on Supercomputers Suren Byna Scientific Data Management Group, LBNL Scientific Data - Where is it coming from? Simulations Experiments
▪ Simulations ▪ Experiments ▪ Observations
2
Scientific Data - Where is it coming from?
3
Life of scientific data
Generation In situ analysis Processing Storage Analysis Preservation Sharing Refinement
4
Supercomputing systems
5
Typical supercomputer architecture
Cori system
Blade&&=&2&x&Burst&Buffer&Node&(2x&SSD)& Lustre&OSSs/OSTs& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& BB& SSD& SSD& BB& SSD& SSD& BB& SSD& SSD& BB& SSD& SSD& ION& IB& IB& ION& IB& IB& Storage&Fabric&(InfiniBand)& Storage&Servers& Compute&Nodes& Aries&HighHSpeed&Network& I/O&Node&(2x&InfiniBand&HCA)& InfiniBand&Fabric&
▪ Data representation
– Metadata, data structures, data models
▪ Data storage
– Storing and retrieving data and metadata to file systems fast
▪ Data access
– Improving performance of data access that scientists desire
▪ Facilitating analysis
– Strategies for supporting finding the meaning in the data
▪ Data transfers
– Transfer data within a supercomputing system and between different systems
6
Scientific Data Management in supercomputers
▪ Data representation
– Metadata, data structures, data models
▪ Data storage
– Storing and retrieving data and metadata to file systems fast
▪ Data access
– Improving performance of data access that scientists desire
▪ Facilitating analysis
– Strategies for supporting finding the meaning in the data
▪ Data transfers
– Transfer data within a supercomputing system and between different systems
7
Scientific Data Management in supercomputers
▪ Storing and retrieving data – Parallel I/O
– Software stack – Modes of parallel I/O – Tuning parallel I/O performance
▪ Autonomous data management system
– Proactive Data Containers (PDC) system – Metadata management service – Data management service
8
Focus of this presentation
Trends – Storage system transformation
9
IO Gap
Memory
Parallel file system (Lustre, GPFS) Archival Storage (HPSS tape) IO Gap Shared burst buffer
Memory
Parallel file system (Lustre, GPFS) Archival Storage (HPSS tape)
Memory
Parallel file system Archival storage (HPSS tape) Shared burst buffer Node-local storage
Conventional
Current
- Eg. Cori @ NERSC
Upcoming
- Eg. Aurora @ ALCF
Campaign storage
- IO performance gap in HPC storage is a significant bottleneck
because of slow disk-based storage
- SSD and new memory technologies are trying to fill the gap, but
increase the depth of storage hierarchy
Applications High Level I/O Libraries I/O Middleware I/O Forwarding Parallel File System I/O Hardware
10
Contemporary Parallel I/O ST Stack
Applications
High Level I/O Library (HDF5, NetCDF, ADIOS) I/O Middleware (MPI-IO) I/O Forwarding Parallel File System (Lustre, GPFS,..) I/O Hardware
11
Parallel I/O software stack
§ I/O Libraries
– HDF5 (The HDF Group) [LBL, ANL] – ADIOS (ORNL) – PnetCDF (Northwestern, ANL) – NetCDF-4 (UCAR)
- Middleware – POSIX-IO, MPI-IO (ANL)
- I/O Forwarding
- File systems: Lustre (Intel), GPFS
(IBM), DataWarp (Cray), … § I/O Hardware (disk-based, SSD-based, …)
▪ Types of parallel I/O
- 1 writer/reader, 1 file
- N writers/readers, N files
(File-per-process)
- N writers/readers, 1 file
- M writers/readers, 1 file
– Aggregators – Two-phase I/O
- M aggregators, M files (file-
per-aggregator)
– Variations of this mode
12
Parallel I/O – Application view
P0 P1
Pn-1
Pn
… file.0 1 Writer/Reader, 1 File
P0 P1
Pn-1
Pn
…
file.0
n Writers/Readers, n Files
file.1
file.n-1
file.n P0 P1
Pn-1
Pn
… n Writers/Readers, 1 File File.1
P0 P1
Pn-1
Pn
… file.0 M Writers/Readers, M Files file.m
P0 P1
Pn-1
Pn
… M Writers/Readers, 1 File File.1
▪ Parallel file systems
– Lustre and Spectrum Scale (GPFS)
▪ Typical building blocks of parallel file systems
– Storage hardware – HDD or SSD RAID – Storage servers (in Lustre, Object Storage Servers [OSS], and object storage targets [OST] – Metadata servers – Client-side processes and interfaces
▪ Management
– Stripe files for parallelism – Tolerate failures
13
Parallel I/O – System view
OST 0 OST 1 OST 2 OST 3 File File Physical view on a parallel file system Logical view Communication network
14
How to achieve peak parallel I/O performance?
HDF5
(Alignment, Chunking, etc.)
MPI I/O
(Enabling collective buffering, Sieving buffer size, collective buffer size, collective buffer nodes, etc.)
Application Parallel File System
(Number of I/O nodes, stripe size, enabling prefetching buffer, etc.)
Storage Hardware Storage Hardware
▪ Parallel I/O software stack provides options for performance
- ptimization
▪ Challenge: Complex inter- dependencies among SW and HW layers
Tuning parameter space
The$whole$space$visualized$
128& 64& 32& 16& 8& 4&
Stripe_Count&
Stripe_Size&(MB)&
32& 16& 8& 4& 2& 1& 128& 64& 32& 16& 8& 4& 2& 1&
cb_nodes&
cb_buffer_size&(MB)&
32& 16& 8& 4& 2& 1& 128& 64& 1048576& 524288&
alignment&
1& MB&
512& 256& 128& 64& siv_buf_size&(KB)& 64&
5242 88&
1& 1& 1& 4&
1& MB&
1048 576&
128& 32& 128& 128&
…$ 23040$
July 30, 2012 CScADS Workshop
15
▪ Simulation of magnetic reconnection (a space weather phenomenon) with VPIC code
– 120,000 cores – 8 arrays (HDF5 datasets) – 32 TB to 42 TB files at 10 time steps
▪ Extracted I/O kernel ▪ M Aggregators to 1 shared file ▪ Trial-and-error selection of Lustre file system parameters while scaling the problem size ▪ Reached peak performance in many instances in a real simulation
16
Tuning for writing trillion particle datasets
More details: SC12 and CUG 2013 papers
Tuning combinations are abundant
- Searching through all combinations manually is
impractical
- Users, typically domain scientists, should not be
burdened with tuning
- Performance auto-tuning has been explored
heavily for optimizing matrix operations
- Auto-tuning for parallel I/O is challenging due to
shared I/O subsystem and slow I/O
- Need a strategy for reduce the search space with
some knowledge
17
Our solution: I/O Auto-tuning
- Auto-tuning framework to
search the parameter space with a reduced number of combinations
- HDF5 I/O library sets the
- ptimization parameters
- H5Tuner: Dynamic
interception of HDF5 calls
- H5Evolve:
– Genetic algorithm based selection – Model-based selection
18
Dynamic Model-driven Auto-tuning
- Auto-tuning using empirical
performance models of I/O
- Steps
– Training phase to develop an I/O model – Pruning phase to select the top-K configurations – Exploration phase to select the best configuration – Refitting step to refine performance model
Overview of Dynamic Model-driven I/O tuning Exploration Pruning Model Generation HPC System Training Phase Storage System Develop an I/O Model Training Set I/O Kernel Top k Configurations Refit the model (Controled by user) Performance Results
Select the Best Performing Configuration
I/O Model All Possible Values Refitting
19
Empirical Performance Model
- Non-linear regression model
- Linear combinations of nb non-linear, low polynomial basis functions
(ϕk), and hyper-parameters β (selected with standard regression approach) for a parameter configuration of x
- For example:
- f: file size; a: number of aggregators; c: stripe count; s: stripe size
m(x;β) = βkφk(x)
k=1 nb
∑
β i = [10.59, 68.99, 59.83, −1.23, 2.26, 0.18, 0.01] m(x) = β1 +β2 1 s +β3 1 a +β4 c s +β5 f c +β6 f s +β7 cf a , with a fit to the data yielding
20
Performance Improvement: 4K cores
0.1 0.3 0.4 1 10 20 30 40 VPIC-IO VORPAL-IO GCRM-IO
I/O Bandwidth (GB/s)
Edison Hopper Stampede Default VPIC-IO on Hopper Default VORPAL-IO on Hopper Default GCRM-IO on Hopper
21
Performance Improvement: 8K cores
0.1 0.2 0.3 1 10 20 30 VPIC-IO VORPAL-IO GCRM-IO
I/O Bandwidth (GB/s) 94x
Edison Hopper Default VPIC-IO on Hopper Default VORPAL-IO on Hopper
22
23
Autonomous data management
Storage Systems and I/O: Current status
24
Hardware Software
High-level lib (HDF5, etc.) IO middleware
(POSIX, MPI-IO)
IO forwarding Parallel file systems Applications
Usage … Data (in memory)
IO software
… Files in file system
- Challenges
– Multi-level hierarchy complicates data movement, especially if user has to be involved – POSIX-IO semantics hinder scalability and performance of file systems and IO software Tune middleware Tune file systems
Memory
Parallel file system Archival storage (HPSS tape) Shared burst buffer Node-local storage Campaign storage
HPC data management requirements
Use case Domain Sim/EOD/ analysis Data size I/O Requirements
FLASH High-energy density physics Simulation ~1PB Data transformations, scalable I/O interfaces, correlation among simulation and experimental data CMB / Planck Cosmology Simulation, EOD/Analysis 10PB Automatic data movement
- ptimizations
DECam & LSST Cosmology EOD/Analysis ~10TB Easy interfaces, data transformations ACME Climate Simulation ~10PB Async I/O, derived variables, automatic data movement TECA Climate Analysis ~10PB Data organization and efficient data movement HipMer Genomics EOD/Analysis ~100TB Scalable I/O interfaces, efficient and automatic data movement
25
Easy interfaces and superior performance
Autonomous data management Information capture and management
25
Storage Systems and I/O: Next Generation
Hardware Software
High-level API Applications
Usage … Data (in memory)
- Next generation IO software
– Autonomous, proactive data management system beyond POSIX restrictions
- Transparent data movement
- Proactive analysis
– Object-centric storage interface
- Rich metadata
- Data and metadata accessible
through queries – Transparent data object placement and organization across storage hardware layers
26
Memory
Parallel file system Archival storage (HPSS tape) Shared burst buffer Node-local storage Campaign storage
What is an object store? Simple POSIX File System Object Store chmod
- pen
read lseek write close stat unlink
Slide from Glenn Lockwood
27
get put delete
What is an object?
- Chunks of a file
- Files (images, videos, etc.)
- Array
- Key-value pairs
- File + Metadata
Current parallel file systems Cloud services (S3, etc.) HDF5, DAOS, etc. OpenStack Swift, MarFS, Ceph, etc.
PDC Interpretation of Objects
Data + Metadata + Provenance + Analysis operations + Information (data products)
Proactive Data Containers (PDC)
Proactive Data Containers
Container Collection PDC Locus
30
▪ Interface
– Programming and client-level interfaces
▪ Services
– Metadata management – Autonomous data movement – Analysis and transformation task execution
▪ PDC locus services
– Object mapping – Local metadata management – Locus task execution
PDC System – High-level Architecture
31
▪ Interface
– Programming and client-level interfaces
▪ Services
– Metadata management – Autonomous data movement – Analysis and transformation task execution
▪ PDC locus services
– Object mapping – Local metadata management – Locus task execution
Persistent Storage API
BB FS Lustre DAOS …
PDC System – High-level Architecture
32
Data Management Using the PDC System
33 PDC system processes Application processes
…
- Storing data
– Application declares persistent data objects → PDC creates metadata
- bjects
– Application adds ‘tags’ / properties to identify objects in future → PDC adds these as metadata – Application processes map memory buffers to regions of objects – When data in objects are ready, PDC system moves to data to storage and updates metadata → Asynchronous and autonomous
- Retrieving data
– Application queries metadata to find desired objects ← PDC system returns handles to the desired objects – Application maps to a region of the object or give query condition ← PDC system brings desired data to memory
- Create & Open objects
– Create sets object properties (metadata): name, lifetime, user info, provenance, tags, dimensions, data type, transformations, etc.
- Create an object region
– Similar to HDF5 hyperslab selections
- Map / Unmap an object region
– Object region <=> memory region
- Lock / Unlock a Mapped Region
– Read / Write Locks – Transparently update memory buffer / object, asynchronously – Transforms occur “outside” of lock time, managed by PDC system
- Close & Release (delete) objects
PDC API – Object Manipulation
34
PDC API – I/O
35
PDC API – I/O
36
PDC API – I/O
37
PDC API – I/O
38
PDC operation
- Create query with conditions
– Set up for query execution, invokes query optimization framework in future – Allows application developers to search for named objects, as well as
- bjects with particular characteristics
- Execute query
– Query execution can occur at multiple tiers, and locally execute on sharded / striped objects
- Iterate_start / Iterate_next
– Iterate over objects from query results, as well as generic actions
- Get_object_handle / Get_object_info
– Retrieve metadata for object
PDC API – Object Access
40
Metadata Object Management
Capabilities
- Create, update, search, and delete metadata objects.
- All tags are searchable.
- Maintain extended attributes and object relationships.
A collection of tags (key-value pairs)
PDC Namespace Management
PDC Metadata Management
Metadata Search
- Exact match search
○ Similar to stat. ○ Require all ID attributes. ○ Retrieve single metadata object, directly from one target server.
- Partial match search
○ Similar to find or grep. ○ Any tag can be specified. ○ Retrieve multiple metadata objects, need to scan all servers. ■ Done in parallel. ■ Indexing is being implemented.
Performance: Metadata Creation
SoMeta 1: all objects have same name but different values in other ID attributes (timestep). SoMeta 4: four unique names are used and each name is used by a quarter of metadata
- bjects. The objects with an identical name have different ID attributes.
SoMeta Unique: each metadata object has a unique name.
Performance of scaling SoMeta by creating 10000 to 100 million metadata objects with 512 servers and 2560 clients on Cori.
Performance: Metadata Search
Exact match search. Partial match search.
Search up to 20% of 1 million objects takes less than a fraction of a second with 128 servers. Network transfer time dominates the total time. Exact match search requires much more small network transfers.
Searching for BOSS objects
Total elapsed time to group objects by adding tags(SoMeta), attributes(SciDB), symlink(Lustre) with different selectivity. Total elapsed time for searching and retrieving the metadata of previously assigned tags/ attributes with different selectivity.
SoMeta is 10 to 90X faster for metadata grouping (tagging), and 2 to 16X faster in searching attributes (tags) than SciDB and MongoDB, up to 800X faster with 80 clients searching in parallel.
PDC System - High-level Architecture
Persistent Storage API
BB FS Lustre DAOS …
Asynchronous I/O
- PDC supports asynchronous I/O through client server architecture
○ Client sends I/O request ○ Server confirms receipt of request ○ Client continues to next computation
Write
Read
VPIC-IO (Weak Scaling) Multi-timestep Write
Total time to write 5 timesteps from the VPIC-IO kernel to Lustre and Burst Buffer on Cori. PDC is 5x faster than HDF5 and 23x over PLFS.
BD-CATS-IO (Weak scaling) Multi-timestep Read
Total time for reading data of 5 timesteps from the BD-CATS-IO kernel from the Lustre and from the burst buffer. PDC is 11X faster than PLFS and HDF5.
Conclusions Easy interfaces and superior performance Autonomous data management Information capture and management
54
- Simpler object interface
- Applications produce data objects and declare to keep them persistent
- Applications request for desired data
- Asynchronous and autonomous data movement
- Bring interesting data to apps
- Manage rich metadata and enhance search capabilities
- Perform analysis and transformations in the data path
▪ Contact:
- Suren Byna (sdm.lbl.gov/~sbyna/) [SByna@lbl.gov]
▪ Contributions to this presentation
- ExaHDF5 project team (sdm.lbl.gov/exahdf5)
- Proactive Data Containers (PDC) team (sdm.lbl.gov/pdc)
- SDM group: sdm.lbl.gov
55