Efficient Scientific Data Efficient Scientific Data Management on - - PowerPoint PPT Presentation

efficient scientific data efficient scientific data
SMART_READER_LITE
LIVE PREVIEW

Efficient Scientific Data Efficient Scientific Data Management on - - PowerPoint PPT Presentation

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on Supercomputers Suren Byna Scientific Data Management Group, LBNL Scientific Data - Where is it coming from? Simulations Experiments


slide-1
SLIDE 1

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on Supercomputers

Suren Byna

Scientific Data Management Group, LBNL

slide-2
SLIDE 2

▪ Simulations ▪ Experiments ▪ Observations

2

Scientific Data - Where is it coming from?

slide-3
SLIDE 3

3

Life of scientific data

Generation In situ analysis Processing Storage Analysis Preservation Sharing Refinement

slide-4
SLIDE 4

4

Supercomputing systems

slide-5
SLIDE 5

5

Typical supercomputer architecture

Cori system

Blade&&=&2&x&Burst&Buffer&Node&(2x&SSD)& Lustre&OSSs/OSTs& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& CN& BB& SSD& SSD& BB& SSD& SSD& BB& SSD& SSD& BB& SSD& SSD& ION& IB& IB& ION& IB& IB& Storage&Fabric&(InfiniBand)& Storage&Servers& Compute&Nodes& Aries&HighHSpeed&Network& I/O&Node&(2x&InfiniBand&HCA)& InfiniBand&Fabric&

slide-6
SLIDE 6

▪ Data representation

– Metadata, data structures, data models

▪ Data storage

– Storing and retrieving data and metadata to file systems fast

▪ Data access

– Improving performance of data access that scientists desire

▪ Facilitating analysis

– Strategies for supporting finding the meaning in the data

▪ Data transfers

– Transfer data within a supercomputing system and between different systems

6

Scientific Data Management in supercomputers

slide-7
SLIDE 7

▪ Data representation

– Metadata, data structures, data models

▪ Data storage

– Storing and retrieving data and metadata to file systems fast

▪ Data access

– Improving performance of data access that scientists desire

▪ Facilitating analysis

– Strategies for supporting finding the meaning in the data

▪ Data transfers

– Transfer data within a supercomputing system and between different systems

7

Scientific Data Management in supercomputers

slide-8
SLIDE 8

▪ Storing and retrieving data – Parallel I/O

– Software stack – Modes of parallel I/O – Tuning parallel I/O performance

▪ Autonomous data management system

– Proactive Data Containers (PDC) system – Metadata management service – Data management service

8

Focus of this presentation

slide-9
SLIDE 9

Trends – Storage system transformation

9

IO Gap

Memory

Parallel file system (Lustre, GPFS) Archival Storage (HPSS tape) IO Gap Shared burst buffer

Memory

Parallel file system (Lustre, GPFS) Archival Storage (HPSS tape)

Memory

Parallel file system Archival storage (HPSS tape) Shared burst buffer Node-local storage

Conventional

Current

  • Eg. Cori @ NERSC

Upcoming

  • Eg. Aurora @ ALCF

Campaign storage

  • IO performance gap in HPC storage is a significant bottleneck

because of slow disk-based storage

  • SSD and new memory technologies are trying to fill the gap, but

increase the depth of storage hierarchy

slide-10
SLIDE 10

Applications High Level I/O Libraries I/O Middleware I/O Forwarding Parallel File System I/O Hardware

10

Contemporary Parallel I/O ST Stack

slide-11
SLIDE 11

Applications

High Level I/O Library (HDF5, NetCDF, ADIOS) I/O Middleware (MPI-IO) I/O Forwarding Parallel File System (Lustre, GPFS,..) I/O Hardware

11

Parallel I/O software stack

§ I/O Libraries

– HDF5 (The HDF Group) [LBL, ANL] – ADIOS (ORNL) – PnetCDF (Northwestern, ANL) – NetCDF-4 (UCAR)

  • Middleware – POSIX-IO, MPI-IO (ANL)
  • I/O Forwarding
  • File systems: Lustre (Intel), GPFS

(IBM), DataWarp (Cray), … § I/O Hardware (disk-based, SSD-based, …)

slide-12
SLIDE 12

▪ Types of parallel I/O

  • 1 writer/reader, 1 file
  • N writers/readers, N files

(File-per-process)

  • N writers/readers, 1 file
  • M writers/readers, 1 file

– Aggregators – Two-phase I/O

  • M aggregators, M files (file-

per-aggregator)

– Variations of this mode

12

Parallel I/O – Application view

P0 P1

Pn-1

Pn

… file.0 1 Writer/Reader, 1 File

P0 P1

Pn-1

Pn

file.0

n Writers/Readers, n Files

file.1

file.n-1

file.n P0 P1

Pn-1

Pn

… n Writers/Readers, 1 File File.1

P0 P1

Pn-1

Pn

… file.0 M Writers/Readers, M Files file.m

P0 P1

Pn-1

Pn

… M Writers/Readers, 1 File File.1

slide-13
SLIDE 13

▪ Parallel file systems

– Lustre and Spectrum Scale (GPFS)

▪ Typical building blocks of parallel file systems

– Storage hardware – HDD or SSD RAID – Storage servers (in Lustre, Object Storage Servers [OSS], and object storage targets [OST] – Metadata servers – Client-side processes and interfaces

▪ Management

– Stripe files for parallelism – Tolerate failures

13

Parallel I/O – System view

OST 0 OST 1 OST 2 OST 3 File File Physical view on a parallel file system Logical view Communication network

slide-14
SLIDE 14

14

How to achieve peak parallel I/O performance?

HDF5

(Alignment, Chunking, etc.)

MPI I/O

(Enabling collective buffering, Sieving buffer size, collective buffer size, collective buffer nodes, etc.)

Application Parallel File System

(Number of I/O nodes, stripe size, enabling prefetching buffer, etc.)

Storage Hardware Storage Hardware

▪ Parallel I/O software stack provides options for performance

  • ptimization

▪ Challenge: Complex inter- dependencies among SW and HW layers

slide-15
SLIDE 15

Tuning parameter space

The$whole$space$visualized$

128& 64& 32& 16& 8& 4&

Stripe_Count&

Stripe_Size&(MB)&

32& 16& 8& 4& 2& 1& 128& 64& 32& 16& 8& 4& 2& 1&

cb_nodes&

cb_buffer_size&(MB)&

32& 16& 8& 4& 2& 1& 128& 64& 1048576& 524288&

alignment&

1& MB&

512& 256& 128& 64& siv_buf_size&(KB)& 64&

5242 88&

1& 1& 1& 4&

1& MB&

1048 576&

128& 32& 128& 128&

…$ 23040$

July 30, 2012 CScADS Workshop

15

slide-16
SLIDE 16

▪ Simulation of magnetic reconnection (a space weather phenomenon) with VPIC code

– 120,000 cores – 8 arrays (HDF5 datasets) – 32 TB to 42 TB files at 10 time steps

▪ Extracted I/O kernel ▪ M Aggregators to 1 shared file ▪ Trial-and-error selection of Lustre file system parameters while scaling the problem size ▪ Reached peak performance in many instances in a real simulation

16

Tuning for writing trillion particle datasets

More details: SC12 and CUG 2013 papers

slide-17
SLIDE 17

Tuning combinations are abundant

  • Searching through all combinations manually is

impractical

  • Users, typically domain scientists, should not be

burdened with tuning

  • Performance auto-tuning has been explored

heavily for optimizing matrix operations

  • Auto-tuning for parallel I/O is challenging due to

shared I/O subsystem and slow I/O

  • Need a strategy for reduce the search space with

some knowledge

17

slide-18
SLIDE 18

Our solution: I/O Auto-tuning

  • Auto-tuning framework to

search the parameter space with a reduced number of combinations

  • HDF5 I/O library sets the
  • ptimization parameters
  • H5Tuner: Dynamic

interception of HDF5 calls

  • H5Evolve:

– Genetic algorithm based selection – Model-based selection

18

slide-19
SLIDE 19

Dynamic Model-driven Auto-tuning

  • Auto-tuning using empirical

performance models of I/O

  • Steps

– Training phase to develop an I/O model – Pruning phase to select the top-K configurations – Exploration phase to select the best configuration – Refitting step to refine performance model

Overview of Dynamic Model-driven I/O tuning Exploration Pruning Model Generation HPC System Training Phase Storage System Develop an I/O Model Training Set I/O Kernel Top k Configurations Refit the model (Controled by user) Performance Results

Select the Best Performing Configuration

I/O Model All Possible Values Refitting

19

slide-20
SLIDE 20

Empirical Performance Model

  • Non-linear regression model
  • Linear combinations of nb non-linear, low polynomial basis functions

(ϕk), and hyper-parameters β (selected with standard regression approach) for a parameter configuration of x

  • For example:
  • f: file size; a: number of aggregators; c: stripe count; s: stripe size

m(x;β) = βkφk(x)

k=1 nb

β i = [10.59, 68.99, 59.83, −1.23, 2.26, 0.18, 0.01] m(x) = β1 +β2 1 s +β3 1 a +β4 c s +β5 f c +β6 f s +β7 cf a , with a fit to the data yielding

20

slide-21
SLIDE 21

Performance Improvement: 4K cores

0.1 0.3 0.4 1 10 20 30 40 VPIC-IO VORPAL-IO GCRM-IO

I/O Bandwidth (GB/s)

Edison Hopper Stampede Default VPIC-IO on Hopper Default VORPAL-IO on Hopper Default GCRM-IO on Hopper

21

slide-22
SLIDE 22

Performance Improvement: 8K cores

0.1 0.2 0.3 1 10 20 30 VPIC-IO VORPAL-IO GCRM-IO

I/O Bandwidth (GB/s) 94x

Edison Hopper Default VPIC-IO on Hopper Default VORPAL-IO on Hopper

22

slide-23
SLIDE 23

23

Autonomous data management

slide-24
SLIDE 24

Storage Systems and I/O: Current status

24

Hardware Software

High-level lib (HDF5, etc.) IO middleware

(POSIX, MPI-IO)

IO forwarding Parallel file systems Applications

Usage … Data (in memory)

IO software

… Files in file system

  • Challenges

– Multi-level hierarchy complicates data movement, especially if user has to be involved – POSIX-IO semantics hinder scalability and performance of file systems and IO software Tune middleware Tune file systems

Memory

Parallel file system Archival storage (HPSS tape) Shared burst buffer Node-local storage Campaign storage

slide-25
SLIDE 25

HPC data management requirements

Use case Domain Sim/EOD/ analysis Data size I/O Requirements

FLASH High-energy density physics Simulation ~1PB Data transformations, scalable I/O interfaces, correlation among simulation and experimental data CMB / Planck Cosmology Simulation, EOD/Analysis 10PB Automatic data movement

  • ptimizations

DECam & LSST Cosmology EOD/Analysis ~10TB Easy interfaces, data transformations ACME Climate Simulation ~10PB Async I/O, derived variables, automatic data movement TECA Climate Analysis ~10PB Data organization and efficient data movement HipMer Genomics EOD/Analysis ~100TB Scalable I/O interfaces, efficient and automatic data movement

25

Easy interfaces and superior performance

Autonomous data management Information capture and management

25

slide-26
SLIDE 26

Storage Systems and I/O: Next Generation

Hardware Software

High-level API Applications

Usage … Data (in memory)

  • Next generation IO software

– Autonomous, proactive data management system beyond POSIX restrictions

  • Transparent data movement
  • Proactive analysis

– Object-centric storage interface

  • Rich metadata
  • Data and metadata accessible

through queries – Transparent data object placement and organization across storage hardware layers

26

Memory

Parallel file system Archival storage (HPSS tape) Shared burst buffer Node-local storage Campaign storage

slide-27
SLIDE 27

What is an object store? Simple POSIX File System Object Store chmod

  • pen

read lseek write close stat unlink

Slide from Glenn Lockwood

27

get put delete

slide-28
SLIDE 28

What is an object?

  • Chunks of a file
  • Files (images, videos, etc.)
  • Array
  • Key-value pairs
  • File + Metadata

Current parallel file systems Cloud services (S3, etc.) HDF5, DAOS, etc. OpenStack Swift, MarFS, Ceph, etc.

slide-29
SLIDE 29

PDC Interpretation of Objects

Data + Metadata + Provenance + Analysis operations + Information (data products)

Proactive Data Containers (PDC)

slide-30
SLIDE 30

Proactive Data Containers

Container Collection PDC Locus

30

slide-31
SLIDE 31

▪ Interface

– Programming and client-level interfaces

▪ Services

– Metadata management – Autonomous data movement – Analysis and transformation task execution

▪ PDC locus services

– Object mapping – Local metadata management – Locus task execution

PDC System – High-level Architecture

31

slide-32
SLIDE 32

▪ Interface

– Programming and client-level interfaces

▪ Services

– Metadata management – Autonomous data movement – Analysis and transformation task execution

▪ PDC locus services

– Object mapping – Local metadata management – Locus task execution

Persistent Storage API

BB FS Lustre DAOS …

PDC System – High-level Architecture

32

slide-33
SLIDE 33

Data Management Using the PDC System

33 PDC system processes Application processes

  • Storing data

– Application declares persistent data objects → PDC creates metadata

  • bjects

– Application adds ‘tags’ / properties to identify objects in future → PDC adds these as metadata – Application processes map memory buffers to regions of objects – When data in objects are ready, PDC system moves to data to storage and updates metadata → Asynchronous and autonomous

  • Retrieving data

– Application queries metadata to find desired objects ← PDC system returns handles to the desired objects – Application maps to a region of the object or give query condition ← PDC system brings desired data to memory

slide-34
SLIDE 34
  • Create & Open objects

– Create sets object properties (metadata): name, lifetime, user info, provenance, tags, dimensions, data type, transformations, etc.

  • Create an object region

– Similar to HDF5 hyperslab selections

  • Map / Unmap an object region

– Object region <=> memory region

  • Lock / Unlock a Mapped Region

– Read / Write Locks – Transparently update memory buffer / object, asynchronously – Transforms occur “outside” of lock time, managed by PDC system

  • Close & Release (delete) objects

PDC API – Object Manipulation

34

slide-35
SLIDE 35

PDC API – I/O

35

slide-36
SLIDE 36

PDC API – I/O

36

slide-37
SLIDE 37

PDC API – I/O

37

slide-38
SLIDE 38

PDC API – I/O

38

slide-39
SLIDE 39

PDC operation

slide-40
SLIDE 40
  • Create query with conditions

– Set up for query execution, invokes query optimization framework in future – Allows application developers to search for named objects, as well as

  • bjects with particular characteristics
  • Execute query

– Query execution can occur at multiple tiers, and locally execute on sharded / striped objects

  • Iterate_start / Iterate_next

– Iterate over objects from query results, as well as generic actions

  • Get_object_handle / Get_object_info

– Retrieve metadata for object

PDC API – Object Access

40

slide-41
SLIDE 41

Metadata Object Management

Capabilities

  • Create, update, search, and delete metadata objects.
  • All tags are searchable.
  • Maintain extended attributes and object relationships.

A collection of tags (key-value pairs)

slide-42
SLIDE 42

PDC Namespace Management

slide-43
SLIDE 43

PDC Metadata Management

slide-44
SLIDE 44

Metadata Search

  • Exact match search

○ Similar to stat. ○ Require all ID attributes. ○ Retrieve single metadata object, directly from one target server.

  • Partial match search

○ Similar to find or grep. ○ Any tag can be specified. ○ Retrieve multiple metadata objects, need to scan all servers. ■ Done in parallel. ■ Indexing is being implemented.

slide-45
SLIDE 45

Performance: Metadata Creation

SoMeta 1: all objects have same name but different values in other ID attributes (timestep). SoMeta 4: four unique names are used and each name is used by a quarter of metadata

  • bjects. The objects with an identical name have different ID attributes.

SoMeta Unique: each metadata object has a unique name.

Performance of scaling SoMeta by creating 10000 to 100 million metadata objects with 512 servers and 2560 clients on Cori.

slide-46
SLIDE 46

Performance: Metadata Search

Exact match search. Partial match search.

Search up to 20% of 1 million objects takes less than a fraction of a second with 128 servers. Network transfer time dominates the total time. Exact match search requires much more small network transfers.

slide-47
SLIDE 47

Searching for BOSS objects

Total elapsed time to group objects by adding tags(SoMeta), attributes(SciDB), symlink(Lustre) with different selectivity. Total elapsed time for searching and retrieving the metadata of previously assigned tags/ attributes with different selectivity.

SoMeta is 10 to 90X faster for metadata grouping (tagging), and 2 to 16X faster in searching attributes (tags) than SciDB and MongoDB, up to 800X faster with 80 clients searching in parallel.

slide-48
SLIDE 48

PDC System - High-level Architecture

Persistent Storage API

BB FS Lustre DAOS …

slide-49
SLIDE 49

Asynchronous I/O

  • PDC supports asynchronous I/O through client server architecture

○ Client sends I/O request ○ Server confirms receipt of request ○ Client continues to next computation

slide-50
SLIDE 50

Write

slide-51
SLIDE 51

Read

slide-52
SLIDE 52

VPIC-IO (Weak Scaling) Multi-timestep Write

Total time to write 5 timesteps from the VPIC-IO kernel to Lustre and Burst Buffer on Cori. PDC is 5x faster than HDF5 and 23x over PLFS.

slide-53
SLIDE 53

BD-CATS-IO (Weak scaling) Multi-timestep Read

Total time for reading data of 5 timesteps from the BD-CATS-IO kernel from the Lustre and from the burst buffer. PDC is 11X faster than PLFS and HDF5.

slide-54
SLIDE 54

Conclusions Easy interfaces and superior performance Autonomous data management Information capture and management

54

  • Simpler object interface
  • Applications produce data objects and declare to keep them persistent
  • Applications request for desired data
  • Asynchronous and autonomous data movement
  • Bring interesting data to apps
  • Manage rich metadata and enhance search capabilities
  • Perform analysis and transformations in the data path
slide-55
SLIDE 55

▪ Contact:

  • Suren Byna (sdm.lbl.gov/~sbyna/) [SByna@lbl.gov]

▪ Contributions to this presentation

  • ExaHDF5 project team (sdm.lbl.gov/exahdf5)
  • Proactive Data Containers (PDC) team (sdm.lbl.gov/pdc)
  • SDM group: sdm.lbl.gov

55