Portable Parallel I/O Handling large datasets in heterogeneous - - PowerPoint PPT Presentation

portable parallel i o
SMART_READER_LITE
LIVE PREVIEW

Portable Parallel I/O Handling large datasets in heterogeneous - - PowerPoint PPT Presentation

Mitglied der Helmholtz-Gemeinschaft Portable Parallel I/O Handling large datasets in heterogeneous parallel environments May 21, 2014 Michael Stephan Mitglied der Helmholtz-Gemeinschaft Portable Parallel I/O Part I: HDF5 May 21, 2014


slide-1
SLIDE 1

Mitglied der Helmholtz-Gemeinschaft

Portable Parallel I/O

Handling large datasets in heterogeneous parallel environments

May 21, 2014 Michael Stephan

slide-2
SLIDE 2

Mitglied der Helmholtz-Gemeinschaft

Portable Parallel I/O

Part I: HDF5

May 21, 2014 Michael Stephan

slide-3
SLIDE 3

Learning Objectives

At the end of this lesson, you will be able to Get an idea about the HDF5 funktionality. Create a short example for HDF5 file I/O. Discuss the advantages and disadvantages of HDF5 file I/O. UsersGuide (352 pages) ReferenceGuide (802 pages)

May 21, 2014 Michael Stephan Slide 3

slide-4
SLIDE 4

Outline

Introduction Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5

May 21, 2014 Michael Stephan Slide 4

slide-5
SLIDE 5

Outline

Introduction Motivation Terms and Definitions Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5

May 21, 2014 Michael Stephan Slide 5

slide-6
SLIDE 6

What is HDF5? I

Unique technology suite that makes possible the management of extremely large and complex data collections

May 21, 2014 Michael Stephan Slide 6

slide-7
SLIDE 7

What is HDF5? II

The HDF5 technology suite includes:

A versatile data model that can represent very complex data

  • bjects and a wide variety of metadata.

A completely portable file format with no limit on the number

  • r size of data objects in the collection.

A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces. A rich set of integrated performance features that allow for access time and storage space optimizations. Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection. The HDF5 data model, file format, API, library, and tools are

  • pen and distributed without charge.

May 21, 2014 Michael Stephan Slide 7

slide-8
SLIDE 8

What is HDF5? III

Unlimited size, extensibility, and portability

HDF5 does not limit the size of files or the size or number of

  • bjects in a file.

The HDF5 format and library are extensible and designed to evolve gracefully to satisfy new demands. HDF5 functionality and data is portable across virtually all computing platforms and is distributed with C, C++, Java, and Fortran90 programming interfaces.

May 21, 2014 Michael Stephan Slide 8

slide-9
SLIDE 9

What is HDF5? IV

General data model

HDF5 has a simple but versatile data model. The HDF5 data model supports complex data relationships and dependencies through its grouping and linking mechanisms. HDF5 accommodates many common types of metadata and arbitrary user-defined metadata.

May 21, 2014 Michael Stephan Slide 9

slide-10
SLIDE 10

What is HDF5? V

Unlimited variety of datatypes

HDF5 supports a rich set of pre-defined datatypes as well as the creation of an unlimited variety of complex user-defined datatypes. Datatype definitions can be shared among objects in an HDF file, providing a powerful and efficient mechanism for describing data. Datatype definitions include information such as byte order (endian), size, and floating point representation, to fully describe how the data is stored, insuring portability to other platforms.

May 21, 2014 Michael Stephan Slide 10

slide-11
SLIDE 11

What is HDF5? VI

Flexible, efficient I/O

HDF5, through its virtual file layer, offers extremely flexible storage and data transfer capabilities. Standard (Posix), Parallel, and Network I/O file drivers are provided with HDF5. Application developers can write additional file drivers to implement customized data storage or transport capabilities. The parallel I/O driver for HDF5 reduces access times on parallel systems by reading/writing multiple data streams simultaneously.

May 21, 2014 Michael Stephan Slide 11

slide-12
SLIDE 12

What is HDF5? VII

Flexible data storage

HDF5 employs various compression, extensibility, and chunking strategies to improve access, management, and storage efficiency. HDF5 provides for external storage of raw data, allowing raw data to be shared among HDF5 files and/or applications, and

  • ften saving disk space.

May 21, 2014 Michael Stephan Slide 12

slide-13
SLIDE 13

What is HDF5? VIII

Data transformation and complex subsetting

HDF5 enables datatype and spatial transformation during I/O

  • perations.

HDF5 data I/O functions can operate on selected subsets of the data, reducing transferred data volume and improving access speed.

May 21, 2014 Michael Stephan Slide 13

slide-14
SLIDE 14

Who uses HDF5?

Applications that deal with big or complex data Over 200 different types of apps 2+ million product users world-wide Academia, government agencies, industry

May 21, 2014 Michael Stephan Slide 14

slide-15
SLIDE 15

Outline

Introduction Motivation Terms and Definitions Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5

May 21, 2014 Michael Stephan Slide 15

slide-16
SLIDE 16

An HDF5 “file” is a container...

...into which you can put your data objects Structures to

  • rganize objects

May 21, 2014 Michael Stephan Slide 16

slide-17
SLIDE 17

HDF5 model

Groups – provide structure among objects Datasets – where the primary data goes

Data arrays Rich set of datatype options Flexible, efficient storage and I/O

Attributes, for metadata Other objects

Links (point to data in a file or in another HDF5 file) Datatypes (can be stored for complex structures and reused by multiple datatsets)

May 21, 2014 Michael Stephan Slide 17

slide-18
SLIDE 18

HDF5 Dataset

May 21, 2014 Michael Stephan Slide 18

slide-19
SLIDE 19

HDF5 Dataspace

Two roles

Dataspace contains spatial info about a dataset stored in a file

Rank and dimensions Permanent part of dataset definition

Dataspace describes application’s data buffer and data elements participating in I/O

May 21, 2014 Michael Stephan Slide 19

slide-20
SLIDE 20

HDF5 Datatype I

Datatype – how to interpret a data element

Permanent part of the dataset definition Two classes: atomic and compound Can be stored in a file as an HDF5 object (HDF5 committed datatype) Can be shared among different datasets

May 21, 2014 Michael Stephan Slide 20

slide-21
SLIDE 21

HDF5 Datatype II

HDF5 atomic types

normal integer and float user-definable (e.g., 13-bit integer) variable length types (e.g., strings) references to objects/dataset regions enumeration - names mapped to integers array

May 21, 2014 Michael Stephan Slide 21

slide-22
SLIDE 22

HDF5 Datatype III

HDF5 compound types

Comparable to C structs (“records”) Members can be atomic or compound types

May 21, 2014 Michael Stephan Slide 22

slide-23
SLIDE 23

HDF5 dataset: array of records

May 21, 2014 Michael Stephan Slide 23

slide-24
SLIDE 24

Special storage options for dataset

May 21, 2014 Michael Stephan Slide 24

slide-25
SLIDE 25

HDF5 Attribute

Attribute – data of the form “name = value”, attached to an

  • bject by application

Operations similar to dataset operations, but

Not extendible No compression or partial I/O

Can be overwritten, deleted, added during the “life“ of a dataset

May 21, 2014 Michael Stephan Slide 25

slide-26
SLIDE 26

HDF5 Group

A mechanism for organizing collections of related objects Every file starts with a root group Similar to UNIX directories

/ (root) /X /Y /X/temp

Can have attributes

May 21, 2014 Michael Stephan Slide 26

slide-27
SLIDE 27

Partial I/O

Move just part of a dataset

May 21, 2014 Michael Stephan Slide 27

slide-28
SLIDE 28

Partial I/O

Move just part of a dataset

May 21, 2014 Michael Stephan Slide 28

slide-29
SLIDE 29

Layers – parallel example

May 21, 2014 Michael Stephan Slide 29

slide-30
SLIDE 30

Virtual I/O layer

May 21, 2014 Michael Stephan Slide 30

slide-31
SLIDE 31

Virtual I/O layer

A public API for writing I/O drivers Allows HDF5 to interface to disk, the network, memory, or a user-defined device

May 21, 2014 Michael Stephan Slide 31

slide-32
SLIDE 32

Portability and Robustness

Runs almost anywhere

Linux and UNIX workstations Windows, Mac OS X Big ASC machines, Crays, VMS systems TeraGrid and other clusters Source and binaries available from http://www.hdfgroup.org/HDF5/release/index.html

May 21, 2014 Michael Stephan Slide 32

slide-33
SLIDE 33

Other Software

The HDF Group

HDFView Java tools Command-line utilities Web browser plug-in Regression and performance testing software Parallel h5diff

3rd Party (IDL, MATLAB, Mathematica, PyTables, HDF Explorer, LabView) Communities (EOS, ASC, CGNS) Integration with other software (iRODS, OPeNDAP)

May 21, 2014 Michael Stephan Slide 33

slide-34
SLIDE 34

HDF5 software stack

May 21, 2014 Michael Stephan Slide 34

slide-35
SLIDE 35

Structure of HDF5 Library

May 21, 2014 Michael Stephan Slide 35

slide-36
SLIDE 36

Goals of HDF5 Library

Provide flexible API to support a wide range of operations

  • n data.

Support high performance access in serial and parallel computing environments. Be compatible with common data models and programming languages. Because of these goals, the HDF5 API is rich and large

May 21, 2014 Michael Stephan Slide 36

slide-37
SLIDE 37

Operations Supported by the API

Create groups, datasets, attributes, linkages Create complex data types Assign storage and I/O properties to objects Perform complex subsetting during read/write Use variety of I/O-”devices“ (parallel, remote, etc.) Transform data during I/O Query about file and structure and properties Query about object structure, content, properties

May 21, 2014 Michael Stephan Slide 37

slide-38
SLIDE 38

Characteristics of the HDF5 API

For flexibility, the API is extensive

300+ functions

This can be daunting but there is hope

A few functions can do a lot Start simple Build up knowledge as more features are needed

Library functions are categorized by object type ”H5Lite“ API supports basic capabilities

May 21, 2014 Michael Stephan Slide 38

slide-39
SLIDE 39

The General HDF5 API

Currently C, Fortran 90, Java, and C++ bindings. C routines begin with prefix H5? ? is a character corresponding to the type of object the function acts on Example APIs:

H5D : Dataset interface e.g., H5Dread H5F : File interface e.g., H5Fopen H5S : dataSpace interface e.g., H5Sclose

May 21, 2014 Michael Stephan Slide 39

slide-40
SLIDE 40

Compiling HDF5 Applications

HDF5 wrappers (not fully tested!)

h5cc HDF5 C compiler command (similar to mpicc) h5fc HDF5 F90 compiler command (similar to mpif90) h5c++ HDF5 C++ compiler command

Makefile in /bgsys/local/hdf5/examples/jugene/

May 21, 2014 Michael Stephan Slide 40

slide-41
SLIDE 41

General Programming Paradigm

Properties of object are optionally defined

Creation properties Access property lists Default values used if none are defined

Object is opened or created Object is accessed, possibly many times Object is closed

May 21, 2014 Michael Stephan Slide 41

slide-42
SLIDE 42

Order of Operations

An order is imposed on operations by argument dependencies

A file must be opened before a dataset because the dataset

  • pen call requires a file handle as an argument

Objects can be closed in any order

May 21, 2014 Michael Stephan Slide 42

slide-43
SLIDE 43

HDF5 Defined Types

For portability, the HDF5 library has its own defined types:

hid t: object identifiers (native integer) hsize t: size used for dimensions (unsigned long or unsigned long long) hssize t: for specifying coordinates and sometimes for dimensions (signed long or signed long long) herr t: function return value hvl t: variable length datatype

For C, include hdf5.h in your HDF5 application.

May 21, 2014 Michael Stephan Slide 43

slide-44
SLIDE 44

HDF5 Information

http://www.hdfgroup.org/HDF5/ /usr/local/hdf5/examples/ (JUQUEEN / JUROPA) /bgsys/local/hdf5/examples/ (JUQUEEN) Google ;-)

May 21, 2014 Michael Stephan Slide 44

slide-45
SLIDE 45

Outline

Introduction Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5 Overview of Parallel HDF5 Design

May 21, 2014 Michael Stephan Slide 45

slide-46
SLIDE 46

PHDF5 Requirements

Support MPI programming PHDF5 files compatible with serial HDF5 files

Shareable between different serial or parallel platforms

Single file image to all processes

One file per process design is undesirable

Expensive post processing Not usable by different number of processes

Standard parallel I/O interface Must be portable to different platforms

May 21, 2014 Michael Stephan Slide 46

slide-47
SLIDE 47

PHDF5 Implementation Layers

May 21, 2014 Michael Stephan Slide 47

slide-48
SLIDE 48

MPI-IO vs. HDF5 I

MPI-IO is an Input/Output API. It treats the data file as a “linear byte stream” and each MPI application needs to provide its own file view and data representations to interpret those bytes. All data stored are machine dependent except the “external32” representation. External32 is defined in Big Endianness

Little-endian machines have to do the data conversion in both read or write operations. 64bit sized data types may lose information.

May 21, 2014 Michael Stephan Slide 48

slide-49
SLIDE 49

MPI-IO vs. HDF5 II

HDF5 is a data management software. It stores the data and metadata according to the HDF5 data format definition.

HDF5 file is self-described. Each machine can store the data in its own native representation for efficient I/O without loss of data precision. Any necessary data representation conversion is done by the HDF5 library automatically.

May 21, 2014 Michael Stephan Slide 49

slide-50
SLIDE 50

Programming Restrictions

Most PHDF5 APIs are collective PHDF5 opens a parallel file with a communicator

Returns a file-handle Future access to the file via the file-handle All processes must participate in collective PHDF5 APIs Different files can be opened via different communicators

May 21, 2014 Michael Stephan Slide 50

slide-51
SLIDE 51

Examples of PHDF5 API

Examples of PHDF5 collective API

File operations: H5Fcreate, H5Fopen, H5Fclose Objects creation: H5Dcreate, H5Dopen, H5Dclose Objects structure: H5Dextend (increase dimension sizes)

Array data transfer can be collective or independent

Dataset operations: H5Dwrite, H5Dread Collectiveness is indicated by function parameters, not by function names as in MPI API

May 21, 2014 Michael Stephan Slide 51

slide-52
SLIDE 52

What Does PHDF5 Support?

After a file is opened by the processes of a communicator

All parts of file are accessible by all processes All objects in the file are accessible by all processes Multiple processes may write to the same data array Each process may write to individual data array

C and F90 language interfaces

May 21, 2014 Michael Stephan Slide 52

slide-53
SLIDE 53

Programming model

HDF5 uses access template object (property list) to control the file access mechanism General model to access HDF5 file in parallel:

Setup MPI-IO or POSIX access template (access property list) Open File Access Data Close File

May 21, 2014 Michael Stephan Slide 53

slide-54
SLIDE 54

Setup MPI-IO access template

Each process of the MPI communicator creates an access template and sets it up with MPI parallel access information

C: herr t H5Pset fapl mpio(hid t plist id, MPI Comm comm, MPI Info info); F90: h5pset fapl mpio f(plist id, comm, info)

plist id is a file access property list identifier

May 21, 2014 Michael Stephan Slide 54

slide-55
SLIDE 55

Creating and Opening Dataset

All processes of the communicator open / close a dataset by a collective call

C: H5Dcreate or H5Dopen; H5Dclose F90: h5dcreate f or h5dopen f; h5dclose f

All processes of the communicator must extend an unlimited dimension dataset before writing to it

C: H5Dextend F90: h5dextend f

May 21, 2014 Michael Stephan Slide 55

slide-56
SLIDE 56

Accessing a Dataset

All processes that have opened dataset may do collective I/O Each process may do independent and arbitrary number

  • f data I/O access calls

C: H5Dwrite and H5Dread F90: h5dwrite f and h5dread f

May 21, 2014 Michael Stephan Slide 56

slide-57
SLIDE 57

Dataset transfer property

Create and set dataset transfer property

C: H5Pset dxpl mpio

H5FD MPIO COLLECTIVE H5FD MPIO INDEPENDENT (default)

F90: h5pset dxpl mpio f

H5FD MPIO COLLECTIVE F H5FD MPIO INDEPENDENT F (default)

Access dataset with the defined transfer property

May 21, 2014 Michael Stephan Slide 57

slide-58
SLIDE 58

Writing and Reading Hyperslabs

Distributed memory model: data is split among processes PHDF5 uses HDF5 hyperslab model Each process defines memory and file hyperslabs Each process executes partial write/read call

Collective calls Independent calls

H5Sselect hyperslab(filespace,H5S SELECT SET,offset, stride, count, block)

May 21, 2014 Michael Stephan Slide 58

slide-59
SLIDE 59

Writing dataset by rows

May 21, 2014 Michael Stephan Slide 59

slide-60
SLIDE 60

Writing dataset by rows

May 21, 2014 Michael Stephan Slide 60

slide-61
SLIDE 61

Writing dataset by columns

May 21, 2014 Michael Stephan Slide 61

slide-62
SLIDE 62

Writing dataset by columns

May 21, 2014 Michael Stephan Slide 62

slide-63
SLIDE 63

Writing dataset by pattern

May 21, 2014 Michael Stephan Slide 63

slide-64
SLIDE 64

Writing dataset by pattern

May 21, 2014 Michael Stephan Slide 64

slide-65
SLIDE 65

Writing dataset by chunks

May 21, 2014 Michael Stephan Slide 65

slide-66
SLIDE 66

Writing dataset by chunks

May 21, 2014 Michael Stephan Slide 66