Mitglied der Helmholtz-Gemeinschaft
Portable Parallel I/O
Handling large datasets in heterogeneous parallel environments
May 21, 2014 Michael Stephan
Portable Parallel I/O Handling large datasets in heterogeneous - - PowerPoint PPT Presentation
Mitglied der Helmholtz-Gemeinschaft Portable Parallel I/O Handling large datasets in heterogeneous parallel environments May 21, 2014 Michael Stephan Mitglied der Helmholtz-Gemeinschaft Portable Parallel I/O Part I: HDF5 May 21, 2014
Mitglied der Helmholtz-Gemeinschaft
May 21, 2014 Michael Stephan
Mitglied der Helmholtz-Gemeinschaft
May 21, 2014 Michael Stephan
At the end of this lesson, you will be able to Get an idea about the HDF5 funktionality. Create a short example for HDF5 file I/O. Discuss the advantages and disadvantages of HDF5 file I/O. UsersGuide (352 pages) ReferenceGuide (802 pages)
May 21, 2014 Michael Stephan Slide 3
Introduction Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5
May 21, 2014 Michael Stephan Slide 4
Introduction Motivation Terms and Definitions Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5
May 21, 2014 Michael Stephan Slide 5
Unique technology suite that makes possible the management of extremely large and complex data collections
May 21, 2014 Michael Stephan Slide 6
The HDF5 technology suite includes:
A versatile data model that can represent very complex data
A completely portable file format with no limit on the number
A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces. A rich set of integrated performance features that allow for access time and storage space optimizations. Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection. The HDF5 data model, file format, API, library, and tools are
May 21, 2014 Michael Stephan Slide 7
Unlimited size, extensibility, and portability
HDF5 does not limit the size of files or the size or number of
The HDF5 format and library are extensible and designed to evolve gracefully to satisfy new demands. HDF5 functionality and data is portable across virtually all computing platforms and is distributed with C, C++, Java, and Fortran90 programming interfaces.
May 21, 2014 Michael Stephan Slide 8
General data model
HDF5 has a simple but versatile data model. The HDF5 data model supports complex data relationships and dependencies through its grouping and linking mechanisms. HDF5 accommodates many common types of metadata and arbitrary user-defined metadata.
May 21, 2014 Michael Stephan Slide 9
Unlimited variety of datatypes
HDF5 supports a rich set of pre-defined datatypes as well as the creation of an unlimited variety of complex user-defined datatypes. Datatype definitions can be shared among objects in an HDF file, providing a powerful and efficient mechanism for describing data. Datatype definitions include information such as byte order (endian), size, and floating point representation, to fully describe how the data is stored, insuring portability to other platforms.
May 21, 2014 Michael Stephan Slide 10
Flexible, efficient I/O
HDF5, through its virtual file layer, offers extremely flexible storage and data transfer capabilities. Standard (Posix), Parallel, and Network I/O file drivers are provided with HDF5. Application developers can write additional file drivers to implement customized data storage or transport capabilities. The parallel I/O driver for HDF5 reduces access times on parallel systems by reading/writing multiple data streams simultaneously.
May 21, 2014 Michael Stephan Slide 11
Flexible data storage
HDF5 employs various compression, extensibility, and chunking strategies to improve access, management, and storage efficiency. HDF5 provides for external storage of raw data, allowing raw data to be shared among HDF5 files and/or applications, and
May 21, 2014 Michael Stephan Slide 12
Data transformation and complex subsetting
HDF5 enables datatype and spatial transformation during I/O
HDF5 data I/O functions can operate on selected subsets of the data, reducing transferred data volume and improving access speed.
May 21, 2014 Michael Stephan Slide 13
Applications that deal with big or complex data Over 200 different types of apps 2+ million product users world-wide Academia, government agencies, industry
May 21, 2014 Michael Stephan Slide 14
Introduction Motivation Terms and Definitions Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5
May 21, 2014 Michael Stephan Slide 15
...into which you can put your data objects Structures to
May 21, 2014 Michael Stephan Slide 16
Groups – provide structure among objects Datasets – where the primary data goes
Data arrays Rich set of datatype options Flexible, efficient storage and I/O
Attributes, for metadata Other objects
Links (point to data in a file or in another HDF5 file) Datatypes (can be stored for complex structures and reused by multiple datatsets)
May 21, 2014 Michael Stephan Slide 17
May 21, 2014 Michael Stephan Slide 18
Two roles
Dataspace contains spatial info about a dataset stored in a file
Rank and dimensions Permanent part of dataset definition
Dataspace describes application’s data buffer and data elements participating in I/O
May 21, 2014 Michael Stephan Slide 19
Datatype – how to interpret a data element
Permanent part of the dataset definition Two classes: atomic and compound Can be stored in a file as an HDF5 object (HDF5 committed datatype) Can be shared among different datasets
May 21, 2014 Michael Stephan Slide 20
HDF5 atomic types
normal integer and float user-definable (e.g., 13-bit integer) variable length types (e.g., strings) references to objects/dataset regions enumeration - names mapped to integers array
May 21, 2014 Michael Stephan Slide 21
HDF5 compound types
Comparable to C structs (“records”) Members can be atomic or compound types
May 21, 2014 Michael Stephan Slide 22
May 21, 2014 Michael Stephan Slide 23
May 21, 2014 Michael Stephan Slide 24
Attribute – data of the form “name = value”, attached to an
Operations similar to dataset operations, but
Not extendible No compression or partial I/O
Can be overwritten, deleted, added during the “life“ of a dataset
May 21, 2014 Michael Stephan Slide 25
A mechanism for organizing collections of related objects Every file starts with a root group Similar to UNIX directories
/ (root) /X /Y /X/temp
Can have attributes
May 21, 2014 Michael Stephan Slide 26
Move just part of a dataset
May 21, 2014 Michael Stephan Slide 27
Move just part of a dataset
May 21, 2014 Michael Stephan Slide 28
May 21, 2014 Michael Stephan Slide 29
May 21, 2014 Michael Stephan Slide 30
A public API for writing I/O drivers Allows HDF5 to interface to disk, the network, memory, or a user-defined device
May 21, 2014 Michael Stephan Slide 31
Runs almost anywhere
Linux and UNIX workstations Windows, Mac OS X Big ASC machines, Crays, VMS systems TeraGrid and other clusters Source and binaries available from http://www.hdfgroup.org/HDF5/release/index.html
May 21, 2014 Michael Stephan Slide 32
The HDF Group
HDFView Java tools Command-line utilities Web browser plug-in Regression and performance testing software Parallel h5diff
3rd Party (IDL, MATLAB, Mathematica, PyTables, HDF Explorer, LabView) Communities (EOS, ASC, CGNS) Integration with other software (iRODS, OPeNDAP)
May 21, 2014 Michael Stephan Slide 33
May 21, 2014 Michael Stephan Slide 34
May 21, 2014 Michael Stephan Slide 35
Provide flexible API to support a wide range of operations
Support high performance access in serial and parallel computing environments. Be compatible with common data models and programming languages. Because of these goals, the HDF5 API is rich and large
May 21, 2014 Michael Stephan Slide 36
Create groups, datasets, attributes, linkages Create complex data types Assign storage and I/O properties to objects Perform complex subsetting during read/write Use variety of I/O-”devices“ (parallel, remote, etc.) Transform data during I/O Query about file and structure and properties Query about object structure, content, properties
May 21, 2014 Michael Stephan Slide 37
For flexibility, the API is extensive
300+ functions
This can be daunting but there is hope
A few functions can do a lot Start simple Build up knowledge as more features are needed
Library functions are categorized by object type ”H5Lite“ API supports basic capabilities
May 21, 2014 Michael Stephan Slide 38
Currently C, Fortran 90, Java, and C++ bindings. C routines begin with prefix H5? ? is a character corresponding to the type of object the function acts on Example APIs:
H5D : Dataset interface e.g., H5Dread H5F : File interface e.g., H5Fopen H5S : dataSpace interface e.g., H5Sclose
May 21, 2014 Michael Stephan Slide 39
HDF5 wrappers (not fully tested!)
h5cc HDF5 C compiler command (similar to mpicc) h5fc HDF5 F90 compiler command (similar to mpif90) h5c++ HDF5 C++ compiler command
Makefile in /bgsys/local/hdf5/examples/jugene/
May 21, 2014 Michael Stephan Slide 40
Properties of object are optionally defined
Creation properties Access property lists Default values used if none are defined
Object is opened or created Object is accessed, possibly many times Object is closed
May 21, 2014 Michael Stephan Slide 41
An order is imposed on operations by argument dependencies
A file must be opened before a dataset because the dataset
Objects can be closed in any order
May 21, 2014 Michael Stephan Slide 42
For portability, the HDF5 library has its own defined types:
hid t: object identifiers (native integer) hsize t: size used for dimensions (unsigned long or unsigned long long) hssize t: for specifying coordinates and sometimes for dimensions (signed long or signed long long) herr t: function return value hvl t: variable length datatype
For C, include hdf5.h in your HDF5 application.
May 21, 2014 Michael Stephan Slide 43
http://www.hdfgroup.org/HDF5/ /usr/local/hdf5/examples/ (JUQUEEN / JUROPA) /bgsys/local/hdf5/examples/ (JUQUEEN) Google ;-)
May 21, 2014 Michael Stephan Slide 44
Introduction Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5 Overview of Parallel HDF5 Design
May 21, 2014 Michael Stephan Slide 45
Support MPI programming PHDF5 files compatible with serial HDF5 files
Shareable between different serial or parallel platforms
Single file image to all processes
One file per process design is undesirable
Expensive post processing Not usable by different number of processes
Standard parallel I/O interface Must be portable to different platforms
May 21, 2014 Michael Stephan Slide 46
May 21, 2014 Michael Stephan Slide 47
MPI-IO is an Input/Output API. It treats the data file as a “linear byte stream” and each MPI application needs to provide its own file view and data representations to interpret those bytes. All data stored are machine dependent except the “external32” representation. External32 is defined in Big Endianness
Little-endian machines have to do the data conversion in both read or write operations. 64bit sized data types may lose information.
May 21, 2014 Michael Stephan Slide 48
HDF5 is a data management software. It stores the data and metadata according to the HDF5 data format definition.
HDF5 file is self-described. Each machine can store the data in its own native representation for efficient I/O without loss of data precision. Any necessary data representation conversion is done by the HDF5 library automatically.
May 21, 2014 Michael Stephan Slide 49
Most PHDF5 APIs are collective PHDF5 opens a parallel file with a communicator
Returns a file-handle Future access to the file via the file-handle All processes must participate in collective PHDF5 APIs Different files can be opened via different communicators
May 21, 2014 Michael Stephan Slide 50
Examples of PHDF5 collective API
File operations: H5Fcreate, H5Fopen, H5Fclose Objects creation: H5Dcreate, H5Dopen, H5Dclose Objects structure: H5Dextend (increase dimension sizes)
Array data transfer can be collective or independent
Dataset operations: H5Dwrite, H5Dread Collectiveness is indicated by function parameters, not by function names as in MPI API
May 21, 2014 Michael Stephan Slide 51
After a file is opened by the processes of a communicator
All parts of file are accessible by all processes All objects in the file are accessible by all processes Multiple processes may write to the same data array Each process may write to individual data array
C and F90 language interfaces
May 21, 2014 Michael Stephan Slide 52
HDF5 uses access template object (property list) to control the file access mechanism General model to access HDF5 file in parallel:
Setup MPI-IO or POSIX access template (access property list) Open File Access Data Close File
May 21, 2014 Michael Stephan Slide 53
Each process of the MPI communicator creates an access template and sets it up with MPI parallel access information
C: herr t H5Pset fapl mpio(hid t plist id, MPI Comm comm, MPI Info info); F90: h5pset fapl mpio f(plist id, comm, info)
plist id is a file access property list identifier
May 21, 2014 Michael Stephan Slide 54
All processes of the communicator open / close a dataset by a collective call
C: H5Dcreate or H5Dopen; H5Dclose F90: h5dcreate f or h5dopen f; h5dclose f
All processes of the communicator must extend an unlimited dimension dataset before writing to it
C: H5Dextend F90: h5dextend f
May 21, 2014 Michael Stephan Slide 55
All processes that have opened dataset may do collective I/O Each process may do independent and arbitrary number
C: H5Dwrite and H5Dread F90: h5dwrite f and h5dread f
May 21, 2014 Michael Stephan Slide 56
Create and set dataset transfer property
C: H5Pset dxpl mpio
H5FD MPIO COLLECTIVE H5FD MPIO INDEPENDENT (default)
F90: h5pset dxpl mpio f
H5FD MPIO COLLECTIVE F H5FD MPIO INDEPENDENT F (default)
Access dataset with the defined transfer property
May 21, 2014 Michael Stephan Slide 57
Distributed memory model: data is split among processes PHDF5 uses HDF5 hyperslab model Each process defines memory and file hyperslabs Each process executes partial write/read call
Collective calls Independent calls
H5Sselect hyperslab(filespace,H5S SELECT SET,offset, stride, count, block)
May 21, 2014 Michael Stephan Slide 58
May 21, 2014 Michael Stephan Slide 59
May 21, 2014 Michael Stephan Slide 60
May 21, 2014 Michael Stephan Slide 61
May 21, 2014 Michael Stephan Slide 62
May 21, 2014 Michael Stephan Slide 63
May 21, 2014 Michael Stephan Slide 64
May 21, 2014 Michael Stephan Slide 65
May 21, 2014 Michael Stephan Slide 66