Benchmarking HDF5 Compression Filters in R Mike L. Smith - - PowerPoint PPT Presentation

benchmarking hdf5 compression filters in r
SMART_READER_LITE
LIVE PREVIEW

Benchmarking HDF5 Compression Filters in R Mike L. Smith - - PowerPoint PPT Presentation

Benchmarking HDF5 Compression Filters in R Mike L. Smith @grimbough HDF5 is a file format for storing large, heterogenous, data Used in a variety of software, e.g: DelayedArray Kallisto ONT sequencing mz5 mass spec


slide-1
SLIDE 1

Benchmarking HDF5 Compression Filters in R

Mike L. Smith @grimbough

slide-2
SLIDE 2

HDF5 is a file format for storing large, heterogenous, data

  • Used in a variety of software, e.g:

○ DelayedArray ○ Kallisto ○ ONT sequencing ○ mz5 mass spec file

  • Interfaces in many languages

○ C, Python, … ○ rhdf5 & Rhdf5lib

  • Key features:

○ Hierarchical ○ Self describing ○ Efficient subsetting ○ Compressed

http://neondataskills.org/HDF5/About

slide-3
SLIDE 3

HDF5 datasets are not contiguous, but stored in chunks

slide-4
SLIDE 4

HDF5 datasets are not contiguous, but stored in chunks

slide-5
SLIDE 5

Chunks are stored separately on disk

slide-6
SLIDE 6

Only read the chunks needed for a subset

slide-7
SLIDE 7

Chunks can be processed by filters - usually for compression

On Disk Storage Shuffle GZIP Compress GZIP Decompress Unshuffle Writing Reading In Memory Data Chunk

slide-8
SLIDE 8

There are a number of compression filters available

  • Internal filters

○ HDF5 ships with support for GZIP and SZIP

  • Dynamic filters

○ Third party tools can be made available at runtime ○ Wrap existing compression tool in small amount of C code ○ Provide location to HDF5 and they are loaded when required ○ Independent of the application(s) using them

slide-9
SLIDE 9

rhdf5filters provides additional filters in R

  • BLOSC meta compressor
  • BZIP2
  • Compiles C code on all platforms, including Windows
  • Integrated with rhdf5

○ Writing: Supply argument to function ○ Reading: Used automatically if needed

  • msmith.de/rhdf5filters/
slide-10
SLIDE 10

Filters & parameters have been benchmarked

Reading Time Writing Time File Size

slide-11
SLIDE 11

You can explore the results with a shiny app

  • Scripts to run benchmarks also

available

  • Grateful for any contributions on

both style and substance!

  • msmith.de/rhdf5filters-benchmarks
slide-12
SLIDE 12

Thanks to EMBL Huber Lab & BioC community!

msmith.de/rhdf5filters-benchmarks

@grimbough