HDF5
A data format I need to know about
June 17, 2020 — June 8, 2022
HDF5
: (Python/R/Java/C/Fortran/MATLAB/whatever) An optimised and flexible data format from the Big Science applications of the 90s. Inbuilt compression and indexing support for things too big for memory. A good default with wide support and good performance. For numerical data, very simple.
The table definition when writing structured data is boring and it mangles text encodings if you aren’t careful, which means a surprising amount of time can be lost writing schemas if there is highly structured data to store.
Built-in lossless compression in HDF5 is not impressive on floats, and their lossy compression is bad. Recent HDF5 supports filters providing fancier methods.
I am currently doing a lot of heavy HDF5 processing, so this is scruffy.
1 Compression
I have been trying to use compression from HDF5, but the built-in options are not great. Extended lossy compression is available via HDF5.
Untidy notes
potentially useful — compression algorithms specialised on smooth fields
- szcompressor/SZ3
- LLNL/H5Z-ZFP: A registered ZFP compression plugin for HDF5
- LLNL/zfp: Compressed numerical arrays that support high-speed random access
Of these, only ZFP is supported in hdf5plugin making it easy to use from Python.
2 Parallel access
HDF5 with multiple processes is complicated; may be worthwhile.
Useful tools: visualisers.
To install hdf5py using the homebrew HDF5 libraries (useful for Apple Silicon or other weird architectures) try this tip:
3 Virtual datasets
a.k.a. arrays-made-up-of-multiple-arrays-across-multiple-files.
I wonder how this interacts with parallelism? How about when writing? It sounds like that is a supported means of getting multiple writers if we are careful.