Data storage formats
June 17, 2020 — November 26, 2023
In what format do I send my data over the network or stash it on disk? Especially interesting for experiment data, which is big and numbery.
Files full of stored data are the most basic form of database,1 and much less arsing around if we can get away with it. Classic enterprise databases are optimised for things I don’t often need in data science research, such as structured records, high-write concurrency etc.
1 Textual formats
For CSV, JSON etc, see text data processing.
A hot mess that has the virtue that many projects use it, although they all use it badly, inconsistently and slowly. For exploratory small data analysis, textual tabular is tolerable and common.
For ongoing projects, CSV is best parsed with a tabular library such as pandas or tablib and thereafter stored in some fancier format.
2 HDF5
A numerical data storage system beloved of physicists. So important in my workflow that I made a new notebook. See HDF5.
3 Arrow
Apache Arrow (Python/R/C++/many more) (source) is a fresh system designed to serialize the same types of data as HDF5, but be both simpler from the user side and faster at scale. See Wes McKinney’s pyarrow blog post or Notes from a data witch: Getting started with Apache Arrow. It seems to be optimised for in-memory data, although it does store stuff on disk.
3.1 Polars
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.
Features
- Lazy | eager execution
- Multi-threaded
- SIMD
- Query optimization
- Powerful expression API
- Hybrid Streaming (larger than RAM datasets)
- Rust | Python | NodeJS | …
4 Parquet
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file format[…]
- Free and open source file format.
- Language agnostic.
- Column-based format - files are organised by column, rather than by row, which saves storage space and speeds up analytics queries.
- Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases.
- Highly efficient data compression and decompression.
- Supports complex data types and advanced nested data structures.
Seems to be more happy to serialize to disk than Apache Arrow.
5 Pandas
pandas can use various data formats for backend access, e.g. HDF5, CSV, JSON, SQL, Excel, Parquet.
6 Petastorm
Is this a data file format or some kind of database? Possibly both?
In this article, we describe Petastorm, an open source data access library developed at Uber ATG. This library enables single machine or distributed training and evaluation of deep learning models directly from multi-terabyte datasets in Apache Parquet format. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, Pytorch, and PySpark. It can also be used from pure Python code.
It sounds like a nice backend built upon pyarrow with special support for spark and ML formats. It seems like this might be easy to use from python but might be less pleasant for non-python users, even if it is still possible, because there will not be so many luxurious supporting libraries.
7 Dask
Numpy-style arrays and pandas style dataframes and more, but distributed. See Dask
8 zarr
Numpy-style arrays and pandas style dataframes and more, but distributed. See zarr
9 Protobuf
protobuf
: (Python/R/Java/C/Go/whatever) Google’s famous data format. Recommended for tensorflow although it’s soooooo boooooooring if I am reading that page I am very far from what I love. A data format rather than a storage solution per se (there is no assumption that it will be stored on the file system)
You might also want everything not to be hard. Try prototool
or buf
10 flatbuffers
FlatBuffers is an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, PHP, and Python. It was originally created at Google for game development and other performance-critical applications…
Why not use Protocol Buffers, or…?
Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.
That sounds like an optimization that I will not need.
11 Algorithms for compressing scientific data
Lossy compression of dense numerical data is weirdly fascinating. There are various special features of numerical data; They are stored in floating point, which is very wasteful and typically too precise. They typically have much redundancy, and we need to load them quickly; so we might naturally imagine they can be compressed. But can we control the accuracy with which they are compressed and stored such that they are reliable to load and that the errors introduced by lossy compression do not invalidate our inference? What this even means is context-dependent.
SZ3 and ZFP seem to be leading contenders right now, for flexible and reasonable data storage for science.
SZ
ZFP
sdrbench.github.io/ notionally benchmarks different methods, although I cannot make any sense of it myself. Where are the actual results?
12 References
Footnotes
In some weird sense of basic, insofar as a filesystem can be thought of as a sophisticated database of sorts.↩︎