Data sets for machine learning for partial differential equations
May 15, 2017 — February 26, 2025
Suspiciously similar content
Datasets and training harnesses for machine learning on partial differential equations (PDEs).
You’ll notice there’s an emphasis on Computational Fluid Dynamics (CFD) in these problems, especially single-phase problems. That is where the early success of operator learning has been (although, I’d argue, not where it is most needed).
pdebench/PDEBench: An Extensive Benchmark for Scientific Machine Learning (Takamoto et al. 2022) (Disclaimer: I contributed significantly to this project)
PDEArena (Brandstetter et al. 2022; Gupta and Brandstetter 2022)
Johns Hopkins Turbulence Databases (JHTDB) (Li et al. 2008; Yu et al. 2012)
karlotness/nn-benchmark: An extensible benchmark suite to evaluate data-driven physical simulation (Otness et al. 2021)
-
Welcome to the Well, a large-scale collection of machine learning datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain scientists and numerical software developers to provide 15TB of data across 16 datasets covering diverse domains such as biological systems, fluid dynamics, acoustic scattering, as well as magneto-hydrodynamic simulations of extra-galactic fluids or supernova explosions. These datasets can be used individually or as part of a broader benchmark suite for accelerating research in machine learning and computational sciences.
APEBench / APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs (Koehler et al. 2024)
APEBench is a JAX-based tool to evaluate autoregressive neural emulators for PDEs on periodic domains in 1d, 2d, and 3d. It comes with an efficient reference simulator based on spectral methods that is used for procedural data generation (no need to download large datasets with APEBench). Since this simulator can also be embedded into emulator training (e.g., for a “solver-in-the-loop” correction setting), this is the first benchmark suite to support differentiable physics.
If we have a simulator, we can run it live and generate data on the fly. Here is one tool to facilitate that.
INRIA’s Melissa (Ribés and Raffin 2020; Terraz et al. 2017)
Melissa is a file-avoiding, fault-tolerant, and elastic framework to run large-scale sensitivity analysis (Melissa-SA) and large-scale deep surrogate training (Melissa-DL) on supercomputers. With Melissa-SA, the largest runs so far involved up to 30k cores, executed 80,000 parallel simulations, and generated 288 TB of intermediate data that did not need to be stored on the file system …
Classical sensitivity analysis and deep surrogate training consist of running different instances of a simulation with different sets of input parameters, storing the results to disk to later read them back to train a Neural Network or compute the required statistics. The amount of storage needed can quickly become overwhelming, with the associated long read time making data processing time-consuming. To avoid this pitfall, scientists reduce their study size by running low-resolution simulations or down-sampling output data in space and time.
Melissa (Fig. 1) bypasses this limitation by avoiding intermediate file storage. Melissa processes the data online (in transit), enabling very large-scale data processing:
Working out which data to simulate to optimally train the neural network (active learning) is a key part of the problem, and I’m not aware of much work in that area.
Bhan et al. (2024) tackles the closely related problem of controlling PDEs. Kim, Kim, and Lee (2024) is the only actual active learning approach I have seen in recent literature.