Data sets for machine learning for partial differential equations

May 15, 2017 — February 26, 2025

calculus
data sets
dynamical systems
geometry
machine learning
neural nets
PDEs
physics
regression
sciml
SDEs
signal processing
statistics
statmech
stochastic processes
surrogate
time series

Datasets and training harnesses for machine learning on partial differential equations (PDEs).

Figure 1: Massive turbulence in the cloud.

You’ll notice there’s an emphasis on Computational Fluid Dynamics (CFD) in these problems, especially single-phase problems. That is where the early success of operator learning has been (although, I’d argue, not where it is most needed).

If we have a simulator, we can run it live and generate data on the fly. Here is one tool to facilitate that.

INRIA’s Melissa (Ribés and Raffin 2020; Terraz et al. 2017)

Melissa is a file-avoiding, fault-tolerant, and elastic framework to run large-scale sensitivity analysis (Melissa-SA) and large-scale deep surrogate training (Melissa-DL) on supercomputers. With Melissa-SA, the largest runs so far involved up to 30k cores, executed 80,000 parallel simulations, and generated 288 TB of intermediate data that did not need to be stored on the file system …

Classical sensitivity analysis and deep surrogate training consist of running different instances of a simulation with different sets of input parameters, storing the results to disk to later read them back to train a Neural Network or compute the required statistics. The amount of storage needed can quickly become overwhelming, with the associated long read time making data processing time-consuming. To avoid this pitfall, scientists reduce their study size by running low-resolution simulations or down-sampling output data in space and time.

Melissa (Fig. 1) bypasses this limitation by avoiding intermediate file storage. Melissa processes the data online (in transit), enabling very large-scale data processing:

Working out which data to simulate to optimally train the neural network (active learning) is a key part of the problem, and I’m not aware of much work in that area.

Bhan et al. (2024) tackles the closely related problem of controlling PDEs. Kim, Kim, and Lee (2024) is the only actual active learning approach I have seen in recent literature.

1 References

Bhan, Bian, Krstic, et al. 2024. PDE Control Gym: A Benchmark for Data-Driven Boundary Control of Partial Differential Equations.” In Proceedings of the 6th Annual Learning for Dynamics & Control Conference.
Brandstetter, Berg, Welling, et al. 2022. Clifford Neural Layers for PDE Modeling.” In.
Gupta, and Brandstetter. 2022. Towards Multi-Spatiotemporal-Scale Generalized PDE Modeling.”
Kim, Kim, and Lee. 2024. Flexible Active Learning of PDE Trajectories.”
Koehler, Niedermayr, Westermann, et al. 2024. APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs.”
Li, Perlman, Wan, et al. 2008. A Public Turbulence Database Cluster and Applications to Study Lagrangian Evolution of Velocity Increments in Turbulence.” Journal of Turbulence.
Ohana, McCabe, Meyer, et al. 2024. The Well: A Large-Scale Collection of Diverse Physics Simulations for Machine Learning.” In The Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Otness, Gjoka, Bruna, et al. 2021. An Extensible Benchmark Suite for Learning to Simulate Physical Systems.” In.
Ribés, and Raffin. 2020. The Challenges of In Situ Analysis for Multiple Simulations.” In.
Takamoto, Praditia, Leiteritz, et al. 2022. PDEBench: An Extensive Benchmark for Scientific Machine Learning.” In.
Terraz, Ribes, Fournier, et al. 2017. Melissa: Large Scale in Transit Sensitivity Analysis Avoiding Intermediate Files.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
Yu, Kanov, Perlman, et al. 2012. Studying Lagrangian Dynamics of Turbulence Using on-Demand Fluid Particle Tracking in a Public Turbulence Database.” Journal of Turbulence.