Simulation-based inference
If I knew the right inputs to the simulator, could I get behaviour which matched my observations?
December 23, 2014 — March 3, 2024
This is chaos right now; I’m consolidating notebooks. Categories may not be well-posted.
Suppose we have access to a simulator of a system of interest, and if we knew the “right” inputs we could get behaviour from it which matched some observations we have made of a related phenomenon in the world. Suppose further that the simulator is pretty messy so we do not have access to the likelihood. Can we still do statistics, e.g. inferring the parameters of the simulator which would give rise to the observations we have made?
Oh my, what a variety of ways we can try.
There are various families of methods here; some try to work purely in samples; others try to approximate the likelihood. I am not sure how all the methods relate to one another. But let us mention some.
Cranmer, Brehmer, and Louppe (2020) attempt to develop a taxonomy Figure 2. They make likelihood-free methods sound useful for machine learning for physics.
1 Neural likelihood methods
As summarised in Cranmer, Brehmer, and Louppe (2020); see Neural likelihood inference. See the Mackelab sbi page for several implementations particularly targeting simulation-based inference.
Compare to contrastive learning.
2 Indirect inference
A.k.a the auxiliary method.
In the (older?) frequentist framing you can get through an undergraduate program in statistics without simulation-based inference arising. However, I am pretty sure it is required for economists and ecologists.
Quoting Cosma:
[…] your model is too complicated for you to appeal to any of the usual estimation methods of statistics. […] there is no way to even calculate the likelihood of a given data set \(x_1,x_2,…x_t\equiv x_t\) under parameters \(\theta\) in closed form, which would rule out even numerical likelihood maximization, to say nothing of Bayesian methods […] Yet you can simulate; it seems like there should be some way of saying whether the simulations look like the data. This is where indirect inference comes in […] Introduce a new model, called the “auxiliary model”, which is mis-specified and typically not even generative, but is easily fit to the data, and to the data alone. (By that last I mean that you don’t have to impute values for latent variables, etc., etc., even though you might know those variables exist and are causally important.) The auxiliary model has its own parameter vector \(\beta\), with an estimator \(\hat{\beta}\). These parameters describe aspects of the distribution of observables, and the idea of indirect inference is that we can estimate the generative parameters \(\theta\) by trying to match those aspects of observations, by trying to match the auxiliary parameters.
Aaron King’s lab at UMichigan stamped its mark on a lot of this research. One wonders whether the optimal summary statistic can be learned from the data. Apparently yes?.
I gather the pomp R package does some simulation-based inference, but I have not checked in for a while so there might be broader and/or fresher options.
3 Scoring rules
See scoring rules (Gneiting and Raftery 2007; Pacchiardi and Dutta 2022). NB, these are calibration scores, not Fisher scores.
3.1 Energy distances
I thought I knew what this was but I think not. The fact there are so many grandiose publications here (Gneiting and Raftery 2007; Székely and Rizzo 2013, 2017) leads me to suspect there is more going on than the obvious? TBC.
3.2 MMD
A particularly convenient discrepancy to use for simulation-based problems is the MMD, because it can be evaluated without reference to a density. See Maximum Mean Discrepancy.
4 Approximate Bayesian Computation
Slightly different take, which resembles the indirect inference approach. See Approximate Bayesian Computation.