Gibbs posteriors
Bayes-like inference with losses instead of likelihoods
September 26, 2024 — February 11, 2025
Suspiciously similar content
Let’s do some Bayesian inference! We have a parameter \(\theta\) and some (i.i.d for now) data \(X = \{x_1, x_2, \ldots, x_n\}\). Suppose we have a prior \(\pi(\theta)\) and a likelihood \(p(x|\theta)\); for now, we’ll assume it has a density. The update from prior to posterior is given by Bayes’ theorem: \[ \pi_n(\theta) \propto \pi(\theta) \prod_{i=1}^n p(x_i|\theta) \tag{1}\] or in the log domain \[ \log \pi_n(\theta) = \log \pi(\theta) + \sum_{i=1}^n \log p(x_i|\theta) + \log \text{marginal likelihood}. \] In a Gibbs posterior approach, we decide the likelihood isn’t quite cutting the mustard but still want to get something like a Bayesian posterior in a more general setting. We do two things:
- Replace the likelihood \(\log p(x_i|\theta)\), specifically the negative log-likelihood \(-\log p(x_i|\theta)\), with a different loss function \(\ell(\theta, x_i)\).
- Introduce a learning rate factor \(\omega\) that lets us control how much we trust the loss function relative to the prior.
Not the same as a Gibbs sampler or a Gibbs distribution, a Gibbs posterior is a way of doing Bayesian inference that uses a loss function instead of a likelihood. That said, you can, confusingly, use Gibbs samplers to sample from Gibbs posteriors.
How does that look? \[ \begin{aligned} \pi_n(\theta) &\propto \exp\Bigl\{-\omega \sum_i^n \ell(\theta, x_i) \Bigr\}\,\pi(\theta)\\ &=\exp\Bigl\{-\omega R_n(\theta)\Bigr\}\,\pi(\theta), \end{aligned} \tag{2}\] where \[ R_n(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(\theta, x_i) \] is simply the empirical risk (sometimes we put a factor of \(1/n\) in front of the sum to make it an average).
\(\omega\) is a tempering/temperature factor.
We seem to have given up the likelihood principle since the empirical risk is estimated directly rather than from an integral of a cost over a posterior prediction decision, but maybe this is okay if we aren’t sure about the likelihood anyway.
N. A. Syring (2018) is a thesis-length introduction. There is a compact explanation in Martin and Syring (2022).
Note that the Gibbs posterior becomes the same as the classical Bayesian posterior when we choose the loss function to be the negative log-likelihood and set the learning rate to 1. This means that instead of updating beliefs via the standard likelihood, we use a loss function that—for this choice—exactly recovers the usual Bayesian update.
1 A worked example
Would be a good idea. Note that Equation 2 includes a sneaky implicit integral, just like normal Bayes. The solution is an integral over the parameter space. It’s easy to say “the solution is just the function that satisfies so-and-so” but calculating it can be tricky. I’m not sure when it would be harder or easier than classical Bayes in practice.
2 Theoretical guarantees
Not sure, but see (Martin and Syring 2022; N. Syring and Martin 2023; Luo et al. 2023).
3 As a robust Bayesian method
So it seems, although the literature of Gibbs posteriors looks quite different from the robust Bayes literature I’m used to.
4 Generalized variational inference
Gibbs posteriors seem related to so-called Generalized Variational Inference (Bissiri, Holmes, and Walker 2016). The use of a loss function instead of a likelihood sounds like a shared property.
5 Energy-based models
Connection to Energy-based models expanded in Andy Jones’ Gibbs posteriors and energy-based models.