Generalized Bayesian inference
Approximating the Gibbs posterior
September 26, 2024 — February 12, 2025
Suspiciously similar content
1 ‘Generalized’
I dislike naming things “Generalized”, for all the obvious reasons. Imagine if biologists named Eukaryotes Generalized Prokaryotes. You cannot do this in the rest of the world, but in machine learning, somehow it is normal.
My ongoing battle to have a moratorium on naming anything “generalized” continues with no success.
1.1 Gibbs Posterior and Fancy Methods
The main idea seems to be using the Gibbs posterior and doing fancy stuff with it, like approximating it with variational inference or other cool methods.
2 Generalized Bayesian Computation
I just saw a presentation on Dellaporta et al. (2022) which stakes a claim to the term “Generalized Bayesian Computation”. She mixes bootstrap, Bayes nonparametrics, MMD, and simulation-based inference in an M-open setting. I’m not sure which of the results are specific to that (impressive) paper, but Dellaporta name-checks Fong, Lyddon, and Holmes (2019), Lyddon, Walker, and Holmes (2018), Matsubara et al. (2022), Pacchiardi and Dutta (2022), Schmon, Cannon, and Knoblauch (2021).
There’s some interesting stuff happening in that group. Maybe this introductory post will be a good start: Generalising Bayesian Inference.
3 Generalized Variational Inference
If we add a variational approximation, we can approximate the Gibbs posterior.
Knoblauch, Jewson, and Damoulas (2022) calls this Generalized Variational Inference 1
The argument is that we can interpret the solution to the Robust Bayesian Inference problem variationally. We recall the average risk in mean form:
\[ R_n(\theta) = \frac{1}{n}\sum_{i=1}^n \ell(\theta, x_i) \]
which defines the Gibbs posterior measure as
\[ \pi_n(\theta) \propto \exp\{-\omega\, n\, R_n(\theta)\}\,\pi(\theta), \]
They argue it is equivalent to solving an optimisation problem over probability measures \[q(\theta)\] of the form
\[ q^* = \arg\min_{q \in \mathcal{P}(\Theta)} \left\{\omega\, n\, \mathbb{E}_q\bigl[R_n(\theta)\bigr] + \mathrm{KL}(q\| \pi)\right\}. \]
The GVI framework generalizes this by allowing three free ingredients in the inference procedure compared to the classic Bayesian (or variational Bayesian):
- loss function \(\ell\), as in Gibbs posteriors
- a divergence function \(D\) (which doesn’t have to be the KL divergence)
- variational family \(\mathcal{Q}\).
The optimisation objective is
\[ q^* = \arg\min_{q\in \mathcal{Q}} \left\{\mathbb{E}_q\biggl[\sum_{i=1}^n \ell(\theta,x_i)\biggr] + D(q\| \pi)\right\}. \]
In this setup, when \(D\) is the KL divergence and the loss is the negative log-likelihood (properly scaled), the classical Bayesian posterior is recovered.
4 Connection to other non-KL inference.
TBD. See inference without KL divergence.
5 References
Footnotes
A name lab-grown to irritate me. I reject calling things “Generalized” and I also think that “variational inference” as statisticians use it is a misnomer. I acknowledge I will not win this naming fight, but that does not mean I need to like it.↩︎