Variational inference
On fitting the best model one can be bothered to
March 22, 2016 — May 24, 2020
Suspiciously similar content
Inference where we approximate the density of the posterior variationally. That is, we use cunning tricks to solve an inference problem by optimising over some parameter set, usually one that allows us to trade off difficulty for fidelity in some useful way.
This idea is not intrinsically Bayesian (i.e. the density we are approximating need not be a posterior density or the marginal likelihood of the evidence), but much of the hot literature on it is from Bayesians doing something fashionable in probabilistic deep learning, so for concreteness I will assume Bayesian uses here.
This is usually mentioned in contrast to the other main method of approximating such densities: sampling from them, usually using Markov Chain Monte Carlo. In practice, the two are related (Salimans, Kingma, and Welling 2015) and nowadays even used together (Rezende and Mohamed 2015; Caterini, Doucet, and Sejdinovic 2018).
Once we have decided we are happy to use variational approximations, we are left with the question of … how? There are, AFAICT, two main schools of thought here - methods which leverage the graphical structure of the problem and maintain structural hygiene, which use variational message passing
1 “Variational”
The term “variational” comes from the calculus of variations, where one seeks to find the function that minimises a functional. In the context of variational inference, the functional is assumed to be the KL divergence between the true posterior and the variational approximation.
This usage really annoys me; there are other functionals we could minimise than KL, and the term “variational” is not helpful in distinguishing these.
This name is IMO bad. I want to change it. I will lose this battle; it is too late.
2 Introduction
The classic intro seems to be (Jordan et al. 1999), which considers diverse types of variational calculus applications and inference. Typical ML uses these days are more specific; an archetypal example would be the variational auto-encoder (Diederik P. Kingma and Welling 2014).
3 Inference via KL divergence
The most common version uses KL loss to construct the famous Evidence Lower Bound Objective. This is mathematically convenient and highly recommended if you can get away with it.
3.1 Implicit
Implicit VI is a special case of variational loss, it sounds like? TBD.
4 Other loss functions
In which probability metric should one approximate the target density? For tradition and convenience, we usually use KL-loss, but this is not ideal, and alternatives are hot topics. There are simple ones, such as “reverse KL,” which is sometimes how we justify expectation propagation and also the modest generalisation to Rényi-divergence inference (Li and Turner 2016).
Ingmar Schuster’s critique of black box loss (Ranganath et al. 2016) raises some issues:
It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses \(\pi, q\) and test functions from some family \(\mathcal{F}\). I completely agree with the motivation: KL-Divergence in the form \(\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x\) indeed underestimates the variance of \(\pi\) and approximates only one mode. Using KL the other way around, \(\int \pi(x) \log \frac{\pi(x)}{q(x)} \mathrm{d}x\) takes all modes into account, but still tends to underestimate variance.
[…] the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density \(q\) at all but uses test functions exclusively.
5 “Generalized”
Variational inference but where the variational loss is not KL. See Generalized Variational Inference.
6 Philosophical interpretations
John Schulman’s Sending Samples Without Bits-Back is a nifty interpretation of KL variational bounds in terms of coding theory/message sending.
Not grandiose enough? See Karl Friston’s interpretation of variational inference a principle of cognition.
7 In graphical models
8 Mean-field assumption
TODO: mention the importance of this for classic-flavoured variational inference (Mean Field Variational Bayes). This confused me for aaaaages. AFAICT this is a problem of history. Not all variational inference makes the confusingly-named “mean-field” assumption, but for a long while that was the only game in town, so tutorials of a certain vintage take mean-field variational inference as a synonym for variational inference. If I have just learnt some non-mean-field SVI methods from a recent NeurIPS paper, then I run into this, I might well be confused.
9 Mixture models
Mixture models are classic and for ages, seemed to be the default choice for variational approximation. They are an interesting trick to make a graphical model conditionally conjugate by use of auxiliary variables.
10 Reparameterization trick
See reparameterisation.