Annealing in inference
Tempering, cooling, Platt scaling…
September 30, 2020 — October 17, 2024
Placeholder for a concept that has cropped up a few times in my conversations of late — informally, annealing methods are ones in which we think about changing the “temperature” of the system whose energy is given by a certain (log-)probability density, which ends up being the same thing as raising the density to a power, or multiplying the log density by something. Other related concepts include cooling densities, tempering, Platt scaling, fractional densities, cold posteriors and some other stuff.
Call the \(\tau\)-tempering of a density \(p(\mathbf{x})\) the density \(p^\tau(\mathbf{x})\) for \(\tau\in\mathbb{R}^+\). For that to be still normalized we need to divide by the partition function, so that \[p^\tau(\mathbf{x}) = \frac{p(\mathbf{x})^\tau}{Z(\tau)},\] We can normally get away with just using the unnormalized density.
Not sure of the origin point of the annealing concept, but maybe Gelfand and Mitter (1990) for an early introduction with practical application.
1 As data weighting
If we do not wish to accumulate “all the information” in a data point, but would rather prefer to “weight” it somehow (“Let’s only be 60% as influenced by this datum as we might naively be”) then one natural interpretation of the weighting would be as a tempering, i.e. using the \(\tau=0.6\)-tempering of the data likelihood. This is a logical trick to try in Bayesian statistics, but surprisingly recent. See Wang, Kucukelbir, and Blei (2017).
2 In Gibbs posteriors
Tempering seems to arise naturally in the Gibbs posterior framework.
3 “Cold” posteriors
If \(\tau>1\), we call it “cooling” or “cold” posterior.
Wenzel et al. (2020) argue, in the context of Bayesian NNs:
…[W]e demonstrate that predictive performance is improved significantly through the use of a “cold posterior” that overcounts evidence. Such cold posteriors sharply deviate from the Bayesian paradigm but are commonly used as heuristic in Bayesian deep learning papers. We put forward several hypotheses that could explain cold posteriors and evaluate the hypotheses through experiments.
Much debate was sparked. See Aitchison (2020), Adlam, Snoek, and Smith (2020), Noci et al. (2021), Izmailov et al. (2021). They also draw a parallel to Masegosa (2020) which looks somewhat interesting.
Aitchison (2020) introduces the machinery:
Tempered (e.g. Zhang et al. 2018) and cold (Wenzel et al. 2020) posteriors differ slightly in how they apply the temperature parameter. For cold posteriors, we scale the whole posterior, whereas tempering is a method typically applied in variational inference, and corresponds to scaling the likelihood but not the prior, \[ \begin{aligned} \log \mathrm{P}_{\text {cold }}(\theta \mid X, Y) & =\frac{1}{T} \log \mathrm{P}(X, Y \mid \theta)+\frac{1}{T} \log \mathrm{P}(\theta)+\text { const } \\ \log \mathrm{P}_{\text {tempered }}(\theta \mid X, Y) & =\frac{1}{\lambda} \log \mathrm{P}(X, Y \mid \theta)+\log \mathrm{P}(\theta)+\text { const. } \end{aligned} \] While cold posteriors are typically used in SGLD, tempered posteriors are usually targeted by variational methods. In particular, variational methods apply temperature scaling to the KL-divergence between the approximate posterior, \(\mathrm{Q}(\theta)\) and prior, \[ \mathcal{L}=\mathbb{E}_{\mathrm{Q}(\theta)}[\log \mathrm{P}(X, Y \mid \theta)]-\lambda \mathrm{D}_{\mathrm{KL}}(\mathrm{Q}(\theta) \| \mathrm{P}(\theta)) . \] Note that the only difference between cold and tempered posteriors is whether we scale the prior, and if we have Gaussian priors over the parameters (the usual case in Bayesian neural networks), this scaling can be absorbed into the prior variance, \[ \frac{1}{T} \log \mathrm{P}_{\text {cold }}(\theta)=-\frac{1}{2 T \sigma_{\text {cold }}^2} \sum_i \theta_i^2+\text { const }=-\frac{1}{2 \sigma_{\text {tempered }}^2} \sum_i \theta_i^2+\text { const }=\log \mathrm{P}_{\text {cold }}(\theta) . \] in which case, \(\sigma_{\text {cold }}^2=\sigma_{\text {tempered }}^2 / T\), so the tempered posteriors we discuss are equivalent to cold posteriors with rescaled prior variances.
4 Examples for particular likelihoods
4.1 Gaussian
For a multivariate Gaussian distribution in canonical (information) form, the density is expressed as \[ p(x) \propto \exp\left( -\tfrac{1}{2} x^\top \Lambda x + x^\top \eta \right), \]
where
- \(\Lambda\) is the precision matrix (the inverse of the covariance matrix \(\Sigma\), i.e., \(\Lambda = \Sigma^{-1}\)),
- \(\eta = \Lambda \mu\) is the information vector,
- \(\mu\) is the mean vector.
When we temper this density by \(\tau\), we get
\[ \begin{aligned} p_\tau(x;\eta,\Lambda) &\propto \left[ \exp\left( -\tfrac{1}{2} x^\top \Lambda x + x^\top \eta \right) \right]^\tau \\ &= \exp\left( -\tfrac{1}{2} \tau x^\top \Lambda x + \tau x^\top \eta \right)\\ &\propto p_\tau(x;\tau\eta,\tau\Lambda) \end{aligned} \]
In the moments form, tempering a multivariate Gaussian distribution by a scalar \(\tau \in (0,1]\) results in:
- Unchanged Mean: \(\mu' = \mu\)
- Scaled Covariance: \(\Sigma' = \dfrac{1}{\tau} \Sigma\)