Normalising flows

Invertible density models, sounding clever by using the word diffeomorphism like a real mathematician

April 3, 2018 — June 12, 2024

approximation
Bayes
density
likelihood free
Monte Carlo
nonparametric
optimization
probabilistic algorithms
probability
statistics
Figure 1: Approximating the desired distribution by perturbation of the available distribution

Cleverly structured “nice” transforms of RVs to sample from tricky target distributions via an easy source distribution. Useful in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning, best summarised as “fancy change of variables such that I can differentiate through the parameters of a distribution”. Storchastic credits this to (Glasserman and Ho 1991) as perturbation analysis.

Connections to optimal transport and likelihood-free inference, in that this trick can enable some clever approximate-likelihood approaches. Uses transport maps.

Terminology: All variables here are assumed to take values in \(\mathbb{R}^D\). If I am writing about a random variable I write it \(\mathsf{x}\) and if I am writing about the realised values, I write it \(x\). \(\mathsf{x}\sim p(x)\) should be read “The random variable \(\mathsf{x}\) has law with density \(p(x)\).” 1

1 Basic

AFAICT the terminology is chosen to imply that the reparameterization should be differentiable, bijective and computationally convenient Ingmar Shuster’s summary of the foundational (Rezende and Mohamed 2015) is a little snarky:

The paper adopts the term normalising flow for referring to the plain old change of variables formula for integrals.

I have had the dissertation (Papamakarios 2019) recommended to me a lot as a summary for this field, although I have not read it. The shorter summary paper looks good though (Papamakarios et al. 2021). See Kobyzev, Prince, and Brubaker (2021) for a recent review.

Normalizing flows are a method of variational inference. As with typical variational inference, it is convenient for us while doing inference to have a way of handling an approximate posterior density over the latent variables \(\mathsf{z}\) given the data \(\mathsf{x}\). The real posterior density \(p(z|x)\) is intractable, so we construct an approximate density \(q_{\phi}(z|x)\) parameterised by some variational parameters \(\phi\). We have certain desiderata. Specifically, we would like to efficiently…

  1. calculate the density \(q_{\phi}(z|x)\), such that…
  2. we can estimate expectations/integrals with respect to \(q_{\phi}(\cdot|z)\) in the sense that we can estimate \(\int q_{\phi}(\cdot|z)f(z) \mathrm{d} z,\) which we are satisfied to do via Monte Carlo, and so it will suffice if we can simulate from this density, and
  3. we can differentiate through this density with respect to the variational parameters to find \({\textstyle \frac{\partial }{\partial \phi }q_{\phi}(z|x)}.\)
  4. Our method is ‘flexible enough’ to approximate a complicated, messy posterior density.

The whole problem in such a case would entail solving for the variational objective:

\[\begin{aligned} \log p_{ \theta }\left( x\right) &\geq \mathcal{L}\left( \theta, \phi ; x\right)\\ &=\mathbb{E}_{q_{\phi}( z | x )}\left[-\log q_{\phi}( z | x )+ \log p_{ \theta }( x, z )\right]\\ &=-\operatorname{KL}\left(q_{\phi}\left( z | x\right) \| p_{ \theta }( z )\right)+ \mathbb{E}_{q_{\phi}\left( z | x\right)}\left[ \log p_{ \theta }\left( x | z \right)\right]\\ &=-\mathcal{F}(\theta, \phi) \end{aligned}\]

Here \(p_{\theta}(x)\) is the marginal likelihood of the generative model for the data, \(p_{\theta}(x|z)\) the density of observations given the latent factors and \(\theta\) parameterises the density of the generative model.

(For the next bit, we temporarily suppress the dependence on \(\mathsf{x}\) to avoid repetition, and the dependence of the transforms and density upon \({\phi}\).)

Normalizing flows use the reparameterization trick as an answer to those desiderata. Specifically, we assume that for some function (not density!) \(f:\mathbb{R}^D\to\mathbb{R}^D,\) that \[\mathsf{z}=f(\mathsf{z}_0)\] and that \(\mathsf{z}_0\sim q_{0}=\mathcal{N}(0,\mathrm{I} )\) (or some similarly easy distribution). It turns out that by imposing some extra restrictions, we can do most of the heavy lifting in this algorithm through this simple \(\mathsf{z}_0\sim q_0\) and still get the power of a fancy posterior \(\mathsf{z}\sim q.\)

Now we can calculate (sufficiently nice) expectations with respect to \(\mathsf{z}\) using the law of the unconscious statistician \[\mathbb{E} f(\mathsf{z})=\int f(z) q_0(z) \mathrm{d} z.\]

However, we need to impose additional conditions to guarantee a tractable form for the densities; in particular we impose the restriction that \(f\) is invertible, so that we can use the change of variables formula to find the density

\[q\left( z \right) =q_0( f^{-1}(z) )\left| \operatorname{det} \frac{\partial f^{-1}(z)}{\partial z }\right| =q_0( f^{-1}(z) )\left| \operatorname{det} \frac{\partial f(z)}{\partial z }\right|^{-1}. \]

We can economise on function inversions since we are evaluating always via simulating \(\mathsf{z}\) from \(f(\mathsf{z}_0)\), i.e. \(\mathsf{z}:=f( \mathsf{z}_0)\), so we can write

\[q\left( \mathsf{z} \right) =q_0(\mathsf{z}_0 )\left| \operatorname{det} \frac{\partial f(\mathsf{z})}{\partial \mathsf{z} }\right|^{-1}.\]

We do not need, that is to say, \(f\) inversions, just the Jacobian determinant, which might be easier. Spoiler: later on we construct some \(f\) transforms for which it is easy-ish to find that Jacobian determinant without inverting the function.

We want this mapping to depend upon \(\mathsf{x}\) and the parameters \(\phi\) so we reintroduce that dependence now. We parameterize these functions \(f_{\phi}(\mathsf{z}):=f(\mathsf{z},\phi(\mathsf{x}));\) that is, our \(\phi\) parameters are a learned mapping from observation \(\mathsf{z}_0\) to a posterior sample from the latent variable \(\mathsf{z}\) with density \(q_{K}\left( z _{K} | x \right) \simeq p_\theta(z|x)\) such that, if we configure everything just right, \(q_{K}\left( z _{K} | x \right) \simeq p_\theta(z|x)\).

And indeed it is not too hard to find a recipe for optimising these parameters. We may estimate the following derivatives $_{} ( x ) $ and \(\nabla_{\phi} \mathcal{F}( x )\) which is sufficient to minimise \(\mathcal{F}(\theta, \phi)\) by gradient descent.

In practice we are doing this in a big data context, so we use stochastic subsamples from \(x\) to estimate the gradient and Monte Carlo simulations from \(\mathsf{z}_0\) to estimate the necessary integrals with respect to \(q_K(\cdot|\mathsf{x}).\)

So far so good. But it turns out that this is not in general very tractable, because determinants are notoriously expensive to calculate, scaling poorly — \(\mathcal{O}(D^3)\) — in the number of dimensions, which we expect to be large in trendy neural-network-type problems.

We look for restrictions on the form of \(f_{\phi}\) such that \(\operatorname{det} \frac{\partial f_{\phi}}{\partial z }\) is cheap and yet the approximate \(q\) they induce is still “flexible enough”. The normalizing flow solution is to choose compositions of some class of cheap functions \[ \mathsf{z}_{K}=f_{K} \circ \ldots \circ f_{2} \circ f_{1}\left( \mathsf{z} _{0}\right)\sim q_K.\]

By induction on the change of variables formula (and using the logarithm for tidiness), we can find the density of the variable mapped through this flow as

\[\log q_{K}\left( \mathsf{z} _{K} | x \right) =\log q_{0}\left( \mathsf{z} _{0} | x \right) - \sum_{k=1}^{K} \log \left|\operatorname{det}\left(\frac{\partial f_{k}\left( \mathsf{z} _{k-1} \right)}{\partial \mathsf{z} _{k-1}}\right)\right| \]

Compositions of such cheap functions should also be cheap-ish and each iteration would add extra flexibility.

But which \(f_k\) mappings are cheap? The archetypal one from (Rezende and Mohamed 2015) is the “planar flow”:

\[ f(\mathsf{z})= \mathsf{z} + u h\left( w ^{\top} \mathsf{z} +b\right)\] where $ u, w ^{D}, b $ and \(h:\mathbb{R}\to\mathbb{R}\) is some monotonic differentiable activation function.

Figure 2: Rezende and Mohamed (2015)’s planar flows applied to some latent distributions

There is a standard identity, the matrix determinant lemma \(\operatorname{det}\left( \mathrm{I} + u v^{\top}\right)=1+ u ^{\top} v,\) from which it follows that \[ \begin{aligned} \operatorname{det} \frac{\partial f}{\partial z } &=\operatorname{det}\left( \mathrm{I} + u h^{\prime}\left( w ^{\top} z +b\right) w ^{\top}\right) \\ &=1+ u ^{\top} h^{\prime}\left( w ^{\top} z +b\right) w. \end{aligned} \] We can often simply parameterise acceptable domains for functions like these so that they remain invertible; For example if \(h\equiv \tanh\) a sufficient condition is \(u^\top w\geq 1.\) This means that we know it should be simple to parameterise these weights \(u,w,b\). Or, as in our application, that it is easy to construct functions \(\phi:x\mapsto u,w,b\) which are guaranteed to remain in an invertible domain.

For functions like this, the determinant calculation is cheap, and does not depend at all on the ambient dimension of \(\mathsf{x}\). However, we might find it hard to persuade ourselves that this mapping is flexible enough to represent \(q\) well, at least without letting \(K\) be large, as the mapping must pass through a univariate “bottleneck” \(w ^{\top} \mathsf{z}.\) Indeed, empirically this mapping does not in fact perform well and a lot of time has been spent trying to do better.

The parallel with the problem of finding covariance kernels is interesting. In both cases we have some large function class we wish to parameterise so we can search over it, but we restrict it to a subset with computationally convenient properties and a simple domain.

2 A fancier flow

There are various solutions to this problem; an illustrative one is Sylvester flow (van den Berg et al. 2018) which, instead of the matrix determinant lemma uses a generalisation, Sylvester’s determinant identity. This tells us that for all \(\mathrm{A} \in \mathbb{R}^{D \times M}, \mathrm{B} \in \mathbb{R}^{M \times R}\), \[ \operatorname{det}\left( \mathrm{I}_{D}+ \mathrm{A} \mathrm{B} \right) =\operatorname{det}\left( \mathrm{I}_{M}+ \mathrm{B} \mathrm{A} \right)\] where $ {M}$ and $ {D}$ are respectively $M $ and $ D$ dimensional identity matrices.

This suggests we might exploit a generalized planar flow, \[ f(\mathsf{z}):= \mathsf{z} + \mathrm{A} h\left(\mathrm{B} \mathsf{z} +b\right) \] with $ ^{D M}, ^{M D}, b ^{M},$ and \(M \leq D.\) The determinant calculation scales as \(\mathcal{O}(M^3)\ll \mathcal{O}(D^3),\) which is cheap-ish and, we hope, gives us enough additional scope to design sufficiently flexible flows. \(M \geq 1\) is a bottleneck parameter.

The price is an additional parameterisation problem. How do we select \(\mathrm{A}\) and \(\mathrm{B}\) such that they are still invertible for a given \(h\)? The solution in (van den Berg et al. 2018) is to break this into two simpler parameterisation problems.

They choose \(f(\mathsf{z}):=\mathsf{z}+\mathrm{Q} \mathrm{R} h\left(\tilde{\mathrm{R}} \mathrm{Q}^{T} \mathsf{z}+\mathrm{b}\right)\) where \(\mathrm{R}\) and \(\tilde{\mathrm{R}}\) are upper triangular \(M \times M\) matrices, and \(Q\) is a \(D\times M\) matrix with orthonormal columns \(\mathrm{Q}=\left(\mathrm{q}_{1} \ldots \mathrm{q}_{M}\right).\) Using the Sylvester identity on this \(f\) we find

\[\operatorname{det} \frac{\partial f}{\partial z } =\operatorname{det}\left( \mathrm{I} _{M}+\operatorname{diag}\left(h^{\prime}\left(\tilde{\mathrm{R} } \mathrm{Q} ^{T} z + b \right)\right) \tilde{\mathrm{R} } \mathrm{R} \right)\]

They also show that if, in addition,

  1. \(h: \mathbb{R} \longrightarrow \mathbb{R}\) is a smooth function with bounded, positive derivative and
  2. if the diagonal entries of \(\mathrm{R}\) and \(\tilde{\mathrm{R}}\) satisfy \(r_{i i} \tilde{r}_{i i}>-1 /\left\|h^{\prime}\right\|_{\infty}\) and
  3. \(\tilde{\mathrm{R}}\) is invertible,

then \(f\) is invertible as required.

Now all we need to do is choose a differentiable parameterisation of the upper-triangular \(\mathrm{R},\) the upper-triangular invertible \(\tilde{\mathrm{R}}\) with appropriate diagonal entries and the orthonormal matrix, \(Q\). That is a whole other story though.

3 Spline

The type of flow that everyone seems to use in practice for moderate dimensionality is the spline flow (Durkan et al. 2019).

4 Glow

Generative flow for images and similar stuff (Durk P. Kingma and Dhariwal 2018).

5 For density estimation

I am no longer sure what distinction I wished to draw with this heading, but see (Esteban G. Tabak and Vanden-Eijnden 2010; E. G. Tabak and Turner 2013) and maybe also check out transport maps.

🏗

6 Tutorials

7 General measure transport

See transport maps.

8 Representational power of

A CMU team argues

The most fundamental restriction of the normalising flow paradigm is that each layer needs to be invertible. We ask whether this restriction has any ‘cost’ in terms of the size, and in particular the depth, of the model. Here we’re counting depth in terms of the number of the invertible transformations that make up the flow. A requirement for large depth would explain training difficulties due to exploding (or vanishing) gradients. Since the Jacobian of a composition of functions is the product of the Jacobians of the functions being composed, the min (max) singular value of the Jacobian of the composition is the product of the min (max) singular value of the Jacobians of the functions. This implies that the smallest (largest) singular value of the Jacobian will get exponentially smaller (larger) with the number of compositions.

A natural way of formalising this question is by exhibiting a distribution which is easy to model for an unconstrained generator network but hard for a shallow normalising flow. Precisely, we ask: is there a probability distribution that can be represented by a shallow generator with a small number of parameters that could not be approximately represented by a shallow composition of invertible transformations?

We demonstrate that such a distribution exists.

…To reiterate the takeaway: a GLOW-style linear layer in between affine couplings could in theory make your network between 5 and 47 times smaller while representing the same function. We now have a precise understanding of the value of that architectural choice!

9 Conditioned

(Koehler, Mehta, and Risteski 2020; Winkler et al. 2023) TBC

10 Implementations

11 References

Ambrosio, Gigli, and Savare. 2008. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich.
Charpentier, Borchert, Zügner, et al. 2022. Natural Posterior Network: Deep Bayesian Uncertainty for Exponential Family Distributions.” arXiv:2105.04471 [Cs, Stat].
Chen, Changyou, Li, Chen, et al. 2017. Continuous-Time Flows for Efficient Inference and Density Estimation.” arXiv:1709.01179 [Stat].
Chen, Tian Qi, Rubanova, Bettencourt, et al. 2018. Neural Ordinary Differential Equations.” In Advances in Neural Information Processing Systems 31.
Dinh, Sohl-Dickstein, and Bengio. 2016. Density Estimation Using Real NVP.” In Advances In Neural Information Processing Systems.
Durkan, Bekasov, Murray, et al. 2019. Neural Spline Flows.” In Advances in Neural Information Processing Systems.
Figurnov, Mohamed, and Mnih. 2018. Implicit Reparameterization Gradients.” In Advances in Neural Information Processing Systems 31.
Glasserman, and Ho. 1991. Gradient Estimation Via Perturbation Analysis.
Grumitt, Karamanis, and Seljak. 2023. Flow Annealed Kalman Inversion for Gradient-Free Inference in Bayesian Inverse Problems.”
Huang, Krueger, Lacoste, et al. 2018. Neural Autoregressive Flows.” arXiv:1804.00779 [Cs, Stat].
Jankowiak, and Obermeyer. 2018. Pathwise Derivatives Beyond the Reparameterization Trick.” In International Conference on Machine Learning.
Kingma, Durk P, and Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions.” In Advances in Neural Information Processing Systems 31.
Kingma, Diederik P., Salimans, Jozefowicz, et al. 2016. Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29.
Kingma, Diederik P., Salimans, and Welling. 2015. Variational Dropout and the Local Reparameterization Trick.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.
Kingma, Diederik P., and Welling. 2014. Auto-Encoding Variational Bayes.” In ICLR 2014 Conference.
Kobyzev, Prince, and Brubaker. 2021. Normalizing Flows: An Introduction and Review of Current Methods.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Koehler, Mehta, and Risteski. 2020. Representational Aspects of Depth and Conditioning in Normalizing Flows.” arXiv:2010.01155 [Cs, Stat].
Louizos, and Welling. 2017. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In PMLR.
Lu, and Huang. 2020. Woodbury Transformations for Deep Generative Flows.” In Advances in Neural Information Processing Systems.
Massaroli, Poli, Bin, et al. 2020. Stable Neural Flows.” arXiv:2003.08063 [Cs, Math, Stat].
Mohamed, Rosca, Figurnov, et al. 2020. Monte Carlo Gradient Estimation in Machine Learning.” Journal of Machine Learning Research.
Nash, and Durkan. 2019. Autoregressive Energy Machines.” In Proceedings of the 36th International Conference on Machine Learning.
Papamakarios. 2019. Neural Density Estimation and Likelihood-Free Inference.”
Papamakarios, Murray, and Pavlakou. 2017. Masked Autoregressive Flow for Density Estimation.” In Advances in Neural Information Processing Systems 30.
Papamakarios, Nalisnick, Rezende, et al. 2021. Normalizing Flows for Probabilistic Modeling and Inference.” Journal of Machine Learning Research.
Pfau, and Rezende. 2020. “Integrable Nonparametric Flows.” In.
Rezende, and Mohamed. 2015. Variational Inference with Normalizing Flows.” In International Conference on Machine Learning. ICML’15.
Ruiz, Titsias, and Blei. 2016. The Generalized Reparameterization Gradient.” In Advances In Neural Information Processing Systems.
Shi, Siddharth, Paige, et al. 2019. Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models.” arXiv:1911.03393 [Cs, Stat].
Tabak, E. G., and Turner. 2013. A Family of Nonparametric Density Estimation Algorithms.” Communications on Pure and Applied Mathematics.
Tabak, Esteban G., and Vanden-Eijnden. 2010. Density Estimation by Dual Ascent of the Log-Likelihood.” Communications in Mathematical Sciences.
van den Berg, Hasenclever, Tomczak, et al. 2018. Sylvester Normalizing Flows for Variational Inference.” In UAI18.
Wang, and Wang. 2019. Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
Winkler, Worrall, Hoogeboom, et al. 2023. Learning Likelihoods with Conditional Normalizing Flows.”

Footnotes

  1. I am aware from painful experience that using a font to denote a random variable will irritate some people, but this is usually the same people who have strong opinions about “which” versus “that” and other such bullshit, and such people are welcome to their opinions but they can get their own blog to express those opinions on in notations which suit their purposes. In my context, things come out much clearer if we can distinguish RVs, their realisations, densities and observations using some appropriate notation which is not capitalisation or italics, which I already use for other purposes.↩︎