Reparameterization methods for MC gradient estimation
Pathwise gradient estimation,
April 3, 2018 — May 1, 2023
Reparameterization trick. A trick where we cleverly transform RVs to sample from tricky target distributions, and their jacobians, via a “nice” source distribution. Useful in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning. Pairs well with normalising flows to get powerful target distributions. Storchastic credits pathwise gradients to Glasserman and Ho (1991) as perturbation analysis. The name comes from Diederik P. Kingma and Welling (2014). According to Bloem-Reddy and Teh (2020) the reparameterisation trick is an application of noise outsourcing.
1 Tutorials
The classic tutorial is by Shakir Mohamed, Machine Learning Trick of the Day (4): Reparameterisation Tricks:
Suppose we want the gradient of an expectation of a smooth function \(f\): \[ \nabla_\theta \mathbb {E}_{p(z; \theta)}[f (z)]=\nabla_\theta \int p(z; \theta) f (z) d z \] […] This gradient is often difficult to compute because the integral is typically unknown and the parameters \(\theta,\) with respect to which we are computing the gradient, are of the distribution \(p(z; \theta).\)
Now we suppose that we know some function \(g\) such that for some easy distribution \(p(\epsilon),\) \(z | \theta=g(\epsilon, \theta)\). Now we can try to estimate the gradient of the expectation by Monte Carlo:
\[ \nabla_\theta \mathbb {E}_{p(z; \theta)}[f (z)]=\mathbb {E}_{p (c)}\left[\nabla_\theta f(g(\epsilon, \theta))\right] \] Let’s derive this expression and explore the implications of it for our optimisation problem. One-liners give us a transformation from a distribution \(p(\epsilon)\) to another \(p (z)\), thus the differential area (mass of the distribution) is invariant under the change of variables. This property implies that: \[ p (z)=\left|\frac{d \epsilon}{d z}\right|-p(\epsilon) \Longrightarrow-p (z) d z|=|p(\epsilon) d \epsilon| \] Re-expressing the troublesome stochastic optimisation problem using random variate reparameterisation, we find: \[ \begin{aligned} \nabla_\theta \mathbb {E}_{p(z; \theta)}[f (z)] &=\nabla_\theta \int p(z; \theta) f (z) d z \\ &= \nabla_\theta \int p(\epsilon) f (z) d \epsilon\\ &=\nabla_\theta \int p(\epsilon) f(g(\epsilon, \theta)) d \epsilon \\ &=\nabla_\theta \mathbb {E}_{p (c)}[f(g(\epsilon, \theta))]\\ &=\mathbb {E}_{p (e)}\left[\nabla_\theta f(g(\epsilon, \theta))\right] \end{aligned} \]
That is a classic; but there are some more pedagogic tutorials these days:
- The Ermon VAE notes
- Matthew N. Bernstein, Blackbox variational inference via the reparameterization gradient
- Gunderson, The Reparameterization Trick
Yuge Shi’s variational inference tutorial is a tour of cunning reparameterisation gradient tricks written for her paper Shi et al. (2019). She punts some details to Mohamed et al. (2020) which in turn tells me that this adventure continues at Monte Carlo gradient estimation, (Figurnov, Mohamed, and Mnih 2018; Devroye 2006; Jankowiak and Obermeyer 2018).
2 Variational Autoencoder
Diederik P. Kingma and Welling (2014) introduces both the reparameterization trick (in its current name) and the variational autoencoder as an aside. Nifty.
3 Normalising flows
Reparameterization at maximum elaboration. Cunning reparameterization maps with desirable properties for nonparametric density inference. See normalising flows.
4 General measure transport
See transport maps.
5 Tooling
6 Incoming
Universal representation theorems? Probably many, here are some I saw: Perekrestenko, Müller, and Bölcskei (2020); Perekrestenko, Eberhard, and Bölcskei (2021).
We show that every \(d\)-dimensional probability distribution of bounded support can be generated through deep ReLU networks out of a 1-dimensional uniform input distribution. What is more, this is possible without incurring a cost-in terms of approximation error measured in Wasserstein-distance-relative to generating the \(d\)-dimensional target distribution from \(d\) independent random variables. This is enabled by a vast generalization of the space-filling approach discovered in [2]. The construction we propose elicits the importance of network depth in driving the Wasserstein distance between the target distribution and its neural network approximation to zero. Finally, we find that, for histogram target distributions, the number of bits needed to encode the corresponding generative network equals the fundamental limit for encoding probability distributions as dictated by quantization theory.