Reparameterization methods for MC gradient estimation

Pathwise gradient estimation,

April 3, 2018 — May 1, 2023

approximation
Bayes
density
likelihood free
Monte Carlo
nonparametric
optimization
probabilistic algorithms
probability
sciml
statistics
Figure 1

Reparameterization trick. A trick where we cleverly transform RVs to sample from tricky target distributions, and their jacobians, via a “nice” source distribution. Useful in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning. Pairs well with normalising flows to get powerful target distributions. Storchastic credits pathwise gradients to Glasserman and Ho (1991) as perturbation analysis. The name comes from Diederik P. Kingma and Welling (2014). According to Bloem-Reddy and Teh (2020) the reparameterisation trick is an application of noise outsourcing.

1 Tutorials

The classic tutorial is by Shakir Mohamed, Machine Learning Trick of the Day (4): Reparameterisation Tricks:

Suppose we want the gradient of an expectation of a smooth function \(f\): \[ \nabla_\theta \mathbb {E}_{p(z; \theta)}[f (z)]=\nabla_\theta \int p(z; \theta) f (z) d z \] […] This gradient is often difficult to compute because the integral is typically unknown and the parameters \(\theta,\) with respect to which we are computing the gradient, are of the distribution \(p(z; \theta).\)

Now we suppose that we know some function \(g\) such that for some easy distribution \(p(\epsilon),\) \(z | \theta=g(\epsilon, \theta)\). Now we can try to estimate the gradient of the expectation by Monte Carlo:

\[ \nabla_\theta \mathbb {E}_{p(z; \theta)}[f (z)]=\mathbb {E}_{p (c)}\left[\nabla_\theta f(g(\epsilon, \theta))\right] \] Let’s derive this expression and explore the implications of it for our optimisation problem. One-liners give us a transformation from a distribution \(p(\epsilon)\) to another \(p (z)\), thus the differential area (mass of the distribution) is invariant under the change of variables. This property implies that: \[ p (z)=\left|\frac{d \epsilon}{d z}\right|-p(\epsilon) \Longrightarrow-p (z) d z|=|p(\epsilon) d \epsilon| \] Re-expressing the troublesome stochastic optimisation problem using random variate reparameterisation, we find: \[ \begin{aligned} \nabla_\theta \mathbb {E}_{p(z; \theta)}[f (z)] &=\nabla_\theta \int p(z; \theta) f (z) d z \\ &= \nabla_\theta \int p(\epsilon) f (z) d \epsilon\\ &=\nabla_\theta \int p(\epsilon) f(g(\epsilon, \theta)) d \epsilon \\ &=\nabla_\theta \mathbb {E}_{p (c)}[f(g(\epsilon, \theta))]\\ &=\mathbb {E}_{p (e)}\left[\nabla_\theta f(g(\epsilon, \theta))\right] \end{aligned} \]

That is a classic; but there are some more pedagogic tutorials these days:

Yuge Shi’s variational inference tutorial is a tour of cunning reparameterisation gradient tricks written for her paper Shi et al. (2019). She punts some details to Mohamed et al. (2020) which in turn tells me that this adventure continues at Monte Carlo gradient estimation, (Figurnov, Mohamed, and Mnih 2018; Devroye 2006; Jankowiak and Obermeyer 2018).

2 Variational Autoencoder

Diederik P. Kingma and Welling (2014) introduces both the reparameterization trick (in its current name) and the variational autoencoder as an aside. Nifty.

3 Normalising flows

Reparameterization at maximum elaboration. Cunning reparameterization maps with desirable properties for nonparametric density inference. See normalising flows.

4 General measure transport

See transport maps.

5 Tooling

Storchastic.

6 Incoming

Universal representation theorems? Probably many, here are some I saw: Perekrestenko, Müller, and Bölcskei (2020); Perekrestenko, Eberhard, and Bölcskei (2021).

We show that every \(d\)-dimensional probability distribution of bounded support can be generated through deep ReLU networks out of a 1-dimensional uniform input distribution. What is more, this is possible without incurring a cost-in terms of approximation error measured in Wasserstein-distance-relative to generating the \(d\)-dimensional target distribution from \(d\) independent random variables. This is enabled by a vast generalization of the space-filling approach discovered in [2]. The construction we propose elicits the importance of network depth in driving the Wasserstein distance between the target distribution and its neural network approximation to zero. Finally, we find that, for histogram target distributions, the number of bits needed to encode the corresponding generative network equals the fundamental limit for encoding probability distributions as dictated by quantization theory.

7 References

Albergo, Goldstein, Boffi, et al. 2023. Stochastic Interpolants with Data-Dependent Couplings.”
Albergo, and Vanden-Eijnden. 2023. Building Normalizing Flows with Stochastic Interpolants.” In.
Ambrosio, Gigli, and Savare. 2008. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich.
Bamler, and Mandt. 2017. Structured Black Box Variational Inference for Latent Time Series Models.” arXiv:1707.01069 [Cs, Stat].
Bloem-Reddy, and Teh. 2020. Probabilistic Symmetries and Invariant Neural Networks.”
Caterini, Doucet, and Sejdinovic. 2018. Hamiltonian Variational Auto-Encoder.” In Advances in Neural Information Processing Systems.
Charpentier, Borchert, Zügner, et al. 2022. Natural Posterior Network: Deep Bayesian Uncertainty for Exponential Family Distributions.” arXiv:2105.04471 [Cs, Stat].
Chen, Changyou, Li, Chen, et al. 2017. Continuous-Time Flows for Efficient Inference and Density Estimation.” arXiv:1709.01179 [Stat].
Chen, Tian Qi, Rubanova, Bettencourt, et al. 2018. Neural Ordinary Differential Equations.” In Advances in Neural Information Processing Systems 31.
Cunningham, Zabounidis, Agrawal, et al. 2020. Normalizing Flows Across Dimensions.”
Devroye. 2006. Nonuniform Random Variate Generation.” In Simulation. Handbooks in Operations Research and Management Science.
Dinh, Sohl-Dickstein, and Bengio. 2016. Density Estimation Using Real NVP.” In Advances In Neural Information Processing Systems.
Figurnov, Mohamed, and Mnih. 2018. Implicit Reparameterization Gradients.” In Advances in Neural Information Processing Systems 31.
Germain, Gregor, Murray, et al. 2015. MADE: Masked Autoencoder for Distribution Estimation.”
Glasserman, and Ho. 1991. Gradient Estimation Via Perturbation Analysis.
Grathwohl, Chen, Bettencourt, et al. 2018. FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models.” arXiv:1810.01367 [Cs, Stat].
Huang, Krueger, Lacoste, et al. 2018. Neural Autoregressive Flows.” arXiv:1804.00779 [Cs, Stat].
Jankowiak, and Obermeyer. 2018. Pathwise Derivatives Beyond the Reparameterization Trick.” In International Conference on Machine Learning.
Kingma, Durk P, and Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions.” In Advances in Neural Information Processing Systems 31.
Kingma, Diederik P., Salimans, Jozefowicz, et al. 2016. Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29.
Kingma, Diederik P., Salimans, and Welling. 2015. Variational Dropout and the Local Reparameterization Trick.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.
Kingma, Diederik P., and Welling. 2014. Auto-Encoding Variational Bayes.” In ICLR 2014 Conference.
Koehler, Mehta, and Risteski. 2020. Representational Aspects of Depth and Conditioning in Normalizing Flows.” arXiv:2010.01155 [Cs, Stat].
Lin, Khan, and Schmidt. 2019. Stein’s Lemma for the Reparameterization Trick with Exponential Family Mixtures.” arXiv:1910.13398 [Cs, Stat].
Lipman, Chen, Ben-Hamu, et al. 2023. Flow Matching for Generative Modeling.”
Louizos, and Welling. 2017. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In PMLR.
Lu, and Huang. 2020. Woodbury Transformations for Deep Generative Flows.” In Advances in Neural Information Processing Systems.
Marzouk, Moselhy, Parno, et al. 2016. Sampling via Measure Transport: An Introduction.” In Handbook of Uncertainty Quantification.
Massaroli, Poli, Bin, et al. 2020. Stable Neural Flows.” arXiv:2003.08063 [Cs, Math, Stat].
Maurais, and Marzouk. 2024. Sampling in Unit Time with Kernel Fisher-Rao Flow.”
Mohamed, Rosca, Figurnov, et al. 2020. Monte Carlo Gradient Estimation in Machine Learning.” Journal of Machine Learning Research.
Ng, and Zammit-Mangion. 2020. Non-Homogeneous Poisson Process Intensity Modeling and Estimation Using Measure Transport.” arXiv:2007.00248 [Stat].
Papamakarios. 2019. Neural Density Estimation and Likelihood-Free Inference.”
Papamakarios, Murray, and Pavlakou. 2017. Masked Autoregressive Flow for Density Estimation.” In Advances in Neural Information Processing Systems 30.
Papamakarios, Nalisnick, Rezende, et al. 2021. Normalizing Flows for Probabilistic Modeling and Inference.” Journal of Machine Learning Research.
Perekrestenko, Eberhard, and Bölcskei. 2021. High-Dimensional Distribution Generation Through Deep Neural Networks.” Partial Differential Equations and Applications.
Perekrestenko, Müller, and Bölcskei. 2020. Constructive Universal High-Dimensional Distribution Generation Through Deep ReLU Networks.”
Pfau, and Rezende. 2020. “Integrable Nonparametric Flows.” In.
Potapczynski, Loaiza-Ganem, and Cunningham. 2020. Invertible Gaussian Reparameterization: Revisiting the Gumbel-Softmax.” In Advances in Neural Information Processing Systems.
Ran, and Hu. 2017. Parameter Identifiability in Statistical Machine Learning: A Review.” Neural Computation.
Rezende, and Mohamed. 2015. Variational Inference with Normalizing Flows.” In International Conference on Machine Learning. ICML’15.
Rezende, Mohamed, and Wierstra. 2015. Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” In Proceedings of ICML.
Rippel, and Adams. 2013. High-Dimensional Probability Estimation with Deep Density Models.” arXiv:1302.5125 [Cs, Stat].
Ruiz, Titsias, and Blei. 2016. The Generalized Reparameterization Gradient.” In Advances In Neural Information Processing Systems.
Shi, Siddharth, Paige, et al. 2019. Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models.” arXiv:1911.03393 [Cs, Stat].
Spantini, Baptista, and Marzouk. 2022. Coupling Techniques for Nonlinear Ensemble Filtering.” SIAM Review.
Spantini, Bigoni, and Marzouk. 2017. Inference via Low-Dimensional Couplings.” Journal of Machine Learning Research.
Tabak, E. G., and Turner. 2013. A Family of Nonparametric Density Estimation Algorithms.” Communications on Pure and Applied Mathematics.
Tabak, Esteban G., and Vanden-Eijnden. 2010. Density Estimation by Dual Ascent of the Log-Likelihood.” Communications in Mathematical Sciences.
van den Berg, Hasenclever, Tomczak, et al. 2018. Sylvester Normalizing Flows for Variational Inference.” In UAI18.
Wang, and Wang. 2019. Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
Wehenkel, and Louppe. 2021. Graphical Normalizing Flows.” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.
Xu, Zuheng, Chen, and Campbell. 2023. MixFlows: Principled Variational Inference via Mixed Flows.”
Xu, Ming, Quiroz, Kohn, et al. 2019. Variance Reduction Properties of the Reparameterization Trick.” In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics.
Yang, Li, and Wang. 2021. On the Capacity of Deep Generative Networks for Approximating Distributions.” arXiv:2101.12353 [Cs, Math, Stat].
Zahm, Constantine, Prieur, et al. 2018. Gradient-Based Dimension Reduction of Multivariate Vector-Valued Functions.” arXiv:1801.07922 [Math].
Zhang, and Curtis. 2021. Bayesian Geophysical Inversion Using Invertible Neural Networks.” Journal of Geophysical Research: Solid Earth.