Quasi-gradients of discrete parameters

December 20, 2022 — May 17, 2024

calculus

classification

optimization

probabilistic algorithms

probability

score function

statistics

Suspiciously similar content

Figure 1: Abraham Bosse, Moyen vniuersel de pratiquer la perspectiue sur les tableaux, ou surfaces irregulieres : ensemble quelques particularitez concernant cet art, & celuy de la graueure en taille-douce (1653)

Notes on taking gradients through things that look like they have no gradients because their arguments are discrete. There are a lot of loosely related concepts in here that may not reflect an actual common theme. TBC.

1 Stochastic gradients via REINFORCE

The classic generic REINFORCE/Score function method for estimating gradients of expectations of functions of random variables can be used to estimate gradients of functions of discrete random variables as a special case. There are particular extra tricks used for discrete random variables; see e.g. (Grathwohl et al. 2018; Liu et al. 2019; Mnih and Gregor 2014; Tucker et al. 2017).

2 Gumbel-(soft)max

A.k.a. the concrete distribution. See Gumbel-max.

3 Gradients of other weird things

Differentiable sorting? See, e.g. Grover et al. (2018) and Prillo and Eisenschlos (2020).

4 Avoiding the need for gradients

Famously, Expectation Maximization can handle some of the same optimisation problems as gradient-based methods, but without needing gradients. There are presumably more variants.

5 Other methods

What even are (Grathwohl et al. 2021; Zhang, Liu, and Liu 2022)? I think they work for quantised continuous vars, or possibly ordinal vars?

6 Examples

7 Incoming

Differentiating the discrete: Automatic Differentiation meets Integer Optimization | μβ

8 References

Arya, Schauer, Schäfer, et al. 2022. “Automatic Differentiation of Programs with Discrete Randomness.” In.

Casella, and Robert. 1996. “Rao-Blackwellisation of Sampling Schemes.” Biometrika.

Grathwohl, Choi, Wu, et al. 2018. “Backpropagation Through the Void: Optimizing Control Variates for Black-Box Gradient Estimation.” In Proceedings of ICLR.

Grathwohl, Swersky, Hashemi, et al. 2021. “Oops I Took A Gradient: Scalable Sampling for Discrete Distributions.”

Grover, Wang, Zweig, et al. 2018. “Stochastic Optimization of Sorting Networks via Continuous Relaxations.” In.

Hyvärinen. 2005. “Estimation of Non-Normalized Statistical Models by Score Matching.” The Journal of Machine Learning Research.

Jang, Gu, and Poole. 2017. “Categorical Reparameterization with Gumbel-Softmax.”

Kool, Hoof, and Welling. 2019. “Estimating Gradients for Discrete Random Variables by Sampling Without Replacement.” In.

Kool, van Hoof, and Welling. 2019. “Buy 4 Reinforce Samples, Get a Baseline for Free!”

Liang, Norouzi, Berant, et al. 2018. “Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing.”

Lindsten. 2011. “Rao-Blackwellised Particle Methods for Inference and Identification.”

Liu, Regier, Tripuraneni, et al. 2019. “Rao-Blackwellized Stochastic Gradients for Discrete Distributions.” In.

Maddison, Mnih, and Teh. 2017. “The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.” In.

Mnih, and Gregor. 2014. “Neural Variational Inference and Learning in Belief Networks.” In Proceedings of The 31st International Conference on Machine Learning. ICML’14.

Mohamed, Rosca, Figurnov, et al. 2020. “Monte Carlo Gradient Estimation in Machine Learning.” Journal of Machine Learning Research.

Oktay, McGreivy, Aduol, et al. 2020. “Randomized Automatic Differentiation.” arXiv:2007.10412 [Cs, Stat].

Prillo, and Eisenschlos. 2020. “SoftSort: A Continuous Relaxation for the Argsort Operator.”

Rosca, Figurnov, Mohamed, et al. 2019. “Measure–Valued Derivatives for Approximate Bayesian Inference.” In NeurIPS Workshop on Approximate Bayesian Inference.

Schulman, Heess, Weber, et al. 2015. “Gradient Estimation Using Stochastic Computation Graphs.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.

Shi, Zhou, Hwang, et al. 2022. “Gradient Estimation with Discrete Stein Operators.” In Advances in Neural Information Processing Systems.

Staines, and Barber. 2012. “Variational Optimization.”

Swersky, Rubanova, Dohan, et al. 2020. “Amortized Bayesian Optimization over Discrete Spaces.” In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI).

Tucker, Mnih, Maddison, et al. 2017. “REBAR: Low-Variance, Unbiased Gradient Estimates for Discrete Latent Variable Models.” In Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17.

van Krieken, Tomczak, and Teije. 2021. “Storchastic: A Framework for General Stochastic Automatic Differentiation.” In arXiv:2104.00428 [Cs, Stat].

Weber, Heess, Buesing, et al. 2019. “Credit Assignment Techniques in Stochastic Computation Graphs.”

Williams. 1992. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” Machine Learning.

Yin, Yue, and Zhou. 2019. “ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variables.” In Proceedings of the 36th International Conference on Machine Learning.

Yin, and Zhou. 2018. “ARM: Augment-REINFORCE-Merge Gradient for Stochastic Binary Networks.”

Zhang, Liu, and Liu. 2022. “A Langevin-Like Sampler for Discrete Distributions.” In Proceedings of the 39th International Conference on Machine Learning.