Quasi-gradients of discrete parameters
December 20, 2022 — May 17, 2024
Notes on taking gradients through things that look like they have no gradients because their arguments are discrete. There are a lot of loosely related concepts in here that may not reflect an actual common theme. TBC.
See also Polya-Gamma…
1 Stochastic gradients via REINFORCE
The classic generic REINFORCE/Score function method for estimating gradients of expectations of functions of random variables can be used to estimate gradients of functions of discrete random variables as a special case. There are particular extra tricks used for discrete random variables; see e.g. (Grathwohl et al. 2018; Liu et al. 2019; Mnih and Gregor 2014; Tucker et al. 2017).
2 Gumbel-(soft)max
A.k.a. the concrete distribution. See Gumbel-max.
3 Gradients of other weird things
Differentiable sorting? See, e.g. Grover et al. (2018) and Prillo and Eisenschlos (2020).
4 Avoiding the need for gradients
Famously, Expectation Maximization can handle some of the same optimisation problems as gradient-based methods, but without needing gradients. There are presumably more variants.
5 Other methods
What even are (Grathwohl et al. 2021; Zhang, Liu, and Liu 2022)? I think they work for quantised continuous vars, or possibly ordinal vars?