Gumbel (soft) max tricks
Concrete distribution, relaxed categorical etc
February 20, 2017 — April 1, 2022
The family of Gumbel tricks is useful for sampling from things that look like categorical distributions and simplices and learning models that use categorical variables by reparameterisation.
1 Gumbel Trick basic
- Francis Bach on Gumbel tricks has his characteristically out-of-the-simplex perspective.
- Chris J. Maddison on Gumbel Machinery
- Laurent Dinh, Gumbel-Max Trick Inference
- The Gumbel-Max Trick for Discrete Distributions
- Tim Veira, Gumbel-max trick
2 Softmax relaxation
A.k.a. relaxed Bernoulli, relaxed categorical.
One of the co-inventors, Eric Jang, wrote a tutorial Categorical Variational Autoencoders using Gumbel-Softmax:
The main contribution of this work is a “reparameterization trick” for the categorical distribution. Well, not quite—it’s actually a re-parameterization trick for a distribution that we can smoothly deform into the categorical distribution. We use the Gumbel-Max trick, which provides an efficient way to draw samples \(z\) from the Categorical distribution with class probabilities \(\pi_{i}\) : \[ z=\operatorname{OneHot}\left(\underset{i}{\arg \max }\left[g_{i}+\log \pi_{i}\right]\right) \] argmax is not differentiable, so we simply use the softmax function as a continuous approximation of argmax: \[ y_{i}=\frac{\exp \left(\left(\log \left(\pi_{i}\right)+g_{i}\right) / \tau\right)}{\sum_{j=1}^{k} \exp \left(\left(\log \left(\pi_{j}\right)+g_{j}\right) / \tau\right)} \quad \text { for } i=1, \ldots, k \] Hence, we call this the “Gumbel-SoftMax distribution”. \(\tau\) is a temperature parameter that allows us to control how closely samples from the Gumbel-Softmax distribution approximate those from the categorical distribution. As \(\tau \rightarrow 0\), the softmax becomes an argmax and the Gumbel-Softmax distribution becomes the categorical distribution. During training, we let \(\tau>0\) to allow gradients past the sample, then gradually anneal the temperature \(\tau\) (but not completely to 0, as the gradients would blow up).
Emma Benjaminson, The Gumbel-Softmax Distribution takes it in small pedagogic steps.
3 Straight-through Gumbel
TBC
4 Reverse Gumbel
- Gumbel-Max Trick Inference
- Gumbel Machinery · Chris J. Maddison introduces Maddison, Tarlow, and Minka (2015).