The softmax function

September 12, 2024 — September 15, 2024

classification
metrics
probability
regression
statistics
Figure 1

A function which maps an arbitrary \(\mathbb{R}^d\)-vector to the weights of a categorical distribution (i.e. the \((d-1)\)-simplex).

The \(d\)-simplex is defined as the set of \(K\)-dimensional vectors whose elements are non-negative and sum to one. Specifically,

\[ \Delta^{K-1} = \left\{ \mathbf{p} \in \mathbb{R}^K : p_i \geq 0 \text{ for all } i, \text{ and } \sum_{i=1}^K p_i = 1 \right\} \]

This set describes all possible probability distributions over \(K\) outcomes, which aligns with the purpose of the softmax function in generating probabilities from “logits” (un-normalised log-probabilities) in classification problems.

Ubiquitous in modern classification tasks, particularly in neural networks.

Why? Well for one, it turns the slightly fiddly problem of estimating a constrained quantity into an unconstrained one, in a computationally expedient way. It’s not the only such option, but it is simple and has lots of nice mathematical symmetries. It is kinda-sorta convex in its arguments. It falls out in variational inference via KL, etc.

1 Basic

The softmax function transforms a vector of real numbers into a probability distribution over predicted output classes for classification tasks. Given a vector \(\mathbf{z} = (z_1, z_2, \dots, z_K)\), the softmax function \(\sigma(\mathbf{z})_i\) for the \(i\)-th component is

\[ \sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}.\]

2 Derivatives

The first derivative with respect to \(z_j\) is \[ \begin{aligned} \frac{\partial \sigma_{\phi,i}}{\partial z_j} = \sigma_{\phi,i} \left( \delta_{ij} - \sigma_{\phi,j} \right) \end{aligned} \]

where \(\delta_{ij}\) is the Kronecker delta.

The second derivative is then \[ \begin{aligned} \frac{\partial^2 \sigma_{\phi,i}}{\partial z_j \partial z_k} = \sigma_{\phi,i} (\delta_{ik} - \sigma_{\phi,k})(\delta_{ij} - \sigma_{\phi,j}) - \sigma_{\phi,i} \sigma_{\phi,j} (\delta_{jk} - \sigma_{\phi,k}) \end{aligned} \] i.e.

  • \(i = j = k\): \(\sigma_{\phi,i} (1 - \sigma_{\phi,i})(1 - 2 \sigma_{\phi,i})\)
  • \(i = j \neq k\), or \(i \neq j = k\): \(\sigma_{\phi,i} \sigma_{\phi,k} (2 \sigma_{\phi,i} - 1)\)
  • \(i \neq j \neq k\): \(2 \sigma_{\phi,i} \sigma_{\phi,j} \sigma_{\phi,k}\)

3 Non-exponential

Suppose we do not use the \(\exp\) map, but generalize the softmax to use some other invertible, differentiable, increasing function \(\phi:\mathbb{R}\to\mathbb{R}^+\). Given a vector \(\mathbf{z} = (z_1, z_2, \dots, z_K)\), the generalized softmax function \(\Phi_{\phi}(\mathbf{z})\) for the \(i\)-th component is defined as

\[ \Phi_{\phi}(\mathbf{z})_i = \frac{\phi(z_i)}{\sum_{k=1}^K \phi(z_k)}.\]

4 log-Taylor softmax

TBD

5 Via Gumbel

The softmax function can be approximated using the Gumbel-softmax trick, which is useful for training neural networks with discrete outputs.

6 Entropy

6.1 Softmax

We consider the entropy \(H(\mathbf{p})\) of a categorical distribution with probabilities \(\mathbf{p} = [p_1, p_2, \dots, p_K]^T\), where the probabilities are given by the softmax function, \[ \begin{aligned} p_k = \sigma_k(\mathbf{z}) = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} = \frac{e^{z_k}}{Z}, \end{aligned} \] with \(Z = \sum_{j=1}^K e^{z_j}.\)

The entropy \(H(\mathbf{p})\) is by definition \[ \begin{aligned} H(\mathbf{p}) = -\sum_{k=1}^K p_k \log p_k. \end{aligned} \] Substituting \(p_k\) into the entropy expression, we obtain: \[ \begin{aligned} H(\mathbf{p}) &= -\sum_{k=1}^K p_k \log p_k \\ % &= -\sum_{k=1}^K p_k \left( z_k - \log Z \right) \\ &= -\sum_{k=1}^K p_k z_k + \sum_{k=1}^K p_k \log Z \\ % &= -\sum_{k=1}^K p_k z_k + \log Z \sum_{k=1}^K p_k \\ &= -\sum_{k=1}^K p_k z_k + \log Z. \end{aligned} \]

Thus, the entropy of the softmax distribution simplifies to \[ \begin{aligned} H(\sigma(\mathbf{z})) = \log Z - \sum_{k=1}^K p_k z_k. \end{aligned} \]

If we are using softmax we probably care about derivatives, so let us compute the gradient of the entropy with respect to \(z_i\), \[ \begin{aligned} \frac{\partial H}{\partial z_i} &= \frac{\partial}{\partial z_i} \left( \log Z - \sum_{k=1}^K p_k z_k \right) \\ &= \frac{1}{Z} \frac{\partial Z}{\partial z_i} - \sum_{k=1}^K \left( \frac{\partial p_k}{\partial z_i} z_k + p_k \delta_{ik} \right) \\ &= p_i - \sum_{k=1}^K \left( p_k (\delta_{ik} - p_i) z_k + p_k \delta_{ik} \right) \\ &= p_i - \left( p_i (1 - p_i) z_i + p_i \right) - \sum_{k \neq i} p_k (-p_i) z_k \\ &= -1 + p_i, \end{aligned} \] where we used \(\frac{\partial Z}{\partial z_i} = e^{z_i} = Z p_i\) and \(\frac{\partial p_k}{\partial z_i} = p_k (\delta_{ik} - p_i)\).

Thus, the gradient vector is \[ \begin{aligned} \nabla_\mathbf{z} H = -\mathbf{1} + \mathbf{p}, \end{aligned} \] thence the Hessian matrix \(\nabla^2 H\) \[ \begin{aligned} \frac{\partial^2 H}{\partial z_i \partial z_j} &= \frac{\partial}{\partial z_j} \left( -1 + p_i \right) \\ &= \frac{\partial p_i}{\partial z_j} = p_i (\delta_{ij} - p_j)\\ \nabla^2 H &= \operatorname{diag}(\mathbf{p}) - \mathbf{p} \mathbf{p}^T. \end{aligned} \]

For compactness, we define \(\mathbf{p} = \sigma(\mathbf{z})\). Using the Taylor expansion, we approximate the entropy after a small change \(\Delta \mathbf{z}\): \[ \begin{aligned} H(\mathbf{z} + \Delta \mathbf{z}) &\approx H(\mathbf{z}) + (\nabla_\mathbf{z} H)^T \Delta \mathbf{z} + \frac{1}{2} \Delta \mathbf{z}^T (\nabla^2 H) \Delta \mathbf{z} \\ &= H(\mathbf{p}) + (-\mathbf{1} + \mathbf{p})^T \Delta \mathbf{z} + \frac{1}{2} \Delta \mathbf{z}^T \left( \operatorname{diag}(\mathbf{p}) - \mathbf{p} \mathbf{p}^T \right) \Delta \mathbf{z} \\ &= H(\mathbf{p}) - \mathbf{1}^T \Delta \mathbf{z} + \mathbf{p}^T \Delta \mathbf{z} + \frac{1}{2} \Delta \mathbf{z}^T \operatorname{diag}(\mathbf{p}) \Delta \mathbf{z} - \frac{1}{2} (\mathbf{p}^T \Delta \mathbf{z})^2 \\ &= H(\mathbf{p}) - \mathbf{1}^T \Delta \mathbf{z} + \mathbf{p}^T \Delta \mathbf{z} + \frac{1}{2} \sum_{i=1}^K p_i (\Delta z_i)^2 - \frac{1}{2} \left( \sum_{i=1}^K p_i \Delta z_i \right)^2. \end{aligned} \]

6.2 Non-exponential

Let’s extend the reasoning to category probabilities given by the generalized softmax function. \[ \begin{aligned} p_k = \Phi_k(\mathbf{z}) = \frac{\phi(z_k)}{\sum_{j=1}^K \phi(z_j)} = \frac{\phi(z_k)}{Z}, \end{aligned} \] where \(\phi: \mathbb{R} \rightarrow \mathbb{R}^+\) is an increasing, differentiable function, and \(Z = \sum_{j=1}^K \phi(z_j)\).

The entropy becomes \[ \begin{aligned} H(\mathbf{p}) = -\sum_{k=1}^K p_k \log p_k = -\sum_{k=1}^K p_k \left( \log \phi(z_k) - \log Z \right) = -\sum_{k=1}^K p_k \log \phi(z_k) + \log Z. \end{aligned} \]

To compute the gradient \(\nabla_\mathbf{z} H\), we note that \[ \begin{aligned} \frac{\partial p_k}{\partial z_i} = p_k \left( s_k \delta_{ik} - \sum_{j=1}^K p_j s_j \delta_{ij} \right) = p_k s_k \delta_{ik} - p_k p_i s_i, \end{aligned} \] where \(s_i = \frac{\phi'(z_i)}{\phi(z_i)}\).

Then, the gradient is \[ \begin{aligned} \frac{\partial H}{\partial z_i} &= -\sum_{k=1}^K \left( \frac{\partial p_k}{\partial z_i} \log \phi(z_k) + p_k \frac{\phi'(z_k)}{\phi(z_k)} \delta_{ik} \right) + \frac{1}{Z} \phi'(z_i) \\ &= -\sum_{k=1}^K \left( (p_k s_k \delta_{ik} - p_k p_i s_i) \log \phi(z_k) + p_k s_k \delta_{ik} \right) + \frac{1}{Z} \phi'(z_i). \end{aligned} \]

7 References

Banerjee, C, Gupta, et al. 2020. Exploring Alternatives to Softmax Function.”
de Brébisson, and Vincent. 2016. An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family.”
Grave, Joulin, Cissé, et al. 2017. Efficient Softmax Approximation for GPUs.” In Proceedings of the 34th International Conference on Machine Learning.
Liang, Wang, Lei, et al. 2017. Soft-Margin Softmax for Deep Classification.” In Neural Information Processing.