Neural network activation functions
January 12, 2017 — December 5, 2024
Suspiciously similar content
There is a cottage industry built upon showing that neural networks are reasonably universal function approximators with various nonlinearities as activations, under various conditions. Usually, we take it as a given that the particular activation function is not too important.
Sometimes, we might like to play with the precise form of the nonlinearities, even making the nonlinearities themselves directly learnable. The rationale might be that some function shapes might have better approximation properties with respect to various assumptions on the learning problems, in a sense which I will not attempt to make rigorous now, vague hand-waving arguments being the whole point of deep learning. Taking that to its extreme and learning activations instead of weights, leads to Kolmogorov-Arnold networks.
I think a part of this field has been subsumed into the stability-of-dynamical-systems setting? Or we do not care because something-something BatchNorm?
1 ReLU
The current default activation function is ReLU, i.e. \(x\mapsto \max\{0,x\}\), which has many nice properties. However, it does lead to piecewise linear spline approximators. One could regard that as a plus (Unser 2019) but OTOH that makes it hard to solve differential equations.
2 Differentiable activations
Sometimes, then, we want something different. Other classic activations such as \(x\mapsto\tanh x\) have fallen from favour, supplanted by ReLU. However, differentiable activations are useful, especially if higher-order gradients of the solution will be important, e.g. in implicit representation NNs. Many virtues of differentiable activation functions for that purpose are documented Implicit Neural Representations with Periodic Activation Functions. Sitzmann et al. (2020) argues for \(x\mapsto\sin x\) on the basis of various handy properties. Ramachandran, Zoph, and Le (2017) advocate Swish, \(x\mapsto \frac{x}{1+\exp -x}.\)
Other fun things, SELU, the “self-normalising” SELU (scaled exponential linear unit) Klambauer et al. (2017).
All these, AFAICT require careful initialization.
3 Learnable activations
Learnable activations are a thing, e.g. Ramachandran, Zoph, and Le (2017), Agostinelli et al. (2015), Lederer (2021), achieving their apotheosis in Kolmogorov-Arnold Networks.
4 Kolmogorov-Arnold networks
A cute related case of a learnable activation function is the Kolmogorov-Arnold network (Liu, Wang, et al. 2024), where the ever edge learns an activation function and there are no other weights. This has various nice properties such as being easy to compress, somehow. See Kolmogorov-Arnold Networks for a deeper treatment.