Infinite width limits of neural networks
December 9, 2020 — May 11, 2021
Large-width limits of neural nets. An interesting way of considering overparameterization.
A tractable case of NNs in function space.
1 Neural Network Gaussian Process
For now: See Neural network Gaussian process on Wikipedia.
The field sprang from the insight (Neal 1996a) that in the infinite limit, random neural nets with Gaussian weights and appropriate scaling asymptotically approach certain special Gaussian processes, and there are useful conclusions we can draw from that.
More generally, we might consider correlated and/or non-Gaussian weights and deep networks. Unless otherwise stated, though, I am thinking about i.i.d. Gaussian weights and a single hidden layer.
In this single-hidden-layer case, we get a tractable covariance structure. See NN kernels.
2 Neural Network Tangent Kernel
NTK? See Neural Tangent Kernel.
3 Implicit regularization
Here’s one interesting perspective on wide nets (Zhang et al. 2017) which looks rather like the NTK model, but is it? To read.
The effective capacity of neural networks is large enough for a brute-force memorisation of the entire data set.
Even optimisation on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.
Randomising labels is solely a data transformation, leaving all other properties of the learning problem unchanged.
[…] Explicit regularisation may improve generalisation performance, but is neither necessary nor by itself sufficient for controlling generalisation error. […] Appealing to linear models, we analyse how SGD acts as an implicit regulariser.
4 Dropout
Dropout is sometimes presumed to simulate from a certain kind of Gaussian process out of a neural net. See Dropout.
5 As stochastic DEs
We can find an SDE for a given NN-style kernel if we can find Green’s functions \(\sigma^2_\varepsilon \langle G_\cdot(\mathbf{x}_p), G_\cdot(\mathbf{x}_q)\rangle = \mathbb{E} \big[ \psi\big(Z_p\big) \psi\big(Z_q \big) \big].\) Russell Tsuchida observes: if you set \(G_\mathbf{s}(\mathbf{x}_p) = \psi(\mathbf{s}^\top \mathbf{x}_p) \sqrt{\phi(\mathbf{s})}\), where \(\phi\) is the pdf of an independent standard multivariate normal vector is a solution.