Overparameterization in large models
Improper learning, benign overfitting, double descent
April 3, 2018 — October 29, 2024
Notes on the general weird behaviour of increasing the number of slack parameters we use, especially in machine learning, especially in neural nets. Most of these have far more parameters than we “need,” which is a problem for classical models of learning. Herein we learn to fear having too many parameters.
1 For making optimisation nice
Certainly, looking at how some classic non-convex optimization problems can be lifted into convex problems by adding slack variables, we can imagine that something similar happens by analogy in neural nets. Is it enough to imagine that something similar happens in NN, perhaps not lifting them into convex problems per se but at least into better-behaved optimisations in some sense?
The combination of overparameterization and SGD is argued to be the secret to how deep learning works, by e.g. AllenZhuConvergence2018.
RJ Lipton discusses Arno van den Essen’s incidental work on stabilisation methods of polynomials, which relates, AFAICT, to transfer-function-type stability. Does this connect to the overparameterization of rational transfer function analysis of Hardt, Ma, and Recht (2018)? 🏗.
2 Double descent
When adding data (or parameters?) can make the model worse. E.g. Deep Double Descent.
Possibly this phenomenon relates to the concept of data interpolation, although see Resolution of misconception of overfitting: Differentiating learning curves from Occam curves.
3 Data interpolation
a.k.a. benign overfitting. See interpolation/extrapolation in NNs.
4 Lottery ticket hypothesis
The Lottery Ticket hypothesis (Frankle and Carbin 2019; Hayou et al. 2020) asserts something like “there is a good compact network hidden inside the overparameterized one you have.” Intuitively it is computationally hard to find the hidden optimal network. I am interested in computational bounds for this; How much cheaper is it to calculate with a massive network than to find the tiny networks that do better?
5 In extremely large models
6 In the wide-network limit
See Wide NNs.
7 Convex relaxation
See convex relaxation.