Why does deep learning work?
Are we in the pocket of Big VRAM?
May 30, 2017 — December 14, 2020
No time to frame this well, but there are a lot of versions of the question, so… pick one. The essential idea is that we say: Oh my, that deep learning model I just trained had terribly good performance compared with some simpler thing I tried. Can I make my model simpler and still get good results? Or is the overparameterization essential? Can I know a decent error bound? Can I learn anything about the underlying system by looking at the parameters I learned?
And the answer is not “yes” in any satisfying general sense. Pfft.
1 Synthetic tutorials
- Boaz Barak, ML Theory with bad drawings
- Deep learning theory lecture notes (PDF version)
- Where did this link come from? Unusually wide perspective for a deep learning course: mcallester.github.io/ttic-31230/Fall2020/.
2 Magic of (stochastic) gradient descent
The SGD fitting process looks like processes from statistical mechanics.
Proceed with caution, since there is a lot of messy thinking here. Here are some things I’d like to read, but their inclusion should not be taken as a recommendation. The common theme is using ideas from physics to understand deep learning and other directed graph learning methods.
There are also arguments that SGD is doing some kind of MCMC simulation from the problem posterior (Mandt, Hoffman, and Blei 2017) as in NN ensembles or learning a kernel machine (Domingos 2020).
2.1 … with saddle points
tl;dr it looks like you need to worry about saddle points but you probably do not (Lee et al. 2017, 2016).
3 Magic of SGD+overparameterization
Looking at a different part of the problem, the combination of overparameterization and SGD is argued to be the secret (Allen-Zhu, Li, and Song 2018b)
Our main finding demonstrates that, for state-of-the-art network architectures such as fully-connected neural networks, convolutional networks (CNN), or residual networks (Resnet), assuming there are n training samples without duplication, as long as the number of parameters is polynomial in \(n\), first-order methods such as SGD can find global optima of the training objective efficiently, that is, with running time only polynomially dependent on the total number of parameters of the network.
4 Function approximation theory
Ignoring learnability, the pure function-approximation results are an interesting literature. If you can ignore that troublesome optimisation step, how general a thing can your neural network approximate as its depth and width and sparsity changes? The most recent thing I looked at is (Elbrächter et al. 2021), which also has a survey of that literature. See also (Bölcskei et al. 2019; Wiatowski and Bölcskei 2015). They derive some suggestive results, for example, that scaling with depth of network is vastly more favourable than in width for a fixed weight budget.
5 Crazy physics stuff I have not read
Wiatowski et al, (Wiatowski, Grohs, and Bölcskei 2018; Shwartz-Ziv and Tishby 2017) argue that looking at neural networks as random fields with energy propagation dynamics provides some insight into how they work. Haber and Ruthotto leverage some similar insights to argue you can improve NNs by looking at them as ODEs.
Lin and Tegmark, argue that statistical mechanics provides insight into deep learning, and neuroscience (Lin and Tegmark 2016b, 2016a). Maybe on a similar tip, Natalie Wolchover summarises (Mehta and Schwab 2014). Charles H Martin. Why Deep Learning Works II: the Renormalization Group.
There is also a bunch more filed under statistical mechanics of statistics.
6 There is nothing to see here
There is another school again, which argues that much of deep learning is not so interesting after all when you blur out the more hyperbolic claims with a publication bias filter. e.g. Piekniewski, Autopsy of a deep learning paper
Machine learning sits somewhere in between [science and engineering]. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.
There is also the fourth kind of papers, which indeed contain an idea. The idea may even be useful, but it happens to be trivial. In order to cover up that embarrassing fact a heavy artillery of “academic engineering” is loaded again, such that overall the paper looks impressive.
7 Incoming
- Simon J.D. Prince’s new book Understanding Deep Learning (Prince 2023)
- Gradient Dissent, a list of reasons that large backpropagation-trained networks might be worrisome. There are some interesting points in there, and some hyperbole. Also: If it were true that there are externalities from backprop networks (i.e. that they are a kind of methodological pollution that produces private benefits but public costs) then what kind of mechanisms should be applied to disincentives them?
- C&C Against Predictive Optimization.