Scaling laws for very large neural nets
Theory of trading-off budgets for compute size and data
January 14, 2021 — January 21, 2025
Suspiciously similar content
Got good behaviour from a million parameter model? Want to see if stuff gets weirder as we hit a billion parameters? Turns out it does! It seems to even do so dependably! There is something philosophically deep here. Why does looking at more stuff seem to bring more computationally complex problems within reach? I don’t know but I’m desperately keen to carve out some time to solve that.
Brief links on the theme of scaling in the extremely large model/large data limit and what that does to the behaviour of the models. A new front in the complexity and/or statistical mechanics of statistics, and whether neural networks extrapolate.
As to how to scale up these models in practice, see distributed gradient descent.
Content on this page has not been updated as fast as the field has been moving; you should follow the references for the latest.
1 Side note: The bitter, better lesson
See optimal cleverness.
2 Big transformers
One fun result comes from Transformer language models. An interesting observation way back in 2020 was that there seemed to be an unexpected trade-off where you can go faster by training a bigger network. I think this paper was ground zero of modern scaling studies, which try to identify and predict optimal trade-offs and ultimate performance under different scaling (of compute, data, parameters) regimes.
nostalgebraist summarises Henighan et al. (2020);Kaplan et al. (2020):
2.1 L(D): information
OpenAI derives a scaling law called L(D). This law is the best you could possibly do — even with arbitrarily large compute/models — if you are only allowed to train on D data points.
No matter how good your model is, there is only so much it can learn from a finite sample. L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).
2.2 L(C): budgeting
OpenAI also derives another scaling law called L(C). This is the best you can do with compute C, if you spend it optimally.
What does optimal spending look like? Remember, you can spend a unit of compute on * a bigger model (N), or * training the same model for longer (S)
… In the compute regime we are currently in, making the model bigger is way more effective than taking more steps.
The scaling laws continue to be revised.
3 Incoming
- The Scaling Paradox — Toby Ord
- AI progress as a function of time is impressive even if AI progress as a function of resources is not.
- The scaling laws are impressively smooth and long-lasting, but are a proof of poor but predictable scaling, rather than impressive scaling.
- While we know that AI quality metrics scale very poorly with respect to resources, the real-world impacts may scale much better.
- Zhang et al. (2020) (how do NNs learn from language as n increases?)
- DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization (targeting large language models)