Bitter lessons in compute and cleverness

Amortizing the cost of being smart

January 14, 2021 — June 19, 2024

bounded compute
functional analysis
machine learning
model selection
optimization
statmech
Figure 1

What to compute, and when. Overparameterization and inductive biases. Operationalizing the scaling hypothesis. Compute overhangs. Tokenomics. Amortization. Reduced order modeling.

1 Background

Sutton’s famous bitter lesson:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

Lots of people declaim this one, e.g. On the futility of trying to be clever (the bitter lesson redux).

Part of that is saying “things are better when they can scale up”.

Alternative phrasing: Even the best human mind is not very good, so the quickest path to intelligence is the one that prioritizes replacing that bottleneck.

Figure 2: Via Gwern; Can no longer find original link because twitter is being weird

Ermin Orhan argues

I do think that there are many interesting research directions consistent with the philosophy of the bitter lesson that can have more meaningful, longer-term impact than small algorithmic or architectural tweaks. I just want to wrap up this post by giving a couple of examples below:

  1. Probing the limits of model capabilities as a function of training data size: can we get to something close to human-level machine translation by leveraging everything multi-lingual on the web (I think we’ve learned that the answer to this is basically yes)? Can we get to something similar to human-level language understanding by scaling up the GPT-3 approach a couple of orders of magnitude (I think the answer to this is probably we don’t know yet)? Of course, smaller scale, less ambitious versions of these questions are also incredibly interesting and important.

  2. Finding out what we can learn from different kinds of data and how what we learn differs as a function of this: e.g. learning from raw video data vs. learning from multi-modal data received by an embodied agent interacting with the world; learning from pure text vs. learning from text + images or text + video.

  3. Coming up with new model architectures or training methods that can leverage data and compute more efficiently, e.g. more efficient transformers, residual networks, batch normalization, self-supervised learning algorithms that can scale to large (ideally unlimited) data (e.g. likelihood-based generative pre-training, contrastive learning).

I think that Orhan’s statement is easier to engage with than Sutton’s because it has concrete examples. Plus his invective is tight.

Anyway, it is sobering, but not sufficient. That’s great if you have essentially unlimited data, but this is not a helpful framing for me working in hydrology where a single data point on my last project cost AUD700,000, because that is what it costs to drill a thousand-meter well. Telling me that I should collect a billion more data points rather than being clever is not useful, because it would not be clever to collapse the entire global economy collecting my data points.

What we generally want is not a homily about being clever being a waste of time, so much as a trade-off curve quantifying how clever to bother being. That would probably be the best lesson.

2 Bitter lessons in career strategy

Career version:

…a lot of the good ideas that did not require massive compute budget have already been published by smart people who did not have GPUs, so we need to leverage our technological advantage relative to them if we want to get cited.

Research, and indeed predictive analytics, is a competitive market, and advice about relative advantage needs strategic context. But it does not sound as profound if we phrase it that way, eh?

3 The best way of spending the limited compute in my skull is working out how to spend the larger compute in my GPU

See also superintelligence.

4 Computing and remembering

5 LLMs in particular

Partially discussed at economics of llms.

6 Incoming

7 References

Hooker. 2020. The Hardware Lottery.” arXiv:2009.06489 [Cs].
Kaddour, Lynch, Liu, et al. 2022. Causal Machine Learning: A Survey and Open Problems.”
Spufford. 2012. Red Plenty.
Togelius, and Yannakakis. 2023. Choose Your Weapon: Survival Strategies for Depressed AI Academics.”