Transformer networks
The transformer-powered subtitle for this article is “Our most terrifyingly effective weapon against the forces of evil is our ability to laugh at them.”
December 20, 2017 — January 5, 2025
Transformers are big attention networks with some extra tricks — self attention, and usually a positional encoding as well.
I am no expert. Here are some good blog posts explaining everything, for my reference, but I will not write yet another one. This is a fast-moving area and I am not keeping track of it, so if you are on this page looking for guidance you are already in trouble.
Phuong and Hutter (2022)
Transformers are deep feed-forward artificial neural networks with a (self)attention mechanism. They have been tremendously successful in natural language processing tasks and other domains. Since their inception 5 years ago, many variants have been suggested. Descriptions are usually graphical, verbal, partial, or incremental. Despite their popularity, it seems no pseudocode has ever been published for any variant. […] This report intends to rectify the situation for Transformers. It aims to be a self-contained, complete, precise and compact overview of transformer architectures and formal algorithms (but not results)
These networks are massive (heh) in natural language processing right now.
A key point about such networks seems to be that they can be made extremely large but still remain trainable. This leads to interesting scaling laws.
1 Introductions
So many.
TODO: rank in terms of lay-person-friendliness.
Noam Shazeer’s Shape Suffixes post implicitly makes the case that transformers are simply a confusing way of smashing tensors together.
Compact precise definition of a transformer function – foreXiv
Jay Alammar’s Illustrated Transformer series is good.
Lilian Weng, Large Transformer Model Inference Optimization
Lilian Weng, The Transformer Family Version 2.0
Xavier Amatriain, Transformer models: an introduction and catalog — 2023 Edition
nostalgebraist, An exciting new paper on neural language models
This guide to pruning multihead attention NN should probably go somewhere useful if I actually end up doing NLP like all the recruiters seem to want.
John Thickstun, The Transformer Model in Equations
Large language models, explained with a minimum of math and jargon
Large language models, explained with a minimum of math and jargon
A good paper read is Yannic Kilcher’s.
2 Power of
Transformers are pretty good at weird stuff, e.g. automata — see Unveiling Transformers with LEGO (Y. Zhang et al. 2022).
How about Bayesian inference? (Müller et al. 2022)
Can they be an engine of intelligence? What do they do in society? etc. Controversial — see the Stochastic Parrots paper (Bender et al. 2021), and the entire internet commentariat from November 2022 onwards.
3 As set functions
Transformers are neural set functions (!).
4 As recurrent state
5 For forecasting of non-linguistic material
6 Practicalities
For you and me, see AI democratizateion.
7 Embedding vector databases
8 Incoming
LMQL: Programming Large Language Models: “LMQL is a programming language for language model interaction.” (Beurer-Kellner, Fischer, and Vechev 2022)
LMQL generalizes natural language prompting, making it more expressive while remaining accessible. For this, LMQL builds on top of Python, allowing users to express natural language prompts that also contain code. The resulting queries can be directly executed on language models like OpenAI’s GPT models. Fixed answer templates and intermediate instructions allow the user to steer the LLM’s reasoning process.
-
TL;DR — In-context learning is a mysterious emergent behaviour in large language models (LMs) where the LM performs a task just by conditioning on input-output examples, without optimising any parameters. In this post, we provide a Bayesian inference framework for understanding in-context learning as “locating” latent concepts the LM has acquired from pretraining data. This suggests that all components of the prompt (inputs, outputs, formatting, and the input-output mapping) can provide information for inferring the latent concept. We connect this framework to empirical evidence where in-context learning still works when provided training examples with random outputs. While output randomisation cripples traditional supervised learning algorithms, it only removes one source of information for Bayesian inference (the input-output mapping).
Large Language Models as General Pattern Machines
We observe that pre-trained large language models (LLMs) are capable of autoregressively completing complex token sequences—from arbitrary ones procedurally generated by probabilistic context-free grammars (PCFG), to more rich spatial patterns found in the Abstract Reasoning Corpus (ARC), a general AI benchmark, prompted in the style of ASCII art. Surprisingly, pattern completion proficiency can be partially retained even when the sequences are expressed using tokens randomly sampled from the vocabulary. These results suggest that without any additional training, LLMs can serve as general sequence modellers, driven by in-context learning. In this work, we investigate how these zero-shot capabilities may be applied to problems in robotics—from extrapolating sequences of numbers that represent states over time to complete simple motions, to least-to-most prompting of reward-conditioned trajectories that can discover and represent closed-loop policies (e.g., a stabilising controller for CartPole). While difficult to deploy today for real systems due to latency, context size limitations, and compute costs, the approach of using LLMs to drive low-level control may provide an exciting glimpse into how the patterns among words could be transferred to actions.
karpathy/nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs.