Mechanistic interpretability
August 29, 2024 — April 18, 2025
Suspiciously similar content
Understanding complicated AI models by how they work.
See developmental interpretability, which looks at how neural networks evolve and develop capabilities during training.
1 Finding circuits
e.g. Wang et al. (2022)
2 Disentanglement and monosemanticity
Here’s a placeholder to talk about one hyped way of explaining models, especially large language models, using sparse autoencoders. This is popular as an AI Safety technology.
- Interesting critique of the whole area: Heap et al. (2025)
What’s even the null model of the sparse interpretation? - Toy Models of Superposition
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- God Help Us, Let’s Try To Understand The Paper On AI Monosemanticity
- An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability | Adam Karvonen
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- Excursions into Sparse Autoencoders: What is monosemanticity?
- Intro to Superposition & Sparse Autoencoders (Colab exercises)
- Lewingtonpitsos, LLM Sparse Autoencoder Embeddings can be used to train NLP Classifiers
- Neel Nanda, An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
3 Via causal abstraction
See causal abstraction for a different (?) approach to interpretability and disentanglement.
4 Incoming
Multimodal Neurons in Artificial Neural Networks / Distill version
Tracing the thoughts of a large language model Anthropic
Today, we’re sharing two new papers that represent progress on the development of the “microscope”, and the application of it to see new “AI biology”. In the first paper, we extend our prior work locating interpretable concepts (“features”) inside a model to link those concepts together into computational “circuits”, revealing parts of the pathway that transforms the words that go into Claude into the words that come out. In the second, we look inside Claude 3.5 Haiku, performing deep studies of simple tasks representative of ten crucial model behaviors, including the three described above. Our method sheds light on a part of what happens when Claude responds to these prompts, which is enough to see solid evidence that: