Mechanistic interpretability

August 29, 2024 — March 19, 2025

communicating
feature construction
statmech
stochastic processes
high d
language
machine learning
metrics
mind
NLP
sparser than thou
Figure 1

Understanding complicated AI models by “how they work”. cf developmental interpretability, which focuses on understanding how neural networks evolve and develop capabilities during training.

1 Finding circuits

e.g. Wang et al. (2022)

2 Disentanglement and monosemanticity

Placeholder to talk about one hyped means of explaining models, especially large language models, by using sparse autoencoders. Popular as an AI Safety technology.

3 Via causal abstraction

See causal abstraction for a different (?) approach to interpretability and disentanglement.

4 Incoming

5 References

Arditi, Obeso, Syed, et al. 2024. Refusal in Language Models Is Mediated by a Single Direction.”
Cloud, Goldman-Wetzler, Wybitul, et al. 2024. Gradient Routing: Masking Gradients to Localize Computation in Neural Networks.”
Cunningham, Ewart, Riggs, et al. 2023. Sparse Autoencoders Find Highly Interpretable Features in Language Models.”
Gurnee, Nanda, Pauly, et al. 2023. Finding Neurons in a Haystack: Case Studies with Sparse Probing.”
Heap, Lawson, Farnik, et al. 2025. Sparse Autoencoders Can Interpret Randomly Initialized Transformers.”
Jørgensen, Gresele, and Weichwald. 2025. What Is Causal about Causal Models and Representations?
Kantamneni, and Tegmark. 2025. Language Models Use Trigonometry to Do Addition.”
Marks, Rager, Michaud, et al. 2024. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.”
Moran, Sridhar, Wang, et al. 2022. Identifiable Deep Generative Models via Sparse Decoding.”
O’Neill, Ye, Iyer, et al. 2024. Disentangling Dense Embeddings with Sparse Autoencoders.”
Park, Choe, and Veitch. 2024. The Linear Representation Hypothesis and the Geometry of Large Language Models.”
Ravfogel, Svete, Snæbjarnarson, et al. 2025. Gumbel Counterfactual Generation From Language Models.”
Saengkyongam, Rosenfeld, Ravikumar, et al. 2024. Identifying Representations for Intervention Extrapolation.”
Saphra, and Wiegreffe. 2024. Mechanistic?
Tigges, Hollinsworth, Geiger, et al. 2023. Linear Representations of Sentiment in Large Language Models.”
von Kügelgen, Besserve, Wendong, et al. 2023. Nonparametric Identifiability of Causal Representations from Unknown Interventions.” In Advances in Neural Information Processing Systems.
Wang, Variengien, Conmy, et al. 2022. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.”