Model explanation with sparse autoencoders
Monosemanticity, sparsity and foundation models
August 29, 2024 — August 29, 2024
adversarial
classification
communicating
feature construction
game theory
high d
language
machine learning
metrics
mind
NLP
sparser than thou
Placeholder to talk about one hyped means of explaining models, especially large language models, by using sparse autoencoders. Popular as an AI Safety technology.
- An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability | Adam Karvonen
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- Excursions into Sparse Autoencoders: What is monosemanticity?
- Intro to Superposition & Sparse Autoencoders (Colab exercises)
- Lewingtonpitsos, LLM Sparse Autoencoder Embeddings can be used to train NLP Classifiers
1 References
Cunningham, Ewart, Riggs, et al. 2023. “Sparse Autoencoders Find Highly Interpretable Features in Language Models.”
Marks, Rager, Michaud, et al. 2024. “Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.”
Moran, Sridhar, Wang, et al. 2022. “Identifiable Deep Generative Models via Sparse Decoding.”
O’Neill, Ye, Iyer, et al. 2024. “Disentangling Dense Embeddings with Sparse Autoencoders.”
Park, Choe, and Veitch. 2024. “The Linear Representation Hypothesis and the Geometry of Large Language Models.”
Saengkyongam, Rosenfeld, Ravikumar, et al. 2024. “Identifying Representations for Intervention Extrapolation.”
von Kügelgen, Besserve, Wendong, et al. 2023. “Nonparametric Identifiability of Causal Representations from Unknown Interventions.” In Advances in Neural Information Processing Systems.