Developmental interpretability

March 10, 2025 — March 19, 2025

Bayes

bounded compute

dynamical systems

feature construction

high d

language

machine learning

metrics

mind

NLP

sparser than thou

statmech

stochastic processes

Suspiciously similar content

Developmental interpretability is an emerging subfield within AI interpretability that focuses on understanding how neural networks evolve capabilities during training. Rather than analyzing only fully-trained models as static objects, this approach examines the dynamics of learning, capability emergence, and concept formation throughout the training process. It builds on mechanistic interpretability by adding a temporal dimension.

Much of this work explores scaling behaviour in training dynamics, particularly the phase transition when a model suddenly starts to generalise well.

cf the question of when we learn world models.

1 Key Research Directions

Position paper by the SLT folks and fellow travellers: Lehalleur et al. (2025).

1.1 Mechanistic Phase Transitions

Studies discontinuous capability emergence through training, identifying critical learning thresholds and representation shifts. Key works:

Nanda et al. (2023) on Grokking

1.2 Training Dynamics Analysis

Examines gradient behaviours, loss landscapes, and parameter space geometry through frameworks like Singular Learning Theory (SLT).

1.2.1 Developmental Circuits Tracing

Maps formation of specific computational patterns from initialization:

Toy Models of Superposition (Elhage et al. 2022)

1.3 Grokking and Delayed Generalization

Investigates sudden transitions from ‘memorisation’ to ‘understanding’ (Power et al. 2022; Liu et al. 2022; Liu, Michaud, and Tegmark 2023).

TODO: understand how much of the argument leans upon discovering compact circuit representations, and how much upon generalisation, and the relation.

1.4 Component Trajectory Analysis

Tracks evolution of individual neurons/layers through training: - Visualizing Deep Network Training Trajectories with PCA - Mao et al. (2024)

A biased but credible source says of CTA:

One thing I’d say is the Component Trajectory Analysis […] to my eyes, not very interesting, because PCA on timeseries basically just extracts Lissajous curves and therefore always looks like the same thing. [We can] make more sense of this by applying joint PCA to trajectories which vary in their training distribution.

See Carroll et al. (2025).

1.5 Curriculum and Data Influences

Studies how training data order/selection impacts capability development. TODO

1.6 SLT Foundations

Pioneering work connecting Singular Learning Theory to deep learning dynamics: - Singular Learning Theory resources * Filan’s Singular Learning Theory * Liam Carroll’s Distilling Singular Learning Theory - AI Alignment Forum - Murfet’s SLT notes * Singular Learning Theory (SLT) | Liam Carroll (see also, perhaps Carroll (2021) )

This body of work demonstrates how SLT’s mathematical framework explains:

Discontinuous capability emergence through bifurcations in loss landscape geometry
Bayesian posterior phase transitions in SGD-trained networks
Fundamental connections between model complexity and generalisation

More at Singular Learning Theory.

2 Incoming

Developmental Interpretability Primer - Community hub for latest research

3 References

Berti, Giorgi, and Kasneci. 2025. “Emergent Abilities in Large Language Models: A Survey.”

Carroll. 2021. “Phase Transitions in Neural Networks.”

Carroll, Hoogland, Farrugia-Roberts, et al. 2025. “Dynamics of Transient Structure in In-Context Linear Regression Transformers.”

Chen, Lau, Mendel, et al. 2023. “Dynamical Versus Bayesian Phase Transitions in a Toy Model of Superposition.”

Elhage, Hume, Olsson, et al. 2022. “Toy Models of Superposition.”

Lehalleur, Hoogland, Farrugia-Roberts, et al. 2025. “You Are What You Eat – AI Alignment Requires Understanding How Data Shapes Structure and Generalisation.”

Liu, Kitouni, Nolte, et al. 2022. “Towards Understanding Grokking: An Effective Theory of Representation Learning.” Advances in Neural Information Processing Systems.

Liu, Michaud, and Tegmark. 2023. “Omnigrok: Grokking Beyond Algorithmic Data.”

Lorch. 2016. “Visualizing Deep Network Training Trajectories with PCA.”

Mao, Griniasty, Teoh, et al. 2024. “The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold.” Proceedings of the National Academy of Sciences.

Nanda, Chan, Lieberum, et al. 2023. “Progress Measures for Grokking via Mechanistic Interpretability.”

Olah, Cammarata, Schubert, et al. 2020. “Zoom In: An Introduction to Circuits.” Distill.

Power, Burda, Edwards, et al. 2022. “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.”

Teehan, Clinciu, Serikov, et al. 2022. “Emergent Structures and Training Dynamics in Large Language Models.” In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models.

Wang, Farrugia-Roberts, Hoogland, et al. 2024. “Loss Landscape Geometry Reveals Stagewise Development of Transformers.” In.

Watanabe. 2022. “Recent Advances in Algebraic Geometry and Bayesian Statistics.”

Wei, Tay, Bommasani, et al. 2022. “Emergent Abilities of Large Language Models.”