Developmental interpretability
March 10, 2025 — March 19, 2025
Suspiciously similar content
Developmental interpretability is an emerging subfield within AI interpretability that focuses on understanding how neural networks evolve capabilities during training. Rather than analyzing only fully-trained models as static objects, this approach examines the dynamics of learning, capability emergence, and concept formation throughout the training process. It builds on mechanistic interpretability by adding a temporal dimension.
Much of this work explores scaling behaviour in training dynamics, particularly the phase transition when a model suddenly starts to generalise well.
cf the question of when we learn world models.
1 Key Research Directions
Position paper by the SLT folks and fellow travellers: Lehalleur et al. (2025).
1.1 Mechanistic Phase Transitions
Studies discontinuous capability emergence through training, identifying critical learning thresholds and representation shifts. Key works:
1.2 Training Dynamics Analysis
Examines gradient behaviours, loss landscapes, and parameter space geometry through frameworks like Singular Learning Theory (SLT).
1.2.1 Developmental Circuits Tracing
Maps formation of specific computational patterns from initialization:
See also (Berti, Giorgi, and Kasneci 2025; Teehan et al. 2022; Wei et al. 2022)
1.3 Grokking and Delayed Generalization
Investigates sudden transitions from ‘memorisation’ to ‘understanding’ (Power et al. 2022; Liu et al. 2022; Liu, Michaud, and Tegmark 2023).
TODO: understand how much of the argument leans upon discovering compact circuit representations, and how much upon generalisation, and the relation.
1.4 Component Trajectory Analysis
Tracks evolution of individual neurons/layers through training: - Visualizing Deep Network Training Trajectories with PCA - Mao et al. (2024)
A biased but credible source says of CTA:
One thing I’d say is the Component Trajectory Analysis […] to my eyes, not very interesting, because PCA on timeseries basically just extracts Lissajous curves and therefore always looks like the same thing. [We can] make more sense of this by applying joint PCA to trajectories which vary in their training distribution.
See Carroll et al. (2025).
1.5 Curriculum and Data Influences
Studies how training data order/selection impacts capability development. TODO
1.6 SLT Foundations
Pioneering work connecting Singular Learning Theory to deep learning dynamics: - Singular Learning Theory resources * Filan’s Singular Learning Theory * Liam Carroll’s Distilling Singular Learning Theory - AI Alignment Forum - Murfet’s SLT notes * Singular Learning Theory (SLT) | Liam Carroll (see also, perhaps Carroll (2021) )
This body of work demonstrates how SLT’s mathematical framework explains:
- Discontinuous capability emergence through bifurcations in loss landscape geometry
- Bayesian posterior phase transitions in SGD-trained networks
- Fundamental connections between model complexity and generalisation
More at Singular Learning Theory.
2 Incoming
- Developmental Interpretability Primer - Community hub for latest research