Causal abstraction

February 24, 2025 — March 10, 2025

approximation

Bayes

causal

generative

graphical models

language

machine learning

meta learning

neural nets

NLP

probabilistic algorithms

probability

statistics

stringology

time series

Suspiciously similar content

I just ran into this area while trying to invent something similar myself, only to find I’m years too late. It’s an interesting analysis suited to relaxed or approximated causal modelling of causal interventions. It seems to formalize coarse-graining for causal models.

We suspect that the notorious causal inference in LLMs might be built out of such things or understood in terms of them.

1 Causality in hierarchical systems

In the hierarchical setting, we want to consider a system made of micro- and macro-states; we wonder when many microstates can be abstracted into a few macrostates in a causal sense.

Figure 2: Chalupka, Eberhardt, and Perona (2016) introduces a hierarchical causal model with observed macro input variables \(I\), observational variables \(J\) and a hidden variable \(H\), where each of those variables is comprised of many microvariables.

A. Geiger, Ibeling, et al. (2024) summarises:

In some ways, studying modern deep learning models is like studying the weather or an economy: they involve large numbers of densely connected ‘microvariables’ with complex, non-linear dynamics. One way of reining in this complexity is to find ways of understanding these systems in terms of higher-level, more abstract variables (‘macrovariables’). For instance, the many microvariables might be clustered together into more abstract macrovariables. A number of researchers have been exploring theories of causal abstraction, providing a mathematical framework for causally analysing a system at multiple levels of detail (Chalupka, Eberhardt, and Perona 2017; Rubenstein et al. 2017; Beckers and Halpern 2019, 2019; Rischel and Weichwald 2021; Massidda et al. 2023). These methods tell us when a high-level causal model is a simplification of a (typically more fine-grained) low-level model. To date, causal abstraction has been used to analyse weather patterns (Chalupka et al. 2016), human brains (J. Dubois, Oya, et al. 2020; J. Dubois, Eberhardt, et al. 2020), and deep learning models (Chalupka, Perona, and Eberhardt 2015; A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2021; Hu and Tian 2022; A. Geiger, Wu, et al. 2024; Z. Wu et al. 2023).

Imagine trying to understand a bustling city by tracking everyone’s movement. This “micro-level” perspective is overwhelming. Instead, we might analyse neighbourhoods (macro-level) to identify traffic patterns or economic activity. In physics, we call this coarse-graining. Causal abstraction tries to ask a more statistical question: When does a simplified high-level model (macrovariables) accurately represent a detailed low-level system (microvariables)?

For example, a neural network classifies images using millions of neurons (microvariables). A causal abstraction might represent this as a high-level flowchart: Input Image → Detect Edges → Identify Shapes → Classify Object This flowchart is a macrovariable model that abstracts away neuronal details while preserving the “causal story” of how the network works.

Easy to say, harder to formalize.

Chalupka, Eberhardt, and Perona (2016) explains this idea with equivalence classes of variable states that induce a partition on the space of possible causal models. The fundamental object from this hierarchical perspective is a causal partition. Chalupka, Eberhardt, and Perona (2017) constructs some contrived worked examples. They work in terms of discrete variables (or discretisations of continuous ones) to make it easy to discuss the measure of the sets implicated in the partition. I am going to leave this work aside for now, because it is a nice intuition pump but way too clunky for what I need.

2 Non-hierarchical models

A. Geiger, Ibeling, et al. (2024) generalises this further; we are happy to consider equivalence classes over “messy” structures where, for example, microvariables are not neatly partitioned into macrovariables but may be involved in many macrovariables. They would further like to handle systems with loops. At the end, they would like to argue that it is a unifying language for causality in machine learning, and in particular for mechanistic interpretability, and ablation studies.

A shortcoming of existing theory is that macrovariables cannot be represented by quantities formed from overlapping sets of microvariables. Just as with neural network models of human cognition (Smolensky, 1986), this is the typical situation in mechanistic interpretability, where high-level concepts are thought to be represented by modular ‘features’ distributed across individual neural activations […].

Our first contribution is to extend the theory of causal abstraction to remove this limitation, building heavily on previous work. The core issue is that typical hard and soft interventions replace variable mechanisms entirely, so they are unable to isolate quantities distributed across overlapping sets of microvariables. To address this, we consider a very general type of intervention—what we call interventionals—that maps from old mechanisms to new mechanisms. While this space of operations is generally unconstrained, we isolate special classes of interventionals that form intervention algebras, satisfying two key modularity properties. Such classes can essentially be treated as hard interventions with respect to a new (‘translated’) variable space. We elucidate this situation, generalising earlier work by Rubenstein et al. (2017) and Beckers and Halpern (2019).

2.1 Distributed alignment search

e.g. (Abraham et al. 2022; Arora, Jurafsky, and Potts 2024; A. Geiger, Wu, et al. 2024; Tigges et al. 2023)

3 Interventions

To validate abstractions, we use interventions — controlled changes to a system. There seem to be levels of abstraction.

Hard interventions: Force variables to specific values (e.g., clamping a neuron’s activation), which are the classic Judea-Pearl-style interventions made famous by the do-calculus
Soft interventions: These look like “distributional” assignments or something like that. Rather than “setting” a variable to a value as in hard interventions, we assign it a distribution. This idea seems simple and intuitive to me in Correa and Bareinboim (2020); the presentation in A. Geiger, Ibeling, et al. (2024) a little more opaque to me.
In the next section we generalise these to Interventionals: Generalised transformations of mechanisms (e.g., redistributing a concept across multiple neurons). This is the new thing in A. Geiger, Ibeling, et al. (2024) and I have no intuition about it yet at all.

4 LLM summary

Here be dragons! I got perplexity to summarise all the strands. I guarantee even less than usual about the correctness of this summary.

Recent advances in causal abstraction theory have provided rigorous mathematical frameworks for analysing systems at multiple levels of granularity while preserving causal structure. This report synthesises the core contributions across key papers in this domain, examining both theoretical foundations and practical applications.

4.1 Formal Foundations of Causal Abstraction

The foundational work of Beckers & Halpern (2019) established τ-abstractions as a precise mechanism for mapping between causal models (Beckers and Halpern 2020, 2020; Beckers and Halpern 2019). Their framework introduced: - A three-component abstraction tuple (τ, ω, σ) mapping variables, interventions, and outcomes between models - Compositionality guarantees ensuring abstraction hierarchies maintain causal consistency - Distinction between exact vs approximate abstractions through error bounds (Beckers and Halpern 2020; Shin and Gerstenberg 2023)

Building on this, Rubenstein et al. (2017) first formalised the notion of exact transformations between structural causal models (Beckers and Halpern 2020, 2020; Beckers and Halpern 2019). Their key insight was establishing intervention preservation requirements through commutative diagrams:

\[ \begin{CD} \mathcal{I}_L @>\omega>> \mathcal{I}_H \\ @V{\sim}VV @VV{\sim}V \\ \mathcal{M}_L @>>\tau> \mathcal{M}_H \end{CD} \]

where ω maps low-level interventions \[\mathcal{I}_L\] to high-level \[\mathcal{I}_H\] while preserving outcome relationships through τ(Beckers and Halpern 2020, 2020; Beckers and Halpern 2019).

Rischel and Weichwald (2021) advanced compositionality through category theory, proving error bounds satisfy:

\[ \epsilon(M \rightarrow M'') \leq \epsilon(M \rightarrow M') + \epsilon(M' \rightarrow M'') \]

using enriched category structures (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020). Their framework introduced KL-divergence based error metrics while maintaining causal semantics across transformations (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023).

4.2 Approximation and Error Quantification

Beckers and Halpern (2019) introduced formal error metrics for approximate abstractions through:

Intervention-specific divergence measures
Worst-case error bounds across allowed interventions
Probabilistic extensions handling observational uncertainty (Beckers and Halpern 2020; Shin and Gerstenberg 2023).

This was operationalized through error lattices where approximation quality could be analysed at different granularities Beckers and Halpern (2020) . Massidda et al. (2023) extended this to soft interventions, proving uniqueness conditions for intervention maps ω under mechanism preservation constraints (Massidda et al. 2023; Chalupka, Eberhardt, and Perona 2017).

Key theoretical results include:

Compositionality of abstraction errors (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023)
Explicit construction of ω maps via quotient spaces (Massidda et al. 2023)
Duality between variable clustering and intervention preservation (Beckers and Halpern 2020; Beckers and Halpern 2019)

4.3 Applications Across Domains

4.3.1 Neuroscience

(J. Dubois, Oya, et al. 2020; D. Dubois and Prade 2020; J. Dubois, Eberhardt, et al. 2020) applt causal abstraction to neural population dynamics, demonstrating

Valid abstractions from spiking models to mean-field approximations
Emergent causal patterns in coarse-grained neural representations
Intervention preservation across biological scales (Massidda et al. 2023; Chalupka, Eberhardt, and Perona 2017)

4.3.2 Climate Science

Chalupka et al. (2016) showed how El Niño models could be abstracted from high-dimensional wind/temperature data through

Variable clustering preserving causal connectivity
Intervention consistency for climate predictions
Validation through hurricane trajectory simulations (Beckers and Halpern 2020, 2020)

4.3.3 Deep Learning

(A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2022; A. R. Geiger 2023; A. Geiger, Wu, et al. 2024) developed interchange intervention techniques for analysing neural networks

Alignment between model layers and symbolic reasoning steps
Causal faithfulness metrics for transformer architectures
Applications to NLP and computer vision models (A. Geiger et al. 2021, 2021; Chalupka, Eberhardt, and Perona 2017)

Their ANTRA framework enabled testing whether neural networks implement known algorithmic structures through intervention graphs (A. Geiger et al. 2021, 2021).

5 Methodological Themes

Intervention-Centric Formalisation: All approaches centre intervention preservation as the core abstraction criterion (Beckers and Halpern 2020; Beckers and Halpern 2019; Massidda et al. 2023)
Compositionality: Hierarchical error propagation and transform composition are fundamental requirements (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020)
Approximation Metrics: KL-divergence, Wasserstein distance, and intervention-specific losses dominate (Beckers and Halpern 2020; Shin and Gerstenberg 2023)
Algebraic Structures: Category theory and lattice frameworks provide mathematical foundations (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020)
Empirical Validation: Applications demonstrate abstraction viability through simulation and model testing (A. Geiger et al. 2021, 2021; Massidda et al. 2023)

6 References

Abraham, D’Oosterlinck, Feder, et al. 2022. “CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior.”

Arora, Jurafsky, and Potts. 2024. “CausalGym: Benchmarking Causal Interpretability Methods on Linguistic Tasks.”

Bahadori, Chalupka, Choi, et al. 2017. “Causal Regularization.” In.

Beckers, and Halpern. 2019. “Abstracting Causal Models.” Proceedings of the AAAI Conference on Artificial Intelligence.

Beckers, and Halpern. 2020. “Approximate Causal Abstraction.” In Proceedings of Machine Learning Research.

Chalupka, Bischoff, Perona, et al. 2016. “Unsupervised Discovery of El Nino Using Causal Feature Learning on Microlevel Climate Data.” In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI’16.

Chalupka, Eberhardt, and Perona. 2016. “Multi-Level Cause-Effect Systems.” In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics.

———. 2017. “Causal Feature Learning: An Overview.” Behaviormetrika.

Chalupka, Perona, and Eberhardt. 2015. “Visual Causal Feature Learning.” In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. UAI’15.

Correa, and Bareinboim. 2020. “A Calculus for Stochastic Interventions:Causal Effect Identification and Surrogate Experiments.” Proceedings of the AAAI Conference on Artificial Intelligence.

Dubois, Julien, Eberhardt, Paul, et al. 2020. “Personality Beyond Taxonomy.” Nature Human Behaviour.

Dubois, Julien, Oya, Tyszka, et al. 2020. “Causal Mapping of Emotion Networks in the Human Brain: Framework and Initial Findings.” Neuropsychologia, The Neural Basis of Emotion,.

Dubois, Didier, and Prade. 2020. “A Glance at Causality Theories for Artificial Intelligence.” In A Guided Tour of Artificial Intelligence Research: Volume I: Knowledge Representation, Reasoning and Learning.

Geiger, Atticus Reed. 2023. “Uncovering and Inducing Interpretable Causal Structure in Deep Learning Models.”

Geiger, Atticus, Ibeling, Zur, et al. 2024. “Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability.”

Geiger, Atticus, Lu, Icard, et al. 2021. “Causal Abstractions of Neural Networks.” In Advances in Neural Information Processing Systems. NIPS ’21.

Geiger, Atticus, Richardson, and Potts. 2020. “Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation.” In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP.

Geiger, Atticus, Wu, Lu, et al. 2022. “Inducing Causal Structure for Interpretable Neural Networks.” In Proceedings of the 39th International Conference on Machine Learning.

Geiger, Atticus, Wu, Potts, et al. 2024. “Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations.” In Proceedings of the Third Conference on Causal Learning and Reasoning.

Hubinger, Jermyn, Treutlein, et al. 2023. “Conditioning Predictive Models: Risks and Strategies.”

Hu, and Tian. 2022. “Neuron Dependency Graphs: A Causal Abstraction of Neural Networks.” In Proceedings of the 39th International Conference on Machine Learning.

Jørgensen, Gresele, and Weichwald. 2025. “What Is Causal about Causal Models and Representations?”

Kinney, and Lombrozo. 2024. “Building Compressed Causal Models of the World.” Cognitive Psychology.

Komanduri, Wu, Wu, et al. 2024. “From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling.”

Massidda, Geiger, Icard, et al. 2023. “Causal Abstraction with Soft Interventions.” In Proceedings of the Second Conference on Causal Learning and Reasoning.

Massidda, Magliacane, and Bacciu. 2024. “Learning Causal Abstractions of Linear Structural Causal Models.”

Müller, Hollmann, Arango, et al. 2022. “Transformers Can Do Bayesian Inference.”

Richens, and Everitt. 2024. “Robust Agents Learn Causal World Models.”

Rischel, and Weichwald. 2021. “Compositional Abstraction Error and a Category of Causal Models.” In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence.

Rubenstein, Weichwald, Bongers, et al. 2017. “Causal Consistency of Structural Equation Models.” In Uncertainty in Artificial Intelligence.

Shin, and Gerstenberg. 2023. “Learning What Matters: Causal Abstraction in Human Inference.”

Soulos, McCoy, Linzen, et al. 2020. “Discovering the Compositional Structure of Vector Representations with Role Learning Networks.” In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP.

Tigges, Hollinsworth, Geiger, et al. 2023. “Linear Representations of Sentiment in Large Language Models.”

Wang, Chen, Tang, et al. 2024. “Disentangled Representation Learning.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Wu, Zhengxuan, Geiger, Icard, et al. 2023. “Interpretability at Scale: Identifying Causal Mechanisms in Alpaca.” In Advances in Neural Information Processing Systems.

Wu, Anpeng, Kuang, Zhu, et al. 2024. “Causality for Large Language Models.”

Zennaro, Turrini, and Damoulas. 2023. “Quantifying Consistency and Information Loss for Causal Abstraction Learning.” In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence.