Causal abstraction
February 24, 2025 — March 4, 2025
Suspiciously similar content
I just ran into this area while trying to invent something similar myself, only to find I’m years too late. It’s an interesting analysis suited to relaxed or approximated causal modelling of causal interventions. It seems to formalise coarse-graining for causal models.
We suspect that the notorious causal inference in LLMs might be built out of such things or understood in terms of them.
1 Causality in hierarchical systems
A. Geiger, Ibeling, et al. (2024) seems to summarise SOTA at the time of writing:
In some ways, studying modern deep learning models is like studying the weather or an economy: they involve large numbers of densely connected ‘microvariables’ with complex, non-linear dynamics. One way of reining in this complexity is to find ways of understanding these systems in terms of higher-level, more abstract variables (‘macrovariables’). For instance, the many microvariables might be clustered together into more abstract macrovariables. A number of researchers have been exploring theories of causal abstraction, providing a mathematical framework for causally analyzing a system at multiple levels of detail (Chalupka, Eberhardt, and Perona 2017; Rubenstein et al. 2017; Beckers and Halpern 2019, 2019; Rischel and Weichwald 2021; Massidda et al. 2023). These methods tell us when a high-level causal model is a simplification of a (typically more fine-grained) low-level model. To date, causal abstraction has been used to analyze weather patterns (Chalupka et al. 2016), human brains (Dubois, Oya, et al. 2020; Dubois, Eberhardt, et al. 2020), and deep learning models (Chalupka, Perona, and Eberhardt 2015; A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2021; Hu and Tian 2022; A. Geiger, Wu, et al. 2024; Z. Wu et al. 2023).
Imagine trying to understand a bustling city by tracking everyone’s movement. This “micro-level” perspective is overwhelming. Instead, we might analyse neighbourhoods (macro-level) to identify traffic patterns or economic activity. In physics, we call this coarse-graining. Causal abstraction tries to ask a more statistical question: When does a simplified high-level model (macrovariables) accurately represent a detailed low-level system (microvariables)?
For example, a neural network classifies images using millions of neurons (microvariables). A causal abstraction might represent this as a high-level flowchart: Input Image → Detect Edges → Identify Shapes → Classify Object
This flowchart is a macrovariable model that abstracts away neuronal details while preserving the “causal story” of how the network works.
To validate abstractions, we use interventions — controlled changes to a system. There seems to be a generalised hierarchy:
- Hard interventions: Force variables to specific values (e.g., clamping a neuron’s activation), which are the classic Judea-Pearl-style interventions.
- Soft interventions: These look like “distributional” assignments, or something like that. They seem simple and intuitive to me in Correa and Bareinboim (2020) where they are stochastic, but the version in A. Geiger, Ibeling, et al. (2024) is deterministic and baffling.
- In the next section we generalise these to Interventionals: Generalised transformations of mechanisms (e.g., redistributing a concept across multiple neurons). This is the new thing in A. Geiger, Ibeling, et al. (2024) and I have no intuition about it yet at all.
Attempted summary:
A high-level model H is an exact transformation of a low-level model L if: 1. There’s a mapping (alignment) between L’s variables and H’s variables. 2. Interventions on H produce the same outcomes as corresponding interventions on L.
Example:
- Low-level: A neural network solving \((a + b) \times c\).
- High-level: A calculator circuit with adders and multipliers. If fixing the adder output in the calculator (high-level) matches fixing the corresponding neurons in the network (low-level), the abstraction is valid.
2 Non-hierarchical models
A. Geiger, Ibeling, et al. (2024) extends to a new type of intervention, which they call interventionals:
A shortcoming of existing theory is that macrovariables cannot be represented by quantities formed from overlapping sets of microvariables. Just as with neural network models of human cognition (Smolensky, 1986), this is the typical situation in mechanistic interpretability, where high level concepts are thought to be represented by modular ‘features’ distributed across individual neural activations […].
Our first contribution is to extend the theory of causal abstraction to remove this limitation, building heavily on previous work. The core issue is that typical hard and soft interventions replace variable mechanisms entirely, so they are unable to isolate quantities distributed across overlapping sets of microvariables. To address this, we consider a very general type of intervention—what we call interventionals—that maps from old mechanisms to new mechanisms. While this space of operations is generally unconstrained, we isolate special classes of interventionals that form intervention algebras, satisfying two key modularity properties. Such classes can essentially be treated as hard interventions with respect to a new (‘translated’) variable space. We elucidate this situation, generalizing earlier work by Rubenstein et al. (2017) and Beckers and Halpern (2019).
3 Testing Abstractions: Interchange Interventions
To validate an abstraction, we use interchange interventions:
- Base input: Run the low-level model normally.
- Source input: Extract values from a different input.
- Patch: Replace specific low-level values with source values and check if the output matches the high-level prediction.
Example:
Suppose a high-level model claims a neural network uses “noun detection” followed by “pluralisation.” To test this: - Base input: “The cat sleeps.” → Output: “cat” (singular). - Source input: “Three dogs bark.” → Extract “dogs” (plural). - Intervention: Patch “cat” neurons with “dog” activations. If the output becomes “dogs,” the abstraction holds.