Learning graphical models from data
Also, causal discovery, structure discovery
September 19, 2017 — October 1, 2022
Learning the independence graph structure from data in a graphical model. A particular sparse model selection problem where the model is hierarchical, or possibly even non-directional. For bonus sophistication, we might even try to learn causal effects from raw data.
There are a few ways we can learn graphical models. The obvious one, to my mind, is to use a Bayesian network to learn the structure. conditional independence test, an awareness of multiple testing and graph theory. But also Bayesian sampling from possible graph structures is a thing apparently. There are other approaches too.
1 Bayesian learning of Bayesian networks
I am indebted to Dario Draca for introducing this field to me. To quote Dario:
I think the introduction to this paper (Scutari, Graafland, and Gutiérrez 2019) gives an accessible description of the Bayesian approach to Bayesian network structure learning, including how to define parameter priors for discrete and Gaussian networks (the former case being potentially relevant to your signal detection/classification problem).
If you are particularly interested, the introduction to this paper (Scutari 2018) also gives some more information on computing marginal likelihoods (i.e. likelihood of data conditional only on the graph structure) for discrete Bayesian networks.
This one (Kuipers, Moffa, and Heckerman 2014a) gives a rigorous derivation of the marginal graph likelihood for Gaussian networks:
I also wonder about this seminar:
Guido Consonni, Objective Bayes Model Selection of Gaussian Essential Graphs with Observational and Interventional Data.
Graphical models based on Directed Acyclic Graphs (DAGs) represent a powerful tool for investigating dependencies among variables. It is well known that one cannot distinguish between DAGs encoding the same set of conditional independencies (Markov equivalent DAGs) using only observational data. However, the space of all DAGs can be partitioned into Markov equivalence classes, each being represented by a unique Essential Graph (EG), also called Completed Partially Directed Graph (CPDAG). In some fields, in particular genomics, one can have both observational and interventional data, the latter being produced after an exogenous perturbation of some variables in the system, or from randomized intervention experiments. Interventions destroy the original causal structure, and modify the Markov property of the underlying DAG, leading to a finer partition of DAGs into equivalence classes, each one being represented by an Interventional Essential Graph (I-EG) (Hauser and Buehlmann). In this talk we consider Bayesian model selection of EGs under the assumption that the variables are jointly Gaussian. In particular, we adopt an objective Bayes approach, based on the notion of fractional Bayes factor, and obtain a closed form expression for the marginal likelihood of an EG. Next we construct a Markov chain to explore the EG space under a sparsity constraint, and propose an MCMC algorithm to approximate the posterior distribution over the space of EGs. Our methodology, which we name Objective Bayes Essential graph Search (OBES), allows to evaluate the inferential uncertainty associated to any features of interest, for instance the posterior probability of edge inclusion. An extension of OBES to deal simultaneously with observational and interventional data is also presented: this involves suitable modifications of the likelihood and prior, as well as of the MCMC algorithm.
2 Classic methods using independence tests on graphs
Many. In my time in the lectures of Marloes Maathuis I learnt some of the theory, but TBH everything has fallen out my head now, since I have not used them in practice. Notable works are Colombo and Maathuis (2014);Drton and Maathuis (2017);Heinze-Deml, Maathuis, and Meinshausen (2018);Maathuis, Kalisch, and Bühlmann (2009);Maathuis et al. (2010).
Most of these seem to boil down to the cases where belief propagation is well-behaved, to wit, linear-Gaussian and discrete RVs. In these, dependence is simple (in the Gaussian case, essentially, two things are independent if their shared entry in the cross-precision matrix is zero). If we can find a way of estimating actual zeros in that matrix, then what? Often we want causal distributions, so the question remains: can we convert the implied undirected graph into a directed one? tl;dr: sometimes. This is one of those cases where the Bayesian method comes out much cleaner; averaging over possible graphs is a natural way of thinking about this.
For a less causal/intervention focused method, see Nonparanormal skeptic (🏗) (H. Liu et al. 2012) which combines semiparametric regression with non-parametric graph inference.
3 Learning by continuous optimisation
A new trick in the arsenal from those neural network nerds. Xun Zheng, Bryon Aragam and Chen Dan in their blog post Learning DAGs with Continuous Optimization introduce NO-TEARS. This is an interesting bit of work AFAICT. Download from xunzheng/notears, and read the papers (Zheng et al. 2018; Zheng et al. 2020):
Estimating the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing approaches rely on various local heuristics for enforcing the acyclicity constraint. In this paper, we introduce a fundamentally different strategy: We formulate the structure learning problem as a purely continuous optimization problem over real matrices that avoids this combinatorial constraint entirely. This is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting problem can be efficiently solved by standard numerical algorithms, which also makes implementation effortless. The proposed method outperforms existing ones, without imposing any structural assumptions on the graph such as bounded treewidth or in-degree.
Key insight:
…that the k-th power of the adjacency matrix of a graph counts the number of k-step paths from one node to another. In other words, if the diagonal of the matrix power turns out to be all zeros, there [are] no k-step cycles in the graph. Then to characterize acyclicity, we just need to set this constraint for all k=1,2,…,d, eliminating cycles of all possible length.
Has this gone anywhere?
A related approach is Lorch et al. (2021):
Continuous characterization of acyclic graphs. Orthogonal to the work on Bayesian inference, Zheng et al. (2018); have recently proposed a differentiable characterization of acyclic graphs for structure learning. In this work, we adopt the formulation of Yu et al. (2019), who show that a graph with adjacency matrix \(\mathbf{G} \in\{0,1\}^{d \times d}\) does not have any cycles if and only if \(h(\mathbf{G})=0\), where \[ h(\mathbf{G}):=\operatorname{tr}\left[\left(\mathbf{I}+\frac{1}{d} \mathbf{G}\right)^{d}\right]-d . \]
NO-BEARS (H.-C. Lee et al. 2019; Zhu et al. 2020) does more tricks and approximations to make NO-TEARS more scalable, by upper-bounding the spectral radius. (HT Dario Draca for mentioning this.) Connection to orthonormal matrix wrangling.
4 Causal discovery from time series
A particular sub-flavour. Very popular, and weird. Accordingly see the graphical models from time series page.
5 Tools
5.1 DIBS
larslorch/dibs: Joint Bayesian inference of graph and parameters of general Bayes nets
This is the Python JAX implementation for DiBS (Lorch et al. 2021), a fully differentiable method for joint Bayesian inference of the DAG and parameters of general, causal Bayesian networks.
In this implementation, DiBS inference is performed with the particle variational inference method SVGD (Q. Liu and Wang 2019). Since DiBS and SVGD operate on continuous tensors and solely rely on Monte Carlo estimation and gradient ascent-like updates, the inference code leverages efficient vectorised operations, automatic differentiation, just-in-time compilation, and hardware acceleration, fully implemented with JAX.
Their code example is impressive:
In this example, we use DiBS to generate 10 DAG and parameter samples from the joint posterior over Gaussian Bayes nets with means modelled by neural networks.
from dibs.inference import JointDiBS
from dibs.target import make_nonlinear_gaussian_model
import jax.random as random
key = random.PRNGKey(0)
# simulate some data
key, subk = random.split(key)
data, model = make_nonlinear_gaussian_model(key=subk, n_vars=20)
# sample 10 DAG and parameter particles from the joint posterior
dibs = JointDiBS(x=data.x, inference_model=model)
key, subk = random.split(key)
gs, thetas = dibs.sample(key=subk, n_particles=10, steps=1000)
In the above, the keyword argument
x
forJointDiBS
is a matrix of shape[N, d]
and could be any real-world data set.
Cripes.
5.2 bnlearn
bnlearn learns belief networks, i.e. directed graphical models. It is restricted to the classic exact belief propagation case, i.e. multinomial or Gaussian models.
bnlearn implements the following constraint-based structure learning algorithms:
- PC (the stable version);
- Grow-Shrink (GS);
- Incremental Association Markov Blanket (IAMB);
- Fast Incremental Association (Fast-IAMB);
- Interleaved Incremental Association (Inter-IAMB);
- Incremental Association with FDR Correction (IAMB-FDR);
- Max-Min Parents & Children (MMPC);
- Semi-Interleaved Hiton-PC (SI-HITON-PC);
- Hybrid Parents & Children (HPC);
the following score-based structure learning algorithms:
- Hill Climbing (HC);
- Tabu Search (Tabu);
the following hybrid structure learning algorithms:
- Max-Min Hill Climbing (MMHC);
- Hybrid HPC (H2PC);
- General 2-Phase Restricted Maximization (RSMAX2);
the following local discovery algorithms:
- Chow-Liu;
- ARACNE;
and the following Bayesian network classifiers:
- naive Bayes;
- Tree-Augmented naive Bayes (TAN).
Discrete (multinomial) and continuous (multivariate normal) data sets are supported, both for structure and parameter learning. The latter can be performed using either maximum likelihood or Bayesian estimators.
5.3 sparsebn
Compare with sparsebn:
A new R package for learning sparse Bayesian networks and other graphical models from high-dimensional data via sparse regularization. Designed from the ground up to handle:
- Experimental data with interventions
- Mixed observational / experimental data
- High-dimensional data with p >> n
- Datasets with thousands of variables (tested up to p=8000)
- Continuous and discrete data
The emphasis of this package is scalability and statistical consistency on high-dimensional datasets. […] For more details on this package, including worked examples and the methodological background, please see our new preprint.
Overview
The main methods for learning graphical models are:
- estimate.dag for directed acyclic graphs (Bayesian networks).
- estimate.precision for undirected graphs (Markov random fields).
- estimate.covariance for covariance matrices.
Currently, estimation of precision and covariance matrices is limited to Gaussian data.
5.4 Causalnex
CausalNex is a Python library that uses Bayesian Networks to combine machine learning and domain expertise for causal reasoning. You can use CausalNex to uncover structural relationships in your data, learn complex distributions, and observe the effect of potential interventions.
5.5 caus2e
The main contribution of cause2e is the integration of two established causal packages that have currently been separated and cumbersome to combine:
- Causal discovery methods from the py-causal package, which is a Python wrapper around parts of the Java TETRAD software. It provides many algorithms for learning the causal graph from data and domain knowledge.
- Causal reasoning methods from the DoWhy package, which is the current standard for the steps of a causal analysis starting from a known causal graph and data
5.6 TETRAD
TETRAD (source, tutorial) is a tool for discovering and visualising and calculating giant empirical DAGs, including general graphical inference and causality. It’s written by eminent causality inference people.
Tetrad is a program which creates, simulates data from, estimates, tests, predicts with, and searches for causal and statistical models. The aim of the program is to provide sophisticated methods in a friendly interface requiring very little statistical sophistication of the user and no programming knowledge. It is not intended to replace flexible statistical programming systems such as Matlab, Splus or R. Tetrad is freeware that performs many of the functions in commercial programs such as Netica, Hugin, LISREL, EQS and other programs, and many discovery functions these commercial programs do not perform. …
The Tetrad programs describe causal models in three distinct parts or stages: a picture, representing a directed graph specifying hypothetical causal relations among the variables; a specification of the family of probability distributions and kinds of parameters associated with the graphical model; and a specification of the numerical values of those parameters.
py-causal is a wrapper around this for Python, and R-causal for R.
5.7 skgmm
skggm (Python) does the Gaussian thing but also has a nice sparsification and good explanation.
The core estimator provided in skggm is
QuicGraphLasso
, which is a scikit-learn compatible interface to QUIC, a proximal Newton-type algorithm that solves the graphical lasso (2) objective.
6 Causeme
Detecting causal associations in time series datasets is a key challenge for novel insights into complex dynamical systems such as the Earth system or the human brain. Interactions in such systems present a number of major challenges for causal discovery techniques and it is largely unknown which methods perform best for which challenge.
The CauseMe platform provides ground truth benchmark datasets featuring different real data challenges to assess and compare the performance of causal discovery methods. The available benchmark datasets are either generated from synthetic models mimicking real challenges, or are real-world datasets where the causal structure is known with high confidence. The datasets vary in dimensionality, complexity and sophistication.