Belief propagation and related algorithms
November 25, 2014 — May 17, 2023
Assumed audience:
People who know basic probability but have not applied it to graphs
\[\renewcommand{\var}{\operatorname{Var}} \renewcommand{\corr}{\operatorname{Corr}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\rv}[1]{\mathsf{#1}} \renewcommand{\vrv}[1]{\vv{\rv{#1}}} \renewcommand{\disteq}{\stackrel{d}{=}} \renewcommand{\dif}{\backslash} \renewcommand{\gvn}{\mid} \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\Pr}{\mathbb{P}}\]
The basic inspiration of message-passing inference turns out to be not always implementable but gives us some basic inspiration. A concrete, implementable version useful to me is Gaussian Belief Propagation. See also graph computations for a more general sense of Message Passing, and graph NNs for the same idea but with more neural dust sprinkled on it.
1 History
The grandparent of Belief Propagation is Pearl (1982) for DAGs and then generalised to various other graphical models later. This definition subsumes such diverse methods as the Viterbi and Baum-Welch algorithms and state filters.
2 Let’s go
Bishop (2006)’s notation is increasingly standard. We can set this up a few ways but a nice general one is factor graphs. Below is a factor graph; if it makes no sense maybe dash over to that page and read about ’em. We assume that the factor graphs are trees, which means that the variables are all connected, without any loops.
If we are dealing with continuous RVs this factor graph corresponds to a density function which factorises \[ p(\vv{x})=f_a(x_1)f_b(x_1,x_2)f_d(x_2)f_c(x_2,x_3)f_g(x_3,x_4,x_5)f_k(x_5). \] For reasons of tradition, we often assume the variables are discrete at this point and work with sums. I find densities more intuitive.
For the moment we use the mnemonic that \(p\)s are densities and thus normalised, whereas \(f\)s are factors, which are not necessarily normalised (but are non-negative and generate finite Lebesgue measures over their arguments, so they can be normalised.)
OK, what can we do with this kind of factorization? I follow Davison and Ortiz (2019)’s explanation, which is the shortest one I can find that is still comprehensible.
Suppose we want to find the marginal density at some node \[ p(x_n)=\int p(\vv{x})\dd(\vv{x}\dif x). \] Read \(\dd (\vv{x}\dif x)\) as “an integral over all axes except \(x\),” e.g. $(x_1)=x_2 x_3 x_4”
We can write the density at \(x\) in terms of a product over the density at each of the factors connected to it. \[p(\vv{x})=\prod_{s\in n(s)}F_s(x,\vv{x}_s)\] Here \(n(x)\) is the set of factor nodes that are neighbours of \(x\). \(F_{s}\) is the product of all factors in the group associated with \(f_{s}.\) \(\vv{x}_{s}\) is the vector of all variables in the subtree connected to \(x\) via \(f_{s}\). Combining these, \[\begin{aligned} p(x) &=\int \left[\prod_{s \in n(x)} F_{s}\left(x, \vv{x}_{s}\right)\right] \dd(\vv{x}\dif x)\\ &=\prod_{s \in n(x)}\left[\int F_{s}\left(x, \vv{x}_{s}\right)\dd(\vv{x}_s)\right]. \end{aligned}\] That second step, reordering the integral and product, is pretty much our whole justification. How can we express that in terms of operations on a factor graph?
A clue is: These inner integrals \(\int F_{s}\left(x, \vv{x}_{s}\right)\dd(\vv{x}_s)\) are marginal densities over an individual \(x,\) and we have removed the dependence on the other arguments in this marginalization; Looking ‘through’ each integral we see only one argument, so we know that those other arguments are in some sense irrelevant from where we sit with regard to calculating this marginal. Next, we represent this decomposition of integrals and products into messages between axes that correspond to marginalizing out dependencies, and thus removing variables from our consideration from our current position.
In doing so we assume that these marginal integrals have convenient and analytic solutions, which is great for proving interesting theories. In practice… If that assumption looks dangerous, that is because it should. These integrals cannot, in general, be so solved, and later on we make all kinds of approximations to make this go.1
Anyway, let us ignore tedious calculus problems for now and assume that works. Describing a recipe which actually does this message passing looks confusing because it involves some notation challenges, but it is not awful if we dedicate some time to staring and swearing at it.
There are two types of messages needed: factor to variable messages, which will be multiplied to give the marginal distribution of that variable, and variable to factor messages, which are harder to explain punchily but eventually make sense if we stare long enough.
Ortiz et al explain it thusly:
Belief Update: The variable node beliefs are updated by taking a product of the incoming messages from all adjacent factors, each of which represents that factor’s belief on the receiving node’s variables.
Factor-to-variable message: To send a message to an adjacent variable node, a factor aggregates messages from all other adjacent variable nodes and marginalises over all the other nodes’ variables to produce a message that expresses the factor’s belief over the receiving node’s variables.
Variable-to-factor message: A variable-to-factor message tells the factor what the belief of the variable would be if the receiving factor node did not exist. This is computed by taking the product of the messages the variable node has received from all other factor nodes.
The factor to variable message looks like this: \[ \mu_{f_{s} \rightarrow x}(x)=\int F_{s}\left(x, \vv{x}_{s}\right)\dd(\vv{x}_s). \] This term \(\mu_{f_{s} \rightarrow x}(x)\) we think of as a message from factor \(f_{s}\) to variable \(x\) The message has the form of a function over variable \(x\) only, which corresponds to marginalised probability over \(x\) as the result of considering all factors in one branch of the tree. Thus, \[ p(x) = \prod_{s\in n(x)}\mu_{f_{s} \rightarrow x}(x). \] That is the easy bit. Next is the slightly weirder variable-to-factor message. \[\begin{aligned} \mu_{x_{m} \rightarrow f_{s}}\left(x_{m}\right) &=\int \prod_{l \in n\left(x_{m}\right) \backslash f_{s}} F_{l}\left(x_{m}, \vv{x}_{m_{l}}\right) \dd \vv{x}_{sm}\\ &=\prod_{l \in n\left(x_{m}\right) \backslash f_{s}} \int F_{l}\left(x_{m}, \vv{x}_{m_{l}}\right) \dd \vv{x}_{m_{l}}\\ &=\prod_{l \in n\left(x_{m}\right) \backslash f_{s}} \mu_{f_{l} \rightarrow x_{m}}\left(x_{m}\right). \end{aligned}\]
TBC.
3 Further reading
We can invent a lot of extra machinery here to do belief propagation in ever more sophisticated ways as regards the message-passing schedule and marginals. A classic example is the junction tree, which is a particular non-unique alternative decomposition of the graphical models in a kind of clustered graph. AFAICT most of these extended belief-y algorithms are interesting but complicated, hard to implement in general, and have been useless to me thus far for practical purposes. Exceptions:
- Quantifying independence with these tools leads us to causal inference, which has indeed been useful for me.
- Gaussian Belief Propagation turns out to be what I am writing a paper on right now.
For purely discrete RVs it is OK, but that is a setting I almost never care about, except for generating neat graphs for causal inference thought experiments.
Having expressed all those negative sentiments about practical application of exact belief passing, nonetheless it is pedagogically useful to learn at least this much, because it provides a heuristic motivation and structure for variational message passing, which does approximately work, approximately generally.
For the dedicated there are many treatises on this topic in the literature, going deep into the most tedious recesses of the subject and they make virtuous and improving class exercises, but offer diminishing returns in understanding. I would prefer texts which build up this machinery towards modern application (Bishop 2006) or for profound reinterpretation (Koller and Friedman 2009) if I wanted to put this into a wider perspective.
4 Roll your own
Pedagogically useful, although probably not industrial-grade, David Barber’s discrete graphical model code (Julia) can do queries over graphical models.
similar domain, more hype: deepmind/PGMax: Loopy belief propagation for factor graphs on discrete variables in JAX (Zhou et al. 2023)
danbar/fglib: factor graph library
The factor graph library (fglib) is a Python package to simulate message passing on factor graphs. It supports the
- sum-product algorithm (belief propagation)
- max-product algorithm
- max-sum algorithm
- mean-field algorithm - in development
with discrete and Gaussian random variables.
Pure numpy
Krashkov’s summary notebooks
- Belief-Propagation/1-SummaryPGMandBP.ipynb at master · krashkov/Belief-Propagation · GitHub
- Belief-Propagation/2-ImplementationFactor.ipynb at master · krashkov/Belief-Propagation · GitHub
- Belief-Propagation/3-ImplementationPGM.ipynb at master · krashkov/Belief-Propagation · GitHub
- Belief-Propagation/4-ImplementationBP.ipynb at master · krashkov/Belief-Propagation · GitHub
Joe Ortiz’ GBP implementations are elegant and easy to understand. Here are two python ones:
-
pgmpy is a pure python implementation for Bayesian Networks with a focus on modularity and extensibility. Implementations of various algorithms for Structure Learning, Parameter Estimation, Approximate (Sampling Based) and Exact inference, and Causal Inference are available.
See the Belief Propagation for GBP.
5 References
Footnotes
Question: does anyone ever use a large numerical solver for this? What breaks in that case, apart from the messages growing embarrassingly large?↩︎