Kernel distribution embedding

Conditional mean embeddings etc

August 21, 2016 — June 17, 2024

approximation
functional analysis
Hilbert space
measure
metrics
nonparametric
optimization
probability
statistics
Figure 1

Handling and computing probability measures via reproducing kernel methods. If we would like to make a dependence test or a probability metric from this trick then we famously can; that is the integral probability metric that we call maximum mean discrepancy.

1 Basic idea

TBD.

2 Operations

From Kernel embedding of distributions on Wikipedia:

As the Wikipedia comments observe, the \(P(X)\) notation is unusual; It only really makes sense in the context of the paper.

This section illustrates how basic probabilistic rules may be reformulated as (multi)linear algebraic operations in the kernel embedding framework and is primarily based on the work of Song et al (Song et al. 2009; Song, Fukumizu, and Gretton 2013). The following notation is adopted:

  • \(P(X, Y)=\) joint distribution over random variables \(X, Y\)
  • \(P(X)=\int_{\Omega} P(X, \mathrm{~d} y)=\) marginal distribution of \(X ; P(Y)=\) marginal distribution of \(Y\)
  • \(P(Y \mid X)=\frac{P(X, Y)}{P(X)}=\) conditional distribution of \(Y\) given \(X\) with corresponding conditional embedding operator \(\mathcal{C}_{Y \mid X}\)
  • \(\pi(Y)=\) prior distribution over \(Y\)
  • \(Q\) is used to distinguish distributions which incorporate the prior from distributions \(P\) which do not rely on the prior

In practice, all embeddings are empirically estimated from data \(\left\{\left(x_1, y_1\right), \ldots,\left(x_n, y_n\right)\right\}\) and it is assumed that a set of samples \(\left\{\tilde{y}_1, \ldots, \tilde{y}_{\tilde{n}}\right\}\) may be used to estimate the kernel embedding of the prior distribution \(\pi(Y)\).

2.1 Kernel sum rule

In probability theory, the marginal distribution of \(X\) can be computed by integrating out \(Y\) from the joint density (including the prior distribution on \(Y\)) \[ Q(X)=\int_{\Omega} P(X \mid Y) \mathrm{d} \pi(Y) \]

The analogue of this rule in the kernel embedding framework states that \(\mu_X^\pi\), the RKHS embedding of \(Q(X)\), can be computed via \[ \mu_X^\pi=\mathbb{E}\left[\mathcal{C}_{X \mid Y} \varphi(Y)\right]=\mathcal{C}_{X \mid Y} \mathbb{E}[\varphi(Y)]=\mathcal{C}_{X \mid Y} \mu_Y^\pi \] where \(\mu_Y^\pi\) is the kernel embedding of \(\pi(Y)\). In practical implementations, the kernel sum rule takes the following form \[ \widehat{\mu}_X^\pi=\widehat{\mathcal{C}}_{X \mid Y} \widehat{\mu}_Y^\pi=\mathbf{\Upsilon}(\mathbf{G}+\lambda \mathbf{I})^{-1} \widetilde{\mathbf{G}} \boldsymbol{\alpha} \] where \[ \mu_Y^\pi=\sum_{i=1}^{\tilde{n}} \alpha_i \varphi\left(\tilde{y}_i\right) \] is the empirical kernel embedding of the prior distribution, \(\boldsymbol{\alpha}=\left(\alpha_1, \ldots, \alpha_{\tilde{n}}\right)^T, \boldsymbol{\Upsilon}=\left(\varphi\left(x_1\right), \ldots, \varphi\left(x_n\right)\right)\), and \(\mathbf{G}, \widetilde{\mathbf{G}}\) are Gram matrices with entries \(\mathbf{G}_{i j}=k\left(y_i, y_j\right), \widetilde{\mathbf{G}}_{i j}=k\left(y_i, \tilde{y}_j\right)\) respectively.

2.2 Kernel chain rule

In probability theory, a joint distribution can be factorised into a product between conditional and marginal distributions \[ Q(X, Y)=P(X \mid Y) \pi(Y) \]

The analogue of this rule in the kernel embedding framework states that \(\mathcal{C}_{X Y}^\pi\), the joint embedding of \(Q(X, Y)\), can be factorised as a composition conditional embedding operator with the auto-covariance operator associated with \(\pi(Y)\) \[ \mathcal{C}_{X Y}^\pi=\mathcal{C}_{X \mid Y} \mathcal{C}_{Y Y}^\pi \] where \[ \begin{aligned} \mathcal{C}_{X Y}^\pi & =\mathbb{E}[\varphi(X) \otimes \varphi(Y)] \\ \mathcal{C}_{Y Y}^\pi & =\mathbb{E}[\varphi(Y) \otimes \varphi(Y)] \end{aligned} \]

In practical implementations, the kernel chain rule takes the following form \[ \widehat{\mathcal{C}}_{X Y}^\pi=\widehat{\mathcal{C}}_{X \mid Y} \widehat{\mathcal{C}}_{Y Y}^\pi=\mathbf{\Upsilon}(\mathbf{G}+\lambda \mathbf{I})^{-1} \widetilde{\mathbf{G}} \operatorname{diag}(\boldsymbol{\alpha}) \widetilde{\boldsymbol{\Phi}}^T \]

2.3 Kernel Bayes’ rule

In probability theory, a posterior distribution can be expressed in terms of a prior distribution and a likelihood function as \[ Q(Y \mid x)=\frac{P(x \mid Y) \pi(Y)}{Q(x)} \text { where } Q(x)=\int_{\Omega} P(x \mid y) \mathrm{d} \pi(y) \]

The analogue of this rule in the kernel embedding framework expresses the kernel embedding of the conditional distribution in terms of conditional embedding operators which are modified by the prior distribution \[ \mu_{Y \mid x}^\pi=\mathcal{C}_{Y \mid X}^\pi \varphi(x)=\mathcal{C}_{Y X}^\pi\left(\mathcal{C}_{X X}^\pi\right)^{-1} \varphi(x) \] where from the chain rule: \[ \mathcal{C}_{Y X}^\pi=\left(\mathcal{C}_{X \mid Y} \mathcal{C}_{Y Y}^\pi\right)^T \]

In practical implementations, the kernel Bayes’ rule takes the following form \[ \widehat{\mu}_{Y \mid x}^\pi=\widehat{\mathcal{C}}_{Y X}^\pi\left(\left(\widehat{\mathcal{C}}_{X X}\right)^2+\tilde{\lambda} \mathbf{I}\right)^{-1} \widehat{\mathcal{C}}_{X X}^\pi \varphi(x)=\widetilde{\mathbf{\Phi}} \mathbf{\Lambda}^T\left((\mathbf{D K})^2+\tilde{\lambda} \mathbf{I}\right)^{-1} \mathbf{K D K}_x \] where \[ \mathbf{\Lambda}=(\mathbf{G}+\tilde{\lambda} \mathbf{I})^{-1} \widetilde{\mathbf{G}} \operatorname{diag}(\boldsymbol{\alpha}), \quad \mathbf{D}=\operatorname{diag}\left((\mathbf{G}+\tilde{\lambda} \mathbf{I})^{-1} \widetilde{\mathbf{G}} \boldsymbol{\alpha}\right) \]

Two regularisation parameters are used in this framework: \(\lambda\) for the estimation of \(\hat{\mathcal{C}}_{Y X}^\pi, \hat{\mathcal{C}}_{X X}^\pi=\mathbf{\Upsilon} \mathbf{D} \mathbf{\Upsilon}^T\) and \(\tilde{\lambda}\) for the estimation of the final conditional embedding operator## References

\[ \hat{\mathcal{C}}_{Y \mid X}^\pi=\hat{\mathcal{C}}_{Y X}^\pi\left(\left(\hat{\mathcal{C}}_{X X}^\pi\right)^2+\tilde{\lambda} \mathbf{I}\right)^{-1} \hat{\mathcal{C}}_{X X}^\pi \]

The latter regularisation is done on square of \(\hat{\mathcal{C}}_{X X}^\pi\) because \(D\) may not be positive definite.

3 References

Alquier, and Gerber. 2024. Universal Robust Regression via Maximum Mean Discrepancy.” Biometrika.
Cherief-Abdellatif, and Alquier. 2020. MMD-Bayes: Robust Bayesian Estimation via Maximum Mean Discrepancy.” In Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference.
Chowdhury, Oliveira, and Ramos. 2020. Active Learning of Conditional Mean Embeddings via Bayesian Optimisation.” In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI).
Grünewälder, Lever, Baldassarre, et al. 2012. Conditional Mean Embeddings as Regressors.” In Proceedings of the 29th International Coference on International Conference on Machine Learning. ICML’12.
Muandet, Fukumizu, Sriperumbudur, et al. 2017. Kernel Mean Embedding of Distributions: A Review and Beyond.” Foundations and Trends® in Machine Learning.
Nishiyama, and Fukumizu. 2016. Characteristic Kernels and Infinitely Divisible Distributions.” The Journal of Machine Learning Research.
Park, and Muandet. 2021. A Measure-Theoretic Approach to Kernel Conditional Mean Embeddings.”
Pfister, Bühlmann, Schölkopf, et al. 2018. Kernel-Based Tests for Joint Independence.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Salvi, Lemercier, Liu, et al. 2024. Higher Order Kernel Mean Embeddings to Capture Filtrations of Stochastic Processes.” In Advances in Neural Information Processing Systems. NIPS ’21.
Schölkopf, Muandet, Fukumizu, et al. 2015. Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations.” arXiv:1501.06794 [Cs, Stat].
Smola, Gretton, Song, et al. 2007. A Hilbert Space Embedding for Distributions.” In Algorithmic Learning Theory. Lecture Notes in Computer Science 4754.
Song, Fukumizu, and Gretton. 2013. Kernel Embeddings of Conditional Distributions: A Unified Kernel Framework for Nonparametric Inference in Graphical Models.” IEEE Signal Processing Magazine.
Song, Huang, Smola, et al. 2009. Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems.” In Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09.
Sriperumbudur, B. K., Gretton, Fukumizu, et al. 2008. Injective Hilbert Space Embeddings of Probability Measures.” In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008).
Sriperumbudur, Bharath K., Gretton, Fukumizu, et al. 2010. Hilbert Space Embeddings and Metrics on Probability Measures.” Journal of Machine Learning Research.