Efficient factoring of GP likelihoods

October 16, 2020 — October 26, 2020

algebra

approximation

Gaussian

generative

graphical models

Hilbert space

kernel tricks

machine learning

networks

optimization

probability

statistics

Suspiciously similar content

There are many ways to cleverly slice up GP likelihoods so that inference is cheap.

This page is about some of them, especially the union of sparse and variational tricks. Scalable Gaussian process regressions choose cunning factorisations such that the model collapses down to a lower-dimensional thing than it might have seemed to need, at least approximately. There is a compilation of tricks to make this go — variational approximations a model, sparse GP models where there are a small number of inducing points (Dezfouli and Bonilla 2015; Edwin V. Bonilla, Krauth, and Dezfouli 2019; Krauth et al. 2016; Hensman, Fusi, and Lawrence 2013; Salimbeni and Deisenroth 2017). You might suspect yourself of using such a method if you find that some important high-dimensional expectation can be evaluated by some function of univariate Gaussians.

There are indeed a lot of different factorisations that can be done here; See filtering GPs for one which factorises over a single input axis. Other tricks from outside GPs which factorise a distribution cleverly, such as message-passing inference are likely useful here. Also, Toeplitz and related structures work out nicely for, e.g. lattice-distributed inputs and some other situations I forget right now.

1 Inducing variables

See GP inducing variables.

2 Inducing features

See GP inducing features.

3 Spectral and rank sparsity

Loosely speaking, where the functions can be represented in a small number of (hopefully tractable) basis functions. See, for example (Adam et al. 2020; Zammit-Mangion and Cressie 2021).

4 Lattice inputs

See GP with lattice inputs.

5 SVI for Gaussian processes

As seen in Hensman, Fusi, and Lawrence (2013);Salimbeni and Deisenroth (2017).

6 Low rank methods

Represent the GP in terms of a controlled budget of basis functions. See low-rank Gaussian processes.

7 Vecchia factorisation

Approximate the precision matrix by one with a sparse Cholesky factorisation. See Vecchia factorisation.

8 Local

Approximate covariance by nearby predictors.

9 Latent Gaussian Process models

The Edwin V. Bonilla, Krauth, and Dezfouli (2019) set up for Latent Gaussian Process models (“LGPMs”) goes as follows:

We are learning a mapping \(\boldsymbol{f}:\mathbb{R}^D\to\mathbb{R}^P\) from data. The dataset looks like \(\mathcal{D}=\left\{\mathbf{x}_{n}, \mathbf{y}_{n}\right\}_{n=1}^{N}\equiv \left\{\mathbf{x}, \mathbf{y}\right\}.\) \(\mathbf{x}_{n}\in \mathbb{R}^D\) is an input vector and \(\mathbf{y}_{n}\in\mathbb{R}^P\) is an output. We decree that the mapping from inputs to outputs may be expressed by \(Q\) underlying latent functions \(\left\{f_{j}\right\}_{j=1}^{Q}.\) We assume that the \(Q\) latent functions \(\left\{f_{j}\right\}\) are drawn from (a priori) independent zero-mean Gaussian processes.

\[ \begin{aligned} p\left(f_{j} \mid \boldsymbol{\theta}_{j}\right) & \sim \mathcal{G} \mathcal{P}\left(0, \kappa_{j}\left(\cdot, \cdot ; \boldsymbol{\theta}_{j}\right)\right), \quad j=1, \ldots Q, \quad \text { and } \\ p(\mathbf{f} \mid \boldsymbol{\theta}) &=\prod_{j=1}^{Q} p\left(\mathbf{f}_{\cdot j} \mid \boldsymbol{\theta}_{j}\right) \\ &=\prod_{j=1}^{Q} \mathcal{N}\left(\mathbf{f}_{\cdot j} ; \mathbf{0}, \mathbf{K}_{\mathbf{x x}}^{j}\right). \end{aligned} \] Here \(\mathbf{f}\) is the set of all latent function values; \(\mathbf{f}_{\cdot j}=\left\{f_{j}\left(\mathbf{x}_{n}\right)\right\}_{n=1}^{N}\) denotes the values of latent function \(j\). The Gram matrix is \(\mathbf{K}_{\mathrm{xx}}^{j}\), induced by a covariance kernel, \(\kappa_{j}\left(\cdot, \cdot ; \boldsymbol{\theta}_{j}\right)\). The parameters of all kernel functions we call \(\boldsymbol{\theta}=\left\{\boldsymbol{\theta}_{j}\right\}.\) Our observation model can have various likelihoods; We call the corresponding parameter \(\boldsymbol{\phi}\). We assume that our multi-dimensional observations \(\left\{\mathbf{y}_{n}\right\}\) are i.i.d. given the latent functions \(\left\{\mathbf{f}_{n}\right\},\) so that \[ p(\mathbf{y} \mid \mathbf{f}, \boldsymbol{\phi})=\prod_{n=1}^{N} p\left(\mathbf{y}_{n} \mid \mathbf{f}_{n \cdot}, \boldsymbol{\phi}\right) \] \(\mathbf{f}_{n\cdot}=\{f_{j}(\boldsymbol{x}_n)\}_{j=1}^{q}\) is the set of latent \(\boldsymbol{f}\) values upon which \(\mathbf{y}_{n}\) depends.

There are several factorizations to note here

The prior is factored into latent functions per-coordinate
the conditional likelihood is factored over observations (i.e. noise is independent)

If we further factorise the variational approximation in some way this will work out nicely, e.g. into Gaussian mixtures. This works out well for us when we try to devise a system of inference later to minimise the ELBO. TBC.

For now, though, let us examine exactly tractable inference

10 References

Adam, Eleftheriadis, Durrande, et al. 2020. “Doubly Sparse Variational Gaussian Processes.” In AISTATS.

Ameli, and Shadden. 2023. “A Singular Woodbury and Pseudo-Determinant Matrix Identities and Application to Gaussian Process Regression.” Applied Mathematics and Computation.

Barfoot. 2020. “Fundamental Linear Algebra Problem of Gaussian Inference.”

Bonilla, Edwin V. 2017. “Variational Learning of GP Models.”

Bonilla, Edwin V., Chai, and Williams. 2007. “Multi-Task Gaussian Process Prediction.” In Proceedings of the 20th International Conference on Neural Information Processing Systems. NIPS’07.

Bonilla, Edwin V., Krauth, and Dezfouli. 2019. “Generic Inference in Latent Gaussian Process Models.” Journal of Machine Learning Research.

Bruinsma, Perim, Tebbutt, et al. 2020. “Scalable Exact Inference in Multi-Output Gaussian Processes.” In International Conference on Machine Learning.

Dahl, and Bonilla. 2017. “Scalable Gaussian Process Models for Solar Power Forecasting.” In Data Analytics for Renewable Energy Integration: Informing the Generation and Distribution of Renewable Energy. Lecture Notes in Computer Science.

Dahl, and Bonilla. 2019. “Sparse Grouped Gaussian Processes for Solar Power Forecasting.” arXiv:1903.03986 [Cs, Stat].

Dezfouli, and Bonilla. 2015. “Scalable Inference for Gaussian Process Models with Black-Box Likelihoods.” In Advances in Neural Information Processing Systems 28. NIPS’15.

Dutordoir, Hensman, van der Wilk, et al. 2021. “Deep Neural Networks as Point Estimates for Deep Gaussian Processes.” In arXiv:2105.04504 [Cs, Stat].

Galy-Fajou, Perrone, and Opper. 2021. “Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation.” Entropy.

Hensman, Fusi, and Lawrence. 2013. “Gaussian Processes for Big Data.” In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence. UAI’13.

Krauth, Bonilla, Cutajar, et al. 2016. “AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In UAI17.

Lázaro-Gredilla, and Figueiras-Vidal. 2009. “Inter-Domain Gaussian Processes for Sparse Inference Using Inducing Features.” In Advances in Neural Information Processing Systems.

Leibfried, Dutordoir, John, et al. 2022. “A Tutorial on Sparse Gaussian Processes and Variational Inference.”

Lemercier, Salvi, Cass, et al. 2021. “SigGPDE: Scaling Sparse Gaussian Processes on Sequential Data.” In Proceedings of the 38th International Conference on Machine Learning.

Matthews. 2017. “Scalable Gaussian Process Inference Using Variational Methods.”

Meanti, Carratino, Rosasco, et al. 2020. “Kernel Methods Through the Roof: Handling Billions of Points Efficiently.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.

Nguyen, and Bonilla. 2014. “Automated Variational Inference for Gaussian Process Models.” In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. NIPS’14.

Nowak, and Litvinenko. 2013. “Kriging and Spatial Design Accelerated by Orders of Magnitude: Combining Low-Rank Covariance Approximations with FFT-Techniques.” Mathematical Geosciences.

Qi, Abdel-Gawad, and Minka. 2010. “Sparse-Posterior Gaussian Processes for General Likelihoods.” In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence. UAI’10.

Ritter, Kukla, Zhang, et al. 2021. “Sparse Uncertainty Representation in Deep Learning with Inducing Weights.” arXiv:2105.14594 [Cs, Stat].

Rossi, Heinonen, Bonilla, et al. 2021. “Sparse Gaussian Processes Revisited: Bayesian Approaches to Inducing-Variable Approximations.” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.

Saatçi, Turner, and Rasmussen. 2010. “Gaussian Process Change Point Models.” In Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML’10.

Salimbeni, and Deisenroth. 2017. “Doubly Stochastic Variational Inference for Deep Gaussian Processes.” In Advances In Neural Information Processing Systems.

Shi, Titsias, and Mnih. 2020. “Sparse Orthogonal Variational Inference for Gaussian Processes.” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics.

Snelson, and Ghahramani. 2007. “Local and Global Sparse Gaussian Process Approximations.” In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics.

Solin, and Särkkä. 2020. “Hilbert Space Methods for Reduced-Rank Gaussian Process Regression.” Statistics and Computing.

Tiao, Dutordoir, and Picheny. 2023. “Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes.” In.

Titsias. 2009. “Variational Learning of Inducing Variables in Sparse Gaussian Processes.” In International Conference on Artificial Intelligence and Statistics.

Wilson, Knowles, and Ghahramani. 2012. “Gaussian Process Regression Networks.” In Proceedings of the 29th International Coference on International Conference on Machine Learning. ICML’12.

Zammit-Mangion, and Cressie. 2021. “FRK: An R Package for Spatial and Spatio-Temporal Prediction with Large Datasets.” Journal of Statistical Software.

Zhang, Yufeng, Liu, Chen, et al. 2022. “On the Properties of Kullback-Leibler Divergence Between Multivariate Gaussian Distributions.”

Zhang, Rui, Walder, Bonilla, et al. 2020. “Quantile Propagation for Wasserstein-Approximate Gaussian Processes.” In Proceedings of NeurIPS 2020.