GP inducing features

October 16, 2020 — January 21, 2025

algebra
approximation
Gaussian
generative
graphical models
Hilbert space
kernel tricks
machine learning
networks
optimization
probability
spheres
statistics

TBD

Figure 1

I’m short of time, so I’ll quote a summary from Tiao, Dutordoir, and Picheny (2023), which is not pithy but has the information.

the joint distribution of the model augmented by inducing variables \(\mathbf{u}\) is \(p(\mathbf{y}, \mathbf{f}, \mathbf{u})=p(\mathbf{y} \mid \mathbf{f}) p(\mathbf{f}, \mathbf{u})\) where \(p(\mathbf{f}, \mathbf{u})=p(\mathbf{f} \mid \mathbf{u}) p(\mathbf{u})\) for prior \(p(\mathbf{u})=\mathcal{N}\left(\mathbf{0}, \mathbf{K}_{\mathbf{u u}}\right)\) and conditional \[ p(\mathbf{f} \mid \mathbf{u})=\mathcal{N}\left(\mathbf{f} \mid \mathbf{Q}_{\mathrm{fu}} \mathbf{u}, \mathbf{K}_{\mathrm{ff}}-\mathbf{Q}_{\mathrm{ff}}\right) \]

where \(\mathbf{Q}_{\mathrm{ff}} \triangleq \mathbf{Q}_{\mathrm{fu}} \mathbf{K}_{\mathrm{uu}} \mathbf{Q}_{\mathrm{uf}}\) and \(\mathbf{Q}_{\mathrm{fu}} \triangleq \mathbf{K}_{\mathrm{fu}} \mathbf{K}_{\mathrm{uu}}^{-1}\). The joint variational distribution is defined as \(q(\mathbf{f}, \mathbf{u}) \triangleq p(\mathbf{f} \mid \mathbf{u}) q(\mathbf{u})\), where \(q(\mathbf{u}) \triangleq \mathcal{N}\left(\mathbf{m}_{\mathbf{u}}, \mathbf{C}_{\mathbf{u}}\right)\) for variational parameters \(\mathbf{m}_{\mathbf{u}} \in \mathbb{R}^M\) and \(\mathbf{C}_{\mathbf{u}} \in \mathbb{R}^{M \times M}\) s.t. \(\mathbf{C}_{\mathbf{u}} \succeq 0\). Integrating out \(\mathbf{u}\) yields the posterior predictive

\[ q\left(\mathbf{f}_*\right)=\mathcal{N}\left(\mathbf{Q}_{* \mathbf{u}} \mathbf{m}_{\mathbf{u}}, \mathbf{K}_{* *}-\mathbf{Q}_{* \mathbf{u}}\left(\mathbf{K}_{\mathbf{u u}}-\mathbf{C}_{\mathbf{u}}\right) \mathbf{Q}_{\mathbf{u} *}\right) \]

where parameters \(\mathrm{m}_{\mathrm{u}}\) and \(\mathrm{C}_{\mathrm{u}}\) are learned by minimizing the Kullback-Leibler (KL) divergence between the approximate and exact posterior, KL \([q(\mathbf{f}) \| p(\mathbf{f} \mid \mathbf{y})]\). Thus seen, sVGP has time complexity \(\mathcal{O}\left(M^3\right)\) at prediction time and \(\mathcal{O}\left(M^3+\right.\) \(\left.M^2 N\right)\) during training. In the reproducing kernel Hilbert space (RKHS) associated with \(k\), the predictive has a dual representation in which the mean and covariance share the same basis determined by \(\mathbf{u}\) (Cheng & Boots, 2017; Salimbeni et al., 2018). More specifically, the basis function is effectively the vector-valued function \(\mathbf{k}_{\mathrm{u}}: \mathcal{X} \rightarrow \mathbb{R}^M\) whose \(m\)-th component is defined as

\[ \left[\mathbf{k}_{\mathbf{u}}(\mathbf{x})\right]_m \triangleq \operatorname{Cov}\left(f(\mathbf{x}), u_m\right) \]

In the standard definition of inducing points, \(\left[\mathbf{k}_{\mathbf{u}}(\mathbf{x})\right]_m=\) \(k\left(\mathbf{z}_m, \mathbf{x}\right)\), so the basis function is solely determined by \(k\) and the local influence of pseudo-input \(\mathbf{z}_m\). Inter-domain inducing features are a generalisation of standard inducing variables in which each variable \(u_m \triangleq L_m[f]\) for some linear operator \(L_m: \mathbb{R}^{\mathcal{X}} \rightarrow \mathbb{R}\). A particularly useful operator is the integral transform, \(L_m[f] \triangleq\) \(\int_{\mathcal{X}} f(\mathbf{x}) \phi_m(\mathbf{x}) \mathrm{d} \mathbf{x}\), which was originally employed by Lázaro-Gredilla & Figueiras-Vidal (2009). Refer to the manuscript of van der Wilk et al. (2020) for a more thorough and contemporary treatment. A closely related form is the scalar projection of \(f\) onto some \(\phi_m\) in the RKHS \(\mathcal{H}\),

\[ L_m[f] \triangleq\left\langle f, \phi_m\right\rangle_{\mathcal{H}} \] and conditional

\[ p(\mathbf{f} \mid \mathbf{u})=\mathcal{N}\left(\mathbf{f} \mid \mathbf{Q}_{\mathrm{fu}} \mathbf{u}, \mathbf{K}_{\mathrm{ff}}-\mathbf{Q}_{\mathrm{ff}}\right) \]

where \(\mathbf{Q}_{\mathrm{ff}} \triangleq \mathbf{Q}_{\mathrm{fu}} \mathbf{K}_{\mathrm{uu}} \mathbf{Q}_{\mathrm{uf}}\) and \(\mathbf{Q}_{\mathrm{fu}} \triangleq \mathbf{K}_{\mathrm{fu}} \mathbf{K}_{\mathrm{uu}}^{-1}\). The joint variational distribution is defined as \(q(\mathbf{f}, \mathbf{u}) \triangleq p(\mathbf{f} \mid \mathbf{u}) q(\mathbf{u})\), where \(q(\mathbf{u}) \triangleq \mathcal{N}\left(\mathbf{m}_{\mathbf{u}}, \mathbf{C}_{\mathbf{u}}\right)\) for variational parameters \(\mathbf{m}_{\mathbf{u}} \in \mathbb{R}^M\) and \(\mathbf{C}_{\mathbf{u}} \in \mathbb{R}^{M \times M}\) s.t. \(\mathbf{C}_{\mathbf{u}} \succeq 0\). Integrating out \(\mathbf{u}\) yields the posterior predictive

\[ q\left(\mathbf{f}_*\right)=\mathcal{N}\left(\mathbf{Q}_{* \mathbf{u}} \mathbf{m}_{\mathbf{u}}, \mathbf{K}_{* *}-\mathbf{Q}_{* \mathbf{u}}\left(\mathbf{K}_{\mathbf{u u}}-\mathbf{C}_{\mathbf{u}}\right) \mathbf{Q}_{\mathbf{u} *}\right) \]

where parameters \(\mathrm{m}_{\mathrm{u}}\) and \(\mathrm{C}_{\mathrm{u}}\) are learned by minimising the Kullback-Leibler (KL) divergence between the approximate and exact posterior, KL \([q(\mathbf{f}) \| p(\mathbf{f} \mid \mathbf{y})]\). Thus seen, sVGP has time complexity \(\mathcal{O}\left(M^3\right)\) at prediction time and \(\mathcal{O}\left(M^3+\right.\) \(\left.M^2 N\right)\) during training. In the reproducing kernel Hilbert space (RKHS) associated with \(k\), the predictive has a dual representation in which the mean and covariance share the same basis determined by \(\mathbf{u}\) (Cheng & Boots, 2017; Salimbeni et al., 2018). More specifically, the basis function is effectively the vector-valued function \(\mathbf{k}_{\mathrm{u}}: \mathcal{X} \rightarrow \mathbb{R}^M\) whose \(m\)-th component is defined as

\[ \left[\mathbf{k}_{\mathbf{u}}(\mathbf{x})\right]_m \triangleq \operatorname{Cov}\left(f(\mathbf{x}), u_m\right) \]

In the standard definition of inducing points, \(\left[\mathbf{k}_{\mathbf{u}}(\mathbf{x})\right]_m=\) \(k\left(\mathbf{z}_m, \mathbf{x}\right)\), so the basis function is solely determined by \(k\) and the local influence of pseudo-input \(\mathbf{z}_m\). Inter-domain inducing features are a generalisation of standard inducing variables in which each variable \(u_m \triangleq L_m[f]\) for some linear operator \(L_m: \mathbb{R}^{\mathcal{X}} \rightarrow \mathbb{R}\). A particularly useful operator is the integral transform, \(L_m[f] \triangleq\) \(\int_{\mathcal{X}} f(\mathbf{x}) \phi_m(\mathbf{x}) \mathrm{d} \mathbf{x}\), which was originally employed by Lázaro-Gredilla & Figueiras-Vidal (2009). Refer to the manuscript of van der Wilk et al. (2020) for a more thorough and contemporary treatment. A closely related form is the scalar projection of \(f\) onto some \(\phi_m\) in the RKHS \(\mathcal{H}\),

\[ L_m[f] \triangleq\left\langle f, \phi_m\right\rangle_{\mathcal{H}} \] which leads to \(\left[\mathbf{k}_{\mathbf{u}}(\mathbf{x})\right]_m=\phi_m(\mathbf{x})\) by the reproducing property of the RKHS. This, in effect, equips the GP approximation with basis functions \(\phi_m\) that are not solely determined by the kernel, and suitable choices can lead to sparser representations and considerable computational benefits (Hensman et al., 2018; Burt et al., 2020; Dutordoir et al., 2020; Sun et al., 2021).

Figure 2: Pruning the basis features.

1 Incoming

Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes | Louis Tiao

2 References

Dutordoir, Durrande, and Hensman. 2020. Sparse Gaussian Processes with Spherical Harmonic Features.” In Proceedings of the 37th International Conference on Machine Learning. ICML’20.
Dutordoir, Hensman, van der Wilk, et al. 2021. Deep Neural Networks as Point Estimates for Deep Gaussian Processes.” In arXiv:2105.04504 [Cs, Stat].
Lázaro-Gredilla, and Figueiras-Vidal. 2009. Inter-Domain Gaussian Processes for Sparse Inference Using Inducing Features.” In Advances in Neural Information Processing Systems.
Rossi, Heinonen, Bonilla, et al. 2021. Sparse Gaussian Processes Revisited: Bayesian Approaches to Inducing-Variable Approximations.” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.
Shi, Titsias, and Mnih. 2020. Sparse Orthogonal Variational Inference for Gaussian Processes.” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics.
Tiao, Dutordoir, and Picheny. 2023. Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes.” In.
Wilk, Dutordoir, John, et al. 2020. A Framework for Interdomain and Multioutput Gaussian Processes.”