GP inducing features
October 16, 2020 — January 21, 2025
Suspiciously similar content
TBD
I’m short of time, so I’ll quote a summary from Tiao, Dutordoir, and Picheny (2023), which is not pithy but has the information.
the joint distribution of the model augmented by inducing variables \(\mathbf{u}\) is \(p(\mathbf{y}, \mathbf{f}, \mathbf{u})=p(\mathbf{y} \mid \mathbf{f}) p(\mathbf{f}, \mathbf{u})\) where \(p(\mathbf{f}, \mathbf{u})=p(\mathbf{f} \mid \mathbf{u}) p(\mathbf{u})\) for prior \(p(\mathbf{u})=\mathcal{N}\left(\mathbf{0}, \mathbf{K}_{\mathbf{u u}}\right)\) and conditional \[ p(\mathbf{f} \mid \mathbf{u})=\mathcal{N}\left(\mathbf{f} \mid \mathbf{Q}_{\mathrm{fu}} \mathbf{u}, \mathbf{K}_{\mathrm{ff}}-\mathbf{Q}_{\mathrm{ff}}\right) \]
where \(\mathbf{Q}_{\mathrm{ff}} \triangleq \mathbf{Q}_{\mathrm{fu}} \mathbf{K}_{\mathrm{uu}} \mathbf{Q}_{\mathrm{uf}}\) and \(\mathbf{Q}_{\mathrm{fu}} \triangleq \mathbf{K}_{\mathrm{fu}} \mathbf{K}_{\mathrm{uu}}^{-1}\). The joint variational distribution is defined as \(q(\mathbf{f}, \mathbf{u}) \triangleq p(\mathbf{f} \mid \mathbf{u}) q(\mathbf{u})\), where \(q(\mathbf{u}) \triangleq \mathcal{N}\left(\mathbf{m}_{\mathbf{u}}, \mathbf{C}_{\mathbf{u}}\right)\) for variational parameters \(\mathbf{m}_{\mathbf{u}} \in \mathbb{R}^M\) and \(\mathbf{C}_{\mathbf{u}} \in \mathbb{R}^{M \times M}\) s.t. \(\mathbf{C}_{\mathbf{u}} \succeq 0\). Integrating out \(\mathbf{u}\) yields the posterior predictive
\[ q\left(\mathbf{f}_*\right)=\mathcal{N}\left(\mathbf{Q}_{* \mathbf{u}} \mathbf{m}_{\mathbf{u}}, \mathbf{K}_{* *}-\mathbf{Q}_{* \mathbf{u}}\left(\mathbf{K}_{\mathbf{u u}}-\mathbf{C}_{\mathbf{u}}\right) \mathbf{Q}_{\mathbf{u} *}\right) \]
where parameters \(\mathrm{m}_{\mathrm{u}}\) and \(\mathrm{C}_{\mathrm{u}}\) are learned by minimizing the Kullback-Leibler (KL) divergence between the approximate and exact posterior, KL \([q(\mathbf{f}) \| p(\mathbf{f} \mid \mathbf{y})]\). Thus seen, sVGP has time complexity \(\mathcal{O}\left(M^3\right)\) at prediction time and \(\mathcal{O}\left(M^3+\right.\) \(\left.M^2 N\right)\) during training. In the reproducing kernel Hilbert space (RKHS) associated with \(k\), the predictive has a dual representation in which the mean and covariance share the same basis determined by \(\mathbf{u}\) (Cheng & Boots, 2017; Salimbeni et al., 2018). More specifically, the basis function is effectively the vector-valued function \(\mathbf{k}_{\mathrm{u}}: \mathcal{X} \rightarrow \mathbb{R}^M\) whose \(m\)-th component is defined as
\[ \left[\mathbf{k}_{\mathbf{u}}(\mathbf{x})\right]_m \triangleq \operatorname{Cov}\left(f(\mathbf{x}), u_m\right) \]
In the standard definition of inducing points, \(\left[\mathbf{k}_{\mathbf{u}}(\mathbf{x})\right]_m=\) \(k\left(\mathbf{z}_m, \mathbf{x}\right)\), so the basis function is solely determined by \(k\) and the local influence of pseudo-input \(\mathbf{z}_m\). Inter-domain inducing features are a generalisation of standard inducing variables in which each variable \(u_m \triangleq L_m[f]\) for some linear operator \(L_m: \mathbb{R}^{\mathcal{X}} \rightarrow \mathbb{R}\). A particularly useful operator is the integral transform, \(L_m[f] \triangleq\) \(\int_{\mathcal{X}} f(\mathbf{x}) \phi_m(\mathbf{x}) \mathrm{d} \mathbf{x}\), which was originally employed by Lázaro-Gredilla & Figueiras-Vidal (2009). Refer to the manuscript of van der Wilk et al. (2020) for a more thorough and contemporary treatment. A closely related form is the scalar projection of \(f\) onto some \(\phi_m\) in the RKHS \(\mathcal{H}\),
\[ L_m[f] \triangleq\left\langle f, \phi_m\right\rangle_{\mathcal{H}} \] and conditional
\[ p(\mathbf{f} \mid \mathbf{u})=\mathcal{N}\left(\mathbf{f} \mid \mathbf{Q}_{\mathrm{fu}} \mathbf{u}, \mathbf{K}_{\mathrm{ff}}-\mathbf{Q}_{\mathrm{ff}}\right) \]
where \(\mathbf{Q}_{\mathrm{ff}} \triangleq \mathbf{Q}_{\mathrm{fu}} \mathbf{K}_{\mathrm{uu}} \mathbf{Q}_{\mathrm{uf}}\) and \(\mathbf{Q}_{\mathrm{fu}} \triangleq \mathbf{K}_{\mathrm{fu}} \mathbf{K}_{\mathrm{uu}}^{-1}\). The joint variational distribution is defined as \(q(\mathbf{f}, \mathbf{u}) \triangleq p(\mathbf{f} \mid \mathbf{u}) q(\mathbf{u})\), where \(q(\mathbf{u}) \triangleq \mathcal{N}\left(\mathbf{m}_{\mathbf{u}}, \mathbf{C}_{\mathbf{u}}\right)\) for variational parameters \(\mathbf{m}_{\mathbf{u}} \in \mathbb{R}^M\) and \(\mathbf{C}_{\mathbf{u}} \in \mathbb{R}^{M \times M}\) s.t. \(\mathbf{C}_{\mathbf{u}} \succeq 0\). Integrating out \(\mathbf{u}\) yields the posterior predictive
\[ q\left(\mathbf{f}_*\right)=\mathcal{N}\left(\mathbf{Q}_{* \mathbf{u}} \mathbf{m}_{\mathbf{u}}, \mathbf{K}_{* *}-\mathbf{Q}_{* \mathbf{u}}\left(\mathbf{K}_{\mathbf{u u}}-\mathbf{C}_{\mathbf{u}}\right) \mathbf{Q}_{\mathbf{u} *}\right) \]
where parameters \(\mathrm{m}_{\mathrm{u}}\) and \(\mathrm{C}_{\mathrm{u}}\) are learned by minimising the Kullback-Leibler (KL) divergence between the approximate and exact posterior, KL \([q(\mathbf{f}) \| p(\mathbf{f} \mid \mathbf{y})]\). Thus seen, sVGP has time complexity \(\mathcal{O}\left(M^3\right)\) at prediction time and \(\mathcal{O}\left(M^3+\right.\) \(\left.M^2 N\right)\) during training. In the reproducing kernel Hilbert space (RKHS) associated with \(k\), the predictive has a dual representation in which the mean and covariance share the same basis determined by \(\mathbf{u}\) (Cheng & Boots, 2017; Salimbeni et al., 2018). More specifically, the basis function is effectively the vector-valued function \(\mathbf{k}_{\mathrm{u}}: \mathcal{X} \rightarrow \mathbb{R}^M\) whose \(m\)-th component is defined as
\[ \left[\mathbf{k}_{\mathbf{u}}(\mathbf{x})\right]_m \triangleq \operatorname{Cov}\left(f(\mathbf{x}), u_m\right) \]
In the standard definition of inducing points, \(\left[\mathbf{k}_{\mathbf{u}}(\mathbf{x})\right]_m=\) \(k\left(\mathbf{z}_m, \mathbf{x}\right)\), so the basis function is solely determined by \(k\) and the local influence of pseudo-input \(\mathbf{z}_m\). Inter-domain inducing features are a generalisation of standard inducing variables in which each variable \(u_m \triangleq L_m[f]\) for some linear operator \(L_m: \mathbb{R}^{\mathcal{X}} \rightarrow \mathbb{R}\). A particularly useful operator is the integral transform, \(L_m[f] \triangleq\) \(\int_{\mathcal{X}} f(\mathbf{x}) \phi_m(\mathbf{x}) \mathrm{d} \mathbf{x}\), which was originally employed by Lázaro-Gredilla & Figueiras-Vidal (2009). Refer to the manuscript of van der Wilk et al. (2020) for a more thorough and contemporary treatment. A closely related form is the scalar projection of \(f\) onto some \(\phi_m\) in the RKHS \(\mathcal{H}\),
\[ L_m[f] \triangleq\left\langle f, \phi_m\right\rangle_{\mathcal{H}} \] which leads to \(\left[\mathbf{k}_{\mathbf{u}}(\mathbf{x})\right]_m=\phi_m(\mathbf{x})\) by the reproducing property of the RKHS. This, in effect, equips the GP approximation with basis functions \(\phi_m\) that are not solely determined by the kernel, and suitable choices can lead to sparser representations and considerable computational benefits (Hensman et al., 2018; Burt et al., 2020; Dutordoir et al., 2020; Sun et al., 2021).
1 Incoming
Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes | Louis Tiao