Positive (semi-)definite kernels
Covariance functions, Mercer kernels, positive definite functions, spare reproducing kernels for that Hilbert space I bought on eBay real cheap
September 16, 2019 — January 5, 2021
On the interpretation of kernels as the covariance functions of stochastic processes, which is one way to define stochastic processes.
Suppose we have a real-valued stochastic process
\[\{\mathsf{x}(t)\}_{t\in \mathcal{T}} \] For now, we may as well take the index to be \(\mathcal{T}\subseteq\mathbb{R}^D\), or at least a nice metric space.
The covariance kernel of \(\mathsf{x}\) is a function
\[\begin{aligned}\kappa:&\mathcal{T}\times \mathcal{T}&\to& \mathbb{R}\\ &s,t&\mapsto&\operatorname{Cov}(\mathsf{x}(s),\mathsf{x}(t)). \end{aligned}\]
This is covariance in the usual sense, to wit,
\[\begin{aligned} \operatorname{Cov}(\mathsf{x}(s),\mathsf{x}(t)) &:=\mathbb{E}[\mathsf{x}(s)-\mathbb{E}[\mathsf{x}(t)]] \mathbb{E}[\mathsf{x}(t)-\mathbb{E}[\mathsf{x}(t)]]\\ &=\mathbb{E}[\mathsf{x}(s)\mathsf{x}(t)]- \mathbb{E}[\mathsf{x}(s)]\mathbb{E}[\mathsf{x}(t)]\\ \end{aligned}\]
These are useful objects. In spatial statistics, Gaussian processes, kernel machines and covariance estimation we are concerned with such covariances between values of stochastic processes at different values of their indices. The Karhunen—Loève transform decomposes stochastic processes into a basis of eigenfunctions of the covariance kernel operator. We might consider them in terms of processes defined through convolution instead.
Any process with finite second moments has a covariance function. They are especially renowned for Gaussian process methods, since Gaussian processes are uniquely specified by their mean function and covariance kernels, and also have the usual convenient algebraic properties by virtue of being Gaussian.
TODO: relate to representer theorems.
1 Covariance kernels of some example processes
1.1 A simple Markov chain
Consider a homogeneous continuous time Markov process taking values in \(\{0,1\}\). Suppose it has a transition rate matrix
\[\left[\begin{array}{cc} 0 & \lambda\\ \lambda & 0 \end{array}\right] \] and moreover, that we start the chain from the stationary distribution, \([\frac 1 2\; \frac 1 2]^\top,\) which implies that \(\operatorname{Cov}(0, t)=\operatorname{Cov}(s, s+t)\) for all \(s\), and further, that \(\mathbb{E}[\mathsf{x}(t)]=\frac 1 2 \,\forall t\). So we know that \(\operatorname{Cov}(s,s+t)=\mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)]- \frac 1 4.\) What is \(\mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)]\)?
\[\begin{aligned} \mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)] &=\mathbb{P}[\{\mathsf{x}(0)=1\}\cap\{\mathsf{x}(t)=1\}]\\ &=\mathbb{P}[\number of jumps on $[0,t]$ is even]\\ &=\mathbb{P}[\mathsf{z}\text{ is even}]\text{ where } \mathsf{z}\sim\operatorname{Poisson}(\lambda t)\\ &=\sum_{k=0}^{\infty}\frac{(\lambda t)^{2k} \exp(-\lambda t)}{(2k)!}\\ &= \exp(-\lambda t) \sum_{k=0}^{\infty}\frac{(\lambda t)^{2k}}{(2k)!}\\ &= \exp(-\lambda t)(\exp(-\lambda t) + \exp(\lambda t))/2 &\text{Taylor expansion}\\ &= \frac{\exp(-2\lambda t)}{2} + \frac{1}{2} \end{aligned}\]
From this we deduce that \(\operatorname{Cov}(s,s+t)=\frac{\exp(-2\lambda t)}{2} + \frac{1}{4}.\)
Question: what functions are admissible as covariance kernels for Markov chains?
1.2 The Hawkes process
Covariance kernels are also important in various point processes. Notably, the Hawkes process was introduced in terms of its covariance. 🚧
1.3 Gaussian processes
Handling certain functions in terms of their covariances is particularly convenient. Specifically, Gaussian processes are uniquely specified by the mean and covariance function.
2 General real covariance kernels
A function \(K:\mathcal{T}\times\mathcal{T}\to\mathbb{R}\) can be a covariance kernel if
- It is symmetric in its arguments \(K(s,t)=K(t,s)\) (more generally, conjugate symmetric — \(K(s,t)=K^*(t,s)\), but I think maybe my life will be simpler if I ignore the complex case for the moment.)
- It is positive semidefinite.
That positive semidefiniteness means that for arbitrary real numbers \(c_1,\dots,c_k\) and arbitrary indices \(t_1,\dots,t_k\)
\[ \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K(t_{i}, t_{j}) \geq 0 \]
The interpretation here is since we need the covariances induced by the finite dimensional distributions of this process to be consistent, it is necessary that
\[ \operatorname{Var}\left\{c_{1} X_{\mathbf{t}_{1}}+\cdots+c_{k} X_{\mathbf{t}_{k}}\right\}= \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K\left(\mathbf{t}_{i}, \mathbf{t}_{j}\right) \geq 0 \]
This arises from the constraint on the covariance of \(\operatorname{Var}(\mathbf X\in \mathbb{R}_+^d\) which requires that for \(\mathbf {b}\in \mathbb{R}^d\)
\[ \operatorname {var} (\mathbf {b} ^{\top}\mathbf {X} ) =\mathbf {b} ^{\top}\operatorname {var} (\mathbf {X} )\mathbf {b} \]
Question: What can we say about this covariance if every element of \(\mathbf X\) is non-negative?
Amazingly (to me), this necessary condition will also be sufficient to make something a covariance kernel. In practice, designing covariance functions using positive definiteness is tricky; the space of positive definite kernels is implicit. What we normally do is find a fun class that guarantees positive definiteness and riffle through that. Most of the rest of this notebook is devoted to such classes.
3 Bonus: complex covariance kernels
I talked in terms of real kernels above because I generally observe real-valued measurements of processes. But often complex covariances arise in a natural way too.
A function \(K:\mathcal{T}\times\mathcal{T}\to\mathbb{C}\) can be a covariance kernel if
- It is symmetric conjugate symmetric in its arguments — \(K(s,t)=K^*(t,s)\),
- It is positive semidefinite.
That positive semidefiniteness means that for arbitrary complex numbers \(c_1,\dots,c_k\) and arbitrary indices \(t_1,\dots,t_k\)
\[ \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} \overline{c_{j}} K(t_{i}, t_{j}) \geq 0. \]
Every analytic real kernel should also be a complex kernel, right? 🏗
4 Kernel zoo
For some examples of covariance kernels, see the kernel zoo.
5 Learning kernels
See learning kernels.
6 Non-positive kernels
As in, kernels which are not positive-definite. For example, (Ong et al. 2004; Saha and Balamurugan 2020) 🏗