Covariance estimation
Esp Gaussian
November 17, 2014 — April 26, 2023
Estimating the thing that is given to you by oracles in statistics homework assignments: the covariance matrix. Or, if the data is indexed by some parameter we might consider the covariance kernel. We are especially interested in this in Gaussian processes, where the covariance structure characterizes the process up to its mean.
I am not introducing a complete theory of covariance estimation here, merely some notes.
Two big data problems can arise here: large \(p\) (ambient dimension) and large \(n\) (sample size). Large \(p\) is a problem because the covariance matrix is a \(p \times p\) matrix and frequently we need to invert it to calculate some target estimand.
Often life can be made not too bad for large \(n\) with Gaussian structure because, essentially, the problem has nice nearly low rank structure.
1 Bayesian
Inverse Wishart priors. 🏗 Other?
2 Precision estimation
The workhorse of learning graphical models under linearity and Gaussianity. See precision estimation for a more complete treatment.
3 Continuous
See kernel learning.
4 Parametric
4.1 Cholesky methods
4.2 on a lattice
Estimating a stationary covariance function on a regular lattice? That is a whole field of its own. Useful keywords include circulant embedding. Although strictly more general than Gaussian processes on a lattice, it is often used in that context and some extra results are on that page for now.
5 Unordered
Thanks to Rothman (2010) I now think about covariance estimates as different in ordered versus exchangeable data.
6 Sandwich estimators
For robust covariances of vector data. AKA Heteroskedasticity-consistent covariance estimators. Incorporating Eicker-Huber-White sandwich estimator, Andrews kernel HAC estimator, Newey-West and others. For an intro see Achim Zeileis, Open-Source Econometric Computing in R.
7 Incoming
- Basic inference using Inverse Wishart by having a basic “process model” that increases uncertainty of the covariance estimate.
- general moment combination tricks
- John Cook’s comparison of standard deviation estimation tricks
8 Bounding by harmonic and arithmetic means
There are some known bounds for the univariate case. Wikipedia says, in Relations with the harmonic and arithmetic means that it has been shown (Mercer 2000) that for a sample \(\left\{y_i\right\}\) of positive real numbers, \[ \sigma_y^2 \leq 2 y_{\max }(A-H) \] where \(y_{\max }\) is the maximum of the sample, \(A\) is the arithmetic mean, \(H\) is the harmonic mean of the sample and \(\sigma_y^2\) is the (biased) variance of the sample. This bound has been improved, and it is known that variance is bounded by \[ \begin{gathered} \sigma_y^2 \leq \frac{y_{\max }(A-H)\left(y_{\max }-A\right)}{y_{\max }-H}, \\ \sigma_y^2 \geq \frac{y_{\min }(A-H)\left(A-y_{\min }\right)}{H-y_{\min }}, \end{gathered} \] where \(y_{\min }\) is the minimum of the sample (Sharma 2008).
Mond and Pec̆arić (1996) says
Let us define the arithmetic mean of \(A\) with weight \(w\) as \[ A_n(A ; w)=\sum_{i=1}^n w_i A_i \] and the harmonic mean of \(A\) with weight \(w\) as \[ H_n(A ; w)=\left(\sum_{i=1}^n w_i A_i^{-1}\right)^{-1} \] It is well known \([2,5]\) that \[ H_n(A ; w) \leqslant A_n(A ; w) \] Moreover, if \(A_{i j}(i, j=1, \ldots, n)\) are positive definite matrices from \(H_m\), then the following inequality is also valid [1]: \[ \frac{1}{n} \sum_{j=1}^n\left(\frac{1}{n} \sum_{i=1}^n A_{i j}^{-1}\right)^{-1} \leqslant\left[\frac{1}{n} \sum_{i=1}^n\left(\frac{1}{n} \sum_{j=1}^n A_{i j}\right)^{-1}\right]^{-1} \]
For multivariate covariance we are interested in the PSD matrix version of this.