Matrix square roots

Whitening, preconditioning etc

August 5, 2014 — May 13, 2023

feature construction
functional analysis
high d
linear algebra
networks
probability
signal processing
sparser than thou
statistics

Assumed audience:

People with undergrad linear algebra

Figure 1

Interpretations and tricks for matrix square roots.

There are two types of things which are referred to as matrix square roots of a Hermitian matrix \(\mathrm{M}\) in circulation:

  1. \(\mathrm{X}\) such that \(\mathrm{X}\mathrm{X}=\mathrm{M}\), and
  2. \(\mathrm{X}\) such that \(\mathrm{X}\mathrm{X}^{\top}=\mathrm{M}\), when \(\mathrm{M}\) is positive semi-definite.

Sometimes either will do; other times we care about which. Sometimes we want a particular structure, e.g. lower triangular as in the Cholesky decomposition.

1 Almost low-rank roots

Perturbations of nearly-low rank matrices are not themselves nearly low rank in general, but there exist efficient algorithms for finding them at least. Fasi, Higham, and Liu (2023):

We consider the problem of computing the square root of a perturbation of the scaled identity matrix, \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\), where \(\mathrm{U}\) and \(\mathrm{V}\) are \(n \times k\) matrices with \(k \leq n\). This problem arises in various applications, including computer vision and optimization methods for machine learning. We derive a new formula for the \(p\)th root of \(\mathrm{A}\) that involves a weighted sum of powers of the \(p\)th root of the \(k \times k\) matrix \(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\). This formula is particularly attractive for the square root, since the sum has just one term when \(p=2\). We also derive a new class of Newton iterations for computing the square root that exploit the low-rank structure.

Their method works for low-rank-plus-diagonal matrices without negative eigenvalues.

Theorem: Let \(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with \(k \leq n\) and assume that \(\mathrm{V}^{H} \mathrm{U}\) is nonsingular. Let \(f\) be defined on the spectrum of \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\), and if \(k=n\) let \(f\) be defined at \(\alpha\). Then \[\quad f(\mathrm{A})=f(\alpha) \mathrm{I}_n+\mathrm{U}\left(\mathrm{V}^{H} \mathrm{U}\right)^{-1}\left(f\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)-f(\alpha) \mathrm{I}_k\right) \mathrm{V}^{H}.\] The theorem says two things: that \(f(\mathrm{A})\), like \(\mathrm{A}\), is a perturbation of rank at most \(k\) of the identity matrix and that \(f(\mathrm{A})\) can be computed by evaluating \(f\) and the inverse at two \(k \times k\) matrices.

I would call this a generalized Woodbury formula, and I think it is pretty cool; which tells you something about my current obsession profile. Anyway, they use it to discover the following:

Let \(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with \(k \leq n\) have full rank and let the matrix \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\) have no eigenvalues on \(\mathbb{R}^{-}\). Then for any integer \(p \geq 1\), \[ \mathrm{A}^{1 / p}=\alpha^{1 / p} \mathrm{I}_n+\mathrm{U}\left(\sum_{i=0}^{p-1} \alpha^{i / p} \cdot\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)^{(p-i-1) / p}\right)^{-1} \mathrm{V}^{H} \]

and in particular

Let \(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with \(k \leq n\) have full rank and let the matrix \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\) have no eigenvalues on \(\mathbb{R}^{-}\). Then \[ \mathrm{A}^{1 / 2}=\alpha^{1 / 2} \mathrm{I}_n+\mathrm{U}\left(\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)^{1 / 2}+\alpha^{1 / 2} \mathrm{I}_k\right)^{-1} \mathrm{V}^{H}. \]

They also derive an explicit gradient step for calculating it, namely “Denman-Beaver iteration”:

The (scaled) DB iteration is \[ \begin{aligned} \mathrm{X}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{X}_i+\mu_i^{-1} \mathrm{Y}_i^{-1}\right), & \mathrm{X}_0=\mathrm{A}, \\ \mathrm{Y}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{Y}_i+\mu_i^{-1} \mathrm{X}_i^{-1}\right), & \mathrm{Y}_0=\mathrm{I}, \end{aligned} \] where the positive scaling parameter \(\mu_i \in \mathbb{R}\) can be used to accelerate the convergence of the method in its initial steps. The choice \(\mu_i=1\) yields the unscaled DB method, for which \(\mathrm{X}_i\) and \(\mathrm{Y}_i\) converge quadratically to \(\mathrm{A}^{1 / 2}\) and \(\mathrm{A}^{-1 / 2}\), respectively.

… for \(i \geq 0\) the iterates \(\mathrm{X}_i\) and \(\mathrm{Y}_i\) can be written in the form \[ \begin{aligned} & \mathrm{X}_i=\beta_i \mathrm{I}_n+\mathrm{U} \mathrm{B}_i \mathrm{V}^{H}, \quad \beta_i \in \mathbb{C}, \quad \mathrm{B}_i \in \mathbb{C}^{k \times k}, \\ & \mathrm{Y}_i=\gamma_i \mathrm{I}_n+\mathrm{U} \mathrm{C}_i \mathrm{V}^{H}, \quad \gamma_i \in \mathbb{C}, \quad \mathrm{C}_i \in \mathbb{C}^{k \times k} . \end{aligned}\]

This gets us a computational speed up, although of a rather complicated kind. For a start, its constant factor is favourable compared to the naive approach, but it also has a somewhat favourable scaling with \(n\), being less-than-cubic although more than quadratic depending on some optimisation convergence rates, which depends both on the problem and upon optimal selection of \(\beta_i,\gamma_i\), which they give a recipe for but it gets kinda complicated and engineering-y.

Anyway, let us suppose \(\mathrm{U}=\mathrm{V}=\mathrm{Z}\) and replay that. Then \[ \mathrm{A}^{1 / 2}=\alpha^{1 / 2} \mathrm{I}_n+\mathrm{Z}\left(\left(\alpha \mathrm{I}_k+\mathrm{Z}^{H} \mathrm{Z}\right)^{1 / 2}+\alpha^{1 / 2} \mathrm{I}_k\right)^{-1} \mathrm{Z}^{H}. \]

The Denman-Beaver step becomes \[ \begin{aligned} \mathrm{X}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{X}_i+\mu_i^{-1} \mathrm{Y}_i^{-1}\right), & \mathrm{X}_0=\mathrm{A}, \\ \mathrm{Y}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{Y}_i+\mu_i^{-1} \mathrm{X}_i^{-1}\right), & \mathrm{Y}_0=\mathrm{I}, \end{aligned} \]

This looked useful although note that this is giving us a full-size square root.

2 Diagonal-plus-low-rank Cholesky

Louis Tiao, in Efficient Cholesky decomposition of low-rank updates summarises Seeger (2004).

3 References

Del Moral, and Niclas. 2018. A Taylor Expansion of the Square Root Matrix Functional.”
Dolcetti, and Pertici. 2020. Real Square Roots of Matrices: Differential Properties in Semi-Simple, Symmetric and Orthogonal Cases.”
Fasi, Higham, and Liu. 2023. Computing the Square Root of a Low-Rank Perturbation of the Scaled Identity Matrix.” SIAM Journal on Matrix Analysis and Applications.
Kessy, Lewin, and Strimmer. 2018. Optimal Whitening and Decorrelation.” The American Statistician.
Minka. 2000. Old and new matrix algebra useful for statistics.
Petersen, and Pedersen. 2012. The Matrix Cookbook.”
Pleiss, Jankowiak, Eriksson, et al. 2020. Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian Optimization.” Advances in Neural Information Processing Systems.
Saad. 2003. Iterative Methods for Sparse Linear Systems: Second Edition.
Seeger, ed. 2004. Low Rank Updates for the Cholesky Decomposition.
Song, Sebe, and Wang. 2022. Fast Differentiable Matrix Square Root.” In.