Matrix square roots
Whitening, preconditioning etc
August 5, 2014 — May 13, 2023
Assumed audience:
People with undergrad linear algebra
Interpretations and tricks for matrix square roots.
There are two types of things which are referred to as matrix square roots of a Hermitian matrix \(\mathrm{M}\) in circulation:
- \(\mathrm{X}\) such that \(\mathrm{X}\mathrm{X}=\mathrm{M}\), and
- \(\mathrm{X}\) such that \(\mathrm{X}\mathrm{X}^{\top}=\mathrm{M}\), when \(\mathrm{M}\) is positive semi-definite.
Sometimes either will do; other times we care about which. Sometimes we want a particular structure, e.g. lower triangular as in the Cholesky decomposition.
1 Almost low-rank roots
Perturbations of nearly-low rank matrices are not themselves nearly low rank in general, but there exist efficient algorithms for finding them at least. Fasi, Higham, and Liu (2023):
We consider the problem of computing the square root of a perturbation of the scaled identity matrix, \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\), where \(\mathrm{U}\) and \(\mathrm{V}\) are \(n \times k\) matrices with \(k \leq n\). This problem arises in various applications, including computer vision and optimization methods for machine learning. We derive a new formula for the \(p\)th root of \(\mathrm{A}\) that involves a weighted sum of powers of the \(p\)th root of the \(k \times k\) matrix \(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\). This formula is particularly attractive for the square root, since the sum has just one term when \(p=2\). We also derive a new class of Newton iterations for computing the square root that exploit the low-rank structure.
Their method works for low-rank-plus-diagonal matrices without negative eigenvalues.
Theorem: Let \(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with \(k \leq n\) and assume that \(\mathrm{V}^{H} \mathrm{U}\) is nonsingular. Let \(f\) be defined on the spectrum of \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\), and if \(k=n\) let \(f\) be defined at \(\alpha\). Then \[\quad f(\mathrm{A})=f(\alpha) \mathrm{I}_n+\mathrm{U}\left(\mathrm{V}^{H} \mathrm{U}\right)^{-1}\left(f\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)-f(\alpha) \mathrm{I}_k\right) \mathrm{V}^{H}.\] The theorem says two things: that \(f(\mathrm{A})\), like \(\mathrm{A}\), is a perturbation of rank at most \(k\) of the identity matrix and that \(f(\mathrm{A})\) can be computed by evaluating \(f\) and the inverse at two \(k \times k\) matrices.
I would call this a generalized Woodbury formula, and I think it is pretty cool; which tells you something about my current obsession profile. Anyway, they use it to discover the following:
Let \(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with \(k \leq n\) have full rank and let the matrix \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\) have no eigenvalues on \(\mathbb{R}^{-}\). Then for any integer \(p \geq 1\), \[ \mathrm{A}^{1 / p}=\alpha^{1 / p} \mathrm{I}_n+\mathrm{U}\left(\sum_{i=0}^{p-1} \alpha^{i / p} \cdot\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)^{(p-i-1) / p}\right)^{-1} \mathrm{V}^{H} \]
and in particular
Let \(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with \(k \leq n\) have full rank and let the matrix \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\) have no eigenvalues on \(\mathbb{R}^{-}\). Then \[ \mathrm{A}^{1 / 2}=\alpha^{1 / 2} \mathrm{I}_n+\mathrm{U}\left(\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)^{1 / 2}+\alpha^{1 / 2} \mathrm{I}_k\right)^{-1} \mathrm{V}^{H}. \]
They also derive an explicit gradient step for calculating it, namely “Denman-Beaver iteration”:
The (scaled) DB iteration is \[ \begin{aligned} \mathrm{X}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{X}_i+\mu_i^{-1} \mathrm{Y}_i^{-1}\right), & \mathrm{X}_0=\mathrm{A}, \\ \mathrm{Y}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{Y}_i+\mu_i^{-1} \mathrm{X}_i^{-1}\right), & \mathrm{Y}_0=\mathrm{I}, \end{aligned} \] where the positive scaling parameter \(\mu_i \in \mathbb{R}\) can be used to accelerate the convergence of the method in its initial steps. The choice \(\mu_i=1\) yields the unscaled DB method, for which \(\mathrm{X}_i\) and \(\mathrm{Y}_i\) converge quadratically to \(\mathrm{A}^{1 / 2}\) and \(\mathrm{A}^{-1 / 2}\), respectively.
… for \(i \geq 0\) the iterates \(\mathrm{X}_i\) and \(\mathrm{Y}_i\) can be written in the form \[ \begin{aligned} & \mathrm{X}_i=\beta_i \mathrm{I}_n+\mathrm{U} \mathrm{B}_i \mathrm{V}^{H}, \quad \beta_i \in \mathbb{C}, \quad \mathrm{B}_i \in \mathbb{C}^{k \times k}, \\ & \mathrm{Y}_i=\gamma_i \mathrm{I}_n+\mathrm{U} \mathrm{C}_i \mathrm{V}^{H}, \quad \gamma_i \in \mathbb{C}, \quad \mathrm{C}_i \in \mathbb{C}^{k \times k} . \end{aligned}\]
This gets us a computational speed up, although of a rather complicated kind. For a start, its constant factor is favourable compared to the naive approach, but it also has a somewhat favourable scaling with \(n\), being less-than-cubic although more than quadratic depending on some optimisation convergence rates, which depends both on the problem and upon optimal selection of \(\beta_i,\gamma_i\), which they give a recipe for but it gets kinda complicated and engineering-y.
Anyway, let us suppose \(\mathrm{U}=\mathrm{V}=\mathrm{Z}\) and replay that. Then \[ \mathrm{A}^{1 / 2}=\alpha^{1 / 2} \mathrm{I}_n+\mathrm{Z}\left(\left(\alpha \mathrm{I}_k+\mathrm{Z}^{H} \mathrm{Z}\right)^{1 / 2}+\alpha^{1 / 2} \mathrm{I}_k\right)^{-1} \mathrm{Z}^{H}. \]
The Denman-Beaver step becomes \[ \begin{aligned} \mathrm{X}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{X}_i+\mu_i^{-1} \mathrm{Y}_i^{-1}\right), & \mathrm{X}_0=\mathrm{A}, \\ \mathrm{Y}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{Y}_i+\mu_i^{-1} \mathrm{X}_i^{-1}\right), & \mathrm{Y}_0=\mathrm{I}, \end{aligned} \]
This looked useful although note that this is giving us a full-size square root.
2 Diagonal-plus-low-rank Cholesky
Louis Tiao, in Efficient Cholesky decomposition of low-rank updates summarises Seeger (2004).