Matrix norms, divergences, metrics

June 3, 2016 — May 29, 2023

algebra
feature construction
functional analysis
high d
Hilbert space
linear algebra
networks
optimization
probability
signal processing
sparser than thou
statistics
Figure 1

Matrix norms! Ways of measuring the “size” of a matrix, or, more typically for me, the “distance” between matrices, which is the size of one minus the other. Useful as optimisation objectives, or as metrics for comparing matrices.

I write the singular value decomposition of an \(m\times n\) matrix \(\mathrm{B}\) \[ \mathrm{B} = \mathrm{U}\boldsymbol{\Sigma}\mathrm{V} \] into unitary matrices \(\mathrm{U},\, \mathrm{V}\) and a matrix, with non-negative diagonals \(\boldsymbol{\Sigma}\), of respective dimensions \(m\times m,\,n\times n,\,m\times n\).

The diagonal entries of \(\boldsymbol{\Sigma}\), written \(\sigma_i(\mathrm{B})\) are the singular values of \(\mathrm{B}\) and the entries of \(\mathrm{B}\) itself as with the entries \(b_{jk}\).

Now here is a question: Why does this notebook not connect more closely to the matrix concentration page?

1 Operator norm

A classic, defined in terms of norms on two other spaces, and using the matrix as an operator mapping between them. Define two norms, one on \(\mathbb{R}^m\) and one on \(\mathbb{R}^n\). These norms induce an operator norm for matrices \(\mathrm{B}: \mathbb{R}^n \to \mathbb{R}^m\), \[ \|\mathrm{B}\|_{\text{op}}=\sup _{\|v\|\leq 1}\|\mathrm{B} v\|=\sup _{v \neq 0} \frac{\|\mathrm{B} v\|}{\|v\|}, \] where \(v \in \mathbb{R}^n\). Note that the operator norm depends on the underlying norms. When the norm on both spaces is \(\ell_p\) for some \(1\leq p \leq \infty\), i.e. \(\|v\|_p=\left(\sum_i v_i^p\right)^{1 / p}\), we conventionally write \(\|\mathrm{B}\|_p\). The definition of the operator norm gives rise to the following useful inequality which was the workhorse of my functional analysis classes for proving convergence and thus equivalence of norms: \[ \|\mathrm{B} v\| \leq\|\mathrm{B}\|_{\text{op}}\|v\| \] for any \(v \in \mathbb{R}^n\).

1.1 Spectral norm

Operator norm with \(p=2\), i.e. the largest singular value.

Famous useful relation to the Frobenius norm: \[\|\mathrm{B}\|_2 \leqslant\|\mathrm{B}\|_F \leqslant \sqrt{n}\|\mathrm{B}\|_2\]. There are more relations though.

2 Frobenius norm

Coincides with the \(\ell_2\) norm when the matrix happens to be a column vector.

We can define the Frobenius norm in several equivalent ways, \[ \begin{aligned} \|\mathrm{B}\|_F^2 &=\sum_{j=1}^{m}\sum_{k=1}^{n}|b_{jk}|^2\\ &=\operatorname{tr}\left(\mathrm{B}\mathrm{B}^{\top}\right)\\ &=\operatorname{tr}\left(\mathrm{B}^{\top}\mathrm{B}\right)\\ &=\sum_{j=1}^{\min(m,n)}\sigma_{j}(\mathrm{B})^2 &(*)\\ &=\langle\mathrm{B},\mathrm{B}\rangle_F \end{aligned} \]

It is useful to define the Frobenius norm in terms of the Frobenius inner product that we slyly introduced above: \[ \begin{aligned} \langle\mathrm{A}, \mathrm{B}\rangle_{\mathrm{F}}=\sum_{i, j} \overline{A_{i j}} B_{i j}=\operatorname{Tr}\left(\overline{\mathrm{A}^{\top}} \mathrm{B}\right) \equiv \operatorname{Tr}\left(\mathrm{A}^{\dagger} \mathrm{B}\right). \end{aligned} \]

Handy property for partitioned matrices: \[ \begin{aligned} \left\|\left[\begin{array}{c|c} \mathrm{A}_{11} & \mathrm{A}_{12} \\ \hline \mathrm{A}_{21} & \mathrm{A}_{22} \end{array}\right]\right\|_F^2 &=\left\|\mathrm{A}_{11} \right\|_F^2 + \left\|\mathrm{A}_{12}\right\|_F^2 + \left\|\mathrm{A}_{21}\right\|_F^2 + \left\|\mathrm{A}_{22}\right\|_F^2 \end{aligned} \]

Handy property for low-rank-style symmetric products of tall, skinny matrices \[ \begin{aligned} \|\mathrm{A}\mathrm{A}^{\top}\|_F^2 =\operatorname{tr}\left(\mathrm{A}\mathrm{A}^{\top}\mathrm{A}\mathrm{A}^{\top}\right) =\operatorname{tr}\left(\mathrm{A}^{\top}\mathrm{A}\mathrm{A}^{\top}\mathrm{A}\right) =\|\mathrm{A}^{\top}\mathrm{A}\|_F^2 \end{aligned} \]

Handy property for low-rank-style products of non-symmetric tall-skinny matrices: \[ \begin{aligned} \|\mathrm{A}\mathrm{B}^{\top}\|_F^2 =\operatorname{tr}\left(\mathrm{A}\mathrm{B}^{\top}\mathrm{B}\mathrm{A}^{\top}\right) =\operatorname{tr}\left((\mathrm{A}^{\top}\mathrm{A})(\mathrm{B}^{\top}\mathrm{B})\right) \end{aligned} \] This latter form does not require us to form the gigantic tall, wide matrix.

This norm seems to be mostly useful because it is tractable; squared Frobenius norm is simple and minimising squared Frobenius minimises basic Frobenius. And that representation as a simple trace, we can differentiate that easily.

It is not completely silly as a norm in its own right, though. Note that \((*)\) line shows us that the Frobenius norm is equivalently the \(\ell_2\) norm of the singular value vector. This is surprising to me, since this otherwise would feel like a kinda “dumb” norm. “Pretending the matrix is a vector” feels to me like it shouldn’t work, but look! there is some kind of statistic that we might care about there. Also, Frobenius norm bounds some other “more interpretable” norms, so it is indirectly useful.

Frobenius also induces a computationally expedient distance for low-rank-plus-diagonal matrices.

3 Nuclear norm

A.k.a. trace norm, Schatten 1-norm. The sum of a matrix’s singular values.

\[ \begin{aligned} \|\mathrm{B}\|_* &=\operatorname{tr}\left(\sqrt{\mathrm{B}^{\top}\mathrm{B}}\right)\\ &=\sum_{j=1}^{\min(m,n)}\sigma_{j}(\mathrm{B}) \end{aligned} \]

4 Schatten norms

Generalising nuclear and Frobenius norms. The Schatten \(p\)-norm is defined \[ \|\mathrm{B}\|_{p}=\left(\sum _{i=1}^{\min\{m,\,n\}}\sigma _{i}^{p}(\mathrm{B})\right)^{1/p}. \]

The case \(p = 2\) yields the Frobenius norm. The case \(p =\infty\) yields the spectral norm. Finally, \(p = 1\) yields the nuclear norm, also known as the Ky Fan norm.

5 Bregman divergence

🏗 Relation to exponential family and maximum likelihood.

Mark Reid: Meet the Bregman divergences:

If you have some abstract way of measuring the “distance” between any two points and, for any choice of distribution over points the mean point minimises the average distance to all the others, then your distance measure must be a Bregman divergence.

TBC

6 Relations

Ken Tay notes

Mazumder, Hastie, and Tibshirani (2010) has a very nice little lemma (Lemma 6 in the paper) that links the nuclear norm of a matrix (i.e. the sum of its singular values) to the solution of an optimization problem involving Frobenius norms. Here is the lemma: Lemma: For any matrix \(Z \in \mathbb{R}^{m \times n}\), we have \[ \|Z\|_*=\min _{U, V: Z=U V^{\top}} \frac{1}{2}\left(\|U\|_F^2+\|V\|_F^2\right) . \] If \(\operatorname{rank}(Z)=k \leq \min (m, n)\), then the minimum above is attained at a factor decomposition \(Z=U_{m \times k} V_{n \times k}^{\top}\).

6.1 The inequality table

From Petersen and Pedersen (2012):

E. H. Rasmussen has in yet unpublished material derived and collected the following inequalities. They are collected in a table as below, assuming \(\mathrm{B}\) is an \(m \times n\), and \(d=\operatorname{rank}(\mathrm{B})\)

\(\|\mathrm{B}\|_{max}\) $ ||_{1}$ $| |_{}$ $| athrm{B}|_{2}$ $| hrm{B}|_{F}$ $| m{B}|_{KF}$
\(\|\mathrm{B}\|_{max}\) $ 1$ \(1\) \(1\) \(1\) \(1\)
\(\|\mathrm{B}\|_{1}\) $ m$ \(m\) $ t{m}$ $\sqrt{ m}$ $ $
\(\|\mathrm{B}\|_{\infty}\) $ n$ \(n\) $ t{n}$ $\sqrt{ n}$ $ $
\(\|\mathrm{B}\|_{2}\) $ $ $ qrt{n}$ $ t{m}$ \(1\) \(1\)
\(\|\mathrm{B}\|_{F}\) $ $ $ qrt{n}$ $ t{m}$ $\sqrt{ d}$ \(1\)
\(\|\mathrm{B}\|_{KF}\) $ $ $ qrt{nd}$ $ t{md}$ \(d\) $ $

which are to be read as, e.g.

\[ \|\mathrm{B}\|_2 \leq \sqrt{m} \cdot\|\mathrm{B}\|_{\infty} \]

Sorry there are some weird line breaks in that table; not sure how to fix that.

7 Incoming

8 References

Chen, and Chi. 2018. Harnessing Structures in Big Data via Guaranteed Low-Rank Matrix Estimation: Recent Theory and Fast Algorithms via Convex and Nonconvex Optimization.” IEEE Signal Processing Magazine.
Dokmanić, and Gribonval. 2017. Beyond Moore-Penrose Part II: The Sparse Pseudoinverse.” arXiv:1706.08701 [Cs, Math].
Mazumder, Hastie, and Tibshirani. 2010. Spectral Regularization Algorithms for Learning Large Incomplete Matrices.” Journal of Machine Learning Research.
Mercer. 2000. Bounds for A–G, A–H, G–H, and a Family of Inequalities of Ky Fan’s Type, Using a General Method.” Journal of Mathematical Analysis and Applications.
Minka. 2000. Old and new matrix algebra useful for statistics.
Moakher, and Batchelor. 2006. Symmetric Positive-Definite Matrices: From Geometry to Applications and Visualization.” In Visualization and Processing of Tensor Fields.
Omladič, and Šemrl. 1990. “On the Distance Between Normal Matrices.” Proceedings of the American Mathematical Society.
Petersen, and Pedersen. 2012. The Matrix Cookbook.”
Sharma. 2008. Some More Inequalities for Arithmetic Mean, Harmonic Mean and Variance.” Journal of Mathematical Inequalities.
Tropp, Joel A. 2015. An Introduction to Matrix Concentration Inequalities.
Tropp, Joel. 2019. Matrix Concentration & Computational Linear Algebra / ENS Short Course.
Vershynin. 2015. Estimation in High Dimensions: A Geometric Perspective.” In Sampling Theory, a Renaissance: Compressive Sensing and Other Developments. Applied and Numerical Harmonic Analysis.