Matrix norms, divergences, metrics
June 3, 2016 — May 29, 2023
Matrix norms! Ways of measuring the “size” of a matrix, or, more typically for me, the “distance” between matrices, which is the size of one minus the other. Useful as optimisation objectives, or as metrics for comparing matrices.
I write the singular value decomposition of an \(m\times n\) matrix \(\mathrm{B}\) \[ \mathrm{B} = \mathrm{U}\boldsymbol{\Sigma}\mathrm{V} \] into unitary matrices \(\mathrm{U},\, \mathrm{V}\) and a matrix, with non-negative diagonals \(\boldsymbol{\Sigma}\), of respective dimensions \(m\times m,\,n\times n,\,m\times n\).
The diagonal entries of \(\boldsymbol{\Sigma}\), written \(\sigma_i(\mathrm{B})\) are the singular values of \(\mathrm{B}\) and the entries of \(\mathrm{B}\) itself as with the entries \(b_{jk}\).
Now here is a question: Why does this notebook not connect more closely to the matrix concentration page?
1 Operator norm
A classic, defined in terms of norms on two other spaces, and using the matrix as an operator mapping between them. Define two norms, one on \(\mathbb{R}^m\) and one on \(\mathbb{R}^n\). These norms induce an operator norm for matrices \(\mathrm{B}: \mathbb{R}^n \to \mathbb{R}^m\), \[ \|\mathrm{B}\|_{\text{op}}=\sup _{\|v\|\leq 1}\|\mathrm{B} v\|=\sup _{v \neq 0} \frac{\|\mathrm{B} v\|}{\|v\|}, \] where \(v \in \mathbb{R}^n\). Note that the operator norm depends on the underlying norms. When the norm on both spaces is \(\ell_p\) for some \(1\leq p \leq \infty\), i.e. \(\|v\|_p=\left(\sum_i v_i^p\right)^{1 / p}\), we conventionally write \(\|\mathrm{B}\|_p\). The definition of the operator norm gives rise to the following useful inequality which was the workhorse of my functional analysis classes for proving convergence and thus equivalence of norms: \[ \|\mathrm{B} v\| \leq\|\mathrm{B}\|_{\text{op}}\|v\| \] for any \(v \in \mathbb{R}^n\).
1.1 Spectral norm
Operator norm with \(p=2\), i.e. the largest singular value.
Famous useful relation to the Frobenius norm: \[\|\mathrm{B}\|_2 \leqslant\|\mathrm{B}\|_F \leqslant \sqrt{n}\|\mathrm{B}\|_2\]. There are more relations though.
2 Frobenius norm
Coincides with the \(\ell_2\) norm when the matrix happens to be a column vector.
We can define the Frobenius norm in several equivalent ways, \[ \begin{aligned} \|\mathrm{B}\|_F^2 &=\sum_{j=1}^{m}\sum_{k=1}^{n}|b_{jk}|^2\\ &=\operatorname{tr}\left(\mathrm{B}\mathrm{B}^{\top}\right)\\ &=\operatorname{tr}\left(\mathrm{B}^{\top}\mathrm{B}\right)\\ &=\sum_{j=1}^{\min(m,n)}\sigma_{j}(\mathrm{B})^2 &(*)\\ &=\langle\mathrm{B},\mathrm{B}\rangle_F \end{aligned} \]
It is useful to define the Frobenius norm in terms of the Frobenius inner product that we slyly introduced above: \[ \begin{aligned} \langle\mathrm{A}, \mathrm{B}\rangle_{\mathrm{F}}=\sum_{i, j} \overline{A_{i j}} B_{i j}=\operatorname{Tr}\left(\overline{\mathrm{A}^{\top}} \mathrm{B}\right) \equiv \operatorname{Tr}\left(\mathrm{A}^{\dagger} \mathrm{B}\right). \end{aligned} \]
Handy property for partitioned matrices: \[ \begin{aligned} \left\|\left[\begin{array}{c|c} \mathrm{A}_{11} & \mathrm{A}_{12} \\ \hline \mathrm{A}_{21} & \mathrm{A}_{22} \end{array}\right]\right\|_F^2 &=\left\|\mathrm{A}_{11} \right\|_F^2 + \left\|\mathrm{A}_{12}\right\|_F^2 + \left\|\mathrm{A}_{21}\right\|_F^2 + \left\|\mathrm{A}_{22}\right\|_F^2 \end{aligned} \]
Handy property for low-rank-style symmetric products of tall, skinny matrices \[ \begin{aligned} \|\mathrm{A}\mathrm{A}^{\top}\|_F^2 =\operatorname{tr}\left(\mathrm{A}\mathrm{A}^{\top}\mathrm{A}\mathrm{A}^{\top}\right) =\operatorname{tr}\left(\mathrm{A}^{\top}\mathrm{A}\mathrm{A}^{\top}\mathrm{A}\right) =\|\mathrm{A}^{\top}\mathrm{A}\|_F^2 \end{aligned} \]
Handy property for low-rank-style products of non-symmetric tall-skinny matrices: \[ \begin{aligned} \|\mathrm{A}\mathrm{B}^{\top}\|_F^2 =\operatorname{tr}\left(\mathrm{A}\mathrm{B}^{\top}\mathrm{B}\mathrm{A}^{\top}\right) =\operatorname{tr}\left((\mathrm{A}^{\top}\mathrm{A})(\mathrm{B}^{\top}\mathrm{B})\right) \end{aligned} \] This latter form does not require us to form the gigantic tall, wide matrix.
This norm seems to be mostly useful because it is tractable; squared Frobenius norm is simple and minimising squared Frobenius minimises basic Frobenius. And that representation as a simple trace, we can differentiate that easily.
It is not completely silly as a norm in its own right, though. Note that \((*)\) line shows us that the Frobenius norm is equivalently the \(\ell_2\) norm of the singular value vector. This is surprising to me, since this otherwise would feel like a kinda “dumb” norm. “Pretending the matrix is a vector” feels to me like it shouldn’t work, but look! there is some kind of statistic that we might care about there. Also, Frobenius norm bounds some other “more interpretable” norms, so it is indirectly useful.
Frobenius also induces a computationally expedient distance for low-rank-plus-diagonal matrices.
3 Nuclear norm
A.k.a. trace norm, Schatten 1-norm. The sum of a matrix’s singular values.
\[ \begin{aligned} \|\mathrm{B}\|_* &=\operatorname{tr}\left(\sqrt{\mathrm{B}^{\top}\mathrm{B}}\right)\\ &=\sum_{j=1}^{\min(m,n)}\sigma_{j}(\mathrm{B}) \end{aligned} \]
4 Schatten norms
Generalising nuclear and Frobenius norms. The Schatten \(p\)-norm is defined \[ \|\mathrm{B}\|_{p}=\left(\sum _{i=1}^{\min\{m,\,n\}}\sigma _{i}^{p}(\mathrm{B})\right)^{1/p}. \]
The case \(p = 2\) yields the Frobenius norm. The case \(p =\infty\) yields the spectral norm. Finally, \(p = 1\) yields the nuclear norm, also known as the Ky Fan norm.
5 Bregman divergence
🏗 Relation to exponential family and maximum likelihood.
Mark Reid: Meet the Bregman divergences:
If you have some abstract way of measuring the “distance” between any two points and, for any choice of distribution over points the mean point minimises the average distance to all the others, then your distance measure must be a Bregman divergence.
TBC
6 Relations
Mazumder, Hastie, and Tibshirani (2010) has a very nice little lemma (Lemma 6 in the paper) that links the nuclear norm of a matrix (i.e. the sum of its singular values) to the solution of an optimization problem involving Frobenius norms. Here is the lemma: Lemma: For any matrix \(Z \in \mathbb{R}^{m \times n}\), we have \[ \|Z\|_*=\min _{U, V: Z=U V^{\top}} \frac{1}{2}\left(\|U\|_F^2+\|V\|_F^2\right) . \] If \(\operatorname{rank}(Z)=k \leq \min (m, n)\), then the minimum above is attained at a factor decomposition \(Z=U_{m \times k} V_{n \times k}^{\top}\).
6.1 The inequality table
From Petersen and Pedersen (2012):
E. H. Rasmussen has in yet unpublished material derived and collected the following inequalities. They are collected in a table as below, assuming \(\mathrm{B}\) is an \(m \times n\), and \(d=\operatorname{rank}(\mathrm{B})\)
\(\|\mathrm{B}\|_{max}\) $ | ||_{1}$ $| | |_{}$ $| | athrm{B}|_{2}$ $| | hrm{B}|_{F}$ $| | m{B}|_{KF}$ | |
---|---|---|---|---|---|---|
\(\|\mathrm{B}\|_{max}\) | $ | 1$ \(1\) | \(1\) | \(1\) | \(1\) | |
\(\|\mathrm{B}\|_{1}\) $ | m$ | \(m\) | $ | t{m}$ $\sqrt{ | m}$ $ | $ |
\(\|\mathrm{B}\|_{\infty}\) $ | n$ \(n\) | $ | t{n}$ $\sqrt{ | n}$ $ | $ | |
\(\|\mathrm{B}\|_{2}\) $ | $ $ | qrt{n}$ $ | t{m}$ | \(1\) | \(1\) | |
\(\|\mathrm{B}\|_{F}\) $ | $ $ | qrt{n}$ $ | t{m}$ $\sqrt{ | d}$ | \(1\) | |
\(\|\mathrm{B}\|_{KF}\) $ | $ $ | qrt{nd}$ $ | t{md}$ \(d\) | $ | $ |
which are to be read as, e.g.
\[ \|\mathrm{B}\|_2 \leq \sqrt{m} \cdot\|\mathrm{B}\|_{\infty} \]
Sorry there are some weird line breaks in that table; not sure how to fix that.