Measure concentration inequalities
The fancy name for probability inequalities
November 25, 2014 — March 3, 2021
Welcome to the probability inequality mines!
When something in your process (measurement, estimation) means that you can be pretty sure that a whole bunch of your stuff is particularly likely to be somewhere in particular, e.g. being 80% sure I am only 20% wrong.
As undergraduates, we run into central limit theorems, but there are many more diverse ways we can keep track of our probability, or at least most of it. This idea is a basic workhorse in univariate probability and turns out to be yet more essential in multivariate matrix probability, as seen in matrix factorization, compressive sensing, PAC-bounds and suchlike.
1 Background
Overviews include
- Roman Vershynin’s High-Dimensional Probability: An Introduction with Applications in Data Science (Vershynin 2018)
- Thomas Lumley’s super simple intro to chaining and controlling maxima.
- Dasgupta, Asymptotic Theory of Statistics and Probability (DasGupta 2008) is practical, and despite its name, introduces some basic non-asymptotic inequalities
- Raginsky and Sason, Concentration of Measure Inequalities in Information Theory, Communications and Coding (Raginsky and Sason 2012)
- Tropp, An Introduction to Matrix Concentration Inequalities (Tropp 2015) high-dimensional data! free!
- Boucheron, Bousquet & Lugosi, Concentration inequalities (Boucheron, Bousquet, and Lugosi 2004) (Clear and brisk but missing some newer stuff)
- Boucheron, Lugosi & Massart, Concentration inequalities: a nonasymptotic theory of independence (Boucheron, Lugosi, and Massart 2013). Haven’t read it yet.
- Massart, Concentration inequalities and model section (Massart 2007). Clear, focussed, but brusque. Depressingly, by being applied it also demonstrates the limitations of its chosen techniques, which seem sweet in application but bitter in the required assumptions. (Massart 2000) is an earlier draft.
- Lugosi’s Concentration-of-measure Lecture notes as good, especially for treatment of Efron-Stein inequalities. This taxonomy is used in his Combinatorial statistics notes.
- Terry Tao’s lecture notes
- Divergence in everything: erasure divergence and concentration inequalities
- Talagrand’s opus that is commonly credited with kicking off the modern fad, particularly the part due to the chaining method. (Talagrand 1995)
- Luca Trevisan wrote an example-driven explanation of Talagrand generic chaining.
2 Markov
For any nonnegative random variable \(X,\) and \(t>0\) \[ \mathbb{P}\{X \geq t\} \leq \frac{\mathbb{E} X}{t} \] Corollary: if \(\phi\) is a strictly monotonically increasing non-negative-valued function then for any random variable \(X\) and real number \(t\) \[ \mathbb{P}\{X \geq t\}=\mathbb{P}\{\phi(X) \geq \phi(t)\} \leq \frac{\mathbb{E} \phi(X)}{\phi(t)} \]
3 Chebychev
A corollary of Markov’s bound is with \(\phi(x)=x^{2}\) is Chebyshev’s: if \(X\) is an arbitrary random variable and \(t>0,\) then \[ \mathbb{P}\{|X-\mathbb{E} X| \geq t\}=\mathbb{P}\left\{|X-\mathbb{E} X|^{2} \geq t^{2}\right\} \leq \frac{\mathbb{E}\left[|X-\mathbb{E} X|^{2}\right]}{t^{2}}=\frac{\operatorname{Var}\{X\}}{t^{2}} \] More generally taking \(\phi(x)=x^{q}(x \geq 0),\) for any \(q>0\) we have \[ \mathbb{P}\{|X-\mathbb{E} X| \geq t\} \leq \frac{\mathbb{E}\left[|X-\mathbb{E} X|^{q}\right]}{t^{q}} \] We can choose \(q\) to optimize the obtained upper bound for the problem in hand.
4 Hoeffding
Another one about sums of RVs, in particular for bounded RVs with potentially different bounds.
Let \(X_1, X_2, \ldots, X_n\) be independent bounded random variables, in the sense that \(a_i \leq X_i \leq b_i\) for all \(i.\) Define the sum of these variables \(S_n = \sum_{i=1}^n X_i.\) Let \(\mu = \mathbb{E}[S_n]\) be the expected value of \(S_n.\)
Hoeffding’s inequality states that for any \(t > 0\),
\[ \mathbb{P}(S_n - \mu \geq t) \leq \exp \left( -\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right) \]
and
\[ \mathbb{P}(S_n - \mu \leq -t) \leq \exp \left( -\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right). \]
For example, consider \(X_1, X_2, \ldots, X_n\) independent random variables taking values in \([0, 1]\). For \(S_n = \sum_{i=1}^n X_i\) and \(\mu = \mathbb{E}[S_n],\) Hoeffding’s inequality becomes:
\[ \mathbb{P}(S_n - \mu \geq t) \leq \exp \left( -\frac{2t^2}{n} \right). \]
Cool, except my variates are rarely bounded. What do I do then? Probably Chernoff or Bernstein bounds.
5 Chernoff
Taking \(\phi(x)=e^{sx}\) where \(s>0\), for any random variable \(X\), and any \(t>0\), we have \[ \mathbb{P}\{X \geq t\}=\mathbb{P}\left\{e^{sX} \geq e^{st}\right\} \leq \frac{\mathbb{E}[e^{sX}]}{e^{st}}. \] \(s\) is a free parameter we choose to make the bound as tight as possible.
That’s what I was taught in class. The lazy (e.g., me) might not notice that a useful bound for sums of RVs can be derived from this result because this tells us about the Laplace transform (i.e. Moment-generating function), which tells us about sums.
For independent random variables \(X_1, X_2, \ldots, X_n\), let \(S_n = \sum_{i=1}^n X_i\). We have \[ \mathbb{P}(S_n \geq t) \leq \inf_{s > 0} \frac{\mathbb{E}[e^{sS_n}]}{e^{st}}. \] If \(X_i\) are i.i.d., then \[ \mathbb{P}(S_n \geq t) \leq \inf_{s > 0} \frac{(\mathbb{E}[e^{sX_1}])^n}{e^{st}}. \]
6 Bernstein inequality
Another one about bounded RVs. For independent zero-mean random variables \(X_1, X_2, \ldots, X_n\) with \(\mathbb{P}(|X_i| \leq M)=1,\)
\[ \mathbb{P}\left( \sum_{i=1}^n X_i \geq t \right) \leq \exp \left( -\frac{t^2/2}{\sum_{i=1}^n \mathbb{E}[X_i^2] + Mt/3} \right) \]
7 Efron-Stein
Do these results follow from Stein’s method? They have the right form, but the derivation is not clear. In fact, where did I get these results from? I forgot to provide a citation for what I was cribbing from.
Let \(g: \mathcal{X}^{n} \rightarrow \mathbb{R}\) be a real-valued measurable function of n variables. Efron-Stein inequalities concern the difference between the random variable \(Z=g\left(X_{1}, \ldots, X_{n}\right)\) and its expected value \(\mathbb{E}X\) when \(X_{1}, \ldots, X_{n}\) are arbitrary independent random variables.
Define \(\mathbb{E}_{i}\) for the expected value with respect to the variable \(X_{i}\), that is, \[\mathbb{E}_{i} Z=\mathbb{E}\left[Z \mid X_{1}, \ldots, X_{i-1}, X_{i+1}, \ldots, X_{n}\right]\] Then \[ \operatorname{Var}(Z) \leq \sum_{i=1}^{n} \mathbb{E}\left[\left(Z-\mathbb{E}_{i} Z\right)^{2}\right] \]
Now, let \(X_{1}^{\prime}, \ldots, X_{n}^{\prime}\) be an independent copy of \(X_{1}, \ldots, X_{n}\). \[ Z_{i}^{\prime}=g\left(X_{1}, \ldots, X_{i}^{\prime}, \ldots, X_{n}\right) \] Alternatively, \[ \operatorname{Var}(Z) \leq \frac{1}{2} \sum_{i=1}^{n} \mathbb{E}\left[\left(Z-Z_{i}^{\prime}\right)^{2}\right] \] Nothing here seems to constrain the variables here to be real-valued, merely the function \(g\), but apparently they do not work for matrix variables as written — we need to see matrix Efron-Stein results for that.
8 Kolmogorov
🏗
9 Gaussian
For the Gaussian distribution. Filed there, perhaps?
10 Sub-Gaussian
🏗
E.g. Hanson-Wright.
11 Martingale bounds
🏗
12 Khintchine
Let us copy from wikipedia:
Heuristically: if we pick \(N\) complex numbers \(x_1,\dots,x_N \in\mathbb{C}\), and add them together, each multiplied by jointly independent random signs \(\pm 1\), then the expected value of the sum’s magnitude is close to \(\sqrt{|x_1|^{2}+ \cdots + |x_N|^{2}}\).
Let \({\varepsilon_n}_{n=1}^N\) i.i.d. random variables with \(P(\varepsilon_n=\pm1)=\frac12\) for \(n=1,\ldots, N\), i.e., a sequence with Rademacher distribution. Let \(0<p<\infty\) and let \(x_1,\ldots,x_N \in \mathbb{C}\). Then
\[ A_p \left( \sum_{n=1}^N |x_n|^2 \right)^{1/2} \leq \left(\operatorname{E} \left|\sum_{n=1}^N \varepsilon_n x_n\right|^p \right)^{1/p} \leq B_p \left(\sum_{n=1}^N |x_n|^2\right)^{1/2} \]
for some constants \(A_p,B_p>0\). It is a simple matter to see that \(A_p = 1\) when \(p \ge 2\), and \(B_p = 1\) when \(0 < p \le 2\).
13 Empirical process theory
🏗
14 Matrix concentration
If we fix our interest to matrices in particular, some fun things arise. See Matrix concentration inequalities
15 Jensen gap
See Jensen gaps.