Statistical learning theory
Eventually including structural risk minimisation, risk bounds, hopefully-uniform convergence rates, VC-dimension, generalisation-and-stability framings etc
July 6, 2016 — December 21, 2024
Another placeholder far from my own background.
Given some amount of noisy data, how complex a model can I learn? How can I know how well it will generalize to new data? If I can answer such questions a priori, I can fit a complex model with some messy regularisation hyperparameters and choose those hyperparameters analytically.
When I did statistics, we talked about this in terms of regularisation and model selection. However, the ML people started thinking about this from the other end and came up with some interestingly different tools.
Depending on your training, you might have encountered bounds in terms of Rademacher complexity, Gaussian complexity, Vapnik-Chernovenkis dimension, bias-variance tradeoffs… Modern results seem to appeal to concentration inequalities. Also, apparently stability of estimators gets you somewhere?
AFAICT, nice statistical learning results apply to rather restricted classes of models; e.g. SVMs and kernel regressions have built-in sample complexity results under the right loss function, but the ground gets rapidly shakier as you generalise to the models we actually use in practice.
If I were learning this these days, I’d start from the new Francis Bach textbook (Bach 2024) (online, supporting code).
1 VC dimension
For classifier losses.
2 Rademacher complexity
3 Stability-based
Apparently studying stability of estimators gets us somewhere?
4 PAC-learning
Probably approximately correct, you mean?
5 Non-I.I.D data
6 Incoming
Percy Liang’s course notes give a rapid overview: CS229T/STAT231: Statistical Learning Theory (Winter 2014).
See also function approximation, and or model selection for the statisticians’ approach, which is more about working out which model our data can support once you’ve fit it, frequently by appealing to asymptotic large-sample results. In learning theory, it seems one always cares about finite sample bounds. Which is reasonable. and the attractiveness of getting them without, e.g. tedious and computationally expensive cross validation, bootstrapping is understandable.
I would like to understand the relationship between the different kinds of convergence rate results that we get, and different learning theories.
(Golubev and Nussbaum 1990; Hasminskii and Ibragimov 1990) give minimax convergence rates in Sobolev class regression.
This looks like fun (David, Moran, and Yehudayoff 2016):
We begin with the setting of multiclass categorization (zero/one loss). We prove that in this case learnability is equivalent to compression of logarithmic sample size, and that uniform convergence implies compression of constant size. We then consider Vapnik’s general learning setting: we show that in order to extend the compressibility-learnability equivalence to this case, it is necessary to consider an approximate variant of compression.