Statistical learning theory

Eventually including structural risk minimisation, risk bounds, hopefully-uniform convergence rates, VC-dimension, generalisation-and-stability framings etc

July 6, 2016 — December 21, 2024

approximation
functional analysis
machine learning
measure
metrics
model selection
optimization
probability
statmech
Figure 1

Another placeholder far from my own background.

Given some amount of noisy data, how complex a model can I learn? How can I know how well it will generalize to new data? If I can answer such questions a priori, I can fit a complex model with some messy regularisation hyperparameters and choose those hyperparameters analytically.

When I did statistics, we talked about this in terms of regularisation and model selection. However, the ML people started thinking about this from the other end and came up with some interestingly different tools.

Depending on your training, you might have encountered bounds in terms of Rademacher complexity, Gaussian complexity, Vapnik-Chernovenkis dimension, bias-variance tradeoffs… Modern results seem to appeal to concentration inequalities. Also, apparently stability of estimators gets you somewhere?

AFAICT, nice statistical learning results apply to rather restricted classes of models; e.g. SVMs and kernel regressions have built-in sample complexity results under the right loss function, but the ground gets rapidly shakier as you generalise to the models we actually use in practice.

If I were learning this these days, I’d start from the new Francis Bach textbook (Bach 2024) (online, supporting code).

1 VC dimension

For classifier losses.

2 Rademacher complexity

3 Stability-based

Apparently studying stability of estimators gets us somewhere?

4 PAC-learning

Probably approximately correct, you mean?

5 Non-I.I.D data

See learning for dependent data.

6 Incoming

Percy Liang’s course notes give a rapid overview: CS229T/STAT231: Statistical Learning Theory (Winter 2014).

See also function approximation, and or model selection for the statisticians’ approach, which is more about working out which model our data can support once you’ve fit it, frequently by appealing to asymptotic large-sample results. In learning theory, it seems one always cares about finite sample bounds. Which is reasonable. and the attractiveness of getting them without, e.g. tedious and computationally expensive cross validation, bootstrapping is understandable.

I would like to understand the relationship between the different kinds of convergence rate results that we get, and different learning theories.

(Golubev and Nussbaum 1990; Hasminskii and Ibragimov 1990) give minimax convergence rates in Sobolev class regression.

This looks like fun (David, Moran, and Yehudayoff 2016):

We begin with the setting of multiclass categorization (zero/one loss). We prove that in this case learnability is equivalent to compression of logarithmic sample size, and that uniform convergence implies compression of constant size. We then consider Vapnik’s general learning setting: we show that in order to extend the compressibility-learnability equivalence to this case, it is necessary to consider an approximate variant of compression.

7 References

Abbeel, Koller, and Ng. 2006. Learning Factor Graphs in Polynomial Time and Sample Complexity.” Journal of Machine Learning Research.
Bach. 2024. Learning Theory from First Principles. Adaptive Computation and Machine Learning.
Backurs, Indyk, and Schmidt. 2017. On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks.” arXiv:1704.02958 [Cs, Stat].
Banerjee, Chen, Fazayeli, et al. 2014. Estimation with Norm Regularization.” In Advances in Neural Information Processing Systems 27.
Barbier, Krzakala, Macris, et al. 2017. Phase Transitions, Optimal Errors and Optimality of Message-Passing in Generalized Linear Models.” arXiv:1708.03395 [Cond-Mat, Physics:math-Ph].
Barbier, and Macris. 2017. The Stochastic Interpolation Method: A Simple Scheme to Prove Replica Formulas in Bayesian Inference.” arXiv:1705.02780 [Cond-Mat].
Barbour, and Chen, eds. 2005. An Introduction to Stein’s Method. Lecture Notes Series / Institute for Mathematical Sciences, National University of Singapore, v. 4.
Barron, Huang, Li, et al. 2008. MDL, Penalized Likelihood, and Statistical Risk.” In Information Theory Workshop, 2008. ITW’08. IEEE.
Bartlett, Peter L, Jordan, and McAuliffe. 2006. Convexity, Classification, and Risk Bounds.” Journal of the American Statistical Association.
Bartlett, Peter L., and Mendelson. 2002. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.” Journal of Machine Learning Research.
Bellec, and Tsybakov. 2016. Bounds on the Prediction Error of Penalized Least Squares Estimators with Convex Penalty.” arXiv:1609.06675 [Math, Stat].
Bertin, Pennec, and Rivoirard. 2011. Adaptive Dantzig Density Estimation.” Annales de l’Institut Henri Poincaré, Probabilités Et Statistiques.
Birgé. 2008. Model Selection for Density Estimation with L2-Loss.” arXiv:0808.1416 [Math, Stat].
Blumer, Ehrenfeucht, Haussler, et al. 1989. Learnability and the Vapnik-Chervonenkis Dimension.” J. ACM.
Bousquet, Boucheron, and Lugosi. 2004. Introduction to Statistical Learning Theory.” In Advanced Lectures on Machine Learning. Lecture Notes in Computer Science 3176.
Bousquet, Hanneke, and Moran. 2020. “A Theory of Universal Learning.”
Bunea, Tsybakov, and Wegkamp. 2007a. Sparsity Oracle Inequalities for the Lasso.” Electronic Journal of Statistics.
Bunea, Tsybakov, and Wegkamp. 2007b. Sparse Density Estimation with ℓ1 Penalties.” In Learning Theory. Lecture Notes in Computer Science.
Canas, and Rosasco. 2012. Learning Probability Measures with Respect to Optimal Transport Metrics.” arXiv:1209.1077 [Cs, Stat].
Carmi. 2013. Compressive System Identification: Sequential Methods and Entropy Bounds.” Digital Signal Processing.
Chichignoud, Lederer, and Wainwright. 2014. A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees.” arXiv:1410.0247 [Math, Stat].
Daneshmand, Gomez-Rodriguez, Song, et al. 2014. Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-Thresholding Algorithm.” In ICML.
David, Moran, and Yehudayoff. 2016. On Statistical Learning via the Lens of Compression.” arXiv:1610.03592 [Cs, Math].
Devroye, Györfi, and Lugosi. 1996. A Probabilistic Theory of Pattern Recognition.
Dinh, Ho, Nguyen, et al. 2016. Fast Learning Rates with Heavy-Tailed Losses.” In NIPS.
Flammia, Gross, Liu, et al. 2012. Quantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators.” New Journal of Physics.
Foygel, and Srebro. 2011. Fast-Rate and Optimistic-Rate Error Bounds for L1-Regularized Regression.” arXiv:1108.0373 [Math, Stat].
Gnecco, and Sanguineti. 2008. Approximation Error Bounds via Rademacher’s Complexity.” Applied Mathematical Sciences.
Golowich, Rakhlin, and Shamir. 2017. Size-Independent Sample Complexity of Neural Networks.” arXiv:1712.06541 [Cs, Stat].
Golubev, and Nussbaum. 1990. A Risk Bound in Sobolev Class Regression.” The Annals of Statistics.
Goodfellow, Papernot, and McDaniel. 2016. Cleverhans V0.1: An Adversarial Machine Learning Library.” arXiv:1610.00768 [Cs, Stat].
Gribonval, Blanchard, Keriven, et al. 2017. Compressive Statistical Learning with Random Feature Moments.” arXiv:1706.07180 [Cs, Math, Stat].
Hansen, Reynaud-Bouret, and Rivoirard. 2015. Lasso and Probabilistic Inequalities for Multivariate Point Processes.” Bernoulli.
Hasminskii, and Ibragimov. 1990. On Density Estimation in the View of Kolmogorov’s Ideas in Approximation Theory.” The Annals of Statistics.
Hastie, Trevor J., Tibshirani, Rob, and Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations.
Hastie, Trevor, Tibshirani, and Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction.
Huang, Cheang, and Barron. 2008. Risk of Penalized Least Squares, Greedy Selection and L1 Penalization for Flexible Function Libraries.”
Kabán. 2014. New Bounds on Compressive Linear Least Squares Regression.” In Journal of Machine Learning Research.
Kloft, Rückert, and Bartlett. 2010. A Unifying View of Multiple Kernel Learning.” In Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science.
Koltchinskii. 2011. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Mathematics École d’Été de Probabilités de Saint-Flour 2033.
Krishnapuram, Carin, Figueiredo, et al. 2005. Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Kuznetsov, and Mohri. 2015. Learning Theory and Algorithms for Forecasting Non-Stationary Time Series.” In Advances in Neural Information Processing Systems.
Lederer, and van de Geer. 2014. New Concentration Inequalities for Suprema of Empirical Processes.” Bernoulli.
Maurya. 2018. “Optimal Transport in Statistical Machine Learning : Selected Review and Some Open Questions.” In.
McDonald, Daniel J, Shalizi, and Schervish. 2011a. Estimated VC Dimension for Risk Bounds.” Eprint arXiv:1111.3404.
McDonald, Daniel J., Shalizi, and Schervish. 2011b. Generalization Error Bounds for Stationary Autoregressive Models.” arXiv:1103.0942 [Cs, Stat].
———. 2011c. Risk Bounds for Time Series Without Strong Mixing.” arXiv:1106.0730 [Cs, Stat].
Mei, Bai, and Montanari. 2016. The Landscape of Empirical Risk for Non-Convex Losses.” arXiv:1607.06534 [Stat].
Meng, Wang, Chen, et al. 2016. Generalization Error Bounds for Optimization Algorithms via Stability.” In arXiv:1609.08397 [Stat].
Natarajan. 1989. On Learning Sets and Functions.” Machine Learning.
Oneto. 2018. Model Selection and Error Estimation Without the Agonizing Pain.” WIREs Data Mining and Knowledge Discovery.
Pascanu, Mikolov, and Bengio. 2013. On the Difficulty of Training Recurrent Neural Networks.” In arXiv:1211.5063 [Cs].
Reid, and Williamson. 2011. Information, Divergence and Risk for Binary Experiments.” Journal of Machine Learning Research.
Reinert. 2000. Stein’s Method for Epidemic Processes.” In Complex Stochastic Systems.
Reynaud-Bouret. 2003. Adaptive Estimation of the Intensity of Inhomogeneous Poisson Processes via Concentration Inequalities.” Probability Theory and Related Fields.
Reynaud-Bouret, and Schbath. 2010. Adaptive Estimation for Hawkes Processes; Application to Genome Analysis.” The Annals of Statistics.
Schölkopf, and Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.
———. 2003. A Short Introduction to Learning with Kernels.” In Advanced Lectures on Machine Learning. Lecture Notes in Computer Science 2600.
Semenova, Rudin, and Parr. 2021. A Study in Rashomon Curves and Volumes: A New Perspective on Generalization and Model Simplicity in Machine Learning.” arXiv:1908.01755 [Cs, Stat].
Şimşekli, Sener, Deligiannidis, et al. 2020. Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks.” CoRR.
Singh, and Póczos. 2018. Minimax Distribution Estimation in Wasserstein Distance.” arXiv:1802.08855 [Cs, Math, Stat].
Tropp. 2015. An Introduction to Matrix Concentration Inequalities.
Tulabandhula, and Rudin. 2014. Generalization Bounds for Learning with Linear, Polygonal, Quadratic and Conic Side Knowledge.” arXiv Preprint arXiv:1405.7764.
Vainsencher, Mannor, and Bruckstein. 2011. The Sample Complexity of Dictionary Learning.” Journal of Machine Learning Research.
Geer, Sara van de. 2002. “On Hoeffdoing’s Inequality for Dependent Random Variables.” In Empirical Process Techniques for Dependent Data.
Geer, Sara A. van de. 2008. High-Dimensional Generalized Linear Models and the Lasso.” The Annals of Statistics.
Geer, Sara van de. 2014. Worst Possible Sub-Directions in High-Dimensional Models.” In arXiv:1403.7023 [Math, Stat].
Vapnik, Vladimir. 2010. The Nature of Statistical Learning Theory.
Vapnik, V., and Chervonenkis. 1971. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.” Theory of Probability & Its Applications.
Vapnik, Vladimir, Levin, and Cun. 1994. Measuring the VC-Dimension of a Learning Machine.” Neural Computation.
von Luxburg, and Schölkopf. 2008. Statistical Learning Theory: Models, Concepts, and Results.” arXiv:0810.4752 [Math, Stat].
Wu, and Zhou. 2008. Learning with Sample Dependent Hypothesis Spaces.” Computers & Mathematics with Applications.
Xu, and Raginsky. 2017. Information-Theoretic Analysis of Generalization Capability of Learning Algorithms.” In Advances In Neural Information Processing Systems.
Zhang, Bengio, Hardt, et al. 2017. Understanding Deep Learning Requires Rethinking Generalization.” In Proceedings of ICLR.