Penalised/regularised regression

June 23, 2016 — September 19, 2022

Bayes
functional analysis
linear algebra
model selection
optimization
probability
signal processing
sparser than thou
statistics

Regression estimation with penalties on the model parameters. I am especially interested when the penalties are sparsifying penalties, and I have more notes on sparse regression.

Here I consider general penalties: ridge etc. At least in principle — I have no active projects using penalties without sparsifying them at the moment.

Why might I use such penalties? One reason would be that \(L_2\) penalties have simple forms for their information criteria, as shown by Konishi and Kitagawa (Konishi and Kitagawa 2008, 5.2.4).

See also matrix factorisations, optimisation, multiple testing, concentration inequalities, sparse flavoured ice cream.

To discuss:

Ridge penalties, relationship with robust regression, statistical learning theory etc.

In nonparametric statistics, we might estimate simultaneously what look like many, many parameters, which we constrain in some clever fashion, which usually boils down to something we can interpret as a “penalty” on the parameters.

“Penalization” has a genealogy unknown to me, but is probably the least abstruse for common, general usage.

The “regularisation” nomenclature claims descent from Tikhonov, (e.g. Tikhonov and Glasko (1965)) who wanted to solve ill-conditioned integral and differential equations, which is slightly more general.

In statistics, the term “shrinkage” is used for very nearly the same thing.

“Smoothing” seems to be common in the spline and kernel estimate communities of (Silverman 1982, 1984; Wahba 1990) et al, who usually actually want to smooth curves. When we say “smoothing” you usually mean that you can express your predictions as a “linear smoother”/hat matrix, which has certain nice properties in generalised cross validation.

“Smoothing” is not a great general term, since penalisation does not necessarily cause “smoothness” from any particular perspective — for example, some penalties cause the coefficients to become sparse and therefore, from the perspective of coefficients, it promotes non-smooth vectors. Often the thing that becomes smooth is not obvious.

Regardless, what these problems share in common is that we wish to solve an ill-conditioned inverse problem, so we tame it by adding a penalty to solutions we feel one should be reluctant to accept.

🏗 specifics

1 Connection to Bayesian priors

Famously, a penalty can have an interpretation as a Bayesian prior on the solution space. It is a fun exercise, for example, to “rediscover” the lasso regression as a typical linear regression but with the plus prize for the coefficients. In that case, the maximum a posteriori estimate given those bays rise and the lasso solution coincide. If you want to know the full posterior, you have to do a lot more work. But the connection is suggestive nonetheless.

A related and useful connection is the interpretation of covariance kernels as prize producing smoothness in solutions. A very elegant introduction to these is given in Miller, Glennie, and Seaton (2020).

2 As shrinkage

TBD. James and Stein (1961). beautiful explainer video.

3 Adaptive regularization

What should we regularize to attain specific kinds of solutions?

Here’s one thing I saw recently:

Venkat Chandrasekaran, Learning Semidefinite Regularizers via Matrix Factorization

Abstract: Regularization techniques are widely employed in the solution of inverse problems in data analysis and scientific computing due to their effectiveness in addressing difficulties due to ill-posedness. In their most common manifestation, these methods take the form of penalty functions added to the objective in optimization-based approaches for solving inverse problems. The purpose of the penalty function is to induce a desired structure in the solution, and these functions are specified based on prior domain-specific expertise. We consider the problem of learning suitable regularization functions from data in settings in which prior domain knowledge is not directly available. Previous work under the title of ‘dictionary learning’ or ‘sparse coding’ may be viewed as learning a polyhedral regularizer from data. We describe generalizations of these methods to learn semidefinite regularizers by computing structured factorizations of data matrices. Our algorithmic approach for computing these factorizations combines recent techniques for rank minimization problems along with operator analogs of Sinkhorn scaling. The regularizers obtained using our framework can be employed effectively in semidefinite programming relaxations for solving inverse problems. (Joint work with Yong Sheng Soh)

4 References

Akaike, Hirotogu. 1973. Information Theory and an Extension of the Maximum Likelihood Principle.” In Proceeding of the Second International Symposium on Information Theory.
Akaike, Htrotugu. 1973. Maximum Likelihood Identification of Gaussian Autoregressive Moving Average Models.” Biometrika.
Azizyan, Krishnamurthy, and Singh. 2015. Extreme Compressive Sampling for Covariance Estimation.” arXiv:1506.00898 [Cs, Math, Stat].
Bach. 2009. Model-Consistent Sparse Estimation Through the Bootstrap.” arXiv:0901.3202 [Cs, Stat].
Banerjee, Chen, Fazayeli, et al. 2014. Estimation with Norm Regularization.” In Advances in Neural Information Processing Systems 27.
Barron, Huang, Li, et al. 2008. MDL, Penalized Likelihood, and Statistical Risk.” In Information Theory Workshop, 2008. ITW’08. IEEE.
Battiti. 1992. First-and Second-Order Methods for Learning: Between Steepest Descent and Newton’s Method.” Neural Computation.
Bellec, and Tsybakov. 2016. Bounds on the Prediction Error of Penalized Least Squares Estimators with Convex Penalty.” arXiv:1609.06675 [Math, Stat].
Bickel, Li, Tsybakov, et al. 2006. Regularization in Statistics.” Test.
Brown, and Lin. 2004. Statistical Properties of the Method of Regularization with Periodic Gaussian Reproducing Kernel.” The Annals of Statistics.
Bühlmann, and van de Geer. 2011. Additive Models and Many Smooth Univariate Functions.” In Statistics for High-Dimensional Data. Springer Series in Statistics.
———. 2015. High-Dimensional Inference in Misspecified Linear Models.” arXiv:1503.06426 [Stat].
Burman, and Nolan. 1995. A General Akaike-Type Criterion for Model Selection in Robust Regression.” Biometrika.
Candès, and Fernandez-Granda. 2013. Super-Resolution from Noisy Data.” Journal of Fourier Analysis and Applications.
Candès, and Plan. 2010. Matrix Completion With Noise.” Proceedings of the IEEE.
Cavanaugh. 1997. Unifying the Derivations for the Akaike and Corrected Akaike Information Criteria.” Statistics & Probability Letters.
Chen, and Wang. n.d. Discussion on ‘Confidence Intervals and Hypothesis Testing for High-Dimensional Regression’.”
Chernozhukov, Hansen, and Spindler. 2015. Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach.” Annual Review of Economics.
Efron. 2004. The Estimation of Prediction Error.” Journal of the American Statistical Association.
Efron, Hastie, Johnstone, et al. 2004. Least Angle Regression.” The Annals of Statistics.
Flynn, Hurvich, and Simonoff. 2013. Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models.” arXiv:1302.2068 [Stat].
Friedman, Hastie, and Tibshirani. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software.
Fuglstad, Simpson, Lindgren, et al. 2019. Constructing Priors That Penalize the Complexity of Gaussian Random Fields.” Journal of the American Statistical Association.
Giryes, Sapiro, and Bronstein. 2014. On the Stability of Deep Networks.” arXiv:1412.5896 [Cs, Math, Stat].
Golubev, and Nussbaum. 1990. A Risk Bound in Sobolev Class Regression.” The Annals of Statistics.
Golub, Heath, and Wahba. 1979. Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter.” Technometrics.
Green, Peter J. 1990. On Use of the EM for Penalized Likelihood Estimation.” Journal of the Royal Statistical Society. Series B (Methodological).
Green, P. J. 1990. Bayesian Reconstructions from Emission Tomography Data Using a Modified EM Algorithm.” IEEE Transactions on Medical Imaging.
Gu. 1993. Smoothing Spline Density Estimation: A Dimensionless Automatic Algorithm.” Journal of the American Statistical Association.
Gui, and Li. 2005. Penalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings, with Applications to Microarray Gene Expression Data.” Bioinformatics.
Hastie, and Tibshirani. 1990. Generalized Additive Models.
Hastie, Tibshirani, Rob, and Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations.
Hawe, Kleinsteuber, and Diepold. 2013. Analysis Operator Learning and Its Application to Image Reconstruction.” IEEE Transactions on Image Processing.
Hegde, Indyk, and Schmidt. 2015. A Nearly-Linear Time Framework for Graph-Structured Sparsity.” In Proceedings of the 32nd International Conference on Machine Learning (ICML-15).
Hoerl, and Kennard. 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics.
Huang, Liu, Pourahmadi, et al. 2006. Covariance Matrix Selection and Estimation via Penalised Normal Likelihood.” Biometrika.
James, and Stein. 1961. Estimation with Quadratic Loss.” In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability.
Janson, Fithian, and Hastie. 2015. Effective Degrees of Freedom: A Flawed Metaphor.” Biometrika.
Javanmard, and Montanari. 2014. Confidence Intervals and Hypothesis Testing for High-Dimensional Regression.” Journal of Machine Learning Research.
Kaufman, and Rosset. 2014. When Does More Regularization Imply Fewer Degrees of Freedom? Sufficient Conditions and Counterexamples.” Biometrika.
Kloft, Rückert, and Bartlett. 2010. A Unifying View of Multiple Kernel Learning.” In Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science.
Koenker, and Mizera. 2006. Density Estimation by Total Variation Regularization.” Advances in Statistical Modeling and Inference.
Konishi, and Kitagawa. 1996. Generalised Information Criteria in Model Selection.” Biometrika.
Konishi, and Kitagawa. 2008. Information Criteria and Statistical Modeling. Springer Series in Statistics.
Lange. 1990. Convergence of EM image reconstruction algorithms with Gibbs smoothing.” IEEE transactions on medical imaging.
Liu, Roeder, and Wasserman. 2010. Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models.” In Advances in Neural Information Processing Systems 23.
Meinshausen, and Bühlmann. 2010. Stability Selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Meyer. 2008. Inference Using Shape-Restricted Regression Splines.” The Annals of Applied Statistics.
Miller, Glennie, and Seaton. 2020. Understanding the Stochastic Partial Differential Equation Approach to Smoothing.” Journal of Agricultural, Biological and Environmental Statistics.
Montanari. 2012. Graphical Models Concepts in Compressed Sensing.” Compressed Sensing: Theory and Applications.
Needell, and Tropp. 2008. CoSaMP: Iterative Signal Recovery from Incomplete and Inaccurate Samples.” arXiv:0803.2392 [Cs, Math].
Rahimi, and Recht. 2009. Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In Advances in Neural Information Processing Systems.
Rezende, Mohamed, and Wierstra. 2015. Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” In Proceedings of ICML.
Rigollet, and Weed. 2018. Entropic Optimal Transport Is Maximum-Likelihood Deconvolution.”
Shen, and Huang. 2006. Optimal Model Assessment, Selection, and Combination.” Journal of the American Statistical Association.
Shen, Huang, and Ye. 2004. Adaptive Model Selection and Assessment for Exponential Family Distributions.” Technometrics.
Shen, and Ye. 2002. Adaptive Model Selection.” Journal of the American Statistical Association.
Silverman. 1982. On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method.” The Annals of Statistics.
———. 1984. Spline Smoothing: The Equivalent Variable Kernel Method.” The Annals of Statistics.
Simon, Friedman, Hastie, et al. 2011. Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software.
Smola, Schölkopf, and Müller. 1998. The Connection Between Regularization Operators and Support Vector Kernels.” Neural Networks.
Somekh-Baruch, Leshem, and Saligrama. 2016. On the Non-Existence of Unbiased Estimators in Constrained Estimation Problems.” arXiv:1609.07415 [Cs, Math, Stat].
Stein. 1981. Estimation of the Mean of a Multivariate Normal Distribution.” The Annals of Statistics.
Tansey, Koyejo, Poldrack, et al. 2014. False Discovery Rate Smoothing.” arXiv:1411.6144 [Stat].
Tikhonov, and Glasko. 1965. Use of the Regularization Method in Non-Linear Problems.” USSR Computational Mathematics and Mathematical Physics.
Uematsu. 2015. Penalized Likelihood Estimation in High-Dimensional Time Series Models and Its Application.” arXiv:1504.06706 [Math, Stat].
van de Geer. 2014a. Weakly Decomposable Regularization Penalties and Structured Sparsity.” Scandinavian Journal of Statistics.
———. 2014b. Statistical Theory for High-Dimensional Models.” arXiv:1409.8557 [Math, Stat].
Wahba. 1990. Spline Models for Observational Data.
Weng, Maleki, and Zheng. 2018. Overcoming the Limitations of Phase Transition by Higher Order Analysis of Regularization Techniques.” The Annals of Statistics.
Wood, S. N. 2000. Modelling and Smoothing Parameter Estimation with Multiple Quadratic Penalties.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Wood, Simon N. 2008. Fast Stable Direct Fitting and Smoothness Selection for Generalized Additive Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Wu, and Lange. 2008. Coordinate Descent Algorithms for Lasso Penalized Regression.” The Annals of Applied Statistics.
Xie, Liang, and Song. 2016. Diversity Leads to Generalization in Neural Networks.” arXiv:1611.03131 [Cs, Stat].
Ye. 1998. On Measuring and Correcting the Effects of Data Mining and Model Selection.” Journal of the American Statistical Association.
Zhang, Yiyun, Li, and Tsai. 2010. Regularization Parameter Selections via Generalized Information Criterion.” Journal of the American Statistical Association.
Zhang, Cun-Hui, and Zhang. 2014. Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Zou, and Hastie. 2005. Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Zou, Hastie, and Tibshirani. 2007. On the ‘Degrees of Freedom’ of the Lasso.” The Annals of Statistics.