Model/hyperparameter selection
April 15, 2016 — August 20, 2017
Choosing which of an ensemble of models to use, or, which is the same thing more or less, the number of predictors, or the regularisation. This is a kind of complement to statistical learning theory where you hope to quantify how complicated a model you should bother fitting to a given amount of data.
If your predictors are discrete and small in number, you can do this in the traditional fashion, by stepwise model selection, and you might discuss the degrees of freedom of the model and the data. If you are in the luxurious position of having a small tractable number of parameters and the ability to perform controlled trials, then you do ANOVA.
When you have penalisation parameters, we sometimes phrase this as regularisation and talk about regularisation parameter selection, or hyperparameter selection, which we can do in various ways. Methods for this include degrees-of-freedom penalties, cross-validation etc. However, I’m not yet sure how to make that work in sparse regression.
Multiple testing is model selection writ large, where you can consider many hypothesis tests, possibly effectively infinitely many hypothesis tests, or you have a combinatorial explosion of possible predictors to include.
🏗 document connection with graphical models and thus conditional independence tests.
1 Bayesian
Bayesian model selection is also a thing, although the framing is a little different. In the classic Bayesian method I keep all my models, although some might become very unlikely. But apparently I can also throw some out entirely? Presumably for reasons of computational tractability or what-have-you.
2 Consistency
If the model order itself is the parameter of interest, how do you do consistent inference on that? AIC, for example, is derived for optimising prediction loss, not model selection. (Doesn’t BIC do better?)
An exhausting, exhaustive review of various model selection procedures with an eye to consistency, is given in C. R. Rao and Wu (2001).
3 Cross validation
See cross validation.
4 For mixture models
See mixture models.
5 Under sparsity
6 Hyperparameter search
How do you choose your hyperparameters? NB hyperparameters might not always be about model selection per se; there are also ones that are about, e.g. convergence rate. Anyway. Also one could well regard hyperparameters as normal parameters.
Turns out you can cast this as a bandit problem, or a sequential Bayesian optimisation problem.