Ensembling neural nets
Monte Carlo
December 14, 2020 — November 24, 2021
One of the practical forms of Bayesian inference for massively parameterized networks is by model averaging.
1 Explicit ensembles
Train a collection of networks and calculate empirical means and variances to estimate means posterior predictive (He, Lakshminarayanan, and Teh 2020; Huang et al. 2016; Lakshminarayanan, Pritzel, and Blundell 2017; Wen, Tran, and Ba 2020; Xie, Xu, and Chuang 2013). This is neat, and on one hand, we might think there is nothing special to do here since it’s already more or less classical model ensembling, as near as I can tell. But in practice, there are lots of tricks needed to make this work in a neural network context, particularly because models are already supposed to be so big that they strain the GPU; having many such models is presumably ridiculous. You need tricks. There are various such tricks. BatchEnsemble is one (Wen, Tran, and Ba 2020).
Cute: Justin Domke, in The human regression ensemble, creates ensembles of curves that he drew through datapoints on a PDF and gets pretty good results.
2 Dropout
Dropout is an implicit ensembling method. Or maybe the implicit ensembling method; I am not aware of others. Recommended reading: Foong et al. (2019);Gal, Hron, and Kendall (2017);Kingma, Salimans, and Welling (2015).
A popular kind of noise layer which randomly zeroes out some coefficients in the net when training (and optionally while predicting.) A coarse resemblance to random forests etc is pretty immediate, and indeed you can just use those instead. Here, however, we are trying to average over strong learners, not weak learners.
The key insight here is that dropout can be rationalized, apparently, as model averaging and thence as a kind of implicit probabilistic learning because in the limit it approaches a certain deep Gaussian process (Kingma, Salimans, and Welling 2015; Gal and Ghahramani 2016b, 2015). Leveraging this argument, some papers claim to approximate Bayesian inference by randomising dropout (M. Kasim et al. 2019; M. F. Kasim et al. 2020).
AFAICT current consensus seems to be that the highly cited and very simple model of Gal and Ghahramani (2015) is flawed, and that the rather more onerous approach of Kingma, Salimans, and Welling (2015) is how you would use dropout as a more reasonable posterior; So much was said in a seminar, but I have not really used either paper in practice so I cannot comment.
3 Alternate model combinations
Should we stop weighting hypotheses and start “stacking”? Yao et al. (2018) (also how is that different?)
The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability.
4 Distilling
So apparently you can train a model to emulate an ensemble of similar models? Great terminology here; Hinton, Vinyals, and Dean (2015) refer to distilling of dark knowledge.
See Bubeck on this: Three mysteries in deep learning: Ensemble, knowledge distillation, and self-distillation.
5 Via NTK
How does this work? He, Lakshminarayanan, and Teh (2020).
6 Questions
These methods focus generally on the posterior predictive. How do I find posteriors for parameter values in my model without including them in my predictive loss explicitly? If many of my parameters are not interpretable, I am naturally tempted to fit some by Maximum Likelihood, take them as given, then update posteriors over the remainder, but this does not look like a principled inference procedure.
7 Cascades
Google AI Blog: Model Ensembles Are Faster Than You Think (Wang et al. 2021).