Ensembling neural nets

Monte Carlo

December 14, 2020 — November 24, 2021

Bayes
machine learning
neural nets
particle

One of the practical forms of Bayesian inference for massively parameterized networks is by model averaging.

1 Explicit ensembles

Figure 1

Train a collection of networks and calculate empirical means and variances to estimate means posterior predictive (He, Lakshminarayanan, and Teh 2020; Huang et al. 2016; Lakshminarayanan, Pritzel, and Blundell 2017; Wen, Tran, and Ba 2020; Xie, Xu, and Chuang 2013). This is neat, and on one hand, we might think there is nothing special to do here since it’s already more or less classical model ensembling, as near as I can tell. But in practice, there are lots of tricks needed to make this work in a neural network context, particularly because models are already supposed to be so big that they strain the GPU; having many such models is presumably ridiculous. You need tricks. There are various such tricks. BatchEnsemble is one (Wen, Tran, and Ba 2020).

Cute: Justin Domke, in The human regression ensemble, creates ensembles of curves that he drew through datapoints on a PDF and gets pretty good results.

2 Dropout

Figure 2

Dropout is an implicit ensembling method. Or maybe the implicit ensembling method; I am not aware of others. Recommended reading: Foong et al. (2019);Gal, Hron, and Kendall (2017);Kingma, Salimans, and Welling (2015).

A popular kind of noise layer which randomly zeroes out some coefficients in the net when training (and optionally while predicting.) A coarse resemblance to random forests etc is pretty immediate, and indeed you can just use those instead. Here, however, we are trying to average over strong learners, not weak learners.

The key insight here is that dropout can be rationalized, apparently, as model averaging and thence as a kind of implicit probabilistic learning because in the limit it approaches a certain deep Gaussian process (Kingma, Salimans, and Welling 2015; Gal and Ghahramani 2016b, 2015). Leveraging this argument, some papers claim to approximate Bayesian inference by randomising dropout (M. Kasim et al. 2019; M. F. Kasim et al. 2020).

AFAICT current consensus seems to be that the highly cited and very simple model of Gal and Ghahramani (2015) is flawed, and that the rather more onerous approach of Kingma, Salimans, and Welling (2015) is how you would use dropout as a more reasonable posterior; So much was said in a seminar, but I have not really used either paper in practice so I cannot comment.

3 Alternate model combinations

Should we stop weighting hypotheses and start “stacking”? Yao et al. (2018) (also how is that different?)

The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability.

4 Distilling

So apparently you can train a model to emulate an ensemble of similar models? Great terminology here; Hinton, Vinyals, and Dean (2015) refer to distilling of dark knowledge.

See Bubeck on this: Three mysteries in deep learning: Ensemble, knowledge distillation, and self-distillation.

5 Via NTK

How does this work? He, Lakshminarayanan, and Teh (2020).

6 Questions

These methods focus generally on the posterior predictive. How do I find posteriors for parameter values in my model without including them in my predictive loss explicitly? If many of my parameters are not interpretable, I am naturally tempted to fit some by Maximum Likelihood, take them as given, then update posteriors over the remainder, but this does not look like a principled inference procedure.

7 Cascades

Google AI Blog: Model Ensembles Are Faster Than You Think (Wang et al. 2021).

8 References

Alquier. 2021. User-Friendly Introduction to PAC-Bayes Bounds.” arXiv:2110.11216 [Cs, Math, Stat].
Chada, and Tong. 2022. Convergence Acceleration of Ensemble Kalman Inversion in Nonlinear Settings.” Mathematics of Computation.
Chipman, George, and Mcculloch. 2006. “Bayesian Ensemble Learning.” In.
Clarke. 2003. Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot Be Ignored.” The Journal of Machine Learning Research.
Dandekar, Chung, Dixit, et al. 2021. Bayesian Neural Ordinary Differential Equations.” arXiv:2012.07244 [Cs].
Daxberger, Kristiadi, Immer, et al. 2021. Laplace Redux — Effortless Bayesian Deep Learning.” In arXiv:2106.14806 [Cs, Stat].
Durasov, Bagautdinov, Baque, et al. 2021. Masksembles for Uncertainty Estimation.” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Foong, Burt, Li, et al. 2019. “Pathologies of Factorised Gaussian and MC Dropout Posteriors in Bayesian Neural Networks.” In 4th Workshop on Bayesian Deep Learning (NeurIPS 2019).
Gal, and Ghahramani. 2015. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning (ICML-16).
———. 2016a. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” In arXiv:1512.05287 [Stat].
———. 2016b. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference.” In 4th International Conference on Learning Representations (ICLR) Workshop Track.
———. 2016c. Dropout as a Bayesian Approximation: Appendix.” arXiv:1506.02157 [Stat].
Gal, Hron, and Kendall. 2017. Concrete Dropout.” arXiv:1705.07832 [Stat].
Haber, Lucka, and Ruthotto. 2018. Never Look Back - A Modified EnKF Method and Its Application to the Training of Neural Networks Without Back Propagation.” arXiv:1805.08034 [Cs, Math].
He, Lakshminarayanan, and Teh. 2020. Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems.
Hinton, Vinyals, and Dean. 2015. Distilling the Knowledge in a Neural Network.” arXiv:1503.02531 [Cs, Stat].
Huang, Sun, Liu, et al. 2016. Deep Networks with Stochastic Depth.” In Computer Vision – ECCV 2016. Lecture Notes in Computer Science.
Kasim, Muhammad, Topp-Mugglestone, Hatfield, et al. 2019. “A Million Times Speed up in Parameters Retrieval with Deep Learning.” In.
Kasim, M. F., Watson-Parris, Deaconu, et al. 2020. Up to Two Billion Times Acceleration of Scientific Simulations with Deep Neural Architecture Search.” arXiv:2001.08055 [Physics, Stat].
Kingma, Salimans, and Welling. 2015. Variational Dropout and the Local Reparameterization Trick.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.
Kovachki, and Stuart. 2019. Ensemble Kalman Inversion: A Derivative-Free Technique for Machine Learning Tasks.” Inverse Problems.
Lakshminarayanan, Pritzel, and Blundell. 2017. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” In Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17.
Le, and Clarke. 2017. A Bayes Interpretation of Stacking for M-Complete and M-Open Settings.” Bayesian Analysis.
Mandt, Hoffman, and Blei. 2017. Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR.
Minka. 2002. Bayesian Model Averaging Is Not Model Combination.”
Molchanov, Ashukha, and Vetrov. 2017. Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML.
Papadopoulos, Edwards, and Murray. 2001. Confidence Estimation Methods for Neural Networks: A Practical Comparison.” IEEE Transactions on Neural Networks.
Pearce, Zaki, and Neely. 2018. “Bayesian Neural Network Ensembles.” Third Workshop on Bayesian Deep Learning (NeurIPS 2018), Montréal, Canada.
Ritter, Kukla, Zhang, et al. 2021. Sparse Uncertainty Representation in Deep Learning with Inducing Weights.” arXiv:2105.14594 [Cs, Stat].
Sheikh, Phielipp, and Boloni. 2022. “Maximizing Ensemble Diversity in Deep Reinforcement Learning.”
Wang, Kondratyuk, Christiansen, et al. 2021. Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models.” arXiv:2012.01988 [Cs].
Wen, Tran, and Ba. 2020. BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning.” In ICLR.
Wilson, and Izmailov. 2020. Bayesian Deep Learning and a Probabilistic Perspective of Generalization.”
Wortsman, Horton, Guestrin, et al. 2021. Learning Neural Network Subspaces.” arXiv:2102.10472 [Cs].
Xie, Xu, and Chuang. 2013. Horizontal and Vertical Ensemble with Deep Representation for Classification.” arXiv:1306.2759 [Cs, Stat].
Yao, Vehtari, Simpson, et al. 2018. Using Stacking to Average Bayesian Predictive Distributions.” Bayesian Analysis.