Model averaging, model stacking, model ensembling
On keeping many incorrect hypotheses and using them all as one goodish one
June 20, 2017 — July 23, 2023
Train a bunch of different models and use them all. Fashionable in the form of blending, stacking or staging in machine learning competitions, but also popular in classic frequentist inference as model averaging or bagging.
I’ve seen the idea pop up in disconnected areas recently. Specifically: a Bayesian heuristic for dropout in neural nets, AIC for frequentist model averaging, Neural net ensembles, boosting/bagging, and in a statistical learning context for optimal time series prediction.
This vexingly incomplete article points out that something like model averaging might work for any convex loss thanks to Jensen’s inequality.
Two articles (Clarke 2003; Minka 2002) point out that model averaging and combination are not the same and the difference is acute in the M-open setting.
1 Mixtures of models
See mixture models.
2 Stacking
Alternate fun branding: “super learning”. Not actually model averaging, but looks pretty similar if you squint.
Breiman (1996); Clarke (2003); T. Le and Clarke (2017); Naimi and Balzer (2018); Ting and Witten (1999); Wolpert (1992); Yao et al. (2022); Y. Zhang et al. (2022)
3 Bayesian stacking
As above, but Bayesian. Motivates suggestive invocation of M-open machinery. (Clarke 2003; Clyde and Iversen 2013; Hoeting et al. 1999; T. Le and Clarke 2017; T. M. Le and Clarke 2022; Minka 2002; Naimi and Balzer 2018; Polley 2010; Ting and Witten 1999; Wolpert 1992; Yao et al. 2022, 2018).
4 Forecasting
Time series prediction? Try ensemble methods for time series.