Meta learning

Few-shot learning, learning fast weights, learning to learn

September 16, 2021 — September 16, 2021

functional analysis
how do science
meta learning
model selection
optimization
statmech
Figure 1

Placeholder for what we now call few-shot learning, I think?

Is this what Schmidhuber means when he discusses neural nets learning to program neural nets with fast weights? He dates that idea to the 1990s (Schmidhuber 1992) and relates it via Schlag, Irie, and Schmidhuber (2021) to transformer models.

A mainstream and current approach is to discuss meta-learning:

On the futility of trying to be clever (the bitter lesson redux) summarises some recent negative results

two recent papers, (Raghu et al. 2020; Tian et al. 2020), show that in practice the inner loop run doesn’t really do much in these algorithms, so much so that one can safely do away with the inner loop entirely. This means that the success of these algorithms can be explained completely by standard (single-loop) learning on the entire lumped meta-training dataset. Another recent beautiful theory paper (Du et al. 2021) sheds some light on these experimental results.

1 References

Antoniou, Edwards, and Storkey. 2019. How to Train Your MAML.” arXiv:1810.09502 [Cs, Stat].
Arnold, Mahajan, Datta, et al. 2020. Learn2learn: A Library for Meta-Learning Research.” arXiv:2008.12284 [Cs, Stat].
Brown, Mann, Ryder, et al. 2020. Language Models Are Few-Shot Learners.” arXiv:2005.14165 [Cs].
Du, Hu, Kakade, et al. 2021. Few-Shot Learning via Learning the Representation, Provably.”
Fiebrink, Trueman, and Cook. 2009. A Metainstrument for Interactive, on-the-Fly Machine Learning.” In Proceefdings of NIME.
Finn, Abbeel, and Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.” In Proceedings of the 34th International Conference on Machine Learning.
Künzel, Sekhon, Bickel, et al. 2019. Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.” Proceedings of the National Academy of Sciences.
Lee, Maji, Ravichandran, et al. 2019. Meta-Learning with Differentiable Convex Optimization.”
Medasani, Gamst, Ding, et al. 2016. Predicting Defect Behavior in B2 Intermetallics by Merging Ab Initio Modeling and Machine Learning.” Npj Computational Materials.
Mikulik, Delétang, McGrath, et al. 2020. Meta-Trained Agents Implement Bayes-Optimal Agents.”
Munkhdalai, Sordoni, Wang, et al. 2019. Metalearned Neural Memory.” In Advances In Neural Information Processing Systems.
Oreshkin, Carpov, Chapados, et al. 2020. Meta-Learning Framework with Applications to Zero-Shot Time-Series Forecasting.”
Ortega, Wang, Rowland, et al. 2019. Meta-Learning of Sequential Strategies.”
Pestourie, Mroueh, Nguyen, et al. 2020. Active Learning of Deep Surrogates for PDEs: Application to Metasurface Design.” Npj Computational Materials.
Raghu, Raghu, Bengio, et al. 2020. Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML.”
Rajeswaran, Finn, Kakade, et al. 2019. Meta-Learning with Implicit Gradients.”
Schlag, Irie, and Schmidhuber. 2021. Linear Transformers Are Secretly Fast Weight Programmers.” arXiv:2102.11174 [Cs].
Schmidhuber. 1992. Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks.” Neural Computation.
———. 1993. “Reducing the Ratio Between Learning Complexity and Number of Time Varying Variables in Fully Recurrent Nets.” In Proceedings of the International Conference on Artificial Neural Networks.
Tian, Wang, Krishnan, et al. 2020. Rethinking Few-Shot Image Classification: A Good Embedding Is All You Need?
Uttl, White, and Gonzalez. 2017. Meta-Analysis of Faculty’s Teaching Effectiveness: Student Evaluation of Teaching Ratings and Student Learning Are Not Related.” Studies in Educational Evaluation, Evaluation of teaching: Challenges and promises,.
van Erven, and Koolen. 2016. MetaGrad: Multiple Learning Rates in Online Learning.” In Advances in Neural Information Processing Systems 29.
Zhang, and Wang. 2022. Deep Learning Meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?