Recurrent / convolutional / state-space

Translating between means of approximating time series dynamics

April 5, 2016 — March 5, 2024

Bayes
convolution
dynamical systems
functional analysis
linear algebra
machine learning
making things
music
networks
neural nets
nonparametric
probability
signal processing
sparser than thou
state space models
statistics
time series
Figure 1

A meeting point for some related ideas from different fields. Perspectives on analysing systems in terms of a latent, noisy state, and/or their history of noisy observations. This notebook is dedicated to the possibly surprising fact we can move between hidden-state-type representations and observed-state-only representations, and indeed mix them together conveniently. I had many thoughts about this, but they are largely irrelevant now since the S4 family came along and effectively actioned all of them.

1 Linear systems

See linear feedback systems and linear filter design for stuff about FIR vs IIR filters.

1.1 Linear Time-Invariant systems

Let us talk about Fourier transforms and spectral properties.

2 Koopman operators

Learning state is pointless! Infer directly from observations! See Koopmania.

3 RNNs

Miller and Hardt (2018)

See RNNs.

4 Stability of learning

Hochreiter et al. (2001); Hochreiter (1998);Lamb et al. (2016);Hardt, Ma, and Recht (2018) etc.

RNNs were traditionally considered hard to train stably. And yet, this was possibly a simple mistake:

Were RNNs All We Needed? discusses Feng et al. (2024) Researchers from Mila and Borealis AI have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today’s transformers.

They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating “minLSTM” and “minGRU”. The key changes: ❶ Removed dependencies on previous hidden states in the gates ❷ Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients ❸ Ensured outputs are time-independent in scale (not sure I understood that well either, don’t worry)

⚡️ As a result, you can use a “parallel scan” algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200× faster than their traditional counterparts for long sequences

🔥 The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.

And for Language Modeling, they need 2.5× fewer training steps than Transformers to reach the same performance! 🚀

cf Eugenio Culurciello, The fall of RNN / LSTM. We fell for Recurrent neural networks… who argued the opposite, that transformers have removed the need for RNNs.

5 S4

Figure 2

Interesting package of tools from Christopher Ré’s lab, at the intersection of recurrent networks and 🚧TODO🚧 clarify. See HazyResearch/state-spaces: Sequence Modeling with Structured State Spaces. I find these aesthetically satisfying because I spent 2 years of my PhD trying to solve the same problem and failed. These folks did a better job, so I find it slightly validating that the idea was not stupid. Gu et al. (2021):

Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence u↦y by simply simulating a linear continuous-time state-space representation \(x'=Ax+Bu,y=Cx+Du\). Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices \(A\) that endow LSSLs with long-range memory. Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use hand-crafted features on 100x shorter sequences.

Gu, Goel, and Ré (2021):

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of 10000 or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \(x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t)\), and showed that for appropriate choices of the state matrix \(A\), this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning \(A\) with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60× faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.

Related? Y. Li et al. (2022) Interesting parallel to the recursive/non-recursive transformer duality in How the RWKV language models. Question: Can they do the jobs of transformers?

Nearly (Vardasbi et al. 2023; Nishikawa and Suzuki 2024).

But actually, yes.

6 Mamba

See Mamba

7 Incoming

  • Simchowitz, Boczar, and Recht (2019)

    We analyse a simple prefiltered variation of the least squares estimator for the problem of estimation with biased, semi-parametric noise, an error model studied more broadly in causal statistics and active learning. We prove an oracle inequality which demonstrates that this procedure provably mitigates the variance introduced by long-term dependencies. We then demonstrate that prefiltered least squares yields, to our knowledge, the first algorithm that provably estimates the parameters of partially-observed linear systems that attains rates which do not incur a worst-case dependence on the rate at which these dependencies decay. The algorithm is provably consistent even for systems which satisfy the weaker marginal stability condition obeyed by many classical models based on Newtonian mechanics. In this context, our semi-parametric framework yields guarantees for both stochastic and worst-case noise.

8 References

Arjovsky, Shah, and Bengio. 2016. Unitary Evolution Recurrent Neural Networks.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16.
Atal. 2006. The History of Linear Prediction.” IEEE Signal Processing Magazine.
Ben Taieb, and Atiya. 2016. A Bias and Variance Analysis for Multistep-Ahead Time Series Forecasting.” IEEE transactions on neural networks and learning systems.
Bengio, Simard, and Frasconi. 1994. Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks.
Bordes, Bottou, and Gallinari. 2009. SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent.” Journal of Machine Learning Research.
Cai, Zhu, Wang, et al. 2024. MambaTS: Improved Selective State Space Models for Long-Term Time Series Forecasting.”
Cakir, Ozan, and Virtanen. 2016. Filterbank Learning for Deep Neural Network Based Polyphonic Sound Event Detection.” In Neural Networks (IJCNN), 2016 International Joint Conference on.
Chang, Chen, Haber, et al. 2019. AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks.” In Proceedings of ICLR.
Chang, Meng, Haber, Tung, et al. 2018. Multi-Level Residual Networks from Dynamical Systems View.” In PRoceedings of ICLR.
Chang, Meng, Haber, Ruthotto, et al. 2018. Reversible Architectures for Arbitrarily Deep Residual Neural Networks.” In arXiv:1709.03698 [Cs, Stat].
Chung, Ahn, and Bengio. 2016. Hierarchical Multiscale Recurrent Neural Networks.” arXiv:1609.01704 [Cs].
Chung, Kastner, Dinh, et al. 2015. A Recurrent Latent Variable Model for Sequential Data.” In Advances in Neural Information Processing Systems 28.
Collins, Sohl-Dickstein, and Sussillo. 2016. Capacity and Trainability in Recurrent Neural Networks.” In arXiv:1611.09913 [Cs, Stat].
Cooijmans, Ballas, Laurent, et al. 2016. Recurrent Batch Normalization.” arXiv Preprint arXiv:1603.09025.
Dai, Lai, Yang, et al. 2019. Re-Examination of the Role of Latent Variables in Sequence Modeling.” arXiv:1902.01388 [Cs, Stat].
Doucet, Freitas, and Gordon. 2001. Sequential Monte Carlo Methods in Practice.
Feng, Tung, Ahmed, et al. 2024. Were RNNs All We Needed?
Fraccaro, Sø nderby, Paquet, et al. 2016. Sequential Neural Models with Stochastic Layers.” In Advances in Neural Information Processing Systems 29.
Goodwin, and Vetterli. 1999. Matching Pursuit and Atomic Signal Models Based on Recursive Filter Banks.” IEEE Transactions on Signal Processing.
Grosse, Raina, Kwong, et al. 2007. Shift-Invariant Sparse Coding for Audio Classification.” In The Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI2007).
Gu, and Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.”
Gu, Goel, and Ré. 2021. Efficiently Modeling Long Sequences with Structured State Spaces.”
Gu, Johnson, Goel, et al. 2021. Combining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers.” In Advances in Neural Information Processing Systems.
Haber, and Ruthotto. 2018. Stable Architectures for Deep Neural Networks.” Inverse Problems.
Hardt, Ma, and Recht. 2018. Gradient Descent Learns Linear Dynamical Systems.” The Journal of Machine Learning Research.
Haykin, ed. 2001. Kalman Filtering and Neural Networks. Adaptive and Learning Systems for Signal Processing, Communications, and Control.
Hazan, Singh, and Zhang. 2017. Learning Linear Dynamical Systems via Spectral Filtering.” In NIPS.
Heaps. 2020. Enforcing Stationarity Through the Prior in Vector Autoregressions.” arXiv:2004.09455 [Stat].
Hochreiter. 1998. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.” International Journal of Uncertainty Fuzziness and Knowledge Based Systems.
Hochreiter, Bengio, Frasconi, et al. 2001. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.” In A Field Guide to Dynamical Recurrent Neural Networks.
Hochreiter, and Schmidhuber. 1997. Long Short-Term Memory.” Neural Computation.
Hu, Baumann, Gui, et al. 2024. ZigMa: A DiT-Style Zigzag Mamba Diffusion Model.”
Hürzeler, and Künsch. 2001. Approximating and Maximising the Likelihood for a General State-Space Model.” In Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science.
Ionides, Edward L., Bhadra, Atchadé, et al. 2011. Iterated Filtering.” The Annals of Statistics.
Ionides, E. L., Bretó, and King. 2006. Inference for Nonlinear Dynamical Systems.” Proceedings of the National Academy of Sciences.
Jaeger. 2002. Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the” Echo State Network” Approach.
Jing, Shen, Dubcek, et al. 2017. Tunable Efficient Unitary Neural Networks (EUNN) and Their Application to RNNs.” In PMLR.
Kailath. 1980. Linear Systems. Prentice-Hall Information and System Science Series.
Kailath, Sayed, and Hassibi. 2000. Linear Estimation. Prentice Hall Information and System Sciences Series.
Kaul. 2020. Linear Dynamical Systems as a Core Computational Primitive.” In Advances in Neural Information Processing Systems.
Kingma, Salimans, Jozefowicz, et al. 2016. Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29.
Kolter, and Manek. 2019. Learning Stable Deep Dynamics Models.” In Advances in Neural Information Processing Systems.
Krishnamurthy, Can, and Schwab. 2022. Theory of Gating in Recurrent Neural Networks.” Physical Review. X.
Krishnan, Shalit, and Sontag. 2015. Deep Kalman Filters.” arXiv Preprint arXiv:1511.05121.
———. 2017. Structured Inference Networks for Nonlinear State Space Models.” In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.
Kutschireiter, Surace, Sprekeler, et al. 2015a. “A Neural Implementation for Nonlinear Filtering.” arXiv Preprint arXiv:1508.06818.
Kutschireiter, Surace, Sprekeler, et al. 2015b. Approximate Nonlinear Filtering with a Recurrent Neural Network.” BMC Neuroscience.
Lamb, Goyal, Zhang, et al. 2016. Professor Forcing: A New Algorithm for Training Recurrent Networks.” In Advances In Neural Information Processing Systems.
Laurent, and von Brecht. 2016. A Recurrent Neural Network Without Chaos.” arXiv:1612.06212 [Cs].
Li, Yuhong, Cai, Zhang, et al. 2022. What Makes Convolutional Models Great on Long Sequence Modeling?
Li, Bei, Du, Zhou, et al. 2021. ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation.”
Lipton. 2016. Stuck in a What? Adventures in Weight Space.” arXiv:1602.07320 [Cs].
Ljung. 1999. System Identification: Theory for the User. Prentice Hall Information and System Sciences Series.
Ljung, and Söderström. 1983. Theory and Practice of Recursive Identification. The MIT Press Series in Signal Processing, Optimization, and Control 4.
MacKay, Vicol, Ba, et al. 2018. Reversible Recurrent Neural Networks.” In Advances In Neural Information Processing Systems.
Marelli, and Fu. 2010. A Recursive Method for the Approximation of LTI Systems Using Subband Processing.” IEEE Transactions on Signal Processing.
Martens, and Sutskever. 2011. Learning Recurrent Neural Networks with Hessian-Free Optimization.” In Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11.
Mattingley, and Boyd. 2010. Real-Time Convex Optimization in Signal Processing.” IEEE Signal Processing Magazine.
Megretski. 2003. Positivity of Trigonometric Polynomials.” In 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).
Mehri, Kumar, Gulrajani, et al. 2017. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Mhammedi, Hellicar, Rahman, et al. 2017. Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections.” In PMLR.
Miller, and Hardt. 2018. When Recurrent Models Don’t Need To Be Recurrent.” arXiv:1805.10369 [Cs, Stat].
Moradkhani, Sorooshian, Gupta, et al. 2005. Dual State–Parameter Estimation of Hydrological Models Using Ensemble Kalman Filter.” Advances in Water Resources.
Nerrand, Roussel-Ragot, Personnaz, et al. 1993. Neural Networks and Nonlinear Adaptive Filtering: Unifying Concepts and New Algorithms.” Neural Computation.
Nishikawa, and Suzuki. 2024. State Space Models Are Comparable to Transformers in Estimating Functions with Dynamic Smoothness.”
Oliveira, and Skelton. 2001. Stability Tests for Constrained Linear Systems.” In Perspectives in Robust Control. Lecture Notes in Control and Information Sciences.
Patro, and Agneeswaran. 2024. SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time Series.”
Roberts, Engel, Raffel, et al. 2018. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music.” arXiv:1803.05428 [Cs, Eess, Stat].
Routtenberg, and Tabrikian. 2010. Blind MIMO-AR System Identification and Source Separation with Finite-Alphabet.” IEEE Transactions on Signal Processing.
Seuret, and Gouaisbaut. 2013. Wirtinger-Based Integral Inequality: Application to Time-Delay Systems.” Automatica.
Simchowitz, Boczar, and Recht. 2019. Learning Linear Dynamical Systems with Semi-Parametric Least Squares.” arXiv:1902.00768 [Cs, Math, Stat].
Sjöberg, Zhang, Ljung, et al. 1995. Nonlinear Black-Box Modeling in System Identification: A Unified Overview.” Automatica, Trends in System Identification,.
Smith. 2000. “Disentangling Uncertainty and Error: On the Predictability of Nonlinear Systems.” In Nonlinear Dynamics and Statistics.
Söderström, and Stoica, eds. 1988. System Identification.
Stepleton, Pascanu, Dabney, et al. 2018. Low-Pass Recurrent Neural Networks - A Memory Architecture for Longer-Term Correlation Discovery.” arXiv:1805.04955 [Cs, Stat].
Sutskever. 2013. Training Recurrent Neural Networks.”
Szegedy, Liu, Jia, et al. 2015. Going Deeper with Convolutions.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Telgarsky. 2017. Neural Networks and Rational Functions.” In PMLR.
Thickstun, Harchaoui, and Kakade. 2017. Learning Features of Music from Scratch.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Vardasbi, Pires, Schmidt, et al. 2023. State Spaces Aren’t Enough: Machine Translation Needs Attention.”
Welch. 1967. The Use of Fast Fourier Transform for the Estimation of Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms.” IEEE Transactions on Audio and Electroacoustics.
Werbos. 1988. Generalization of Backpropagation with Application to a Recurrent Gas Market Model.” Neural Networks.
———. 1990. Backpropagation Through Time: What It Does and How to Do It.” Proceedings of the IEEE.
Wiatowski, Grohs, and Bölcskei. 2018. Energy Propagation in Deep Convolutional Neural Networks.” IEEE Transactions on Information Theory.
Williams, and Peng. 1990. An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories.” Neural Computation.
Wisdom, Powers, Pitton, et al. 2016. Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery.” In Advances in Neural Information Processing Systems 29.
Yu, and Deng. 2011. Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP].” IEEE Signal Processing Magazine.
Zhang, Zhang, Kong, et al. 2021. Continuous Self-Attention Models with Neural ODE Networks.” Proceedings of the AAAI Conference on Artificial Intelligence.
Zinkevich. 2003. Online Convex Programming and Generalized Infinitesimal Gradient Ascent.” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning. ICML’03.