Recurrent / convolutional / state-space
Translating between means of approximating time series dynamics
April 5, 2016 — March 5, 2024
A meeting point for some related ideas from different fields. Perspectives on analysing systems in terms of a latent, noisy state, and/or their history of noisy observations. This notebook is dedicated to the possibly surprising fact we can move between hidden-state-type representations and observed-state-only representations, and indeed mix them together conveniently. I had many thoughts about this, but they are largely irrelevant now since the S4 family came along and effectively actioned all of them.
1 Linear systems
See linear feedback systems and linear filter design for stuff about FIR vs IIR filters.
1.1 Linear Time-Invariant systems
Let us talk about Fourier transforms and spectral properties.
2 Koopman operators
Learning state is pointless! Infer directly from observations! See Koopmania.
3 RNNs
Miller and Hardt (2018)
See RNNs.
4 Stability of learning
Hochreiter et al. (2001); Hochreiter (1998);Lamb et al. (2016);Hardt, Ma, and Recht (2018) etc.
RNNs were traditionally considered hard to train stably. And yet, this was possibly a simple mistake:
Were RNNs All We Needed? discusses Feng et al. (2024) Researchers from Mila and Borealis AI have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today’s transformers.
They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating “minLSTM” and “minGRU”. The key changes: ❶ Removed dependencies on previous hidden states in the gates ❷ Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients ❸ Ensured outputs are time-independent in scale (not sure I understood that well either, don’t worry)
⚡️ As a result, you can use a “parallel scan” algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200× faster than their traditional counterparts for long sequences
🔥 The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.
And for Language Modeling, they need 2.5× fewer training steps than Transformers to reach the same performance! 🚀
cf Eugenio Culurciello, The fall of RNN / LSTM. We fell for Recurrent neural networks… who argued the opposite, that transformers have removed the need for RNNs.
5 S4
Interesting package of tools from Christopher Ré’s lab, at the intersection of recurrent networks and 🚧TODO🚧 clarify. See HazyResearch/state-spaces: Sequence Modeling with Structured State Spaces. I find these aesthetically satisfying because I spent 2 years of my PhD trying to solve the same problem and failed. These folks did a better job, so I find it slightly validating that the idea was not stupid. Gu et al. (2021):
Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence u↦y by simply simulating a linear continuous-time state-space representation \(x'=Ax+Bu,y=Cx+Du\). Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices \(A\) that endow LSSLs with long-range memory. Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use hand-crafted features on 100x shorter sequences.
Gu, Goel, and Ré (2021):
A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of 10000 or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \(x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t)\), and showed that for appropriate choices of the state matrix \(A\), this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning \(A\) with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60× faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.
Related? Y. Li et al. (2022) Interesting parallel to the recursive/non-recursive transformer duality in How the RWKV language models. Question: Can they do the jobs of transformers?
Nearly (Vardasbi et al. 2023; Nishikawa and Suzuki 2024).
But actually, yes.
6 Mamba
See Mamba
7 Incoming
Simchowitz, Boczar, and Recht (2019)
We analyse a simple prefiltered variation of the least squares estimator for the problem of estimation with biased, semi-parametric noise, an error model studied more broadly in causal statistics and active learning. We prove an oracle inequality which demonstrates that this procedure provably mitigates the variance introduced by long-term dependencies. We then demonstrate that prefiltered least squares yields, to our knowledge, the first algorithm that provably estimates the parameters of partially-observed linear systems that attains rates which do not incur a worst-case dependence on the rate at which these dependencies decay. The algorithm is provably consistent even for systems which satisfy the weaker marginal stability condition obeyed by many classical models based on Newtonian mechanics. In this context, our semi-parametric framework yields guarantees for both stochastic and worst-case noise.