Kolmogorov-Arnold neural networks

Don’t learn weights, learn activations encode a physical process.

October 13, 2024 — October 14, 2024

calculus
dynamical systems
geometry
Hilbert space
how do science
Lévy processes
machine learning
neural nets
PDEs
physics
regression
sciml
SDEs
signal processing
statistics
statmech
stochastic processes
surrogate
time series
uncertainty
Figure 1

A hyped variant of classic NNs.

Where the classic NN (i.e. the MLP) relies on layers of linear transformations (weights) and fixed activation functions (like ReLU or tanh) at the nodes, the Kolmogorov-Arnold Networks (KANs) learn activation functions.

Interesting things about these networks, from my first impression

  1. They seem to fill a niche between symbolic regression and MLPs.
  2. It seems to be easy to sparsify them in a way that is not possible with MLPs.

1 Kolmogorov-Arnold Theorem

The Kolmogorov-Arnold theorem claims that any continuous multivariate function \(f(x_1, \dots, x_n)\) can be decomposed into sums of univariate functions. The classic representation looks like this:

\[ f(x_1, \dots, x_n) = \sum_{q=1}^{2n+1} \Phi_q \left( \sum_{p=1}^{n} \varphi_{q,p}(x_p) \right) \]

This means that for any complex multivariate function, you can break it down into a composition of univariate functions plus some addition.

2 KAN networks

In a KAN (Liu, Wang, et al. 2024), we learn how these univariate functions compose themselves into a multivariate structure, instead of fixing the composition in advance. The “weights” between nodes, represented by splines or other parameterized functions, are free to learn what the best local univariate relationship is.

KANs are structured by stacking KAN layers, where each layer looks something like this:

\[ x_{l+1,j} = \sum_{i=1}^{n_l} \varphi_{l,j,i}(x_{l,i}) \]

where \(\varphi_{l,j,i}\) is a learnable univariate function (parameterized as a spline, for instance), and \(x_{l,i}\) is the activation value from the previous layer. So while each function is univariate, the overall transformation still respects the multivariate nature of the input. The final function learned by a KAN is a composition of these layers:

\[ \operatorname{KAN}(x) = (\Phi_L \circ \Phi_{L-1} \circ \dots \circ \Phi_1)(x) \]

This makes KANs flexible like MLPs, but maybe the information is combined in a way that is more comprehensible to the human mind: We can visualize or probe the learned univariate functions \(\varphi_{l,j,i}\).

3 As symbolic-ish regression

Symbolic regression tries to discover closed-form expressions—think $ y = (x) + $—directly from the data, i.e. a symbolic representation of a function. Symbolic regression is powerful because it gives you a human-readable formula, something interpretable, but it is not robust to noise. The search space of possible functions is huge, and small changes in data or noise can cause symbolic regression to completely fail.

On the other side of the spectrum, we have traditional NNs (MLPs), which are universal function approximators but work as “black boxes.” They don’t tell us how they approximate a function; they just do it. We get almost zero interpretability.

A KAN can, in theory, produce output that mimics symbolic regression by learning a function’s compositional structure. For example, if we’re modelling something like:

\[ f(x, y) = \exp(\sin(\pi x) + y^2) \]

MLPs would use layers of matrix multiplications and fixed activations (like ReLU) to approximate this. But with KANs, the model potentially actually learns the internal univariate functions (like the \(\sin\) and \(\exp\)) and how to combine them. Once the KAN has trained, we could probe its learned activation functions and discover, for example, that it has closely approximated \(\sin(\pi x)\) and \(\exp(x)\) as part of its learned structure.

4 Scaling accuracy

The paper claims that KANs enjoy a neural scaling law of $ N^{-4} $ (where \(\ell\) is the test loss and \(N\) is the number of parameters).

5 References