This time, I’m looking at “Kolmogorov-Arnold Networks” by Liu et al. The paper introduces Kolmogorov-Arnold Networks (KANs) as a possible alternative to Multi-Layer Perceptrons (MLPs), especially when interpretability matters.

The core idea comes from the Kolmogorov-Arnold representation theorem (KAT), which says any multivariate continuous function can be broken down into sums and compositions of univariate functions. Unlike MLPs, which have fixed activation functions on nodes and learnable linear weights on edges, KANs put learnable activation functions, parameterized as splines, directly on the edges. The nodes simply sum incoming signals. This is the architectural shift that makes the paper interesting.


Key concepts

Kolmogorov-Arnold theorem (KAT) inspiration

The network design is inspired by KAT, which states that multivariate functions can be represented using only univariate functions and sums. KANs attempt to learn this kind of decomposition, where complex relationships are built from simpler, learnable 1D functions.

KAN architecture: activations on edges

The defining feature of KANs is that the learnable components are 1D activation functions on the edges of the network graph. These are typically parameterized as B-splines. The nodes simply perform summation, unlike MLPs where nodes apply fixed non-linearities.

Learnable activation functions

Instead of fixed functions like ReLU or Sigmoid in MLPs, KAN edges learn the shape of their activation function. This allows the network to adapt its non-linearity locally and potentially capture the underlying structure of the data more directly.

Splines and adaptive grids

The learnable edge activations are represented using B-splines defined over a grid. KANs can update these grids during training (“grid extension”), letting them allocate more representational power, or finer grid resolution, to input ranges where the function behaves more complexly.


What I learned

A different theoretical basis

My initial thought was that KANs might be trying to completely replace MLPs. Digging deeper, especially into their foundations, clarified things. MLPs rely on the Universal Approximation Theorem (UAT), focusing on approximation power through linear layers and fixed non-linearities. KANs are built on the Kolmogorov-Arnold Theorem (KAT). It felt analogous to Fourier transforms: just as Fourier analysis breaks a complex signal into a sum of simple sine waves, KAT suggests breaking a complex multivariate function into sums and compositions of simpler 1D functions. KANs try to learn these 1D components, the splines on the edges. This suggests KANs and MLPs may be good at different things: MLPs for general function approximation, KANs for interpretability and for uncovering mathematical structure when it exists.

Universal approximation vs. practical reality

Both UAT for MLPs and KAT for KANs imply universal approximation capabilities. Theoretically, given enough capacity, both can approximate any continuous function. But the way they do it is different. It is not just about whether you can approximate, but how efficiently, how trainably, and how interpretably. MLPs are general workhorses, highly optimized for parallel hardware, but often opaque. KANs offer a path to interpretability and may handle functions with inherent structure better, but currently train more slowly. Choosing between them is a practical engineering trade-off: raw speed and general approximation, or interpretability and possible symbolic structure.

Edges doing the work, not nodes

The shift from node-based fixed activations in MLPs to edge-based learnable activations in KANs is the core architectural change. It feels quite different conceptually: the connections themselves learn the transformations, while nodes just add things up. This structure is closely tied to interpretability.

Potentially dodging the curse of dimensionality?

The paper’s analysis (Theorem 2.1) mentions an approximation error (“residual rate”) that scales independently of the input dimension n. This confused me at first, but the key was understanding it as approximation error, not network residuals. If this holds true in practice, it’s a big deal. It suggests KANs might handle high-dimensional functions more efficiently than traditional methods that suffer from the curse of dimensionality.

The interpretability pipeline

This was one of the most appealing parts. KANs aren’t just interpretable by design; there’s a process. They use regularization, an entropy term plus an L1-like norm on splines, to encourage sparsity, then prune away inactive edges and nodes. The neat part is “symbolification”: the system tries to match learned spline shapes to known symbolic functions like sin, exp, x^2, or linear functions. If a match is found, the spline is replaced by the symbolic function, and its parameters are fine-tuned. This can potentially extract clean mathematical formulas from the trained network.

Slow training, potentially fast inference

The benchmarks showed KANs can be very accurate, sometimes beating MLPs, especially on tasks with underlying symbolic structure, like fitting physics equations or solving PDEs. However, the training wall time is much longer. MLPs benefit hugely from optimized matrix multiplication on GPUs, while KAN’s spline computations are less parallelizable. The flip side is inference. A pruned and symbolified KAN could be extremely fast, with low FLOPS, because evaluating simple symbolic functions is cheap. I also wondered whether pruning KANs might reduce latency more effectively than pruning MLPs, since removing KAN operations may have a more direct impact on serial execution time.

Surprising image fitting performance

Given the emphasis on mathematical structure, I was surprised KANs performed well on image fitting tasks, like the cameraman photo. My thinking shifted here: maybe it’s not about finding one fundamental equation for the image, but about approximating image data efficiently with combinations of simpler spline-like functions, similar to how JPEG uses basis functions. KAN’s strength in combining simple functions seems to help here too.

Continual learning promise, with caveats

The local nature of B-splines seemed promising for continual learning, since changing one part of the function shouldn’t drastically affect others. The paper showed KANs avoiding catastrophic forgetting better than MLPs in a toy example. However, this advantage seemed to diminish for deeper KANs, so it’s not a perfect solution yet.


Summary

Kolmogorov-Arnold Networks offer a genuinely different approach to building neural networks. Instead of fixed activations on nodes, they place learnable spline-based activation functions on edges. This makes interpretability feel more built into the architecture rather than added afterward.

KANs show strong accuracy, especially on science-related tasks, and may help with the curse of dimensionality and continual learning. Their main drawback is slow training compared with highly optimized MLPs. Still, the core idea feels fresh. KANs sit somewhere between numerical approximation and symbolic reasoning, which makes them especially interesting for science and engineering problems where understanding the “why” matters.