Multi-layer perceptrons have been the default wiring for neural networks since the 1980s, when backpropagation made them practical to train. Today, a challenger called Kolmogorov-Arnold Networks is forcing researchers to ask whether we built AI's foundation on the right math in the first place.
The MLP Paradigm: How We Got Here
MLPs are the workhorses of deep learning. Every transformer, every convolutional network, every large language model you have heard of relies on MLP layers somewhere inside its architecture. They sit between attention heads, after convolution filters, and at the very end of classifiers. They are the plumbing of modern AI.
The idea is straightforward. Neurons are arranged in layers, each holding a number called a weight. Data flows in, gets multiplied by those weights, passes through a simple activation function like ReLU, and moves to the next layer. The network learns by adjusting weights until the output matches what you want.
MLPs won because they worked. They can approximate any continuous function given enough neurons and layers, a property known as the universal approximation theorem. That theorem gave researchers a mathematical safety net: just add more layers and the network will figure it out. For decades, that strategy delivered, powering image classifiers, recommendation systems, and eventually the large language models that write code and pass bar exams.
What Kolmogorov-Arnold Networks Actually Do
KANs take a completely different approach. Instead of putting learnable weights on the connections between neurons, KANs put learnable functions on the edges. This sounds like a small change, but it reshapes the entire architecture from the ground up.
Think of an MLP as a factory where every machine does the same simple operation, just with different settings. A KAN is more like a factory where every conveyor belt itself is a tiny, adaptable machine. Each edge in the network learns its own small function, typically a spline, rather than just scaling a number by a fixed weight.
The theoretical backbone comes from the Kolmogorov-Arnold representation theorem. Andrey Kolmogorov first proved the result in 1956, and his student Vladimir Arnold, while still a teenager at Moscow State University, provided a refined version in 1957. The theorem states that any multivariate continuous function can be decomposed into sums and compositions of single-variable functions. Where the universal approximation theorem told MLPs they could eventually learn anything with enough brute force, the Kolmogorov-Arnold theorem says something more structural: complex functions are actually built from simpler one-variable pieces in a specific way.
KANs try to honor that structure directly. The original KAN paper, published on arXiv in April 2024 and later presented as an Oral at ICLR 2025, introduced the architecture and showed that these learnable spline-based edges could match or exceed MLP performance across a range of tasks. Smaller KANs achieved comparable or better accuracy than larger MLPs in function fitting, with faster neural scaling laws. The paper also noted that KANs could be intuitively visualized and could help scientists rediscover mathematical and physical laws, though it acknowledged that training remains slow.
A comparative study published through Springer Nature provides additional evidence. Researchers tested KANs and MLPs on tasks including quadratic and cubic function estimation, temperature prediction, and wine classification, assessing both accuracy through Mean Squared Error and computational cost through FLOPs. Their results indicated that KANs reliably exceeded MLPs across all tested benchmarks, achieving higher predictive accuracy with reduced computational costs.
Inside the Spline Mechanics
A spline is a piecewise polynomial function. Instead of one long equation, you stitch together smaller polynomial segments at points called knots. The positions of those knots and the coefficients of each segment are what the network learns during training.
This gives KANs a built-in advantage for certain types of problems. If the function you are trying to learn is smooth and has a regular structure, splines can approximate it very efficiently. However, evaluating a spline is more computationally expensive than multiplying a number by a weight. The ICLR paper itself acknowledges that slow training remains one of the main practical barriers.
Where KANs Genuinely Outperform MLPs
The strongest case for KANs right now is in scientific computing and function approximation tasks where interpretability and parameter efficiency matter more than raw throughput.
In physics-informed neural networks, for example, you often need to approximate a known differential equation. KANs can sometimes learn these mappings with a fraction of the parameters an MLP would need, and the learned splines can be inspected visually to understand what the network actually learned. The ICLR paper highlighted two examples in mathematics and physics where KANs served as useful collaborators, helping scientists rediscover mathematical and physical laws from data.
For researchers working on symbolic regression, where the goal is to recover an exact mathematical formula from data, KANs offer something MLPs simply cannot. Since each edge learns a function rather than a weight, you can sometimes read the formula directly off the network architecture. This is a genuine capability gap, not just a marginal accuracy improvement.
Where MLPs Still Hold the Advantage
Before anyone writes the obituary for multi-layer perceptrons, consider the ecosystem. Deep learning frameworks and GPU hardware have been optimized around matrix multiplications for years, which is exactly what MLP layers do. KANs, with their spline evaluations on edges, represent a different computational pattern that may not benefit from the same hardware optimizations, though this remains an open question. The ICLR authors explicitly noted slow training as a limitation and said more research is necessary to make KANs efficient enough for broader use.
Then there is the question of scale. The KAN results that have impressed the research community so far come from relatively small models on constrained problems. The ICLR paper focused on small-scale AI and science tasks, not production-scale deployments. How KANs would perform when scaled to billion-parameter language models remains unclear, with open questions around memory efficiency and training stability in extremely deep networks.
MLPs also benefit from decades of engineering tricks like batch normalization, dropout, and residual connections. Whether these techniques translate directly to KANs, which have no linear weights at all and replace every weight parameter with a univariate spline function, is still being explored.
The More Likely Outcome: Convergence, Not Replacement
Can KANs replace MLPs entirely? Based on the current evidence, the honest answer is: not soon, and probably not completely. But that does not mean KANs are a dead end.
The more plausible scenario is architectural convergence. We may see hybrid models that use KAN layers in specific places where spline-based approximation helps, and MLP layers elsewhere where speed and scalability matter. A transformer might use KAN layers in its final classification head or in subnetworks that handle mathematical reasoning, while keeping standard MLP layers in the bulk of the network.
The ICLR 2025 presentation itself framed KANs as promising alternatives to MLPs rather than a wholesale replacement. What KANs have already accomplished is significant. They have broken the assumption that weight-on-neuron is the only viable way to build a learnable network layer. They have demonstrated that respecting the mathematical structure of your problem can yield real efficiency gains. And they have given the scientific computing community a tool that is genuinely better suited to certain types of problems than the generic MLP hammer.
The deeper lesson is about the danger of architectural monoculture. For decades, the field essentially bet everything on one type of layer. KANs are a reminder that there might be fundamentally different ways to build learning machines, and some of those ways might be sitting in plain sight inside old mathematics papers. If theorems from the 1950s can still produce a genuinely new neural network architecture in 2024, what other mathematical ideas are we overlooking right now?
Comments