Regression is
All You
Need

08/28/2025Cyrion Labs

Summary

We present a novel perspective on the attention mechanism by deriving it from first principles of nonparametric regression and kernel smoothing. This view unifies classical statistical methods with modern deep learning.

Attention as Kernel Regression

The attention mechanism can be understood as a form of kernel regression where queries determine the regression point and keys define the kernel centers. This perspective immediately clarifies why attention is such a powerful inductive bias.

By viewing attention through the lens of Nadaraya-Watson kernel regression, we gain new insights into initialization schemes, normalization strategies, and architectural variants.

Implications

This kernel perspective suggests natural extensions to the attention mechanism based on well-studied properties of kernel methods. It also provides a principled framework for understanding when and why attention works.

The connection to classical nonparametric statistics opens up a wealth of theoretical tools that can be applied to analyze and improve transformer architectures.