Attention as Kernel Regression
The attention mechanism can be understood as a form of kernel regression where queries determine the regression point and keys define the kernel centers. This perspective immediately clarifies why attention is such a powerful inductive bias.
By viewing attention through the lens of Nadaraya-Watson kernel regression, we gain new insights into initialization schemes, normalization strategies, and architectural variants.