anism and construct a comprehensive framework for robust transformer models. To achieve this,
we first revisit the interpretation of self-attention in transformers, viewing it through the prism of
the Nadaraya-Watson (NW) estimator [46] in a non-parametric regression context. Within the trans-
former paradigm, the NW estimator is constructed based on the kernel density estimators (KDE)
of the keys and queries. However, these KDEs are not immune to the issue of sample contami-
nation [32]. By conceptualizing the KDE as a solution to the kernel regression problem within a
Reproducing Kernel Hilbert Space (RKHS), we can utilize a range of state-of-the-art robust KDE
techniques, such as those based on robust kernel regression and median-of-mean estimators. This
facilitates the creation of substantially more robust self-attention mechanisms. The resulting suite
of robust self-attention can be adapted to a variety of transformer architectures and tasks across dif-
ferent data modalities. We carry out exhaustive experiments covering vision, language modeling,
and time-series classification. The results demonstrate that our approaches can uphold comparable
accuracy on clean data while exhibiting improved performance on contaminated data. Crucially, this
is accomplished without introducing any extra parameters.
Related Work on Robust Transformers: Vision Transformer (ViT) models [15,66] have recently
demonstrated impressive performance across various vision tasks, positioning themselves as a com-
pelling alternative to CNNs. A number of studies [e.g., 62,51,5,42,43,80] have proposed strate-
gies to bolster the resilience of these models against common adversarial attacks on image data,
thereby enhancing their generalizability across diverse datasets. For instance, [42] provided em-
pirical evidence of ViT’s vulnerability to white-box adversarial attacks, while demonstrating that a
straightforward ensemble defense could achieve remarkable robustness without compromising ac-
curacy on clean data. [80] suggested fully attentional networks to enhance self-attention, achieving
state-of-the-art accuracy on corrupted images. Furthermore, [43] conducted a robustness analysis
on various ViT building blocks, proposing position-aware attention scaling and patch-wise augmen-
tation to enhance the model’s robustness and accuracy. However, these investigations are primarily
geared toward vision-related tasks, which restricts their applicability across different data modal-
ities. As an example, the position-based attention from [43] induces a bi-directional information
flow, which is limiting for position-sensitive datasets such as text or sequences. These methods also
introduce additional parameters. Beyond these vision-focused studies, robust transformers have also
been explored in fields like text analysis and social media. [77] delved into table understanding and
suggested a robust, structurally aware table-text encoding architecture to mitigate the effects of row
and column order perturbations. [36] proposed a robust end-to-end transformer-based model for cri-
sis detection and recognition. Furthermore, [34] developed a unique attention mechanism to create
a robust neural text-to-speech model capable of synthesizing both natural and stable audios. We
have noted that these methods vary in their methodologies, largely due to differences in application
domains, and therefore limiting their generalizability across diverse contexts.
Other Theoretical Frameworks for Attention Mechanisms: Attention mechanisms in transform-
ers have been recently studied from different perspectives. [67] show that attention can be derived
from smoothing the inputs with appropriate kernels. [30,10,73] further linearize the softmax kernel
in attention to attain a family of efficient transformers with both linear computational and memory
complexity. These linear attentions are proven in [7] to be equivalent to a Petrov-Galerkin projec-
tion [57], thereby indicating that the softmax normalization in dot-product attention is sufficient but
not necessary. Other frameworks for analyzing transformers that use ordinary/partial differential
equations include [40,60]. In addition, the Gaussian mixture model and graph-structured learning
have been utilized to study attentions and transformers [63,19,79,74,33]. [49] has linked the
self-attention mechanism with a non-parametric regression perspective, which offers enhanced in-
terpretability of Transformers. Our approach draws upon this viewpoint, but focuses instead on how
it can lead to robust solutions.
2 Self-Attention Mechanism from a Non-parametric Regression Perspective
Assume we have the key and value vectors {kj,vj}j∈[N]that is collected from the data generating
process v=f(k) + ε, where εis some noise vectors with E[ε] = 0, and fis the function that we
want to estimate. We consider a random design setting where the key vectors {kj}j∈[N]are i.i.d.
samples from the distribution p(k), and we use p(v,k)to denote the joint distribution of (v,k)
defined by the data generating process. Our target is to estimate f(q)for any new queries q. The
2