Designing Robust Transformers using Robust Kernel Density Estimation Xing Han

2025-04-26 0 0 908.95KB 23 页 10玖币

侵权投诉

Designing Robust Transformers using

Robust Kernel Density Estimation

Xing Han

Department of ECE

University of Texas at Austin

aaronhan223@utexas.edu

Tongzheng Ren

Department of Computer Science

University of Texas at Austin

tongzheng@utexas.edu

Tan Minh Nguyen

Department of Mathematics

University of California, Los Angeles

tanmnguyen89@ucla.edu

Khai Nguyen

Department of Statistics and Data Sciences

University of Texas at Austin

khainb@utexas.edu

Joydeep Ghosh

Department of ECE

University of Texas at Austin

jghosh@utexas.edu

Nhat Ho

Department of Statistics and Data Sciences

University of Texas at Austin

minhnhat@utexas.edu

Abstract

Transformer-based architectures have recently exhibited remarkable successes

across different domains beyond just powering large language models. How-

ever, existing approaches typically focus on predictive accuracy and computa-

tional cost, largely ignoring certain other practical issues such as robustness to

contaminated samples. In this paper, by re-interpreting the self-attention mecha-

nism as a non-parametric kernel density estimator, we adapt classical robust ker-

nel density estimation methods to develop novel classes of transformers that are

resistant to adversarial attacks and data contamination. We ﬁrst propose meth-

ods that down-weight outliers in RKHS when computing the self-attention oper-

ations. We empirically show that these methods produce improved performance

over existing state-of-the-art methods, particularly on image data under adversar-

ial attacks. Then we leverage the median-of-means principle to obtain another

efﬁcient approach that results in noticeably enhanced performance and robust-

ness on language modeling and time series classiﬁcation tasks. Our methods can

be combined with existing transformers to augment their robust properties, thus

promising to impact a wide variety of applications.

1 Introduction

Attention mechanisms and transformers [70] have drawn lots of attention in the machine learning

community [35,64,31]. Now they are among the best deep learning architectures for a variety of

applications, including those in natural language processing [14,1,11,9,55,3,6,12], computer

vision [16,38,65,56,52,17,39], and reinforcement learning [8,28]. They are also known for

their effectiveness in transferring knowledge from various pretraining tasks to different downstream

applications with weak supervision or no supervision [53,54,14,78,37].

While there have been notable advancements, the robustness of the standard attention module re-

mains an unresolved issue in the literature. In this paper, our goal is to reinforce the attention mech-

Preprint. Under review.

arXiv:2210.05794v3 [cs.LG] 8 Nov 2023

anism and construct a comprehensive framework for robust transformer models. To achieve this,

we ﬁrst revisit the interpretation of self-attention in transformers, viewing it through the prism of

the Nadaraya-Watson (NW) estimator [46] in a non-parametric regression context. Within the trans-

former paradigm, the NW estimator is constructed based on the kernel density estimators (KDE)

of the keys and queries. However, these KDEs are not immune to the issue of sample contami-

nation [32]. By conceptualizing the KDE as a solution to the kernel regression problem within a

Reproducing Kernel Hilbert Space (RKHS), we can utilize a range of state-of-the-art robust KDE

techniques, such as those based on robust kernel regression and median-of-mean estimators. This

facilitates the creation of substantially more robust self-attention mechanisms. The resulting suite

of robust self-attention can be adapted to a variety of transformer architectures and tasks across dif-

ferent data modalities. We carry out exhaustive experiments covering vision, language modeling,

and time-series classiﬁcation. The results demonstrate that our approaches can uphold comparable

accuracy on clean data while exhibiting improved performance on contaminated data. Crucially, this

is accomplished without introducing any extra parameters.

Related Work on Robust Transformers: Vision Transformer (ViT) models [15,66] have recently

demonstrated impressive performance across various vision tasks, positioning themselves as a com-

pelling alternative to CNNs. A number of studies [e.g., 62,51,5,42,43,80] have proposed strate-

gies to bolster the resilience of these models against common adversarial attacks on image data,

thereby enhancing their generalizability across diverse datasets. For instance, [42] provided em-

pirical evidence of ViT’s vulnerability to white-box adversarial attacks, while demonstrating that a

straightforward ensemble defense could achieve remarkable robustness without compromising ac-

curacy on clean data. [80] suggested fully attentional networks to enhance self-attention, achieving

state-of-the-art accuracy on corrupted images. Furthermore, [43] conducted a robustness analysis

on various ViT building blocks, proposing position-aware attention scaling and patch-wise augmen-

tation to enhance the model’s robustness and accuracy. However, these investigations are primarily

geared toward vision-related tasks, which restricts their applicability across different data modal-

ities. As an example, the position-based attention from [43] induces a bi-directional information

ﬂow, which is limiting for position-sensitive datasets such as text or sequences. These methods also

introduce additional parameters. Beyond these vision-focused studies, robust transformers have also

been explored in ﬁelds like text analysis and social media. [77] delved into table understanding and

suggested a robust, structurally aware table-text encoding architecture to mitigate the effects of row

and column order perturbations. [36] proposed a robust end-to-end transformer-based model for cri-

sis detection and recognition. Furthermore, [34] developed a unique attention mechanism to create

a robust neural text-to-speech model capable of synthesizing both natural and stable audios. We

have noted that these methods vary in their methodologies, largely due to differences in application

domains, and therefore limiting their generalizability across diverse contexts.

Other Theoretical Frameworks for Attention Mechanisms: Attention mechanisms in transform-

ers have been recently studied from different perspectives. [67] show that attention can be derived

from smoothing the inputs with appropriate kernels. [30,10,73] further linearize the softmax kernel

in attention to attain a family of efﬁcient transformers with both linear computational and memory

complexity. These linear attentions are proven in [7] to be equivalent to a Petrov-Galerkin projec-

tion [57], thereby indicating that the softmax normalization in dot-product attention is sufﬁcient but

not necessary. Other frameworks for analyzing transformers that use ordinary/partial differential

equations include [40,60]. In addition, the Gaussian mixture model and graph-structured learning

have been utilized to study attentions and transformers [63,19,79,74,33]. [49] has linked the

self-attention mechanism with a non-parametric regression perspective, which offers enhanced in-

terpretability of Transformers. Our approach draws upon this viewpoint, but focuses instead on how

it can lead to robust solutions.

2 Self-Attention Mechanism from a Non-parametric Regression Perspective

Assume we have the key and value vectors {kj,vj}j∈[N]that is collected from the data generating

process v=f(k) + ε, where εis some noise vectors with E[ε] = 0, and fis the function that we

want to estimate. We consider a random design setting where the key vectors {kj}j∈[N]are i.i.d.

samples from the distribution p(k), and we use p(v,k)to denote the joint distribution of (v,k)

deﬁned by the data generating process. Our target is to estimate f(q)for any new queries q. The

NW estimator provides a non-parametric approach to estimating the function f, the main idea is that

f(k) = E[v|k] = ZRD

v·p(v|k)dv=ZRD

v·p(v,k)

p(k)dv,(1)

where the ﬁrst equation comes from the fact that E[ε] = 0, the second equation comes from the

deﬁnition of conditional expectation, and the last equation comes from the deﬁnition of conditional

density. To provide an estimation of f, we just need to obtain estimations for both the joint density

function p(v,k)and the marginal density function p(k). KDE is commonly used for the density

estimation problem [58,50], which requires a kernel kσwith the bandwidth parameter σsatisﬁes

RRDkσ(x−x′)dx= 1,∀x′, and estimate the density as

ˆpσ(v,k) = 1

j∈[N]

kσ([v,k]−[vj,kj]) ˆpσ(k) = 1

j∈[N]

kσ(k−kj),(2)

where [v,k]denotes the concatenation of vand k. Speciﬁcally, when kσis the isotropic Gaussian

kernel, we have ˆpσ(v,k) = 1

NPj∈[N]kσ(v−vj)kσ(k−kj).Combine this with Eq. (1) and Eq.

(2), we can obtain the NW estimator of the function fas

fσ(k) = Pj∈[N]vjkσ(k−kj)

Pj∈[N]kσ(k−kj).(3)

Furthermore, it is not hard to show that if the keys {kj}j∈[N]are normalized, the self-

attention mechanism b

fσ(qi)in Eq. (3) is exactly the standard Softmax attention b

fσ(qi) =

Pj∈[N]softmax q⊤kj/σ2vj.Such an assumption on the normalized key {kj}j∈[N]can be mild,

as in practice we always have a normalization step on the key to stabilizing the training of the trans-

former [61]. If we choose σ2=√D, where Dis the dimension of qand kj, then b

fσ(qi) = hi. As a

result, the self-attention mechanism in fact performs a non-parametric regression with NW-estimator

and isotropic Gaussian kernel when the keys are normalized.

Figure 1: The contour plots illustrate the den-

sity estimation of the two-dimensional query

vector embedding within a transformer’s atten-

tion layer. The left plot employs the regular

KDE method, as deﬁned in Eq. (4), whereas the

right plot utilizes a robustiﬁed version of the

KDE method, which enhances KDE’s robust-

ness against outliers.

KDE as a Regression Problem in RKHS We

start with the formal deﬁnition of the RKHS. The

space Hk={f|f:X → R}is the RKHS

associated with kernel k, where k:X ×X → R,

if it is a Hilbert space with inner product ⟨·,·⟩Hk

and following properties:

•k(x,·)∈ Hk,∀x∈ X;

•∀f∈ Hk,f(x) = ⟨f, k(x,·)⟩Hk. Aka

the reproducing property.

With slightly abuse of notation, we deﬁne

kσ(x,x′) = kσ(x−x′). By the deﬁnition

of the RKHS and the KDE estimator, we know

ˆpσ=1

NPj∈[N]kσ(xj,·)∈ Hkσ, and can be

viewed as the optimal solution of the following

least-square regression problem in RKHS:

ˆpσ= arg min

p∈HkσX

j∈[N]

N∥kσ(xj,·)−p∥2

Hkσ.(4)

Note that, in Eq. (4), the same weight factor 1/N is applied uniformly to each error term

∥kσ(xj,·)−p∥2

Hkσ. This approach functions effectively if there are no outliers in the set

{kσ(xj,·)}j∈[N]. However, when outliers are present (for instance, when there is some jsuch

that ∥kσ(xj,·)∥Hkσ≫ ∥kσ(xi,·)∥Hkσ,∀i∈[N], i ̸=j), the error attributable to these outliers will

overwhelmingly inﬂuence the total error, leading to a signiﬁcant deterioration in the overall density

estimation. We illustrate the robustness issue of the KDE in Figure 1. The view that KDE is sus-

ceptible to outliers, coupled with the non-parametric understanding of the self-attention mechanism,

…

Clean Data

PGD Attack

Brightness

Image Patches Attention Weight Factor

Later that year, she auctioned a breakfast in Mayfair,

London, where she raised around £4000 for the Pratham

NGO, which helps children's primary education.

Later that year, she auctioned a breakfast in Mayfair,

London, where she AAA around £4000 for the Pratham

AAA, which helps children's primary education.

Figure 2: The application of Transformers with robust KDE attention on image and text is shown.

(Left) The robust KDE self-attention generates varying weight factors for image patch embeddings

under adversarial attacks or data corruption. The adversely impacted regions that would other-

wise lead to incorrect predictions are down-weighted, ensuring enhanced accuracy and robustness.

(Right) In the ﬁeld of language modeling, the weight factors lend signiﬁcance to essential keywords

(highlighted in red). In the face of word swap attacks, the fortiﬁed self-attention mechanism, par-

ticularly when utilizing the medians-of-means principle, is proﬁcient in disregarding or reducing

the importance of less consequential words (marked in green). Consequently, this results in a more

resilient procedure during self-attention computations.

implies a potential lack of robustness in Transformers when handling outlier-rich data. We now of-

fer a fresh perspective on this robustness issue, introducing a universal framework that is applicable

across diverse data modalities.

3 Robust Transformers that Employ Robust Kernel Density Estimators

Drawing on the non-parametric regression formulation of self-attention, we derive multiple robust

variants of the NW-estimator and demonstrate their applicability in fortifying existing Transformers.

We propose two distinct types of robust self-attention mechanisms and delve into the properties of

each, potentially paving the way for Transformer variants that are substantially more robust.

3.1 Down-weighting Outliers in RKHS

Inspired by robust regression [18], a direct approach to achieving robust KDE involves down-

weighting outliers in the RKHS. More speciﬁcally, we substitute the least-square loss in Eq. (4)

with a robust loss function ρ, resulting in the following formulation:

ˆprobust = arg min

p∈HkσX

j∈[N]

ρ∥kσ(xj,·)−p∥Hkσ=X

j∈[N]

ωjkσ(xj,·).(5)

Examples of the robust loss function ρinclude the Huber loss [25], Hampel loss [21], Welsch loss

[75] and Tukey loss [18]. We empirically evaluate different loss functions in our experiments.

The critical step here is to estimate the set of weights ω= (ω1,··· , ωN)∈∆N, with each

ωj∝ψ∥kσ(xj,·)−ˆprobust∥Hkσ, where ψ(x) := ρ′(x)

x. Since ˆprobust is deﬁned via ω, and ω

also depends on ˆprobust, one can address this circular deﬁnition problem via an alternative updating

algorithm proposed by [32]. The algorithm starts with randomly initialized ω(0) ∈∆n, and per-

forms alternative updates between ˆprobust and ωuntil the optimal ˆprobust is reached at the ﬁxed point

(see details in Appendix A).

However, while this technique effectively diminishes the inﬂuence of outliers, it also comes with

noticeable drawbacks. Firstly, it necessitates the appropriate selection of the robust loss function,

which may entail additional effort to understand the patterns of outliers. Secondly, the iterative

updates might not successfully converge to the optimal solution. A better alternative is to assign

higher weights to high-density regions and reduce the weights for atypical samples. The original

KDE is scaled and projected to its nearest weighted KDE according to the L2norm. Similar concepts

have been studied by Scaled and Projected KDE (SPKDE) [69], which offer an improved set of

weights that better defend against outliers in the RKHS space. Speciﬁcally, given the scaling factor

β > 1, and let CN

σbe the convex hull of kσ(x1,·), . . . , kσ(xN,·)∈ Hkσ, i.e., the space of weighted

Algorithm 1 Procedure of Computing Attention Vector of Transformer-RKDE/SPKDE/MoM

1: Input:Q={qi}i∈[N],K={kj}j∈[N],V={vl}l∈[N], initial weights ω(0)

2: Normalize K={kj}j∈[N]along the head dimension.

3: Compute kernel function between each pair of sequence: kσ(Q,K) = {kσ(qi−kj)}i,j∈[N].

4: (Optional) apply attention mask on kσ(Q,K).

5: [MoM] Randomly sample Bsubsets I1,...,IBof size S, obtain the median block Ilsuch that

SPj∈Ilkσ(qi−kj) = median{1

SPj∈I1kσ(qi−kj),..., 1

SPj∈IBkσ(qi−kj)}

6: [RKDE] Update weights ω(0) for marginal/joint density by ω(1)

ψ 



kσ(kj,·)−ˆp(k)

robust(k)



Hkσ!

Pj∈[N]ψ 



kσ(kj,·)−ˆp(k)

robust(k)



Hkσ!.

7: [SPKDE] Obtain optimal weights ωfor marginal/joint density via solving equation (7).

8: [RKDE, SPKDE] Obtain robust self-attention vector b

hi=Pj∈[N]vjωjoint

jkσ(qi−kj)

Pj∈[N]ωmarginal

jkσ(qi−kj).

9: [MoM] Obtain attention vector b

hi=Pj∈Ilvjkσ(qi−kj)

Pj∈Ilkσ(qi−kj).

KDEs, the optimal density ˆprobust is given by

ˆprobust = arg min

p∈CN

σ



j∈[N]

kσ(xj,·)−p



Hkσ

,(6)

which is guaranteed to have a unique minimizer since we are projecting in a Hilbert space and

σis closed and convex. Notice that, by deﬁnition, ˆprobust can also be represented as ˆprobust =

Pj∈[N]ωjkσ(xj,·), ω ∈∆N, which is same as the formulation in Eq. (5). Then Eq. (6) can be

written as a quadratic programming (QP) problem over ω:

min

ωω⊤Gω −2q⊤ω, subject to ω∈∆N,(7)

where Gis the Gram matrix of {xj}j∈[N]with kσand q=G1β

N. Since kσis a positive-deﬁnite

kernel and each xiis unique, the Gram matrix Gis also positive-deﬁnite. As a result, this QP

problem is convex, and we can leverage commonly used solvers to efﬁciently obtain the solution

and the optimal density ˆprobust.

Robust Self-Attention Mechanism We now introduce the robust self-attention mechanism that

down-weights atypical samples. We consider the density estimator of the joint distribution and the

marginal distribution when using isotropic Gaussian kernel:

ˆprobust(v,k) = X

j∈[N]

ωjoint

jkσ([vj,kj],[v,k]),ˆprobust(k) = X

j∈[N]

ωmarginal

jkσ(kj,k).(8)

Following the non-parametric regression formulation of self-attention in Eq. (3), we obtain the

robust self-attention mechanism as

hi=Pj∈[N]vjωjoint

jkσ(qi−kj)

Pj∈[N]ωmarginal

jkσ(qi−kj),(9)

where ωjoint and ωmarginal are obtained via either alternative updates or the QP solver. We term Trans-

formers whose density from the non-parametric regression formulation of self-attention employs

Eq. (5) and Eq. (6) as Transformer-RKDE and Transformer-SPKDE, respectively. Figure 2presents

an example of the application of the attention weight factor during the learning process from image

and text data. The derived weight factor can potentially emphasize elements relevant to the class

while reducing the inﬂuence of detrimental ones. Note that, the computation of {ωmarginal

j}j∈[N]

and {ωjoint

j}j∈[N]are separate as ωjoint

jinvolves both keys and values vectors. During the empirical

evaluation, we concatenate the keys and values along the head dimension to obtain the weights for

the joint density ˆprobust(v,k)and only use the key vectors for obtaining the set of weights for the

marginal ˆprobust(k). In addition, ωmarginal, ωjoint ∈Rj×ifor i, j = 1, . . . , N are 2-dimensional ma-

trices that include the pairwise weights between each position of the sequence and the rest of the

positions. The weights are initialized uniformly across a certain sequence length dimension. For

experiments related to language modeling, we can leverage information from the attention mask to

initialize the weights on the unmasked part of the sequence.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DesigningRobustTransformersusingRobustKernelDensityEstimationXingHanDepartmentofECEUniversityofTexasatAustinaaronhan223@utexas.eduTongzhengRenDepartmentofComputerScienceUniversityofTexasatAustintongzheng@utexas.eduTanMinhNguyenDepartmentofMathematicsUniversityofCalifornia,LosAngelestanmnguyen89@ucla...

展开>> 收起<<

Designing Robust Transformers using Robust Kernel Density Estimation Xing Han.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Designing Robust Transformers using Robust Kernel Density Estimation Xing Han

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: