Designing Robust Transformers using Robust Kernel Density Estimation Xing Han

2025-04-26 0 0 908.95KB 23 页 10玖币
侵权投诉
Designing Robust Transformers using
Robust Kernel Density Estimation
Xing Han
Department of ECE
University of Texas at Austin
aaronhan223@utexas.edu
Tongzheng Ren
Department of Computer Science
University of Texas at Austin
tongzheng@utexas.edu
Tan Minh Nguyen
Department of Mathematics
University of California, Los Angeles
tanmnguyen89@ucla.edu
Khai Nguyen
Department of Statistics and Data Sciences
University of Texas at Austin
khainb@utexas.edu
Joydeep Ghosh
Department of ECE
University of Texas at Austin
jghosh@utexas.edu
Nhat Ho
Department of Statistics and Data Sciences
University of Texas at Austin
minhnhat@utexas.edu
Abstract
Transformer-based architectures have recently exhibited remarkable successes
across different domains beyond just powering large language models. How-
ever, existing approaches typically focus on predictive accuracy and computa-
tional cost, largely ignoring certain other practical issues such as robustness to
contaminated samples. In this paper, by re-interpreting the self-attention mecha-
nism as a non-parametric kernel density estimator, we adapt classical robust ker-
nel density estimation methods to develop novel classes of transformers that are
resistant to adversarial attacks and data contamination. We first propose meth-
ods that down-weight outliers in RKHS when computing the self-attention oper-
ations. We empirically show that these methods produce improved performance
over existing state-of-the-art methods, particularly on image data under adversar-
ial attacks. Then we leverage the median-of-means principle to obtain another
efficient approach that results in noticeably enhanced performance and robust-
ness on language modeling and time series classification tasks. Our methods can
be combined with existing transformers to augment their robust properties, thus
promising to impact a wide variety of applications.
1 Introduction
Attention mechanisms and transformers [70] have drawn lots of attention in the machine learning
community [35,64,31]. Now they are among the best deep learning architectures for a variety of
applications, including those in natural language processing [14,1,11,9,55,3,6,12], computer
vision [16,38,65,56,52,17,39], and reinforcement learning [8,28]. They are also known for
their effectiveness in transferring knowledge from various pretraining tasks to different downstream
applications with weak supervision or no supervision [53,54,14,78,37].
While there have been notable advancements, the robustness of the standard attention module re-
mains an unresolved issue in the literature. In this paper, our goal is to reinforce the attention mech-
Preprint. Under review.
arXiv:2210.05794v3 [cs.LG] 8 Nov 2023
anism and construct a comprehensive framework for robust transformer models. To achieve this,
we first revisit the interpretation of self-attention in transformers, viewing it through the prism of
the Nadaraya-Watson (NW) estimator [46] in a non-parametric regression context. Within the trans-
former paradigm, the NW estimator is constructed based on the kernel density estimators (KDE)
of the keys and queries. However, these KDEs are not immune to the issue of sample contami-
nation [32]. By conceptualizing the KDE as a solution to the kernel regression problem within a
Reproducing Kernel Hilbert Space (RKHS), we can utilize a range of state-of-the-art robust KDE
techniques, such as those based on robust kernel regression and median-of-mean estimators. This
facilitates the creation of substantially more robust self-attention mechanisms. The resulting suite
of robust self-attention can be adapted to a variety of transformer architectures and tasks across dif-
ferent data modalities. We carry out exhaustive experiments covering vision, language modeling,
and time-series classification. The results demonstrate that our approaches can uphold comparable
accuracy on clean data while exhibiting improved performance on contaminated data. Crucially, this
is accomplished without introducing any extra parameters.
Related Work on Robust Transformers: Vision Transformer (ViT) models [15,66] have recently
demonstrated impressive performance across various vision tasks, positioning themselves as a com-
pelling alternative to CNNs. A number of studies [e.g., 62,51,5,42,43,80] have proposed strate-
gies to bolster the resilience of these models against common adversarial attacks on image data,
thereby enhancing their generalizability across diverse datasets. For instance, [42] provided em-
pirical evidence of ViT’s vulnerability to white-box adversarial attacks, while demonstrating that a
straightforward ensemble defense could achieve remarkable robustness without compromising ac-
curacy on clean data. [80] suggested fully attentional networks to enhance self-attention, achieving
state-of-the-art accuracy on corrupted images. Furthermore, [43] conducted a robustness analysis
on various ViT building blocks, proposing position-aware attention scaling and patch-wise augmen-
tation to enhance the model’s robustness and accuracy. However, these investigations are primarily
geared toward vision-related tasks, which restricts their applicability across different data modal-
ities. As an example, the position-based attention from [43] induces a bi-directional information
flow, which is limiting for position-sensitive datasets such as text or sequences. These methods also
introduce additional parameters. Beyond these vision-focused studies, robust transformers have also
been explored in fields like text analysis and social media. [77] delved into table understanding and
suggested a robust, structurally aware table-text encoding architecture to mitigate the effects of row
and column order perturbations. [36] proposed a robust end-to-end transformer-based model for cri-
sis detection and recognition. Furthermore, [34] developed a unique attention mechanism to create
a robust neural text-to-speech model capable of synthesizing both natural and stable audios. We
have noted that these methods vary in their methodologies, largely due to differences in application
domains, and therefore limiting their generalizability across diverse contexts.
Other Theoretical Frameworks for Attention Mechanisms: Attention mechanisms in transform-
ers have been recently studied from different perspectives. [67] show that attention can be derived
from smoothing the inputs with appropriate kernels. [30,10,73] further linearize the softmax kernel
in attention to attain a family of efficient transformers with both linear computational and memory
complexity. These linear attentions are proven in [7] to be equivalent to a Petrov-Galerkin projec-
tion [57], thereby indicating that the softmax normalization in dot-product attention is sufficient but
not necessary. Other frameworks for analyzing transformers that use ordinary/partial differential
equations include [40,60]. In addition, the Gaussian mixture model and graph-structured learning
have been utilized to study attentions and transformers [63,19,79,74,33]. [49] has linked the
self-attention mechanism with a non-parametric regression perspective, which offers enhanced in-
terpretability of Transformers. Our approach draws upon this viewpoint, but focuses instead on how
it can lead to robust solutions.
2 Self-Attention Mechanism from a Non-parametric Regression Perspective
Assume we have the key and value vectors {kj,vj}j[N]that is collected from the data generating
process v=f(k) + ε, where εis some noise vectors with E[ε] = 0, and fis the function that we
want to estimate. We consider a random design setting where the key vectors {kj}j[N]are i.i.d.
samples from the distribution p(k), and we use p(v,k)to denote the joint distribution of (v,k)
defined by the data generating process. Our target is to estimate f(q)for any new queries q. The
2
NW estimator provides a non-parametric approach to estimating the function f, the main idea is that
f(k) = E[v|k] = ZRD
v·p(v|k)dv=ZRD
v·p(v,k)
p(k)dv,(1)
where the first equation comes from the fact that E[ε] = 0, the second equation comes from the
definition of conditional expectation, and the last equation comes from the definition of conditional
density. To provide an estimation of f, we just need to obtain estimations for both the joint density
function p(v,k)and the marginal density function p(k). KDE is commonly used for the density
estimation problem [58,50], which requires a kernel kσwith the bandwidth parameter σsatisfies
RRDkσ(xx)dx= 1,x, and estimate the density as
ˆpσ(v,k) = 1
NX
j[N]
kσ([v,k][vj,kj]) ˆpσ(k) = 1
NX
j[N]
kσ(kkj),(2)
where [v,k]denotes the concatenation of vand k. Specifically, when kσis the isotropic Gaussian
kernel, we have ˆpσ(v,k) = 1
NPj[N]kσ(vvj)kσ(kkj).Combine this with Eq. (1) and Eq.
(2), we can obtain the NW estimator of the function fas
b
fσ(k) = Pj[N]vjkσ(kkj)
Pj[N]kσ(kkj).(3)
Furthermore, it is not hard to show that if the keys {kj}j[N]are normalized, the self-
attention mechanism b
fσ(qi)in Eq. (3) is exactly the standard Softmax attention b
fσ(qi) =
Pj[N]softmax qkj2vj.Such an assumption on the normalized key {kj}j[N]can be mild,
as in practice we always have a normalization step on the key to stabilizing the training of the trans-
former [61]. If we choose σ2=D, where Dis the dimension of qand kj, then b
fσ(qi) = hi. As a
result, the self-attention mechanism in fact performs a non-parametric regression with NW-estimator
and isotropic Gaussian kernel when the keys are normalized.
Figure 1: The contour plots illustrate the den-
sity estimation of the two-dimensional query
vector embedding within a transformer’s atten-
tion layer. The left plot employs the regular
KDE method, as defined in Eq. (4), whereas the
right plot utilizes a robustified version of the
KDE method, which enhances KDE’s robust-
ness against outliers.
KDE as a Regression Problem in RKHS We
start with the formal definition of the RKHS. The
space Hk={f|f:X R}is the RKHS
associated with kernel k, where k:X ×X R,
if it is a Hilbert space with inner product ⟨·,·⟩Hk
and following properties:
k(x,·)∈ Hk,x∈ X;
f∈ Hk,f(x) = f, k(x,·)Hk. Aka
the reproducing property.
With slightly abuse of notation, we define
kσ(x,x) = kσ(xx). By the definition
of the RKHS and the KDE estimator, we know
ˆpσ=1
NPj[N]kσ(xj,·)∈ Hkσ, and can be
viewed as the optimal solution of the following
least-square regression problem in RKHS:
ˆpσ= arg min
p∈HkσX
j[N]
1
Nkσ(xj,·)p2
Hkσ.(4)
Note that, in Eq. (4), the same weight factor 1/N is applied uniformly to each error term
kσ(xj,·)p2
Hkσ. This approach functions effectively if there are no outliers in the set
{kσ(xj,·)}j[N]. However, when outliers are present (for instance, when there is some jsuch
that kσ(xj,·)Hkσ≫ ∥kσ(xi,·)Hkσ,i[N], i ̸=j), the error attributable to these outliers will
overwhelmingly influence the total error, leading to a significant deterioration in the overall density
estimation. We illustrate the robustness issue of the KDE in Figure 1. The view that KDE is sus-
ceptible to outliers, coupled with the non-parametric understanding of the self-attention mechanism,
3
Clean Data
PGD Attack
Brightness
Image Patches Attention Weight Factor
Later that year, she auctioned a breakfast in Mayfair,
London, where she raised around £4000 for the Pratham
NGO, which helps children's primary education.
Later that year, she auctioned a breakfast in Mayfair,
London, where she AAA around £4000 for the Pratham
AAA, which helps children's primary education.
Figure 2: The application of Transformers with robust KDE attention on image and text is shown.
(Left) The robust KDE self-attention generates varying weight factors for image patch embeddings
under adversarial attacks or data corruption. The adversely impacted regions that would other-
wise lead to incorrect predictions are down-weighted, ensuring enhanced accuracy and robustness.
(Right) In the field of language modeling, the weight factors lend significance to essential keywords
(highlighted in red). In the face of word swap attacks, the fortified self-attention mechanism, par-
ticularly when utilizing the medians-of-means principle, is proficient in disregarding or reducing
the importance of less consequential words (marked in green). Consequently, this results in a more
resilient procedure during self-attention computations.
implies a potential lack of robustness in Transformers when handling outlier-rich data. We now of-
fer a fresh perspective on this robustness issue, introducing a universal framework that is applicable
across diverse data modalities.
3 Robust Transformers that Employ Robust Kernel Density Estimators
Drawing on the non-parametric regression formulation of self-attention, we derive multiple robust
variants of the NW-estimator and demonstrate their applicability in fortifying existing Transformers.
We propose two distinct types of robust self-attention mechanisms and delve into the properties of
each, potentially paving the way for Transformer variants that are substantially more robust.
3.1 Down-weighting Outliers in RKHS
Inspired by robust regression [18], a direct approach to achieving robust KDE involves down-
weighting outliers in the RKHS. More specifically, we substitute the least-square loss in Eq. (4)
with a robust loss function ρ, resulting in the following formulation:
ˆprobust = arg min
p∈HkσX
j[N]
ρkσ(xj,·)pHkσ=X
j[N]
ωjkσ(xj,·).(5)
Examples of the robust loss function ρinclude the Huber loss [25], Hampel loss [21], Welsch loss
[75] and Tukey loss [18]. We empirically evaluate different loss functions in our experiments.
The critical step here is to estimate the set of weights ω= (ω1,··· , ωN)N, with each
ωjψkσ(xj,·)ˆprobustHkσ, where ψ(x) := ρ(x)
x. Since ˆprobust is defined via ω, and ω
also depends on ˆprobust, one can address this circular definition problem via an alternative updating
algorithm proposed by [32]. The algorithm starts with randomly initialized ω(0) n, and per-
forms alternative updates between ˆprobust and ωuntil the optimal ˆprobust is reached at the fixed point
(see details in Appendix A).
However, while this technique effectively diminishes the influence of outliers, it also comes with
noticeable drawbacks. Firstly, it necessitates the appropriate selection of the robust loss function,
which may entail additional effort to understand the patterns of outliers. Secondly, the iterative
updates might not successfully converge to the optimal solution. A better alternative is to assign
higher weights to high-density regions and reduce the weights for atypical samples. The original
KDE is scaled and projected to its nearest weighted KDE according to the L2norm. Similar concepts
have been studied by Scaled and Projected KDE (SPKDE) [69], which offer an improved set of
weights that better defend against outliers in the RKHS space. Specifically, given the scaling factor
β > 1, and let CN
σbe the convex hull of kσ(x1,·), . . . , kσ(xN,·)∈ Hkσ, i.e., the space of weighted
4
Algorithm 1 Procedure of Computing Attention Vector of Transformer-RKDE/SPKDE/MoM
1: Input:Q={qi}i[N],K={kj}j[N],V={vl}l[N], initial weights ω(0)
2: Normalize K={kj}j[N]along the head dimension.
3: Compute kernel function between each pair of sequence: kσ(Q,K) = {kσ(qikj)}i,j[N].
4: (Optional) apply attention mask on kσ(Q,K).
5: [MoM] Randomly sample Bsubsets I1,...,IBof size S, obtain the median block Ilsuch that
1
SPjIlkσ(qikj) = median{1
SPjI1kσ(qikj),..., 1
SPjIBkσ(qikj)}
6: [RKDE] Update weights ω(0) for marginal/joint density by ω(1)
j=
ψ
kσ(kj,·)ˆp(k)
robust(k)
Hkσ!
Pj[N]ψ
kσ(kj,·)ˆp(k)
robust(k)
Hkσ!.
7: [SPKDE] Obtain optimal weights ωfor marginal/joint density via solving equation (7).
8: [RKDE, SPKDE] Obtain robust self-attention vector b
hi=Pj[N]vjωjoint
jkσ(qikj)
Pj[N]ωmarginal
jkσ(qikj).
9: [MoM] Obtain attention vector b
hi=PjIlvjkσ(qikj)
PjIlkσ(qikj).
KDEs, the optimal density ˆprobust is given by
ˆprobust = arg min
p∈CN
σ
β
NX
j[N]
kσ(xj,·)p
2
Hkσ
,(6)
which is guaranteed to have a unique minimizer since we are projecting in a Hilbert space and
CN
σis closed and convex. Notice that, by definition, ˆprobust can also be represented as ˆprobust =
Pj[N]ωjkσ(xj,·), ω N, which is same as the formulation in Eq. (5). Then Eq. (6) can be
written as a quadratic programming (QP) problem over ω:
min
ωω2qω, subject to ωN,(7)
where Gis the Gram matrix of {xj}j[N]with kσand q=G1β
N. Since kσis a positive-definite
kernel and each xiis unique, the Gram matrix Gis also positive-definite. As a result, this QP
problem is convex, and we can leverage commonly used solvers to efficiently obtain the solution
and the optimal density ˆprobust.
Robust Self-Attention Mechanism We now introduce the robust self-attention mechanism that
down-weights atypical samples. We consider the density estimator of the joint distribution and the
marginal distribution when using isotropic Gaussian kernel:
ˆprobust(v,k) = X
j[N]
ωjoint
jkσ([vj,kj],[v,k]),ˆprobust(k) = X
j[N]
ωmarginal
jkσ(kj,k).(8)
Following the non-parametric regression formulation of self-attention in Eq. (3), we obtain the
robust self-attention mechanism as
b
hi=Pj[N]vjωjoint
jkσ(qikj)
Pj[N]ωmarginal
jkσ(qikj),(9)
where ωjoint and ωmarginal are obtained via either alternative updates or the QP solver. We term Trans-
formers whose density from the non-parametric regression formulation of self-attention employs
Eq. (5) and Eq. (6) as Transformer-RKDE and Transformer-SPKDE, respectively. Figure 2presents
an example of the application of the attention weight factor during the learning process from image
and text data. The derived weight factor can potentially emphasize elements relevant to the class
while reducing the influence of detrimental ones. Note that, the computation of {ωmarginal
j}j[N]
and {ωjoint
j}j[N]are separate as ωjoint
jinvolves both keys and values vectors. During the empirical
evaluation, we concatenate the keys and values along the head dimension to obtain the weights for
the joint density ˆprobust(v,k)and only use the key vectors for obtaining the set of weights for the
marginal ˆprobust(k). In addition, ωmarginal, ωjoint Rj×ifor i, j = 1, . . . , N are 2-dimensional ma-
trices that include the pairwise weights between each position of the sequence and the rest of the
positions. The weights are initialized uniformly across a certain sequence length dimension. For
experiments related to language modeling, we can leverage information from the attention mask to
initialize the weights on the unmasked part of the sequence.
5
摘要:

DesigningRobustTransformersusingRobustKernelDensityEstimationXingHanDepartmentofECEUniversityofTexasatAustinaaronhan223@utexas.eduTongzhengRenDepartmentofComputerScienceUniversityofTexasatAustintongzheng@utexas.eduTanMinhNguyenDepartmentofMathematicsUniversityofCalifornia,LosAngelestanmnguyen89@ucla...

展开>> 收起<<
Designing Robust Transformers using Robust Kernel Density Estimation Xing Han.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:23 页 大小:908.95KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注