CURVED REPRESENTATION SPACE OF VISION TRANSFORMERS Juyeop Kim Junha Park Songkuk Kim Jong-Seok Lee Yonsei University Korea

2025-04-26 0 0 6.96MB 18 页 10玖币
侵权投诉
CURVED REPRESENTATION SPACE OF VISION TRANSFORMERS
Juyeop Kim, Junha Park, Songkuk Kim, Jong-Seok Lee
Yonsei University, Korea
{juyeopkim, junha.park, songkuk, jong-seok.lee}@yonsei.ac.kr
ABSTRACT
Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a
better alternative to traditional convolutional neural networks (CNNs). However, our understanding
of how the new architecture works is still limited. In this paper, we focus on the phenomenon that
Transformers show higher robustness against corruptions than CNNs, while not being overconfident.
This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction
by empirically investigating how the output of the penultimate layer moves in the representation
space as the input data moves linearly within a small area. In particular, we show the following. (1)
While CNNs exhibit fairly linear relationship between the input and output movements, Transformers
show nonlinear relationship for some data. For those data, the output of Transformers moves in a
curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is
hard to move it out of the decision region since the output moves along a curved trajectory instead
of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a
data is slightly modified to jump out of the curved region, the movements afterwards become linear
and the output goes to the decision boundary directly. In other words, there does exist a decision
boundary near the data, which is hard to find only because of the curved representation space. This
explains the underconfident prediction of Transformers. Also, we examine mathematical properties
of the attention operation that induce nonlinear response to linear perturbation. Finally, we share our
additional findings, regarding what contributes to the curved representation space of Transformers,
and how the curvedness evolves during training.
1 Introduction
Self-attention-based neural network architectures, including Vision Transformers [
1
], Swin Transformers [
2
], etc.
(hereinafter referred to as Transformers), have shown to outperform traditional convolutional neural networks (CNNs)
in various computer vision tasks. The success of the new architecture has prompted a question, how Transformers work,
especially compared to CNNs, which would also shed light on deeper understanding of CNNs and eventually neural
networks.
In addition to the improved task performance (e.g., classification accuracy) compared to CNNs, Transformers also
show desirable characteristics in other aspects. It has been shown that Transformers are more robust to adversarial
perturbations than CNNs [
3
,
4
,
5
]. Moreover, Transformers are reported not overconfident in their predictions unlike
CNNs [6] (and we show that Transformers are actually underconfident in this paper).
The high robustness, however, does not comport with underconfidence. Intuitively, a data that is correctly classified
by a model with lower confidence is likely to be located closer to the decision boundary (see Appendix for detailed
discussion). Then, a smaller amount of perturbation would move the data out of the decision region, which translates
into lower robustness of the model. However, the previous results claim the opposite.
To mitigate the contradiction of robustness and underconfidence, this paper presents our empirical study to explore the
representation space of Transformers and CNNs. More specifically, we focus on the linearity of the models, i.e., the
change of the output feature (which is simply referred to as output in this paper) with respect to the linear change of
the input data. It is known that adversarial examples are a result of models being too linear, based on which the fast
gradient sign method (FGSM) was introduced to show that deep neural networks can be easily fooled [
7
]. Motivated by
arXiv:2210.05742v2 [cs.CV] 14 Dec 2023
Curved Representation Space of Vision Transformers
(a) Input space (b) Representation space (c) Representation space
(ResNet50) (Swin-T)
Figure 1: 2D projected movements of (a) the data (black dot) in the input space and corresponding output features in
the representation space for (b) ResNet50 and (c) Swin-T.
this, we examine the input-output relationship of Transformers through the course that the input is gradually perturbed
along the direction determined by FGSM.
Fig. 1 visualizes the representation spaces of CNNs and Transformers comparatively (see Appendix for implementation
details). An image data from ImageNet [
8
], marked with the black dot in Fig. 1a, is gradually modified by a fixed
amount along two mutually orthogonal directions. The corresponding outputs of ResNet50 [
9
] and Swin-T [
2
] are
obtained, which are shown after two-dimensional projection in Figs. 1b and 1c, respectively. While the gradual changes
of the input produce almost linear changes in the output of ResNet50, the output trajectory of Swin-T is nonlinear
around the original output (and then becomes linear when the change of the output is large), i.e., the representation space
is locally curved. We empirically show that this curved representation space results in the aforementioned contradiction.
Our main research questions and findings can be summarized as follows.
1. How does the representation space of Transformers look like? To answer this, we analyze the movement of the
penultimate layer’s output with respect to the linear movement of the input. We use the adversarial gradient produced by
FGSM [
7
] as the direction of movement in the input space, to investigate the linearity of the feature space of the models.
We find that the directions of successive movements of the output significantly change in the case of Transformers
unlike CNNs, indicating that the representation space of Transformers is locally curved.
2. What makes Transformers robust to input perturbation? We find that the curved regions in the representation
space account for the robustness of Transformers. When a data is located in a curved region, a series of linear
perturbations to the input move the output point along a curved trajectory. This makes it hard to move the data out of its
decision region along a short and straight line, which explains high robustness of Transformers for the data.
3. Then, why is the prediction of Transformers underconfident? Although it takes many steps to escape from a curved
decision region and reach a decision boundary, we find that a decision boundary is actually located closely to the
original output. We demonstrate a simple trick to reach the decision boundary quickly. I.e., when a small amount of
random noise is added to the input data, its output can jump out of the locally curved region and arrive at a linear region,
from which a closely located decision boundary can be reached by adding only a small amount of perturbation. This
reveals that the decision boundary exists near the original data in the representation space, which explains the
underconfident predictions of Transformers.
We also present additional observations examining what contributes to the curved representation space of Transformers
and when the curvedness is formed during training.
2 Related work
Since the first application of the self-attention mechanism to vision tasks [
1
], a number of studies have shown that
the models built with traditional convolutional layers are outperformed by Transformers utilizing self-attention layers
in terms of task performance [
2
,
10
,
11
,
12
,
13
,
14
,
15
,
16
,
17
,
18
]. There have been efforts to compare CNNs and
Transformers in various aspects. Empirical studies show that Transformers have higher adversarial robustness than
CNNs [
5
,
4
,
19
,
20
], which seems to be due to the reliance of Transformers on lower frequency information than
2
Curved Representation Space of Vision Transformers
(a) ResNet50 (b) MobileNetV2 (c) ViT-B/16 (d) Swin-T
Figure 2: Reliability diagrams of CNNs and Transformers. Transparency of bars represent the ratio of the number of
data in each confidence bin. ECE and sECE values are also shown in each case.
CNNs [
21
,
22
]. Other studies conclude that Transformers are calibrated better than CNNs yielding overconfident
predictions [
23
,
24
,
25
,
6
]. However, there has been no clear explanation encompassing both higher robustness and
lower confidence of Transformers.
Understanding how neural networks work has been an important research topic. A useful way for this is to investigate
the input-output mapping formed by a model. Since models with piecewise linear activation functions (e.g., ReLU)
implement piecewise linear mappings, several studies investigate the characteristics of linear regions, e.g., counting the
number of linear regions as a measure of model expressivity (or complexity) [
26
,
27
,
28
,
29
,
30
,
31
] and examining
local properties of linear regions [
32
]. Some studies examine the length of the output curve for a given unit-length input
[
31
,
33
,
34
]. There also exist some works that relate the norm of the input-output Jacobian matrix to generalization
performance [
35
,
36
]. However, the input-output relationship of Transformers has not been explored previously, which
is focused in this paper.
3 On the ostensible contradiction of high robustness and underconfidence
3.1 Model calibration
It is desirable that a trained classifier is well-calibrated by making prediction with reasonable certainty, e.g., for data that
a classifier predicts with confidence (i.e., probability of the predicted class) of 80%, its accuracy should also be 80% in
average. A common measure to evaluate model calibration is the expected calibration error (ECE) defined as [37]
ECE =
K
X
i=1
P(i)· |oiei|,(1)
where
K
is the number of bins of confidence,
P(i)
is the fraction of data falling into bin
i
,
oi
is the accuracy of the data
in bin
i
, and
ei
is the average confidence of the data in bin
i
. One limitation of ECE is that it does not distinguish between
overconfidence and underconfidence because the sign of the difference between the accuracy and the confidence is
ignored. Therefore, we define signed ECE (sECE) to augment ECE, as follows.
sECE =
K
X
i=1
P(i)·(oiei).(2)
An overconfident model will have higher confidence than accuracy, resulting in a negative sECE value. An underconfi-
dent model, in contrast, will show a positive value of sECE.
We compare the calibration of CNNs, including ResNet50 [
9
] and MobileNetV2 [
38
,
39
], and Transformers, including
ViT-B/16 [
1
] and Swin-T [
2
], on the ImageNet validation set using ECE and sECE in Fig. 2 (see Fig. 11 in Appendix for
the results of other models). CNNs show negative ECE values and bar plots below the 45
line, indicating overconfidence
in prediction, which is consistent with the previous studies [
23
]. On the other hand, Transformers are underconfident,
showing positive sECE and bar plots over the 45line. This comparison result is interesting: Transformers reportedly
show higher classification accuracy than CNNs, but in fact with lower confidence.
3.2 Passage to decision boundary
It is a common intuition that if a model classifies a data with low confidence, the data is likely to be located near a
decision boundary (see Appendix for detailed discussion). Based on the above results, therefore, the decision boundaries
of Transformers are assumed to be formed near the data compared to CNNs. To validate this, we formulate a procedure
3
Curved Representation Space of Vision Transformers
(a) ResNet50 (b) MobileNetV2 (c) ViT-B/16 (d) Swin-T
Figure 3: Lengths (
ϵ
) of the travel to decision boundaries with respect to the confidence for the ImageNet validation
data. Black lines represent average values.
to examine the distance to a decision boundary from a data through a linear travel. Concretely, we aim to solve the
following optimization problem:
arg min
ϵC(x)̸=y, x=x+ϵ·d,(3)
where
x
is the input data,
y
is the true class label of
x
,
C
is the classifier,
d
is the travel direction,
ϵ
is a positive real
number indicating the travel length, and
x
is the traveled result of
x
. We set the travel direction
d
as the adversarial
gradient produced by FGSM, i.e.,
d= sign(xJ(C(x), y)),(4)
where
J
is the classification loss function (i.e., cross-entropy). Note that
d2=D
, where
D
is the dimension of
x
.
Refer to Algorithm 1 in Appendix for the detailed procedure to solve the optimization problem in Eq. 3.
Fig. 3 shows the obtained values of
ϵ
with respect to the confidence values for the ImageNet validation data (see Fig. 12
in Appendix for the results of other models). On the contrary to our expectation, decision boundaries seem to be located
farther from the data in the input space for Transformers than CNNs. This contradiction is resolved in the following
section.
4 Resolving the contradiction
4.1 Shape of representation space
As mentioned in the Introduction, the FGSM attack was first introduced to show that the linearity of a model causes its
vulnerability to adversarial perturbations [
7
]. To resolve the contradiction between high robustness (a large distance to
the decision boundary) and underconfidence (a small distance to the decision boundary) of Transformers in the previous
section, therefore, we examine the degree of linearity of the input-output relationship, i.e., how linear movements in the
input space appear in the representation space of Transformers.
We divide the travel into Nsteps as
x(n)=x(0) +n·ϵ
Nd,(n= 0,1,··· , N)(5)
where
x(0) =x
and
x(N)
are the initial and final data points, respectively. For each
x(n)
, we obtain its output feature at
the penultimate layer, which is denoted as
z(n)
. Unlike the travel in the input space, the magnitude and direction of the
travel appearing in the representation space may change at each step. Thus, the movement at step nis defined as
d(n)
z=z(n)z(n1),(6)
from which the magnitude (ω(n)) and relative direction (θ(n)) are obtained as
ω(n)=d(n)
z, θ(n)= cos1 d(n)
z·d(n+1)
z
d(n)
z∥∥d(n+1)
z!.(7)
We consider three different ways to determine d:
dFGSM (blue-colored trajectory in Fig. 4): FGSM direction for x(0) (as in Eq. 4).
dr+FGSM
(yellow-colored trajectory in Fig. 4): FGSM direction determined for the randomly perturbed data
x(0)
r=x(0) +ϵr·r, where ris a random vector (r2=D) and ϵrcontrols the amount of this “random jump.
4
摘要:

CURVEDREPRESENTATIONSPACEOFVISIONTRANSFORMERSJuyeopKim,JunhaPark,SongkukKim,Jong-SeokLeeYonseiUniversity,Korea{juyeopkim,junha.park,songkuk,jong-seok.lee}@yonsei.ac.krABSTRACTNeuralnetworkswithself-attention(a.k.a.Transformers)likeViTandSwinhaveemergedasabetteralternativetotraditionalconvolutionalne...

展开>> 收起<<
CURVED REPRESENTATION SPACE OF VISION TRANSFORMERS Juyeop Kim Junha Park Songkuk Kim Jong-Seok Lee Yonsei University Korea.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:6.96MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注