CURVED REPRESENTATION SPACE OF VISION TRANSFORMERS Juyeop Kim Junha Park Songkuk Kim Jong-Seok Lee Yonsei University Korea

2025-04-26 0 0 6.96MB 18 页 10玖币

侵权投诉

CURVED REPRESENTATION SPACE OF VISION TRANSFORMERS

Juyeop Kim, Junha Park, Songkuk Kim, Jong-Seok Lee

Yonsei University, Korea

{juyeopkim, junha.park, songkuk, jong-seok.lee}@yonsei.ac.kr

ABSTRACT

Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a

better alternative to traditional convolutional neural networks (CNNs). However, our understanding

of how the new architecture works is still limited. In this paper, we focus on the phenomenon that

Transformers show higher robustness against corruptions than CNNs, while not being overconﬁdent.

This is contrary to the intuition that robustness increases with conﬁdence. We resolve this contradiction

by empirically investigating how the output of the penultimate layer moves in the representation

space as the input data moves linearly within a small area. In particular, we show the following. (1)

While CNNs exhibit fairly linear relationship between the input and output movements, Transformers

show nonlinear relationship for some data. For those data, the output of Transformers moves in a

curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is

hard to move it out of the decision region since the output moves along a curved trajectory instead

of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a

data is slightly modiﬁed to jump out of the curved region, the movements afterwards become linear

and the output goes to the decision boundary directly. In other words, there does exist a decision

boundary near the data, which is hard to ﬁnd only because of the curved representation space. This

explains the underconﬁdent prediction of Transformers. Also, we examine mathematical properties

of the attention operation that induce nonlinear response to linear perturbation. Finally, we share our

additional ﬁndings, regarding what contributes to the curved representation space of Transformers,

and how the curvedness evolves during training.

1 Introduction

Self-attention-based neural network architectures, including Vision Transformers [

], Swin Transformers [

], etc.

(hereinafter referred to as Transformers), have shown to outperform traditional convolutional neural networks (CNNs)

in various computer vision tasks. The success of the new architecture has prompted a question, how Transformers work,

especially compared to CNNs, which would also shed light on deeper understanding of CNNs and eventually neural

networks.

In addition to the improved task performance (e.g., classiﬁcation accuracy) compared to CNNs, Transformers also

show desirable characteristics in other aspects. It has been shown that Transformers are more robust to adversarial

perturbations than CNNs [

]. Moreover, Transformers are reported not overconﬁdent in their predictions unlike

CNNs [6] (and we show that Transformers are actually underconﬁdent in this paper).

The high robustness, however, does not comport with underconﬁdence. Intuitively, a data that is correctly classiﬁed

by a model with lower conﬁdence is likely to be located closer to the decision boundary (see Appendix for detailed

discussion). Then, a smaller amount of perturbation would move the data out of the decision region, which translates

into lower robustness of the model. However, the previous results claim the opposite.

To mitigate the contradiction of robustness and underconﬁdence, this paper presents our empirical study to explore the

representation space of Transformers and CNNs. More speciﬁcally, we focus on the linearity of the models, i.e., the

change of the output feature (which is simply referred to as output in this paper) with respect to the linear change of

the input data. It is known that adversarial examples are a result of models being too linear, based on which the fast

gradient sign method (FGSM) was introduced to show that deep neural networks can be easily fooled [

]. Motivated by

arXiv:2210.05742v2 [cs.CV] 14 Dec 2023

Curved Representation Space of Vision Transformers

(a) Input space (b) Representation space (c) Representation space

(ResNet50) (Swin-T)

Figure 1: 2D projected movements of (a) the data (black dot) in the input space and corresponding output features in

the representation space for (b) ResNet50 and (c) Swin-T.

this, we examine the input-output relationship of Transformers through the course that the input is gradually perturbed

along the direction determined by FGSM.

Fig. 1 visualizes the representation spaces of CNNs and Transformers comparatively (see Appendix for implementation

details). An image data from ImageNet [

], marked with the black dot in Fig. 1a, is gradually modiﬁed by a ﬁxed

amount along two mutually orthogonal directions. The corresponding outputs of ResNet50 [

] and Swin-T [

] are

obtained, which are shown after two-dimensional projection in Figs. 1b and 1c, respectively. While the gradual changes

of the input produce almost linear changes in the output of ResNet50, the output trajectory of Swin-T is nonlinear

around the original output (and then becomes linear when the change of the output is large), i.e., the representation space

is locally curved. We empirically show that this curved representation space results in the aforementioned contradiction.

Our main research questions and ﬁndings can be summarized as follows.

1. How does the representation space of Transformers look like? To answer this, we analyze the movement of the

penultimate layer’s output with respect to the linear movement of the input. We use the adversarial gradient produced by

FGSM [

] as the direction of movement in the input space, to investigate the linearity of the feature space of the models.

We ﬁnd that the directions of successive movements of the output signiﬁcantly change in the case of Transformers

unlike CNNs, indicating that the representation space of Transformers is locally curved.

2. What makes Transformers robust to input perturbation? We ﬁnd that the curved regions in the representation

space account for the robustness of Transformers. When a data is located in a curved region, a series of linear

perturbations to the input move the output point along a curved trajectory. This makes it hard to move the data out of its

decision region along a short and straight line, which explains high robustness of Transformers for the data.

3. Then, why is the prediction of Transformers underconﬁdent? Although it takes many steps to escape from a curved

decision region and reach a decision boundary, we ﬁnd that a decision boundary is actually located closely to the

original output. We demonstrate a simple trick to reach the decision boundary quickly. I.e., when a small amount of

random noise is added to the input data, its output can jump out of the locally curved region and arrive at a linear region,

from which a closely located decision boundary can be reached by adding only a small amount of perturbation. This

reveals that the decision boundary exists near the original data in the representation space, which explains the

underconﬁdent predictions of Transformers.

We also present additional observations examining what contributes to the curved representation space of Transformers

and when the curvedness is formed during training.

2 Related work

Since the ﬁrst application of the self-attention mechanism to vision tasks [

], a number of studies have shown that

the models built with traditional convolutional layers are outperformed by Transformers utilizing self-attention layers

in terms of task performance [

]. There have been efforts to compare CNNs and

Transformers in various aspects. Empirical studies show that Transformers have higher adversarial robustness than

CNNs [

], which seems to be due to the reliance of Transformers on lower frequency information than

Curved Representation Space of Vision Transformers

(a) ResNet50 (b) MobileNetV2 (c) ViT-B/16 (d) Swin-T

Figure 2: Reliability diagrams of CNNs and Transformers. Transparency of bars represent the ratio of the number of

data in each conﬁdence bin. ECE and sECE values are also shown in each case.

CNNs [

]. Other studies conclude that Transformers are calibrated better than CNNs yielding overconﬁdent

predictions [

]. However, there has been no clear explanation encompassing both higher robustness and

lower conﬁdence of Transformers.

Understanding how neural networks work has been an important research topic. A useful way for this is to investigate

the input-output mapping formed by a model. Since models with piecewise linear activation functions (e.g., ReLU)

implement piecewise linear mappings, several studies investigate the characteristics of linear regions, e.g., counting the

number of linear regions as a measure of model expressivity (or complexity) [

] and examining

local properties of linear regions [

]. Some studies examine the length of the output curve for a given unit-length input

[

]. There also exist some works that relate the norm of the input-output Jacobian matrix to generalization

performance [

]. However, the input-output relationship of Transformers has not been explored previously, which

is focused in this paper.

3 On the ostensible contradiction of high robustness and underconﬁdence

3.1 Model calibration

It is desirable that a trained classiﬁer is well-calibrated by making prediction with reasonable certainty, e.g., for data that

a classiﬁer predicts with conﬁdence (i.e., probability of the predicted class) of 80%, its accuracy should also be 80% in

average. A common measure to evaluate model calibration is the expected calibration error (ECE) deﬁned as [37]

ECE =

i=1

P(i)· |oi−ei|,(1)

where

is the number of bins of conﬁdence,

P(i)

is the fraction of data falling into bin

is the accuracy of the data

in bin

, and

is the average conﬁdence of the data in bin

. One limitation of ECE is that it does not distinguish between

overconﬁdence and underconﬁdence because the sign of the difference between the accuracy and the conﬁdence is

ignored. Therefore, we deﬁne signed ECE (sECE) to augment ECE, as follows.

sECE =

i=1

P(i)·(oi−ei).(2)

An overconﬁdent model will have higher conﬁdence than accuracy, resulting in a negative sECE value. An underconﬁ-

dent model, in contrast, will show a positive value of sECE.

We compare the calibration of CNNs, including ResNet50 [

] and MobileNetV2 [

], and Transformers, including

ViT-B/16 [

] and Swin-T [

], on the ImageNet validation set using ECE and sECE in Fig. 2 (see Fig. 11 in Appendix for

the results of other models). CNNs show negative ECE values and bar plots below the 45

◦

line, indicating overconﬁdence

in prediction, which is consistent with the previous studies [

]. On the other hand, Transformers are underconﬁdent,

showing positive sECE and bar plots over the 45◦line. This comparison result is interesting: Transformers reportedly

show higher classiﬁcation accuracy than CNNs, but in fact with lower conﬁdence.

3.2 Passage to decision boundary

It is a common intuition that if a model classiﬁes a data with low conﬁdence, the data is likely to be located near a

decision boundary (see Appendix for detailed discussion). Based on the above results, therefore, the decision boundaries

of Transformers are assumed to be formed near the data compared to CNNs. To validate this, we formulate a procedure

Curved Representation Space of Vision Transformers

(a) ResNet50 (b) MobileNetV2 (c) ViT-B/16 (d) Swin-T

Figure 3: Lengths (

) of the travel to decision boundaries with respect to the conﬁdence for the ImageNet validation

data. Black lines represent average values.

to examine the distance to a decision boundary from a data through a linear travel. Concretely, we aim to solve the

following optimization problem:

arg min

ϵC(x′)̸=y, x′=x+ϵ·d,(3)

where

is the input data,

is the true class label of

is the classiﬁer,

is the travel direction,

is a positive real

number indicating the travel length, and

x′

is the traveled result of

. We set the travel direction

as the adversarial

gradient produced by FGSM, i.e.,

d= sign(∇xJ(C(x), y)),(4)

where

is the classiﬁcation loss function (i.e., cross-entropy). Note that

∥d∥2=√D

, where

is the dimension of

Refer to Algorithm 1 in Appendix for the detailed procedure to solve the optimization problem in Eq. 3.

Fig. 3 shows the obtained values of

with respect to the conﬁdence values for the ImageNet validation data (see Fig. 12

in Appendix for the results of other models). On the contrary to our expectation, decision boundaries seem to be located

farther from the data in the input space for Transformers than CNNs. This contradiction is resolved in the following

section.

4 Resolving the contradiction

4.1 Shape of representation space

As mentioned in the Introduction, the FGSM attack was ﬁrst introduced to show that the linearity of a model causes its

vulnerability to adversarial perturbations [

]. To resolve the contradiction between high robustness (a large distance to

the decision boundary) and underconﬁdence (a small distance to the decision boundary) of Transformers in the previous

section, therefore, we examine the degree of linearity of the input-output relationship, i.e., how linear movements in the

input space appear in the representation space of Transformers.

We divide the travel into Nsteps as

x(n)=x(0) +n·ϵ

Nd,(n= 0,1,··· , N)(5)

where

x(0) =x

and

x(N)

are the initial and ﬁnal data points, respectively. For each

x(n)

, we obtain its output feature at

the penultimate layer, which is denoted as

z(n)

. Unlike the travel in the input space, the magnitude and direction of the

travel appearing in the representation space may change at each step. Thus, the movement at step nis deﬁned as

d(n)

z=z(n)−z(n−1),(6)

from which the magnitude (ω(n)) and relative direction (θ(n)) are obtained as

ω(n)=∥d(n)

z∥, θ(n)= cos−1 d(n)

z·d(n+1)

∥d(n)

z∥∥d(n+1)

z∥!.(7)

We consider three different ways to determine d:

•dFGSM (blue-colored trajectory in Fig. 4): FGSM direction for x(0) (as in Eq. 4).

•dr+FGSM

(yellow-colored trajectory in Fig. 4): FGSM direction determined for the randomly perturbed data

x(0)

r=x(0) +ϵr·r, where ris a random vector (∥r∥2=√D) and ϵrcontrols the amount of this “random jump.”

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CURVEDREPRESENTATIONSPACEOFVISIONTRANSFORMERSJuyeopKim,JunhaPark,SongkukKim,Jong-SeokLeeYonseiUniversity,Korea{juyeopkim,junha.park,songkuk,jong-seok.lee}@yonsei.ac.krABSTRACTNeuralnetworkswithself-attention(a.k.a.Transformers)likeViTandSwinhaveemergedasabetteralternativetotraditionalconvolutionalne...

展开>> 收起<<

CURVED REPRESENTATION SPACE OF VISION TRANSFORMERS Juyeop Kim Junha Park Songkuk Kim Jong-Seok Lee Yonsei University Korea.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CURVED REPRESENTATION SPACE OF VISION TRANSFORMERS Juyeop Kim Junha Park Songkuk Kim Jong-Seok Lee Yonsei University Korea

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: