
CURVED REPRESENTATION SPACE OF VISION TRANSFORMERS
Juyeop Kim, Junha Park, Songkuk Kim, Jong-Seok Lee
Yonsei University, Korea
{juyeopkim, junha.park, songkuk, jong-seok.lee}@yonsei.ac.kr
ABSTRACT
Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a
better alternative to traditional convolutional neural networks (CNNs). However, our understanding
of how the new architecture works is still limited. In this paper, we focus on the phenomenon that
Transformers show higher robustness against corruptions than CNNs, while not being overconfident.
This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction
by empirically investigating how the output of the penultimate layer moves in the representation
space as the input data moves linearly within a small area. In particular, we show the following. (1)
While CNNs exhibit fairly linear relationship between the input and output movements, Transformers
show nonlinear relationship for some data. For those data, the output of Transformers moves in a
curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is
hard to move it out of the decision region since the output moves along a curved trajectory instead
of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a
data is slightly modified to jump out of the curved region, the movements afterwards become linear
and the output goes to the decision boundary directly. In other words, there does exist a decision
boundary near the data, which is hard to find only because of the curved representation space. This
explains the underconfident prediction of Transformers. Also, we examine mathematical properties
of the attention operation that induce nonlinear response to linear perturbation. Finally, we share our
additional findings, regarding what contributes to the curved representation space of Transformers,
and how the curvedness evolves during training.
1 Introduction
Self-attention-based neural network architectures, including Vision Transformers [
1
], Swin Transformers [
2
], etc.
(hereinafter referred to as Transformers), have shown to outperform traditional convolutional neural networks (CNNs)
in various computer vision tasks. The success of the new architecture has prompted a question, how Transformers work,
especially compared to CNNs, which would also shed light on deeper understanding of CNNs and eventually neural
networks.
In addition to the improved task performance (e.g., classification accuracy) compared to CNNs, Transformers also
show desirable characteristics in other aspects. It has been shown that Transformers are more robust to adversarial
perturbations than CNNs [
3
,
4
,
5
]. Moreover, Transformers are reported not overconfident in their predictions unlike
CNNs [6] (and we show that Transformers are actually underconfident in this paper).
The high robustness, however, does not comport with underconfidence. Intuitively, a data that is correctly classified
by a model with lower confidence is likely to be located closer to the decision boundary (see Appendix for detailed
discussion). Then, a smaller amount of perturbation would move the data out of the decision region, which translates
into lower robustness of the model. However, the previous results claim the opposite.
To mitigate the contradiction of robustness and underconfidence, this paper presents our empirical study to explore the
representation space of Transformers and CNNs. More specifically, we focus on the linearity of the models, i.e., the
change of the output feature (which is simply referred to as output in this paper) with respect to the linear change of
the input data. It is known that adversarial examples are a result of models being too linear, based on which the fast
gradient sign method (FGSM) was introduced to show that deep neural networks can be easily fooled [
7
]. Motivated by
arXiv:2210.05742v2 [cs.CV] 14 Dec 2023