Token-Label Alignment for Vision Transformers Han Xiao12Wenzhao Zheng12Zheng Zhu3Jie Zhou12Jiwen Lu12 1Beijing National Research Center for Information Science and Technology China

2025-04-26 0 0 5.55MB 13 页 10玖币
侵权投诉
Token-Label Alignment for Vision Transformers
Han Xiao1,2,*Wenzhao Zheng1,2,*Zheng Zhu3Jie Zhou1,2Jiwen Lu1,2,
1Beijing National Research Center for Information Science and Technology, China
2Department of Automation, Tsinghua University, China 3PhiGent Robotics
{h-xiao20,zhengwz18}@mails.tsinghua.edu.cn; zhengzhu@ieee.org;
{jzhou,lujiwen}@tsinghua.edu.cn
Abstract
Data mixing strategies (e.g., CutMix) have shown the
ability to greatly improve the performance of convolutional
neural networks (CNNs). They mix two images as inputs for
training and assign them with a mixed label with the same
ratio. While they are shown effective for vision transform-
ers (ViTs), we identify a token fluctuation phenomenon that
has suppressed the potential of data mixing strategies. We
empirically observe that the contributions of input tokens
fluctuate as forward propagating, which might induce a dif-
ferent mixing ratio in the output tokens. The training tar-
get computed by the original data mixing strategy can thus
be inaccurate, resulting in less effective training. To ad-
dress this, we propose a token-label alignment (TL-Align)
method to trace the correspondence between transformed
tokens and the original tokens to maintain a label for each
token. We reuse the computed attention at each layer for
efficient token-label alignment, introducing only negligible
additional training costs. Extensive experiments demon-
strate that our method improves the performance of ViTs on
image classification, semantic segmentation, objective de-
tection, and transfer learning tasks. Code is available at:
https://github.com/Euphoria16/TL- Align.
1. Introduction
The recent developments of vision transformers (ViTs)
have revolutionized the computer vision field and set new
state-of-the-arts in a variety of tasks, such as image classifi-
cation [10,16,29,40], object detection [5,12,13,53], and se-
mantic segmentation [9,27,36,51]. The successful structure
of alternative spatial mixing and channel mixing in ViTs
also motivates the arising of high-performance MLP-like
deep architectures [3739,46] and promotes the evolution
of better CNNs [15,17,30]. In addition to architecture de-
*Equal contribution.
Corresponding author.
signs, an improved training strategy can also greatly boost
the performance of a trained deep model [7,8,22,41].
The training of modern deep architecture almost all
adopts data mixing strategies for data augmentation [23,42
44,49,50], which have been proven to consistently improve
the generalization performance. They randomly mix two
images as well as their labels with the same mixing ratio to
produce mixed data. As the most commonly used data mix-
ing strategy, CutMix [49] performs a copy-and-paste oper-
ation on the spatial domain to produce spatially mixed im-
ages. While data mixing strategies have been widely studied
for CNNs [23,42,44], few works have explored their com-
patibilities with ViTs [7]. We find that self-attention in ViTs
causes a fluctuation of the original spatial structure. Unlike
the translation equivalence that ensures a global label con-
sistency for CNNs, self-attention in ViTs undermines this
global consistency and causes a misalignment between the
token and label. This misalignment induces a different mix-
ing ratio in the output tokens. The training targets computed
by the original data mixing strategies can then be inaccu-
rate, resulting in less effective training.
To address this, we propose a token-label alignment (TL-
Align) method for ViTs to obtain a more accurate target for
training. We present an overview of our method in Figure 1.
We first assign a label to each input token in the mixed im-
age according to the source of the token. We then trace
the correspondence between the input tokens and the trans-
formed tokens and align the labels accordingly. We assume
that only the spatial self-attention and residual connection
operation alter the presence of input tokens since channel
MLP and layer normalization process each token indepen-
dently. We reuse the computed attentions to linearly mix the
labels of input tokens to obtain those of transformed tokens.
The token-label alignment is performed iteratively to obtain
a label for each output token. For class-token-based classi-
fication (e.g., ViT [16] and DeiT [40]), we directly use the
aligned label for the output class token as the training target.
For global-pooling-based classification (e.g., Swin [29]),
we similarly average the labels of output tokens as the train-
1
arXiv:2210.06455v2 [cs.CV] 29 Nov 2022
Convolution: local and shared weights
Self-attention: global and adaptive weights
Aligned
Misaligned
Align
Pretrained
Teacher
Network
TL-Align
CutMix TransMix TokenLabeling
TL-Align TL-Align
TL-Align
(b) Token misalignment for ViTs (c) Different ways to obtain the mixed labels
25% Cat, 75% Dog?
(a) CutMix
+
Cutmix
100% Cat 100% Dog
Figure 1. An overview of the proposed TL-Align. (a) CutMix-like methods [49] are widely used in model training, which spatially mix the
tokens and their labels in the input space. (b) They are originally designed for CNNs and assume the processed tokens are spatially aligned
with the input tokens. We show that it does not hold true for ViTs due to the global receptive field and the adaptive weights. (c) Compared
with existing methods, our method can effectively and efficiently align the tokens and labels without requiring a pretrained teacher network.
ing target. The proposed TL-Align is only used for train-
ing to improve performance and introduces no additional
workload for inference. We apply the proposed TL-Align
to various ViT variants with CutMix including plain ViTs
(DeiT [40]) and hierarchical ViTs (Swin [29]). We observe
a consistent performance boost across different models on
ImageNet-1K [14]. Specifically, our TL-Align improves
DeiT-S by 0.8% using the same training recipe. We evalu-
ate the ImageNet-pretrained models on various downstream
tasks including semantic segmentation, objection detection,
and transfer learning. Experimental results also verify the
robustness and generalization ability of our method.
2. Related Work
Vision Transformer. Transformers have been widely
used in natural language processing and achieved great suc-
cess on many language tasks. Recently, Vision Transform-
ers (ViTs) have aroused extensive interest in computer vi-
sion due to their competitive performance compared with
CNNs [10,16,29,40]. Dosovitskiy et al. [16] firstly intro-
duced transformers into the image classification task. They
split the input image into non-overlapped patches and then
feed them into the transformer encoders. Liu et al. [29]
proposed a shifted windowing scheme to produce hier-
archical feature maps suitable for dense prediction tasks.
The great potential of vision transformer has motivated its
adaptation to many challenging tasks including object de-
tection [5,12,53], segmentation [9,36], image enhance-
ment [6,26] and video understanding [1,31].
Recently, some efforts have been devoted to producing
better training targets to improve the performance of vision
transformers [22,41]. For example, DeiT [40] introduces
a knowledge distillation procedure to reduce the training
cost of ViTs and achieves a better accuracy/speed trade-off.
TokenLabeling [22] employs a pretrained teacher annota-
tor to predict a label for each token for dense knowledge
distillation. Differently, we do not require a pretrained net-
work to obtain the training targets. Our TL-Align maintains
an aligned label for each token layer by layer and can be
trained efficiently in an end-to-end manner.
Data Mixing Strategy. As an important type of data
augmentation, data mixing strategies have demonstrated a
consistent improvement in the generalization performance
of CNNs. Zhang et al. [50] first proposed to combine a
training pair to create augmented samples for model regu-
larization. They perform linear interpolations on both the
input images and associated targets. Following MixUp,
CutMix [49] also utilizes the mixture of two input images
but adopts a region copy-and-paste operation. Later meth-
ods including Puzzle Mix [23], SaliencyMix [42] and Atten-
tive CutMix [44] leverage the salient regions for informative
mixture generation. Recently, Yang et al. [48] proposed a
RecursiveMix strategy which employs the historical input-
prediction-label triplets for scale-invariant feature learning.
Despite the better performance, a drawback of these meth-
ods is the heavily increased training cost due to the saliency
extraction or historical information exploitation.
Most existing data mixing methods are originally de-
signed for CNNs, and their effectiveness on ViTs has not
2
been well explored. TransMix [7] uses the class attention
map at the last layer to re-weight the mixing targets and
assumes the output tokens to keep spatial correspondence
with the input tokens. However, we identify a token fluc-
tuation phenomenon for ViTs which may cause a mismatch
between tokens and labels, leading to inaccurate label as-
signments in both the original CutMix and TransMix. To
address this, we propose to align the label and token space
by tracing their correspondence in a layerwise manner.
3. Proposed Approach
3.1. Preliminaries
The convolution neural network (CNN) has been the
dominant architecture for computer vision in the deep learn-
ing era, greatly improving the performance of many tasks.
Its monopoly has been challenged by the recent emergence
of vision transformers (ViTs), which first “patchify” each
image into tokens and process them with alternating self-
attention (SA) and multi-layer perceptron (MLP).
In addition to architecture design, training strategy also
has a large effect on the model performance, especially the
data augmentation strategy. Data mixing [23,4244,49,50]
is an important set of data augmentation for the training of
both CNNs and ViTs, as it significantly improves the gener-
alization ability of models. As the most commonly used
data mixing strategy, CutMix [49] aims to create virtual
training samples from the given training samples (X, y),
where X∈ RH×W×Cdenotes the input image and yis
the corresponding label. CutMix randomly selects a local
region from one input X1and uses it to replace the pixels in
the same region of another input X2to generate a new sam-
ple ˜
X. Similarly, the label ˜yof ˜x is also the combination of
the original labels y1and y2:
˜
X=MX1+ (1M)X2
˜y=λy1+ (1 λ)y2
(1)
where M∈ {0,1}H×Wis a binary mask indicating the
image each pixel belongs to, 1is an all-one matrix, and is
the element-wise multiplication. λreflects the mixing ratio
of two labels and is the proportion of pixels cropped from
X1in the mixed image ˜
X. For a cropped region [rx, rx+
rw]×[ry, ry+rh]from X1, we compute λ=rwrh
W H to obtain
the initial mixed target ˜y.
3.2. The Token Fluctuation Phenomenon
CutMix is originally designed for CNNs and assumes
the feature extraction process does not alter the mixing ra-
tio. However, we discover that different from CNNs, self-
attention in ViTs can lead to the fluctuation of some tokens.
The fluctuation further results in the mismatch between the
token space and label space, which hinders the effective
training of the network.
Formally, we use zito denote a token of the image Z,
i.e., ziis the transposed i-th column vector of Z. We can
then compute the i-th transformed token ˆ
ziafter the spatial
operation as ˆ
zi=PN
j=1 ws
i,j zj, where ws
i,j is the i, j-th
element of the computed spatial mixing matrix ws(z).
With the assumption of the linear information integra-
tion, we define the contribution of an original token zito a
mixed token ˆ
zjas c(zi,ˆ
zj) = |ws
i,j |
PN
k=1 |ws
k,j |,where | · | de-
notes the absolute value. We can then compute the presence
of a token ziin all the mixed image tokens as:
p(zi) =
N
X
j=1
c(zi,ˆ
zj) =
N
X
j=1
|ws
i,j |
PN
k=1 |ws
k,j |.(2)
For non-strided depth-wise convolution, each token is
multiplied by each element in the convolutional kernel due
to the translation invariance. We can thus obtain:
N
X
l=1 |ws
i,l|=
N
X
j=1 |ws
k,j |=
M
X
k=1,l=1 |Kk,l|,i, j PN E ,
(3)
where PNE denotes the set of positions that are not at the
edge of the image, Kk,l denotes the value of the k, l-th po-
sition of the convolution kernel Kand Mis the kernel size.
We can infer that p(zi) = 1,iPN E , i.e., the effect of
all the internal tokens does not change during the convolu-
tion process. However, for self-attention in ViTs, (3) does
not hold due to the non-existence of translation invariance.
The fluctuation of p(z)is further amplified by the input de-
pendency of the spatial mixing matrix ws(z)induced by
self-attention. As an extreme case, we may obtain p(z)0
for certain tokens. The fluctuation of tokens will alter the
proportion of mixing (i.e., λ) and the network might even
completely ignore one of the mixed images. The actual la-
bel of the processed tokens can then deviate from the mixed
label computed by (1), resulting in less effective training.
3.3. Token-Label Alignment
Each token in ViTs interacts with other tokens using
the self-attention mechanism. The input-dependent weights
empower ViTs with more flexibility but also result in a mis-
match between the processed token and the initial token.
To address this, we propose a token-label alignment (TL-
Align) method to trace the correspondence between the in-
put and transformed tokens to obtain the aligned labels for
the resulting representations, as illustrated in Figure 2.
Specifically, ViTs first split the mixed input ˜
Xafter
CutMix (1) to a sequence of Nnon-overlapped patches
and then flatten them to obtain the original image tokens
{˜x1,˜x2,··· ,˜xN}. We then project them into a proper di-
mension and add positional embeddings:
Z0= [˜zcls;˜x1·E;˜x2·E;··· ;˜xN·E] + Epos,(4)
3
摘要:

Token-LabelAlignmentforVisionTransformersHanXiao1;2;*WenzhaoZheng1;2;*ZhengZhu3JieZhou1;2JiwenLu1;2;†1BeijingNationalResearchCenterforInformationScienceandTechnology,China2DepartmentofAutomation,TsinghuaUniversity,China3PhiGentRoboticsfh-xiao20,zhengwz18g@mails.tsinghua.edu.cn;zhengzhu@ieee.org;fjzh...

展开>> 收起<<
Token-Label Alignment for Vision Transformers Han Xiao12Wenzhao Zheng12Zheng Zhu3Jie Zhou12Jiwen Lu12 1Beijing National Research Center for Information Science and Technology China.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:5.55MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注