Token-Label Alignment for Vision Transformers Han Xiao12Wenzhao Zheng12Zheng Zhu3Jie Zhou12Jiwen Lu12 1Beijing National Research Center for Information Science and Technology China

2025-04-26 0 0 5.55MB 13 页 10玖币

侵权投诉

Token-Label Alignment for Vision Transformers

Han Xiao1,2,*Wenzhao Zheng1,2,*Zheng Zhu3Jie Zhou1,2Jiwen Lu1,2,†

1Beijing National Research Center for Information Science and Technology, China

2Department of Automation, Tsinghua University, China 3PhiGent Robotics

{h-xiao20,zhengwz18}@mails.tsinghua.edu.cn; zhengzhu@ieee.org;

{jzhou,lujiwen}@tsinghua.edu.cn

Abstract

Data mixing strategies (e.g., CutMix) have shown the

ability to greatly improve the performance of convolutional

neural networks (CNNs). They mix two images as inputs for

training and assign them with a mixed label with the same

ratio. While they are shown effective for vision transform-

ers (ViTs), we identify a token ﬂuctuation phenomenon that

has suppressed the potential of data mixing strategies. We

empirically observe that the contributions of input tokens

ﬂuctuate as forward propagating, which might induce a dif-

ferent mixing ratio in the output tokens. The training tar-

get computed by the original data mixing strategy can thus

be inaccurate, resulting in less effective training. To ad-

dress this, we propose a token-label alignment (TL-Align)

method to trace the correspondence between transformed

tokens and the original tokens to maintain a label for each

token. We reuse the computed attention at each layer for

efﬁcient token-label alignment, introducing only negligible

additional training costs. Extensive experiments demon-

strate that our method improves the performance of ViTs on

image classiﬁcation, semantic segmentation, objective de-

tection, and transfer learning tasks. Code is available at:

https://github.com/Euphoria16/TL- Align.

1. Introduction

The recent developments of vision transformers (ViTs)

have revolutionized the computer vision ﬁeld and set new

state-of-the-arts in a variety of tasks, such as image classiﬁ-

cation [10,16,29,40], object detection [5,12,13,53], and se-

mantic segmentation [9,27,36,51]. The successful structure

of alternative spatial mixing and channel mixing in ViTs

also motivates the arising of high-performance MLP-like

deep architectures [37–39,46] and promotes the evolution

of better CNNs [15,17,30]. In addition to architecture de-

*Equal contribution.

†Corresponding author.

signs, an improved training strategy can also greatly boost

the performance of a trained deep model [7,8,22,41].

The training of modern deep architecture almost all

adopts data mixing strategies for data augmentation [23,42–

44,49,50], which have been proven to consistently improve

the generalization performance. They randomly mix two

images as well as their labels with the same mixing ratio to

produce mixed data. As the most commonly used data mix-

ing strategy, CutMix [49] performs a copy-and-paste oper-

ation on the spatial domain to produce spatially mixed im-

ages. While data mixing strategies have been widely studied

for CNNs [23,42,44], few works have explored their com-

patibilities with ViTs [7]. We ﬁnd that self-attention in ViTs

causes a ﬂuctuation of the original spatial structure. Unlike

the translation equivalence that ensures a global label con-

sistency for CNNs, self-attention in ViTs undermines this

global consistency and causes a misalignment between the

token and label. This misalignment induces a different mix-

ing ratio in the output tokens. The training targets computed

by the original data mixing strategies can then be inaccu-

rate, resulting in less effective training.

To address this, we propose a token-label alignment (TL-

Align) method for ViTs to obtain a more accurate target for

training. We present an overview of our method in Figure 1.

We ﬁrst assign a label to each input token in the mixed im-

age according to the source of the token. We then trace

the correspondence between the input tokens and the trans-

formed tokens and align the labels accordingly. We assume

that only the spatial self-attention and residual connection

operation alter the presence of input tokens since channel

MLP and layer normalization process each token indepen-

dently. We reuse the computed attentions to linearly mix the

labels of input tokens to obtain those of transformed tokens.

The token-label alignment is performed iteratively to obtain

a label for each output token. For class-token-based classi-

ﬁcation (e.g., ViT [16] and DeiT [40]), we directly use the

aligned label for the output class token as the training target.

For global-pooling-based classiﬁcation (e.g., Swin [29]),

we similarly average the labels of output tokens as the train-

arXiv:2210.06455v2 [cs.CV] 29 Nov 2022

Convolution: local and shared weights

Self-attention: global and adaptive weights

Aligned

Misaligned

…

Align

Pretrained

Teacher

Network

…

TL-Align

CutMix TransMix TokenLabeling

TL-Align TL-Align

TL-Align

(b) Token misalignment for ViTs (c) Different ways to obtain the mixed labels

25% Cat, 75% Dog?

(a) CutMix

Cutmix

100% Cat 100% Dog

Figure 1. An overview of the proposed TL-Align. (a) CutMix-like methods [49] are widely used in model training, which spatially mix the

tokens and their labels in the input space. (b) They are originally designed for CNNs and assume the processed tokens are spatially aligned

with the input tokens. We show that it does not hold true for ViTs due to the global receptive ﬁeld and the adaptive weights. (c) Compared

with existing methods, our method can effectively and efﬁciently align the tokens and labels without requiring a pretrained teacher network.

ing target. The proposed TL-Align is only used for train-

ing to improve performance and introduces no additional

workload for inference. We apply the proposed TL-Align

to various ViT variants with CutMix including plain ViTs

(DeiT [40]) and hierarchical ViTs (Swin [29]). We observe

a consistent performance boost across different models on

ImageNet-1K [14]. Speciﬁcally, our TL-Align improves

DeiT-S by 0.8% using the same training recipe. We evalu-

ate the ImageNet-pretrained models on various downstream

tasks including semantic segmentation, objection detection,

and transfer learning. Experimental results also verify the

robustness and generalization ability of our method.

2. Related Work

Vision Transformer. Transformers have been widely

used in natural language processing and achieved great suc-

cess on many language tasks. Recently, Vision Transform-

ers (ViTs) have aroused extensive interest in computer vi-

sion due to their competitive performance compared with

CNNs [10,16,29,40]. Dosovitskiy et al. [16] ﬁrstly intro-

duced transformers into the image classiﬁcation task. They

split the input image into non-overlapped patches and then

feed them into the transformer encoders. Liu et al. [29]

proposed a shifted windowing scheme to produce hier-

archical feature maps suitable for dense prediction tasks.

The great potential of vision transformer has motivated its

adaptation to many challenging tasks including object de-

tection [5,12,53], segmentation [9,36], image enhance-

ment [6,26] and video understanding [1,31].

Recently, some efforts have been devoted to producing

better training targets to improve the performance of vision

transformers [22,41]. For example, DeiT [40] introduces

a knowledge distillation procedure to reduce the training

cost of ViTs and achieves a better accuracy/speed trade-off.

TokenLabeling [22] employs a pretrained teacher annota-

tor to predict a label for each token for dense knowledge

distillation. Differently, we do not require a pretrained net-

work to obtain the training targets. Our TL-Align maintains

an aligned label for each token layer by layer and can be

trained efﬁciently in an end-to-end manner.

Data Mixing Strategy. As an important type of data

augmentation, data mixing strategies have demonstrated a

consistent improvement in the generalization performance

of CNNs. Zhang et al. [50] ﬁrst proposed to combine a

training pair to create augmented samples for model regu-

larization. They perform linear interpolations on both the

input images and associated targets. Following MixUp,

CutMix [49] also utilizes the mixture of two input images

but adopts a region copy-and-paste operation. Later meth-

ods including Puzzle Mix [23], SaliencyMix [42] and Atten-

tive CutMix [44] leverage the salient regions for informative

mixture generation. Recently, Yang et al. [48] proposed a

RecursiveMix strategy which employs the historical input-

prediction-label triplets for scale-invariant feature learning.

Despite the better performance, a drawback of these meth-

ods is the heavily increased training cost due to the saliency

extraction or historical information exploitation.

Most existing data mixing methods are originally de-

signed for CNNs, and their effectiveness on ViTs has not

been well explored. TransMix [7] uses the class attention

map at the last layer to re-weight the mixing targets and

assumes the output tokens to keep spatial correspondence

with the input tokens. However, we identify a token ﬂuc-

tuation phenomenon for ViTs which may cause a mismatch

between tokens and labels, leading to inaccurate label as-

signments in both the original CutMix and TransMix. To

address this, we propose to align the label and token space

by tracing their correspondence in a layerwise manner.

3. Proposed Approach

3.1. Preliminaries

The convolution neural network (CNN) has been the

dominant architecture for computer vision in the deep learn-

ing era, greatly improving the performance of many tasks.

Its monopoly has been challenged by the recent emergence

of vision transformers (ViTs), which ﬁrst “patchify” each

image into tokens and process them with alternating self-

attention (SA) and multi-layer perceptron (MLP).

In addition to architecture design, training strategy also

has a large effect on the model performance, especially the

data augmentation strategy. Data mixing [23,42–44,49,50]

is an important set of data augmentation for the training of

both CNNs and ViTs, as it signiﬁcantly improves the gener-

alization ability of models. As the most commonly used

data mixing strategy, CutMix [49] aims to create virtual

training samples from the given training samples (X, y),

where X∈ RH×W×Cdenotes the input image and yis

the corresponding label. CutMix randomly selects a local

region from one input X1and uses it to replace the pixels in

the same region of another input X2to generate a new sam-

ple ˜

X. Similarly, the label ˜yof ˜x is also the combination of

the original labels y1and y2:

X=MX1+ (1−M)X2

˜y=λy1+ (1 −λ)y2

(1)

where M∈ {0,1}H×Wis a binary mask indicating the

image each pixel belongs to, 1is an all-one matrix, and is

the element-wise multiplication. λreﬂects the mixing ratio

of two labels and is the proportion of pixels cropped from

X1in the mixed image ˜

X. For a cropped region [rx, rx+

rw]×[ry, ry+rh]from X1, we compute λ=rwrh

W H to obtain

the initial mixed target ˜y.

3.2. The Token Fluctuation Phenomenon

CutMix is originally designed for CNNs and assumes

the feature extraction process does not alter the mixing ra-

tio. However, we discover that different from CNNs, self-

attention in ViTs can lead to the ﬂuctuation of some tokens.

The ﬂuctuation further results in the mismatch between the

token space and label space, which hinders the effective

training of the network.

Formally, we use zito denote a token of the image Z,

i.e., ziis the transposed i-th column vector of Z. We can

then compute the i-th transformed token ˆ

ziafter the spatial

operation as ˆ

zi=PN

j=1 ws

i,j zj, where ws

i,j is the i, j-th

element of the computed spatial mixing matrix ws(z).

With the assumption of the linear information integra-

tion, we deﬁne the contribution of an original token zito a

mixed token ˆ

zjas c(zi,ˆ

zj) = |ws

i,j |

k=1 |ws

k,j |,where | · | de-

notes the absolute value. We can then compute the presence

of a token ziin all the mixed image tokens as:

p(zi) =

j=1

c(zi,ˆ

zj) =

j=1

|ws

i,j |

k=1 |ws

k,j |.(2)

For non-strided depth-wise convolution, each token is

multiplied by each element in the convolutional kernel due

to the translation invariance. We can thus obtain:

l=1 |ws

i,l|=

j=1 |ws

k,j |=

k=1,l=1 |Kk,l|,∀i, j ∈PN E ,

(3)

where PNE denotes the set of positions that are not at the

edge of the image, Kk,l denotes the value of the k, l-th po-

sition of the convolution kernel Kand Mis the kernel size.

We can infer that p(zi) = 1,∀i∈PN E , i.e., the effect of

all the internal tokens does not change during the convolu-

tion process. However, for self-attention in ViTs, (3) does

not hold due to the non-existence of translation invariance.

The ﬂuctuation of p(z)is further ampliﬁed by the input de-

pendency of the spatial mixing matrix ws(z)induced by

self-attention. As an extreme case, we may obtain p(z)∼0

for certain tokens. The ﬂuctuation of tokens will alter the

proportion of mixing (i.e., λ) and the network might even

completely ignore one of the mixed images. The actual la-

bel of the processed tokens can then deviate from the mixed

label computed by (1), resulting in less effective training.

3.3. Token-Label Alignment

Each token in ViTs interacts with other tokens using

the self-attention mechanism. The input-dependent weights

empower ViTs with more ﬂexibility but also result in a mis-

match between the processed token and the initial token.

To address this, we propose a token-label alignment (TL-

Align) method to trace the correspondence between the in-

put and transformed tokens to obtain the aligned labels for

the resulting representations, as illustrated in Figure 2.

Speciﬁcally, ViTs ﬁrst split the mixed input ˜

Xafter

CutMix (1) to a sequence of Nnon-overlapped patches

and then ﬂatten them to obtain the original image tokens

{˜x1,˜x2,··· ,˜xN}. We then project them into a proper di-

mension and add positional embeddings:

Z0= [˜zcls;˜x1·E;˜x2·E;··· ;˜xN·E] + Epos,(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Token-LabelAlignmentforVisionTransformersHanXiao1;2;*WenzhaoZheng1;2;*ZhengZhu3JieZhou1;2JiwenLu1;2;1BeijingNationalResearchCenterforInformationScienceandTechnology,China2DepartmentofAutomation,TsinghuaUniversity,China3PhiGentRoboticsfh-xiao20,zhengwz18g@mails.tsinghua.edu.cn;zhengzhu@ieee.org;fjzh...

展开>> 收起<<

Token-Label Alignment for Vision Transformers Han Xiao12Wenzhao Zheng12Zheng Zhu3Jie Zhou12Jiwen Lu12 1Beijing National Research Center for Information Science and Technology China.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Token-Label Alignment for Vision Transformers Han Xiao12Wenzhao Zheng12Zheng Zhu3Jie Zhou12Jiwen Lu12 1Beijing National Research Center for Information Science and Technology China

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: