
been well explored. TransMix [7] uses the class attention
map at the last layer to re-weight the mixing targets and
assumes the output tokens to keep spatial correspondence
with the input tokens. However, we identify a token fluc-
tuation phenomenon for ViTs which may cause a mismatch
between tokens and labels, leading to inaccurate label as-
signments in both the original CutMix and TransMix. To
address this, we propose to align the label and token space
by tracing their correspondence in a layerwise manner.
3. Proposed Approach
3.1. Preliminaries
The convolution neural network (CNN) has been the
dominant architecture for computer vision in the deep learn-
ing era, greatly improving the performance of many tasks.
Its monopoly has been challenged by the recent emergence
of vision transformers (ViTs), which first “patchify” each
image into tokens and process them with alternating self-
attention (SA) and multi-layer perceptron (MLP).
In addition to architecture design, training strategy also
has a large effect on the model performance, especially the
data augmentation strategy. Data mixing [23,42–44,49,50]
is an important set of data augmentation for the training of
both CNNs and ViTs, as it significantly improves the gener-
alization ability of models. As the most commonly used
data mixing strategy, CutMix [49] aims to create virtual
training samples from the given training samples (X, y),
where X∈ RH×W×Cdenotes the input image and yis
the corresponding label. CutMix randomly selects a local
region from one input X1and uses it to replace the pixels in
the same region of another input X2to generate a new sam-
ple ˜
X. Similarly, the label ˜yof ˜x is also the combination of
the original labels y1and y2:
˜
X=MX1+ (1−M)X2
˜y=λy1+ (1 −λ)y2
(1)
where M∈ {0,1}H×Wis a binary mask indicating the
image each pixel belongs to, 1is an all-one matrix, and is
the element-wise multiplication. λreflects the mixing ratio
of two labels and is the proportion of pixels cropped from
X1in the mixed image ˜
X. For a cropped region [rx, rx+
rw]×[ry, ry+rh]from X1, we compute λ=rwrh
W H to obtain
the initial mixed target ˜y.
3.2. The Token Fluctuation Phenomenon
CutMix is originally designed for CNNs and assumes
the feature extraction process does not alter the mixing ra-
tio. However, we discover that different from CNNs, self-
attention in ViTs can lead to the fluctuation of some tokens.
The fluctuation further results in the mismatch between the
token space and label space, which hinders the effective
training of the network.
Formally, we use zito denote a token of the image Z,
i.e., ziis the transposed i-th column vector of Z. We can
then compute the i-th transformed token ˆ
ziafter the spatial
operation as ˆ
zi=PN
j=1 ws
i,j zj, where ws
i,j is the i, j-th
element of the computed spatial mixing matrix ws(z).
With the assumption of the linear information integra-
tion, we define the contribution of an original token zito a
mixed token ˆ
zjas c(zi,ˆ
zj) = |ws
i,j |
PN
k=1 |ws
k,j |,where | · | de-
notes the absolute value. We can then compute the presence
of a token ziin all the mixed image tokens as:
p(zi) =
N
X
j=1
c(zi,ˆ
zj) =
N
X
j=1
|ws
i,j |
PN
k=1 |ws
k,j |.(2)
For non-strided depth-wise convolution, each token is
multiplied by each element in the convolutional kernel due
to the translation invariance. We can thus obtain:
N
X
l=1 |ws
i,l|=
N
X
j=1 |ws
k,j |=
M
X
k=1,l=1 |Kk,l|,∀i, j ∈PN E ,
(3)
where PNE denotes the set of positions that are not at the
edge of the image, Kk,l denotes the value of the k, l-th po-
sition of the convolution kernel Kand Mis the kernel size.
We can infer that p(zi) = 1,∀i∈PN E , i.e., the effect of
all the internal tokens does not change during the convolu-
tion process. However, for self-attention in ViTs, (3) does
not hold due to the non-existence of translation invariance.
The fluctuation of p(z)is further amplified by the input de-
pendency of the spatial mixing matrix ws(z)induced by
self-attention. As an extreme case, we may obtain p(z)∼0
for certain tokens. The fluctuation of tokens will alter the
proportion of mixing (i.e., λ) and the network might even
completely ignore one of the mixed images. The actual la-
bel of the processed tokens can then deviate from the mixed
label computed by (1), resulting in less effective training.
3.3. Token-Label Alignment
Each token in ViTs interacts with other tokens using
the self-attention mechanism. The input-dependent weights
empower ViTs with more flexibility but also result in a mis-
match between the processed token and the initial token.
To address this, we propose a token-label alignment (TL-
Align) method to trace the correspondence between the in-
put and transformed tokens to obtain the aligned labels for
the resulting representations, as illustrated in Figure 2.
Specifically, ViTs first split the mixed input ˜
Xafter
CutMix (1) to a sequence of Nnon-overlapped patches
and then flatten them to obtain the original image tokens
{˜x1,˜x2,··· ,˜xN}. We then project them into a proper di-
mension and add positional embeddings:
Z0= [˜zcls;˜x1·E;˜x2·E;··· ;˜xN·E] + Epos,(4)
3