Preprint. Under review. EXPLORING THEROLE OF MEAN TEACHERS IN SELF- SUPERVISED MASKED AUTO-ENCODERS

2025-05-06 0 0 1.98MB 21 页 10玖币
侵权投诉
Preprint. Under review.
EXPLORING THE ROLE OF MEAN TEACHERS IN SELF-
SUPERVISED MASKED AUTO-ENCODERS
Youngwan Lee1,2Jeffrey Willette2Jonghee Kim1Juho Lee2Sung Ju Hwang2,3
1Electronics and Telecommunications Research Institute (ETRI), South Korea
2Korea Advanced Institute of Science and Technology (KAIST), South Korea
3AITRICS, South Korea
{yw.lee,jhkim27}@etri.re.kr, {jwillette,juholee,sjhwang82}@kaist.ac.kr
ABSTRACT
Masked image modeling (MIM) has become a popular strategy for self-supervised
learning (SSL) of visual representations with Vision Transformers. A representa-
tive MIM model, the masked auto-encoder (MAE), randomly masks a subset of
image patches and reconstructs the masked patches given the unmasked patches.
Concurrently, many recent works in self-supervised learning utilize the studen-
t/teacher paradigm which provides the student with an additional target based on
the output of a teacher composed of an exponential moving average (EMA) of pre-
vious students. Although common, relatively little is known about the dynamics
of the interaction between the student and teacher. Through analysis on a sim-
ple linear model, we find that the teacher conditionally removes previous gradient
directions based on feature similarities which effectively acts as a conditional mo-
mentum regularizer. From this analysis, we present a simple SSL method, the
Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA
teacher to MAE. We find that RC-MAE converges faster and requires less mem-
ory usage than state-of-the-art self-distillation methods during pre-training, which
may provide a way to enhance the practicality of prohibitively expensive self-
supervised learning of Vision Transformer models. Additionally, we show that
RC-MAE achieves more robustness and better performance compared to MAE
on downstream tasks such as ImageNet-1K classification, object detection, and
instance segmentation.
1 INTRODUCTION
The Transformer (Vaswani et al.,2017) is the de facto standard architecture in natural language
processing (NLP), and has also surpassed state-of-the-art Convolutional Neural Network (He et al.,
2016;Tan & Le,2019) (CNN) feature extractors in vision tasks through models such as the Vision
Transformer (Dosovitskiy et al.,2021) (ViT). Prior to the advent of ViTs, self-supervised learn-
ing (SSL) algorithms in the vision community (He et al.,2020;Chen et al.,2020c;Grill et al.,2020;
Chen et al.,2021) utilized CNNs (e.g., ResNet (He et al.,2016)) as a backbone, performing instance
discrimination pretext tasks through contrastive learning (He et al.,2020;Chen et al.,2020c). Inter-
estingly, self-distillation schemes (Grill et al.,2020;Caron et al.,2021) using a teacher consisting of
an exponential moving average (EMA) of the previous students, (i.e., a “mean” teacher) (Tarvainen
& Valpola,2017), have been shown to exhibit strong performance.
Inspired by the success of masked language modeling (MLM) pre-training in NLP, recent SSL ap-
proaches (Bao et al.,2022;Zhou et al.,2022;Xie et al.,2022;He et al.,2022;Assran et al.,2022) in
the vision community have proposed forms of masked image modeling (MIM) pretext tasks, using
ViT-based backbones. MIM is a simple pretext task which first randomly masks patches of an im-
age, and then predicts the contents of the masked patches (i.e., tokens) using various reconstruction
targets, e.g., visual tokens (Bao et al.,2022;Dong et al.,2021), semantic features (Zhou et al.,2022;
Assran et al.,2022) and raw pixels (He et al.,2022;Xie et al.,2022). In particular, iBOT (Zhou et al.,
equal contribution
1
arXiv:2210.02077v1 [cs.CV] 5 Oct 2022
Preprint. Under review.
Reconstruction Gradient
#$
Consistency Gradient (Conditional)
#%
Teacher Corrected Gradient + Momentum
Reconstruction Gradient + Momentum
Previous Momentum
(a) Gradient Correction (b) Fine-Tuning on ImageNet-1K
#$
EMA
Patchify &
random masking
Decoder
grad
!
!!
Encoder
stop-grad
Decoder
!
!"
Encoder
original input
EMA
(c) RC-MAE
Figure 1: Overview. (a): When the inputs which led to the previous gradients and current gra-
dients are similar, the consistency gradient provides a conditional correction, allowing the student
to learn from newer knowledge. (b): ImageNet-1K Fine-tuning top-1 accuracy curve: RC-MAE
achieves comparable accuracy (83.4%) at 800 epochs compared to MAE trained for 1600 epochs.
(c): in RC-MAE, the reconstructed patches from the student are compared with the original input
(reconstruction loss Lr), and with the predicted patches from the teacher (consistency loss Lc).
2022) and MSN (Assran et al.,2022) use a self-distillation scheme for MIM by having the teacher
network provide an encoded target (i.e., feature representation) to match the encoded features from
the original image at a semantic feature level. Methods using semantic-level target representations
exhibit strong performance on image-level classification tasks. On the contrary, SimMIM (Xie et al.,
2022) and MAE (He et al.,2022) provide pixel-level reconstructions of masked patches, and lead to
superior performance on dense prediction tasks such as object detection and segmentation. However,
self-distillation for pixel-level MIM has been under-explored as of yet.
A recent SSL approach, BYOL (Grill et al.,2020), has shown that a slight architectural asymmetry
between a student and EMA teacher can create a stable model which outperforms previous con-
trastive learning methods. The success of BYOL (Grill et al.,2020) inspired empirical (Chen & He,
2021) and theoretical (Tian et al.,2021) analyses into what enables BYOL to effectively learn and
avoid collapse with the EMA Teacher during pre-training. Still, despite the popularity of the EMA
Teacher in SSL, relatively little is known about how the teacher interacts with the student throughout
the training process.
In this work, we explore the dynamics of self-distillation in pixel-level MIM, e.g., MAE. Through
analyzing a simple linear model, we investigate the dynamics between the gradients of an image re-
construction loss and a teacher consistency loss, learning that the gradients provided by the teacher
conditionally adjust the current gradient by a weighted mixture of previous gradients based on the
similarity between the current and previous features, acting as a conditional momentum regularizer.
For example, Fig. 1(a) shows the case where the inputs which created the previous gradient momen-
tum are similar to the ones which created the current gradients. In this case, the teacher makes a
conditional correction to remove the previous direction from the momentum, allowing the student
to learn from the newer knowledge. If however, the inputs which created both gradients are nearly
orthogonal, the teacher would instead respond with minimal to no correction. We derive this condi-
tional gradient effect in Proposition 4.1, and show evidence in both a simple linear model (Fig. 2) as
well as in a deep ViT-based (Dosovitskiy et al.,2021) MAE model (Fig. 3).
To empirically validate our analysis of the contributions of EMA Teachers, we present a simple
yet effective SSL approach, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE), by
equipping MAE with an EMA Teacher, and providing a consistency target. Additionally, we study
the effects of using different image masking strategies between the student and teacher models on the
consistency objective, finding that using the same mask generally leads to better performance in both
pre-training and downstream tasks. The same mask tends to form an orthogonal objective (Fig. 4(b))
to the reconstruction loss, which has been shown (Suteu & Guo,2019;Ajemian et al.,2013) to be
beneficial for multi-task models as there is limited interference between tasks. This observation may
be of interest to any future SSL works which leverage multiple pre-training objectives.
2
Preprint. Under review.
Our experiments follow the same architecture, settings, and pre-training recipe as MAE (He et al.,
2022), and we find that the simple addition of a teacher (RC-MAE) consistently outperforms MAE
in all model sizes (e.g., ViT-S, ViT-B, and ViT-L) when fine-tuned for ImageNet classification. Ad-
ditionally, we find that the teacher’s conditional gradient correction we identified allows RC-MAE
to converge faster compared to MAE (Fig. 1(b)), and RC-MAE outperforms recent self-distillation
methods, MSN and iBOT, on dense prediction tasks such as object detection and instance segmenta-
tion. Furthermore, compared to recent self-distillation methods utilizing a mean teacher, RC-MAE
realizes more efficiency in computation and memory due to the fact that both networks receive only
a subset of patches instead of the whole image. Our main contributions are as follows:
1. We analyze the contribution of EMA Teachers in self-supervised learning, finding that
the gradient provided by the teachers conditionally adjusts current gradient direction and
magnitude conditioned on the similarity of current and previous features.
2. Using this knowledge, we propose a simple, yet effective approach for self-supervised
pre-training of Vision Transformers, the Reconstruction-Consistent Masked Auto-
Encoder (RC-MAE), which improves over vanilla MAE in terms of speed of convergence,
adversarial robustness, and performance on classification, object detection, and instance
segmentation tasks.
3. Thanks to its simplicity, RC-MAE achieves greater savings in both memory and computa-
tion compared to other state-of-the-art self-distillation-based MIM methods.
2 RELATED WORKS
In NLP, masked language modeling (MLM) is common for large-scale pre-training (Devlin et al.,
2019;Radford et al.,2018) by predicting masked words. Similarly, ViT (Dosovitskiy et al.,2021;
Liu et al.,2021;Lee et al.,2022) based masked image modeling (MIM) approaches (Zhou et al.,
2022;Bao et al.,2022;He et al.,2022;Xie et al.,2022;Assran et al.,2022) for computer vision
tasks have been proposed. These MIM approaches first apply a mask to patches of an image, and
then the masked patches are predicted given the visible patches either at the token-level (Zhou et al.,
2022;Bao et al.,2022;Assran et al.,2022) or pixel-level (Chen et al.,2020b;Xie et al.,2022;He
et al.,2022). Token-level masked patch prediction (Zhou et al.,2022;Assran et al.,2022;Bao et al.,
2022)predicts tokens or clusters of masked patches similar to MLM. Pixel-level prediction (Chen
et al.,2020b;Xie et al.,2022;He et al.,2022) learns visual representations by reconstructing masked
input patches at the RGB pixel-level.
Additionally, self-distillation (Grill et al.,2020;Caron et al.,2021;Chen et al.,2021) has been
deployed in MIM methods by utilizing a teacher constructed from an exponential moving aver-
age (EMA-Teacher) of student weights, providing an additional target for the student. iBOT (Zhou
et al.,2022) gives a full view of an image (i.e., all patches) to the teacher network as an online tok-
enizer, offering a token-level target of the masked patches. Giving a masked view to the student and
a full view to the teacher, MSN (Assran et al.,2022) makes the output embeddings from an EMA-
Teacher serve as a semantic feature representation target to the student. Likewise, BootMAE (Dong
et al.,2022) also adopts an EMA-Teacher, providing a feature-level target to the student on top of
the pixel-level MIM approach. A key difference from these self-distillation MIM approaches is that
RC-MAE provides only unmasked patches to the teacher and student, instead of the full image. As
a result, RC-MAE shows better scalability compared with recent methods (see. Table 6).
3 PRELIMINARIES
The Masked Autoencoder. (MAE) (He et al.,2022) is a self-supervised approach with a ViT en-
coder fand decoder h, which randomly masks a portion of input patches, and then reconstructs the
masked patches given the visible patches. Given an image XRC×H×W, MAE patchifies Xinto
Nnon-overlapping patches ˜
XRN×(P2C)with a patch size of Pand randomly masks a subset of
patches M(i.e., mask tokens). The subset of visible patches Vis input to the encoder to achieve
latent representations: z=f(V). Then, the decoder hattempts to reconstruct Mgiven the latent
representations, ˆ
Y=h(z;M), where ˆ
YRN×(P2C)denotes the reconstructed patches. MAE
3
Preprint. Under review.
utilizes the mean-squared error reconstruction loss Lrwhich is only computed on masked patches:
Lr=1
M
iM˜
Xiˆ
Yi2
2(1)
The EMA Teacher. The mean teacher model (Tarvainen & Valpola,2017) is a temporal ensemble
of previous student weights which provides an additional target to the student. Doing so has been
shown to reduce the number of labels needed to achieve the same level of accuracy, and has become
a core part of recent state-of-the-art SSL approaches as reviewed in Section 2. Predictions from the
student and teacher are compared via a function such as mean squared error (Tarvainen & Valpola,
2017) or cross-entropy (Caron et al.,2021). Generally, the teacher Tis updated after every gradient
step on the student S, using an exponential moving average of the student weights,
T(t)=αT (t1)+(1α)S(t)=t
i=0
αi(1α)S(ti),(2)
with a parameter α(0,1). The additional target forms a consistency loss Lcbetween the teacher
and the student predictions. Considering the mean squared error loss, and ˆ
Ybeing the prediction
from the teacher model, Lc=1
M
iMˆ
Yiˆ
Y
i2
2(3)
4 THE ROLE OF THE TEACHER
Although EMA teachers are common in recent SSL approaches, relatively little is known about
the interaction between the student and teacher. Through analysis of a linear model that mirrors a
MAE+Teacher objective, we will show how the gradients of both models interact. Considering a
linear model for the student Sand teacher Tconsisting of a single weight matrix, like an MAE,
the objective is to reconstruct the original input xfrom a masked input ˜
x=xm, where is an
elementwise multiplication and mis a random binary mask with a predefined masking ratio.
Proposition 4.1. With the reconstruction and consistency objective (Eqs. (1)and (3)), the gradient
contribution of the teacher (SLc) adjusts the direction and magnitude of the reconstruction gradi-
ents (SLr). The magnitude and direction of the adjustment from the teacher are conditional based
on the similarity between the current and previous features. With ˆ
xrepresenting an independent
input from a previous timestep
SLr+SLc=S
1
2S˜
xx2
2+S
1
2S˜
xStopGrad(T˜
x)∥2
2
=S˜
x˜
xx˜
x+S˜
x˜
xT˜
x˜
x
=S˜
x˜
xx˜
x
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹
SLr
t
i=1
αiλ[Sˆ
xˆ
xxˆ
x+Sˆ
xˆ
xTˆ
xˆ
x
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹
SLr+∇SLcfrom ˆ
x
](ti)
˜
x˜
x
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹
SLc
(4)
Proof. Please see Appendices Band B.1.1
The gradient of the consistency loss SLcis wholly represented by the last term on the RHS. Inter-
estingly, there is a dot product ˆ
x˜
xwhich gets distributed into every term of the sum. If we consider
the dot product as cosine similarity cos(ˆ
x,˜
x)=ˆ
x˜
xˆ
x˜
x, the possible cases for cos(ˆ
x,˜
x)for ˆ
x
at a single arbitrary timestep tare as follows cos(ˆ
x,˜
x){1,0,1,(0,1),(1,0)}.
Case 1: cos(ˆ
x,˜
x){1,1}.In this case, the resulting gradient from the last term on the RHS of
Eq. (4) removes some amount of residual memory of the direction of a previous gradient. A cosine
similarity of 1also means the inputs are collinear, and the gradient is invariant to the sign of ˆ
x˜
x.
Case 2: cos(ˆ
x,˜
x)=0.In this case, There is zero contribution from the teacher for this term in the
sum.
Case 3: cos(ˆ
x,˜
x) (0,1).In this case, the component which contains the previous gradient will
be weighted by the coefficient ˆ
x˜
x.
4
Preprint. Under review.
0 100 200 300 400 500
iter
0
50
100
150
200
k∇k2
Linear MAE+Teacher (cons.)
input type
same
similar
different
(a) SLc2
0 100 200 300 400 500
iter
0
200
400
600
k∇k2
Linear MAE+Teacher (recon.)
input type
same
similar
different
(b) SLr2
0 100 200 300 400 500
iter
1.0
0.5
0.0
0.5
1.0
cosine similarity
Linear MAE+Teacher
input type
same
similar
different
grad type
student
teacher
(c) CosSim(SL(t),SL(t+1)
c)
Figure 2: Linear Models: For three sequences of inputs (Table 1), we performed one gradient step
and teacher update and then calculated SLr, and SLcat the next step. Fig. 2(a):SLc2is
larger when cos(ˆ
x,˜
x)is large due to (Eq. (5)). Fig. 2(b): The looser bound on SLr2(Eq. (9))
shows the opposite line order. Fig. 2(c):SLcshows a conditional direction based on cos(ˆ
x,˜
x).
Putting it all together: For SLc, a cos(ˆ
x,˜
x)=1creates a larger move in a negative direction
while cos(ˆ
x,˜
x)0creates a smaller move in an approximately orthogonal direction.
In all cases, due to the sign of the last term on the RHS of Eq. (4), the gradient of consistency
loss conditionally removes residual memory of previous gradients. The magnitude of the removal
is likewise conditional, which can be seen by using the triangle inequality to upper bound the final
term of Proposition 4.1 to see that,
SLc=t
i=1
αiλS(ti)L(ti)˜
x˜
xt
i=1
αiλS(ti)[...ˆ
x]˜
x˜
x(5)
leading to the conclusion that the magnitude and direction of the consistency gradient directly results
from the similarities to features in the recent learning history, as αi(0,1)decays exponentially
for distant timesteps in the past. The gradient of the student model can be bounded in a similar
fashion, but without the decaying αcoefficients which makes the bound much looser in general (see
Appendix B.1).
Table 1: Input sequences used in Figs. 2and 3
Name Input Description
Case 1 (˜
x,˜
x)same input twice (same)
Case 2 (˜
x,˜
x)same input with a different mask (similar)
Case 3 (˜
x,ˆ
x)different inputs (different)
Empirical Test. To test for this effect in a
linear model, we conducted an experiment by
training a linear model on data consisting of
random samples xR32 from random multi-
variate normal distributions (see Appendix E.1
for further details). After each training iteration, we sampled an extra batch of data, and for each
single point in the batch we constructed sequences consisting of two inputs described in Table 1. We
then took a single gradient step and teacher update for the first input and calculated the gradient of
the reconstruction and consistency loss on the second input.
Expected Outcome. Based on the similarity of the inputs, we would expect the consistency loss
to produce a larger gradient for the same or similar inputs and a smaller gradient for different in-
puts. Additionally, the direction of the reconstruction and consistency gradient should be closer to
opposite for case 1, and closer to orthogonal for case 3, with case 2 falling somewhere in-between.
In Fig. 2, we in fact observe this trend, noticing that the reconstruction loss produces a significantly
larger gradient when the second gradient step is on different inputs due to the looser bound in Eq. (9).
Interpretation. This finding implies that the teacher plays a role of something akin to a gradient
memory, where the teacher acts as a memory bank which retrieves the memory of recent gradients
based on matching a query ˜
xto a key ˆ
xin the memory bank. For novel inputs which do not match
anything in recent memory, the teacher responds with a minimal correction, letting the student learn
more from the reconstruction signal. If the query and key match, however, the teachers gradient will
conditionally remove some directional information contained in the previous gradient. This allows
the student to move in a direction which favors new knowledge gained from the current input, and
cancels out some previous momentum. This process is illustrated in Fig. 1(a). In Appendix D, we
show that the same terms appear in the context of a deep model, with the dot product appearing at
each semantic feature level. However, in a complex model with nonlinearities, the resulting gradient
direction becomes harder to interpret. Even so, in Fig. 3, we empirically find the same underlying
trend in the gradient norms and directions when analyzing RC-MAE (a ViT based model).
5
摘要:

Preprint.Underreview.EXPLORINGTHEROLEOFMEANTEACHERSINSELF-SUPERVISEDMASKEDAUTO-ENCODERSYoungwanLee1,2‡JeffreyWillette2‡JongheeKim1JuhoLee2SungJuHwang2,31ElectronicsandTelecommunicationsResearchInstitute(ETRI),SouthKorea2KoreaAdvancedInstituteofScienceandTechnology(KAIST),SouthKorea3AITRICS,SouthKore...

展开>> 收起<<
Preprint. Under review. EXPLORING THEROLE OF MEAN TEACHERS IN SELF- SUPERVISED MASKED AUTO-ENCODERS.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.98MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注