Preprint. Under review. EXPLORING THEROLE OF MEAN TEACHERS IN SELF- SUPERVISED MASKED AUTO-ENCODERS

2025-05-06 0 0 1.98MB 21 页 10玖币

侵权投诉

Preprint. Under review.

EXPLORING THE ROLE OF MEAN TEACHERS IN SELF-

SUPERVISED MASKED AUTO-ENCODERS

Youngwan Lee1,2∗Jeffrey Willette2∗Jonghee Kim1Juho Lee2Sung Ju Hwang2,3

1Electronics and Telecommunications Research Institute (ETRI), South Korea

2Korea Advanced Institute of Science and Technology (KAIST), South Korea

3AITRICS, South Korea

{yw.lee,jhkim27}@etri.re.kr, {jwillette,juholee,sjhwang82}@kaist.ac.kr

ABSTRACT

Masked image modeling (MIM) has become a popular strategy for self-supervised

learning (SSL) of visual representations with Vision Transformers. A representa-

tive MIM model, the masked auto-encoder (MAE), randomly masks a subset of

image patches and reconstructs the masked patches given the unmasked patches.

Concurrently, many recent works in self-supervised learning utilize the studen-

t/teacher paradigm which provides the student with an additional target based on

the output of a teacher composed of an exponential moving average (EMA) of pre-

vious students. Although common, relatively little is known about the dynamics

of the interaction between the student and teacher. Through analysis on a sim-

ple linear model, we ﬁnd that the teacher conditionally removes previous gradient

directions based on feature similarities which effectively acts as a conditional mo-

mentum regularizer. From this analysis, we present a simple SSL method, the

Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA

teacher to MAE. We ﬁnd that RC-MAE converges faster and requires less mem-

ory usage than state-of-the-art self-distillation methods during pre-training, which

may provide a way to enhance the practicality of prohibitively expensive self-

supervised learning of Vision Transformer models. Additionally, we show that

RC-MAE achieves more robustness and better performance compared to MAE

on downstream tasks such as ImageNet-1K classiﬁcation, object detection, and

instance segmentation.

1 INTRODUCTION

The Transformer (Vaswani et al.,2017) is the de facto standard architecture in natural language

processing (NLP), and has also surpassed state-of-the-art Convolutional Neural Network (He et al.,

2016;Tan & Le,2019) (CNN) feature extractors in vision tasks through models such as the Vision

Transformer (Dosovitskiy et al.,2021) (ViT). Prior to the advent of ViTs, self-supervised learn-

ing (SSL) algorithms in the vision community (He et al.,2020;Chen et al.,2020c;Grill et al.,2020;

Chen et al.,2021) utilized CNNs (e.g., ResNet (He et al.,2016)) as a backbone, performing instance

discrimination pretext tasks through contrastive learning (He et al.,2020;Chen et al.,2020c). Inter-

estingly, self-distillation schemes (Grill et al.,2020;Caron et al.,2021) using a teacher consisting of

an exponential moving average (EMA) of the previous students, (i.e., a “mean” teacher) (Tarvainen

& Valpola,2017), have been shown to exhibit strong performance.

Inspired by the success of masked language modeling (MLM) pre-training in NLP, recent SSL ap-

proaches (Bao et al.,2022;Zhou et al.,2022;Xie et al.,2022;He et al.,2022;Assran et al.,2022) in

the vision community have proposed forms of masked image modeling (MIM) pretext tasks, using

ViT-based backbones. MIM is a simple pretext task which ﬁrst randomly masks patches of an im-

age, and then predicts the contents of the masked patches (i.e., tokens) using various reconstruction

targets, e.g., visual tokens (Bao et al.,2022;Dong et al.,2021), semantic features (Zhou et al.,2022;

Assran et al.,2022) and raw pixels (He et al.,2022;Xie et al.,2022). In particular, iBOT (Zhou et al.,

∗equal contribution

arXiv:2210.02077v1 [cs.CV] 5 Oct 2022

Preprint. Under review.

Reconstruction Gradient

∇#ℒ$

Consistency Gradient (Conditional)

∇#ℒ%

Teacher Corrected Gradient + Momentum

Reconstruction Gradient + Momentum

Previous Momentum

(a) Gradient Correction (b) Fine-Tuning on ImageNet-1K

ℒ#ℒ$

EMA

Patchify &

random masking

Decoder

grad

Encoder

stop-grad

Decoder

Encoder

original input

EMA

Figure 1: Overview. (a): When the inputs which led to the previous gradients and current gra-

dients are similar, the consistency gradient provides a conditional correction, allowing the student

to learn from newer knowledge. (b): ImageNet-1K Fine-tuning top-1 accuracy curve: RC-MAE

achieves comparable accuracy (83.4%) at 800 epochs compared to MAE trained for 1600 epochs.

(c): in RC-MAE, the reconstructed patches from the student are compared with the original input

(reconstruction loss Lr), and with the predicted patches from the teacher (consistency loss Lc).

2022) and MSN (Assran et al.,2022) use a self-distillation scheme for MIM by having the teacher

network provide an encoded target (i.e., feature representation) to match the encoded features from

the original image at a semantic feature level. Methods using semantic-level target representations

exhibit strong performance on image-level classiﬁcation tasks. On the contrary, SimMIM (Xie et al.,

2022) and MAE (He et al.,2022) provide pixel-level reconstructions of masked patches, and lead to

superior performance on dense prediction tasks such as object detection and segmentation. However,

self-distillation for pixel-level MIM has been under-explored as of yet.

A recent SSL approach, BYOL (Grill et al.,2020), has shown that a slight architectural asymmetry

between a student and EMA teacher can create a stable model which outperforms previous con-

trastive learning methods. The success of BYOL (Grill et al.,2020) inspired empirical (Chen & He,

2021) and theoretical (Tian et al.,2021) analyses into what enables BYOL to effectively learn and

avoid collapse with the EMA Teacher during pre-training. Still, despite the popularity of the EMA

Teacher in SSL, relatively little is known about how the teacher interacts with the student throughout

the training process.

In this work, we explore the dynamics of self-distillation in pixel-level MIM, e.g., MAE. Through

analyzing a simple linear model, we investigate the dynamics between the gradients of an image re-

construction loss and a teacher consistency loss, learning that the gradients provided by the teacher

conditionally adjust the current gradient by a weighted mixture of previous gradients based on the

similarity between the current and previous features, acting as a conditional momentum regularizer.

For example, Fig. 1(a) shows the case where the inputs which created the previous gradient momen-

tum are similar to the ones which created the current gradients. In this case, the teacher makes a

conditional correction to remove the previous direction from the momentum, allowing the student

to learn from the newer knowledge. If however, the inputs which created both gradients are nearly

orthogonal, the teacher would instead respond with minimal to no correction. We derive this condi-

tional gradient effect in Proposition 4.1, and show evidence in both a simple linear model (Fig. 2) as

well as in a deep ViT-based (Dosovitskiy et al.,2021) MAE model (Fig. 3).

To empirically validate our analysis of the contributions of EMA Teachers, we present a simple

yet effective SSL approach, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE), by

equipping MAE with an EMA Teacher, and providing a consistency target. Additionally, we study

the effects of using different image masking strategies between the student and teacher models on the

consistency objective, ﬁnding that using the same mask generally leads to better performance in both

pre-training and downstream tasks. The same mask tends to form an orthogonal objective (Fig. 4(b))

to the reconstruction loss, which has been shown (Suteu & Guo,2019;Ajemian et al.,2013) to be

beneﬁcial for multi-task models as there is limited interference between tasks. This observation may

be of interest to any future SSL works which leverage multiple pre-training objectives.

Preprint. Under review.

Our experiments follow the same architecture, settings, and pre-training recipe as MAE (He et al.,

2022), and we ﬁnd that the simple addition of a teacher (RC-MAE) consistently outperforms MAE

in all model sizes (e.g., ViT-S, ViT-B, and ViT-L) when ﬁne-tuned for ImageNet classiﬁcation. Ad-

ditionally, we ﬁnd that the teacher’s conditional gradient correction we identiﬁed allows RC-MAE

to converge faster compared to MAE (Fig. 1(b)), and RC-MAE outperforms recent self-distillation

methods, MSN and iBOT, on dense prediction tasks such as object detection and instance segmenta-

tion. Furthermore, compared to recent self-distillation methods utilizing a mean teacher, RC-MAE

realizes more efﬁciency in computation and memory due to the fact that both networks receive only

a subset of patches instead of the whole image. Our main contributions are as follows:

1. We analyze the contribution of EMA Teachers in self-supervised learning, ﬁnding that

the gradient provided by the teachers conditionally adjusts current gradient direction and

magnitude conditioned on the similarity of current and previous features.

2. Using this knowledge, we propose a simple, yet effective approach for self-supervised

pre-training of Vision Transformers, the Reconstruction-Consistent Masked Auto-

Encoder (RC-MAE), which improves over vanilla MAE in terms of speed of convergence,

adversarial robustness, and performance on classiﬁcation, object detection, and instance

segmentation tasks.

3. Thanks to its simplicity, RC-MAE achieves greater savings in both memory and computa-

tion compared to other state-of-the-art self-distillation-based MIM methods.

2 RELATED WORKS

In NLP, masked language modeling (MLM) is common for large-scale pre-training (Devlin et al.,

2019;Radford et al.,2018) by predicting masked words. Similarly, ViT (Dosovitskiy et al.,2021;

Liu et al.,2021;Lee et al.,2022) based masked image modeling (MIM) approaches (Zhou et al.,

2022;Bao et al.,2022;He et al.,2022;Xie et al.,2022;Assran et al.,2022) for computer vision

tasks have been proposed. These MIM approaches ﬁrst apply a mask to patches of an image, and

then the masked patches are predicted given the visible patches either at the token-level (Zhou et al.,

2022;Bao et al.,2022;Assran et al.,2022) or pixel-level (Chen et al.,2020b;Xie et al.,2022;He

et al.,2022). Token-level masked patch prediction (Zhou et al.,2022;Assran et al.,2022;Bao et al.,

2022)predicts tokens or clusters of masked patches similar to MLM. Pixel-level prediction (Chen

et al.,2020b;Xie et al.,2022;He et al.,2022) learns visual representations by reconstructing masked

input patches at the RGB pixel-level.

Additionally, self-distillation (Grill et al.,2020;Caron et al.,2021;Chen et al.,2021) has been

deployed in MIM methods by utilizing a teacher constructed from an exponential moving aver-

age (EMA-Teacher) of student weights, providing an additional target for the student. iBOT (Zhou

et al.,2022) gives a full view of an image (i.e., all patches) to the teacher network as an online tok-

enizer, offering a token-level target of the masked patches. Giving a masked view to the student and

a full view to the teacher, MSN (Assran et al.,2022) makes the output embeddings from an EMA-

Teacher serve as a semantic feature representation target to the student. Likewise, BootMAE (Dong

et al.,2022) also adopts an EMA-Teacher, providing a feature-level target to the student on top of

the pixel-level MIM approach. A key difference from these self-distillation MIM approaches is that

RC-MAE provides only unmasked patches to the teacher and student, instead of the full image. As

a result, RC-MAE shows better scalability compared with recent methods (see. Table 6).

3 PRELIMINARIES

The Masked Autoencoder. (MAE) (He et al.,2022) is a self-supervised approach with a ViT en-

coder fand decoder h, which randomly masks a portion of input patches, and then reconstructs the

masked patches given the visible patches. Given an image X∈RC×H×W, MAE patchiﬁes Xinto

Nnon-overlapping patches ˜

X∈RN×(P2⋅C)with a patch size of Pand randomly masks a subset of

patches M(i.e., mask tokens). The subset of visible patches Vis input to the encoder to achieve

latent representations: z=f(V). Then, the decoder hattempts to reconstruct Mgiven the latent

representations, ˆ

Y=h(z;M), where ˆ

Y∈RN×(P2⋅C)denotes the reconstructed patches. MAE

Preprint. Under review.

utilizes the mean-squared error reconstruction loss Lrwhich is only computed on masked patches:

Lr=1

M∑

i∈M˜

Xi−ˆ

Yi2

2(1)

The EMA Teacher. The mean teacher model (Tarvainen & Valpola,2017) is a temporal ensemble

of previous student weights which provides an additional target to the student. Doing so has been

shown to reduce the number of labels needed to achieve the same level of accuracy, and has become

a core part of recent state-of-the-art SSL approaches as reviewed in Section 2. Predictions from the

student and teacher are compared via a function such as mean squared error (Tarvainen & Valpola,

2017) or cross-entropy (Caron et al.,2021). Generally, the teacher Tis updated after every gradient

step on the student S, using an exponential moving average of the student weights,

T(t)=αT (t−1)+(1−α)S(t)=t

∑

i=0

αi(1−α)S(t−i),(2)

with a parameter α∈(0,1). The additional target forms a consistency loss Lcbetween the teacher

and the student predictions. Considering the mean squared error loss, and ˆ

Y′being the prediction

from the teacher model, Lc=1

M∑

i∈Mˆ

Yi−ˆ

Y′

i2

2(3)

4 THE ROLE OF THE TEACHER

Although EMA teachers are common in recent SSL approaches, relatively little is known about

the interaction between the student and teacher. Through analysis of a linear model that mirrors a

MAE+Teacher objective, we will show how the gradients of both models interact. Considering a

linear model for the student Sand teacher Tconsisting of a single weight matrix, like an MAE,

the objective is to reconstruct the original input xfrom a masked input ˜

x=x⊙m, where ⊙is an

elementwise multiplication and mis a random binary mask with a predeﬁned masking ratio.

Proposition 4.1. With the reconstruction and consistency objective (Eqs. (1)and (3)), the gradient

contribution of the teacher (∇SLc) adjusts the direction and magnitude of the reconstruction gradi-

ents (∇SLr). The magnitude and direction of the adjustment from the teacher are conditional based

on the similarity between the current and previous features. With ˆ

xrepresenting an independent

input from a previous timestep

∇SLr+∇SLc=∇S

2∥S˜

x−x∥2

2+∇S

2∥S˜

x−StopGrad(T˜

x)∥2

=S˜

x˜

x⊺−x˜

x⊺+S˜

x˜

x⊺−T˜

x˜

x⊺

=S˜

x˜

x⊺−x˜

x⊺

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

∇SLr

−⎡

⎢

⎣

∑

i=1

αiλ[Sˆ

xˆ

x⊺−xˆ

x⊺+Sˆ

xˆ

x⊺−Tˆ

xˆ

x⊺

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

∇SLr+∇SLcfrom ˆ

](t−i)⎤

⎥

⎦

x˜

x⊺

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

∇SLc

(4)

Proof. Please see Appendices Band B.1.1

The gradient of the consistency loss ∇SLcis wholly represented by the last term on the RHS. Inter-

estingly, there is a dot product ˆ

x⊺˜

xwhich gets distributed into every term of the sum. If we consider

the dot product as cosine similarity cos(ˆ

x,˜

x)=ˆ

x⊺˜

xˆ

x˜

x, the possible cases for cos(ˆ

x,˜

x)for ˆ

at a single arbitrary timestep tare as follows cos(ˆ

x,˜

x)∈{−1,0,1,(0,1),(−1,0)}.

Case 1: cos(ˆ

x,˜

x)∈{−1,1}.In this case, the resulting gradient from the last term on the RHS of

Eq. (4) removes some amount of residual memory of the direction of a previous gradient. A cosine

similarity of 1also means the inputs are collinear, and the gradient is invariant to the sign of ˆ

x⊺˜

Case 2: cos(ˆ

x,˜

x)=0.In this case, There is zero contribution from the teacher for this term in the

sum.

Case 3: cos(ˆ

x,˜

x) ∈(0,1).In this case, the component which contains the previous gradient will

be weighted by the coefﬁcient ˆ

x⊺˜

Preprint. Under review.

0 100 200 300 400 500

iter

100

150

200

k∇k2

Linear MAE+Teacher (cons.)

input type

same

similar

diﬀerent

(a) ∥∇SLc∥2

0 100 200 300 400 500

iter

200

400

600

k∇k2

Linear MAE+Teacher (recon.)

input type

same

similar

diﬀerent

(b) ∥∇SLr∥2

0 100 200 300 400 500

iter

−1.0

−0.5

0.0

0.5

1.0

cosine similarity

Linear MAE+Teacher

input type

same

similar

diﬀerent

grad type

student

teacher

Figure 2: Linear Models: For three sequences of inputs (Table 1), we performed one gradient step

and teacher update and then calculated ∇SLr, and ∇SLcat the next step. Fig. 2(a):∇SLc2is

larger when cos(ˆ

x,˜

x)is large due to (Eq. (5)). Fig. 2(b): The looser bound on ∇SLr2(Eq. (9))

shows the opposite line order. Fig. 2(c):∇SLcshows a conditional direction based on cos(ˆ

x,˜

x).

Putting it all together: For ∇SLc, a cos(ˆ

x,˜

x)=1creates a larger move in a negative direction

while cos(ˆ

x,˜

x)≈0creates a smaller move in an approximately orthogonal direction.

In all cases, due to the sign of the last term on the RHS of Eq. (4), the gradient of consistency

loss conditionally removes residual memory of previous gradients. The magnitude of the removal

is likewise conditional, which can be seen by using the triangle inequality to upper bound the ﬁnal

term of Proposition 4.1 to see that,

∇SLc=t

∑

i=1

αiλ∇S(t−i)L(t−i)˜

x˜

x⊺≤t

∑

i=1

αiλ∇S(t−i)[...ˆ

x⊺]˜

x˜

x⊺(5)

leading to the conclusion that the magnitude and direction of the consistency gradient directly results

from the similarities to features in the recent learning history, as αi∈(0,1)decays exponentially

for distant timesteps in the past. The gradient of the student model can be bounded in a similar

fashion, but without the decaying αcoefﬁcients which makes the bound much looser in general (see

Appendix B.1).

Table 1: Input sequences used in Figs. 2and 3

Name Input Description

Case 1 (˜

x,˜

x)same input twice (same)

Case 2 (˜

x,˜

x′)same input with a different mask (similar)

Case 3 (˜

x,ˆ

x)different inputs (different)

Empirical Test. To test for this effect in a

linear model, we conducted an experiment by

training a linear model on data consisting of

random samples x∈R32 from random multi-

variate normal distributions (see Appendix E.1

for further details). After each training iteration, we sampled an extra batch of data, and for each

single point in the batch we constructed sequences consisting of two inputs described in Table 1. We

then took a single gradient step and teacher update for the ﬁrst input and calculated the gradient of

the reconstruction and consistency loss on the second input.

Expected Outcome. Based on the similarity of the inputs, we would expect the consistency loss

to produce a larger gradient for the same or similar inputs and a smaller gradient for different in-

puts. Additionally, the direction of the reconstruction and consistency gradient should be closer to

opposite for case 1, and closer to orthogonal for case 3, with case 2 falling somewhere in-between.

In Fig. 2, we in fact observe this trend, noticing that the reconstruction loss produces a signiﬁcantly

larger gradient when the second gradient step is on different inputs due to the looser bound in Eq. (9).

Interpretation. This ﬁnding implies that the teacher plays a role of something akin to a gradient

memory, where the teacher acts as a memory bank which retrieves the memory of recent gradients

based on matching a query ˜

xto a key ˆ

xin the memory bank. For novel inputs which do not match

anything in recent memory, the teacher responds with a minimal correction, letting the student learn

more from the reconstruction signal. If the query and key match, however, the teachers gradient will

conditionally remove some directional information contained in the previous gradient. This allows

the student to move in a direction which favors new knowledge gained from the current input, and

cancels out some previous momentum. This process is illustrated in Fig. 1(a). In Appendix D, we

show that the same terms appear in the context of a deep model, with the dot product appearing at

each semantic feature level. However, in a complex model with nonlinearities, the resulting gradient

direction becomes harder to interpret. Even so, in Fig. 3, we empirically ﬁnd the same underlying

trend in the gradient norms and directions when analyzing RC-MAE (a ViT based model).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Preprint.Underreview.EXPLORINGTHEROLEOFMEANTEACHERSINSELF-SUPERVISEDMASKEDAUTO-ENCODERSYoungwanLee1,2JeffreyWillette2JongheeKim1JuhoLee2SungJuHwang2,31ElectronicsandTelecommunicationsResearchInstitute(ETRI),SouthKorea2KoreaAdvancedInstituteofScienceandTechnology(KAIST),SouthKorea3AITRICS,SouthKore...

展开>> 收起<<

Preprint. Under review. EXPLORING THEROLE OF MEAN TEACHERS IN SELF- SUPERVISED MASKED AUTO-ENCODERS.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint. Under review. EXPLORING THEROLE OF MEAN TEACHERS IN SELF- SUPERVISED MASKED AUTO-ENCODERS

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: