
Preprint. Under review.
Our experiments follow the same architecture, settings, and pre-training recipe as MAE (He et al.,
2022), and we find that the simple addition of a teacher (RC-MAE) consistently outperforms MAE
in all model sizes (e.g., ViT-S, ViT-B, and ViT-L) when fine-tuned for ImageNet classification. Ad-
ditionally, we find that the teacher’s conditional gradient correction we identified allows RC-MAE
to converge faster compared to MAE (Fig. 1(b)), and RC-MAE outperforms recent self-distillation
methods, MSN and iBOT, on dense prediction tasks such as object detection and instance segmenta-
tion. Furthermore, compared to recent self-distillation methods utilizing a mean teacher, RC-MAE
realizes more efficiency in computation and memory due to the fact that both networks receive only
a subset of patches instead of the whole image. Our main contributions are as follows:
1. We analyze the contribution of EMA Teachers in self-supervised learning, finding that
the gradient provided by the teachers conditionally adjusts current gradient direction and
magnitude conditioned on the similarity of current and previous features.
2. Using this knowledge, we propose a simple, yet effective approach for self-supervised
pre-training of Vision Transformers, the Reconstruction-Consistent Masked Auto-
Encoder (RC-MAE), which improves over vanilla MAE in terms of speed of convergence,
adversarial robustness, and performance on classification, object detection, and instance
segmentation tasks.
3. Thanks to its simplicity, RC-MAE achieves greater savings in both memory and computa-
tion compared to other state-of-the-art self-distillation-based MIM methods.
2 RELATED WORKS
In NLP, masked language modeling (MLM) is common for large-scale pre-training (Devlin et al.,
2019;Radford et al.,2018) by predicting masked words. Similarly, ViT (Dosovitskiy et al.,2021;
Liu et al.,2021;Lee et al.,2022) based masked image modeling (MIM) approaches (Zhou et al.,
2022;Bao et al.,2022;He et al.,2022;Xie et al.,2022;Assran et al.,2022) for computer vision
tasks have been proposed. These MIM approaches first apply a mask to patches of an image, and
then the masked patches are predicted given the visible patches either at the token-level (Zhou et al.,
2022;Bao et al.,2022;Assran et al.,2022) or pixel-level (Chen et al.,2020b;Xie et al.,2022;He
et al.,2022). Token-level masked patch prediction (Zhou et al.,2022;Assran et al.,2022;Bao et al.,
2022)predicts tokens or clusters of masked patches similar to MLM. Pixel-level prediction (Chen
et al.,2020b;Xie et al.,2022;He et al.,2022) learns visual representations by reconstructing masked
input patches at the RGB pixel-level.
Additionally, self-distillation (Grill et al.,2020;Caron et al.,2021;Chen et al.,2021) has been
deployed in MIM methods by utilizing a teacher constructed from an exponential moving aver-
age (EMA-Teacher) of student weights, providing an additional target for the student. iBOT (Zhou
et al.,2022) gives a full view of an image (i.e., all patches) to the teacher network as an online tok-
enizer, offering a token-level target of the masked patches. Giving a masked view to the student and
a full view to the teacher, MSN (Assran et al.,2022) makes the output embeddings from an EMA-
Teacher serve as a semantic feature representation target to the student. Likewise, BootMAE (Dong
et al.,2022) also adopts an EMA-Teacher, providing a feature-level target to the student on top of
the pixel-level MIM approach. A key difference from these self-distillation MIM approaches is that
RC-MAE provides only unmasked patches to the teacher and student, instead of the full image. As
a result, RC-MAE shows better scalability compared with recent methods (see. Table 6).
3 PRELIMINARIES
The Masked Autoencoder. (MAE) (He et al.,2022) is a self-supervised approach with a ViT en-
coder fand decoder h, which randomly masks a portion of input patches, and then reconstructs the
masked patches given the visible patches. Given an image X∈RC×H×W, MAE patchifies Xinto
Nnon-overlapping patches ˜
X∈RN×(P2⋅C)with a patch size of Pand randomly masks a subset of
patches M(i.e., mask tokens). The subset of visible patches Vis input to the encoder to achieve
latent representations: z=f(V). Then, the decoder hattempts to reconstruct Mgiven the latent
representations, ˆ
Y=h(z;M), where ˆ
Y∈RN×(P2⋅C)denotes the reconstructed patches. MAE
3