
construct the input data, whereas our M2D learns to predict masked
latent representations. BYOL differs from ours in that it is a frame-
work for learning representations invariant to data augmentation.
SIM [2], MSN [3], and data2vec [10] learn to predict masked
patch representations using a target network, but, unlike ours, all in-
put patches are fed to the target. CAE [4] and SplitMask [5] encode
target representation using only masked patches, which is similar to
ours but without the use of a target network. While SIM, MSN, CAE,
and SplitMask learn image representations, data2vec also learns au-
dio representations.
In this work, we experimented through learning general-purpose
audio representations. To learn speech and audio, various methods
learn representations using masked input, such as Mockingjay [13],
wav2vec2 [14], HuBERT [15], and BigSSL [16] for speech, and
SSAST [17] for audio. Methods more closely related to ours are
MAE-AST [7], MaskSpec [8], MSM-MAE [9], and Audio-MAE
[6], which adapt MAE to learn audio representations. However, they
differ from our method in that they do not use a target network.
Other SSL methods for learning audio representations include
Wang et al. [18] and DeLoRes [19], and especially BYOL-A [20],
BYOL-S [21], and ATST [22], which use BYOL as the learning
framework; they do not mask the input. For supervised learning,
AST [23], EAT [24], PaSST [25], and HTS-AT [26] have shown
SOTA performance.
3. MASKED MODELING DUO
Our method learns representations by using only visible patches to
predict the masked patch representations. As shown in Fig. 2, it con-
sists of two networks, referred to as the online and target networks.
EMA
Maximize
Agreement
Stop
-gradient
Online Target
Mask
Token
Encoder
Predictor
Positional
Encoding
Positional
Encoding
standardizefilter
concat
Fig. 2. Overview of the M2D framework.
Processing input The framework partitions the input data x(au-
dio spectrogram, image, etc.) into a grid of patches, adds positional
encoding, and randomly selects a number of patches according to a
masking ratio as masked patches xm(e.g., 60% of the input) and the
rest as visible patches xv(e.g., the remaining 40%). While we use
the same positional encoding as MAE [1], we tested various masking
ratios as discussed in Section 4.4.
Online and target networks The online network, defined by a
set of weights θ, encodes the visible patches xvusing the online
encoder fθinto the representation zv=fθ(xv). It concatenates
shared, learnable masked tokens mto zv, adds the position encoding
p, and predicts ˆz, representations of entire input patches, using the
predictor gθ.
ˆz=gθ(concat(zv, m) + p)(1)
It then filters the prediction result ˆzto output ˆzm={ˆz[i]|i∈IM},
containing only masked patch representations, where IMis the set
of indices of the masked patches.
The target network is defined by parameter ξand consists only
of momentum encoder fξ, which is identical to the online encoder
except for the parameter. The network encodes masked patches xm
using fξto output the representation zm=fξ(xm). We then stan-
dardize zmto ˜zm= (zm−mean(zm))/pvar(zm), for stabilizing
the training, which we empirically confirmed in preliminary experi-
ments, rather than for performance gain as in MAE.
Calculating loss The loss is calculated using the standardized tar-
get output ˜zmas a training signal against the online prediction out-
put ˆzm. Inspired by BYOL [12], we calculate the loss Lby the mean
square error (MSE) of l2-normalized ˆzmand ˜zm.
L,||l2(ˆzm)−l2(˜zm)||2
2= 2 −2·hˆzm,˜zmi
||ˆzm||2· ||˜zm||2
,(2)
where h·,·i denotes the inner product.
Updating network parameters Our framework updates parame-
ters θand ξafter each training step. It updates θonly by minimizing
the loss Las depicted by the stop-gradient in Fig. 2, whereas updat-
ing ξis based on a slowly moving exponential average of θwith a
decay rate τ:
ξ←τξ + (1 −τ)θ(3)
It has been empirically shown that stop-gradient operation can
avoid collapsing to an uninformative solution, and the moving-
average behavior may lead to learning effective representations [11].
After the training, we transfer only the fθas a pre-trained model.
4. EXPERIMENTS
We validated the M2D step-by-step experiments that examined the
effectiveness of learning representations directly by comparing our
M2D with MAE (Section 4.2), the effectiveness of feeding only
masked patches to the target (Section 4.3), the impact of various
masking ratios (Section 4.4), and comparing ours with SOTA (Sec-
tion 4.5).
In all experiments, we applied our M2D to masked spectro-
gram modeling (MSM) [9], with an audio spectrogram as input to
learn general-purpose audio representations. We evaluated the per-
formance of pre-trained models in both a linear evaluation and fine-
tuning on a variety of audio downstream tasks spanning environmen-
tal sounds, speech, and music.
4.1. Experimental Setup
We mainly focused on comparing M2D with an MAE, then adapted
MAE implementations and settings with as few changes as possible.
We implemented an additional target network on top of the MAE
code and adopted the MAE decoder as our predictor gθwithout
changes. We used vanilla ViT-Base [27] with a 768-d output fea-
ture as our encoders (fθand fξ) and fixed the patch size to 16 ×16
for all experiments. We tested with masking ratios of 0.6 and 0.7,
which showed good performance in preliminary experiments.
We used the MSM-MAE [9] as an MAE for comparison, an
MAE variant optimized for MSM by making the decoder smaller