MASKED MODELING DUO LEARNING REPRESENTATIONS BY ENCOURAGING BOTH NETWORKS TO MODEL THE INPUT Daisuke Niizumi Daiki Takeuchi Yasunori Ohishi Noboru Harada and Kunio Kashino

2025-05-02 0 0 1.47MB 6 页 10玖币
侵权投诉
MASKED MODELING DUO: LEARNING REPRESENTATIONS
BY ENCOURAGING BOTH NETWORKS TO MODEL THE INPUT
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino
NTT Corporation, Japan
ABSTRACT
Masked Autoencoders is a simple yet powerful self-supervised
learning method. However, it learns representations indirectly by
reconstructing masked input patches. Several methods learn repre-
sentations directly by predicting representations of masked patches;
however, we think using all patches to encode training signal rep-
resentations is suboptimal. We propose a new method, Masked
Modeling Duo (M2D), that learns representations directly while
obtaining training signals using only masked patches. In the M2D,
the online network encodes visible patches and predicts masked
patch representations, and the target network, a momentum encoder,
encodes masked patches. To better predict target representations,
the online network should model the input well, while the target
network should also model it well to agree with online predictions.
Then the learned representations should better model the input.
We validated the M2D by learning general-purpose audio repre-
sentations, and M2D set new state-of-the-art performance on tasks
such as UrbanSound8K, VoxCeleb1, AudioSet20K, GTZAN, and
SpeechCommandsV2.
Index TermsSelf-supervised learning, Masked Autoen-
coders, Masked Image Modeling, Masked Spectrogram Modeling
1. INTRODUCTION
Recently, self-supervised learning (SSL) methods using masked im-
age modeling (MIM) have progressed and yielded promising re-
sults in the image domain. Among them, Masked Autoencoders [1]
(MAE) have inspired numerous subsequent studies and influenced
not only the image domain [2–5] but also the audio domain [6–9].
An MAE effectively learns a representation by reconstructing
a large number (i.e., 75%) of masked input patches using a small
number of visible patches, encouraging the learned representation
to model the input. However, it learns representations indirectly by
minimizing the loss between the original input and the reconstructed
result, which may not be optimal for learning a representation.
In contrast, several previous methods [2, 3, 10] achieve direct
learning of representations, typically by using a momentum encoder
in a Siamese architecture [11] to obtain the masked patch represen-
tations as a training signal. In this case, all input patches are used to
encode these representations, not encouraging to model the input.
We hypothesize that the learned representation would become
more useful if the training signal were encoded using only masked
patches instead of all patches in order to encourage modeling the
input in the training signal. While an MAE effectively encourages
modeling the input signal by limiting the number of visible patches
fed to the encoder, using all the input patches to obtain a training
signal does not benefit from the inductive bias of the MAE.
In this paper, we propose a new method, Masked modeling duo
(M2D), that learns representations directly by predicting the repre-
sentations of masked patches from visible patches only. As illus-
Visible
patches
Masked
patches
All
patches
Online
Target Maximize
Agreement
(a) Ours
(b) Conventional
Encoder
Encoder
Encoder
Predictor
Leaving masked
ones only
(e.g., 70% of input)
(e.g. unmasked 30% of input)
EMA
?
No bias
to model...
Modeling
the input
Fig. 1. M2D pre-training scenario. The online network encodes
visible patches and predicts masked patch representations, while the
target network encodes masked patches. The M2D maximizes the
agreement between these two outputs to learn representations. We
provide only the masked patches to the target illustrated as (a), unlike
conventional methods (e.g., data2vec [10]) depicted as (b), encour-
aging representations to model the input from both the online and
target networks.
trated in Fig.1, the target representations are encoded from only the
masked patches, not from all the input patches as they are in the
previous methods [2,3, 10]. Although our method adds a target mo-
mentum encoder to the MAE, the entire framework remains simple.
The M2D promotes complementary input modeling by feeding
mutually exclusive patches to its networks. For example, to reduce
the prediction error for a heartbeat audio input consisting of two
sounds (S1 and S2), the online network should encode the given
visible patches around S2 into representations modeled as part of
the whole heartbeat in order to predict the representations of masked
patches around S1. Conversely, the masked patch representations
around S1 encoded by the target network are more likely to agree
with the prediction if it is encoded as part of the whole heartbeat.
Therefore, our method encourages input modeling from both sides.
In our experiments, we validated our method by learning a
general-purpose audio representation using an audio spectrogram as
input and confirmed the effectiveness of learning the representation
directly and providing only masked patches to the target. In addi-
tion, M2D set new state-of-the-art (SOTA) performance on several
audio tasks. Our code is available online1.
2. RELATED WORK
This study was inspired by MAE [1] for an MIM and Bootstrap Your
Own Latent [12] (BYOL) as a framework for directly learning la-
tent representations using a target network. An MAE learns to re-
1https://github.com/nttcslab/m2d
arXiv:2210.14648v3 [eess.AS] 2 Mar 2023
construct the input data, whereas our M2D learns to predict masked
latent representations. BYOL differs from ours in that it is a frame-
work for learning representations invariant to data augmentation.
SIM [2], MSN [3], and data2vec [10] learn to predict masked
patch representations using a target network, but, unlike ours, all in-
put patches are fed to the target. CAE [4] and SplitMask [5] encode
target representation using only masked patches, which is similar to
ours but without the use of a target network. While SIM, MSN, CAE,
and SplitMask learn image representations, data2vec also learns au-
dio representations.
In this work, we experimented through learning general-purpose
audio representations. To learn speech and audio, various methods
learn representations using masked input, such as Mockingjay [13],
wav2vec2 [14], HuBERT [15], and BigSSL [16] for speech, and
SSAST [17] for audio. Methods more closely related to ours are
MAE-AST [7], MaskSpec [8], MSM-MAE [9], and Audio-MAE
[6], which adapt MAE to learn audio representations. However, they
differ from our method in that they do not use a target network.
Other SSL methods for learning audio representations include
Wang et al. [18] and DeLoRes [19], and especially BYOL-A [20],
BYOL-S [21], and ATST [22], which use BYOL as the learning
framework; they do not mask the input. For supervised learning,
AST [23], EAT [24], PaSST [25], and HTS-AT [26] have shown
SOTA performance.
3. MASKED MODELING DUO
Our method learns representations by using only visible patches to
predict the masked patch representations. As shown in Fig. 2, it con-
sists of two networks, referred to as the online and target networks.
EMA
Maximize
Agreement
Stop
-gradient
Online Target
Mask
Token
Encoder
Predictor
Positional
Encoding
Positional
Encoding
standardizefilter
concat
Fig. 2. Overview of the M2D framework.
Processing input The framework partitions the input data x(au-
dio spectrogram, image, etc.) into a grid of patches, adds positional
encoding, and randomly selects a number of patches according to a
masking ratio as masked patches xm(e.g., 60% of the input) and the
rest as visible patches xv(e.g., the remaining 40%). While we use
the same positional encoding as MAE [1], we tested various masking
ratios as discussed in Section 4.4.
Online and target networks The online network, defined by a
set of weights θ, encodes the visible patches xvusing the online
encoder fθinto the representation zv=fθ(xv). It concatenates
shared, learnable masked tokens mto zv, adds the position encoding
p, and predicts ˆz, representations of entire input patches, using the
predictor gθ.
ˆz=gθ(concat(zv, m) + p)(1)
It then filters the prediction result ˆzto output ˆzm={ˆz[i]|iIM},
containing only masked patch representations, where IMis the set
of indices of the masked patches.
The target network is defined by parameter ξand consists only
of momentum encoder fξ, which is identical to the online encoder
except for the parameter. The network encodes masked patches xm
using fξto output the representation zm=fξ(xm). We then stan-
dardize zmto ˜zm= (zmmean(zm))/pvar(zm), for stabilizing
the training, which we empirically confirmed in preliminary experi-
ments, rather than for performance gain as in MAE.
Calculating loss The loss is calculated using the standardized tar-
get output ˜zmas a training signal against the online prediction out-
put ˆzm. Inspired by BYOL [12], we calculate the loss Lby the mean
square error (MSE) of l2-normalized ˆzmand ˜zm.
L,||l2(ˆzm)l2(˜zm)||2
2= 2 2·hˆzm,˜zmi
||ˆzm||2· ||˜zm||2
,(2)
where ,·i denotes the inner product.
Updating network parameters Our framework updates parame-
ters θand ξafter each training step. It updates θonly by minimizing
the loss Las depicted by the stop-gradient in Fig. 2, whereas updat-
ing ξis based on a slowly moving exponential average of θwith a
decay rate τ:
ξτξ + (1 τ)θ(3)
It has been empirically shown that stop-gradient operation can
avoid collapsing to an uninformative solution, and the moving-
average behavior may lead to learning effective representations [11].
After the training, we transfer only the fθas a pre-trained model.
4. EXPERIMENTS
We validated the M2D step-by-step experiments that examined the
effectiveness of learning representations directly by comparing our
M2D with MAE (Section 4.2), the effectiveness of feeding only
masked patches to the target (Section 4.3), the impact of various
masking ratios (Section 4.4), and comparing ours with SOTA (Sec-
tion 4.5).
In all experiments, we applied our M2D to masked spectro-
gram modeling (MSM) [9], with an audio spectrogram as input to
learn general-purpose audio representations. We evaluated the per-
formance of pre-trained models in both a linear evaluation and fine-
tuning on a variety of audio downstream tasks spanning environmen-
tal sounds, speech, and music.
4.1. Experimental Setup
We mainly focused on comparing M2D with an MAE, then adapted
MAE implementations and settings with as few changes as possible.
We implemented an additional target network on top of the MAE
code and adopted the MAE decoder as our predictor gθwithout
changes. We used vanilla ViT-Base [27] with a 768-d output fea-
ture as our encoders (fθand fξ) and fixed the patch size to 16 ×16
for all experiments. We tested with masking ratios of 0.6 and 0.7,
which showed good performance in preliminary experiments.
We used the MSM-MAE [9] as an MAE for comparison, an
MAE variant optimized for MSM by making the decoder smaller
摘要:

MASKEDMODELINGDUO:LEARNINGREPRESENTATIONSBYENCOURAGINGBOTHNETWORKSTOMODELTHEINPUTDaisukeNiizumi,DaikiTakeuchi,YasunoriOhishi,NoboruHarada,andKunioKashinoNTTCorporation,JapanABSTRACTMaskedAutoencodersisasimpleyetpowerfulself-supervisedlearningmethod.However,itlearnsrepresentationsindirectlybyreconstr...

展开>> 收起<<
MASKED MODELING DUO LEARNING REPRESENTATIONS BY ENCOURAGING BOTH NETWORKS TO MODEL THE INPUT Daisuke Niizumi Daiki Takeuchi Yasunori Ohishi Noboru Harada and Kunio Kashino.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:1.47MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注