MASKED MODELING DUO LEARNING REPRESENTATIONS BY ENCOURAGING BOTH NETWORKS TO MODEL THE INPUT Daisuke Niizumi Daiki Takeuchi Yasunori Ohishi Noboru Harada and Kunio Kashino

2025-05-02 0 0 1.47MB 6 页 10玖币

侵权投诉

MASKED MODELING DUO: LEARNING REPRESENTATIONS

BY ENCOURAGING BOTH NETWORKS TO MODEL THE INPUT

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino

NTT Corporation, Japan

ABSTRACT

Masked Autoencoders is a simple yet powerful self-supervised

learning method. However, it learns representations indirectly by

reconstructing masked input patches. Several methods learn repre-

sentations directly by predicting representations of masked patches;

however, we think using all patches to encode training signal rep-

resentations is suboptimal. We propose a new method, Masked

Modeling Duo (M2D), that learns representations directly while

obtaining training signals using only masked patches. In the M2D,

the online network encodes visible patches and predicts masked

patch representations, and the target network, a momentum encoder,

encodes masked patches. To better predict target representations,

the online network should model the input well, while the target

network should also model it well to agree with online predictions.

Then the learned representations should better model the input.

We validated the M2D by learning general-purpose audio repre-

sentations, and M2D set new state-of-the-art performance on tasks

such as UrbanSound8K, VoxCeleb1, AudioSet20K, GTZAN, and

SpeechCommandsV2.

Index Terms—Self-supervised learning, Masked Autoen-

coders, Masked Image Modeling, Masked Spectrogram Modeling

1. INTRODUCTION

Recently, self-supervised learning (SSL) methods using masked im-

age modeling (MIM) have progressed and yielded promising re-

sults in the image domain. Among them, Masked Autoencoders [1]

(MAE) have inspired numerous subsequent studies and inﬂuenced

not only the image domain [2–5] but also the audio domain [6–9].

An MAE effectively learns a representation by reconstructing

a large number (i.e., 75%) of masked input patches using a small

number of visible patches, encouraging the learned representation

to model the input. However, it learns representations indirectly by

minimizing the loss between the original input and the reconstructed

result, which may not be optimal for learning a representation.

In contrast, several previous methods [2, 3, 10] achieve direct

learning of representations, typically by using a momentum encoder

in a Siamese architecture [11] to obtain the masked patch represen-

tations as a training signal. In this case, all input patches are used to

encode these representations, not encouraging to model the input.

We hypothesize that the learned representation would become

more useful if the training signal were encoded using only masked

patches instead of all patches in order to encourage modeling the

input in the training signal. While an MAE effectively encourages

modeling the input signal by limiting the number of visible patches

fed to the encoder, using all the input patches to obtain a training

signal does not beneﬁt from the inductive bias of the MAE.

In this paper, we propose a new method, Masked modeling duo

(M2D), that learns representations directly by predicting the repre-

sentations of masked patches from visible patches only. As illus-

Visible

patches

Masked

patches

All

patches

Online

Target Maximize

Agreement

(a) Ours

(b) Conventional

Encoder

Predictor

Leaving masked

ones only

(e.g., 70% of input)

(e.g. unmasked 30% of input)

EMA

No bias

to model...

Modeling

the input

Fig. 1. M2D pre-training scenario. The online network encodes

visible patches and predicts masked patch representations, while the

target network encodes masked patches. The M2D maximizes the

agreement between these two outputs to learn representations. We

provide only the masked patches to the target illustrated as (a), unlike

conventional methods (e.g., data2vec [10]) depicted as (b), encour-

aging representations to model the input from both the online and

target networks.

trated in Fig.1, the target representations are encoded from only the

masked patches, not from all the input patches as they are in the

previous methods [2,3, 10]. Although our method adds a target mo-

mentum encoder to the MAE, the entire framework remains simple.

The M2D promotes complementary input modeling by feeding

mutually exclusive patches to its networks. For example, to reduce

the prediction error for a heartbeat audio input consisting of two

sounds (S1 and S2), the online network should encode the given

visible patches around S2 into representations modeled as part of

the whole heartbeat in order to predict the representations of masked

patches around S1. Conversely, the masked patch representations

around S1 encoded by the target network are more likely to agree

with the prediction if it is encoded as part of the whole heartbeat.

Therefore, our method encourages input modeling from both sides.

In our experiments, we validated our method by learning a

general-purpose audio representation using an audio spectrogram as

input and conﬁrmed the effectiveness of learning the representation

directly and providing only masked patches to the target. In addi-

tion, M2D set new state-of-the-art (SOTA) performance on several

audio tasks. Our code is available online1.

2. RELATED WORK

This study was inspired by MAE [1] for an MIM and Bootstrap Your

Own Latent [12] (BYOL) as a framework for directly learning la-

tent representations using a target network. An MAE learns to re-

1https://github.com/nttcslab/m2d

arXiv:2210.14648v3 [eess.AS] 2 Mar 2023

construct the input data, whereas our M2D learns to predict masked

latent representations. BYOL differs from ours in that it is a frame-

work for learning representations invariant to data augmentation.

SIM [2], MSN [3], and data2vec [10] learn to predict masked

patch representations using a target network, but, unlike ours, all in-

put patches are fed to the target. CAE [4] and SplitMask [5] encode

target representation using only masked patches, which is similar to

ours but without the use of a target network. While SIM, MSN, CAE,

and SplitMask learn image representations, data2vec also learns au-

dio representations.

In this work, we experimented through learning general-purpose

audio representations. To learn speech and audio, various methods

learn representations using masked input, such as Mockingjay [13],

wav2vec2 [14], HuBERT [15], and BigSSL [16] for speech, and

SSAST [17] for audio. Methods more closely related to ours are

MAE-AST [7], MaskSpec [8], MSM-MAE [9], and Audio-MAE

[6], which adapt MAE to learn audio representations. However, they

differ from our method in that they do not use a target network.

Other SSL methods for learning audio representations include

Wang et al. [18] and DeLoRes [19], and especially BYOL-A [20],

BYOL-S [21], and ATST [22], which use BYOL as the learning

framework; they do not mask the input. For supervised learning,

AST [23], EAT [24], PaSST [25], and HTS-AT [26] have shown

SOTA performance.

3. MASKED MODELING DUO

Our method learns representations by using only visible patches to

predict the masked patch representations. As shown in Fig. 2, it con-

sists of two networks, referred to as the online and target networks.

EMA

Maximize

Agreement

Stop

-gradient

Online Target

Mask

Token

Encoder

Predictor

Positional

Encoding

Positional

Encoding

standardizefilter

concat

Fig. 2. Overview of the M2D framework.

Processing input The framework partitions the input data x(au-

dio spectrogram, image, etc.) into a grid of patches, adds positional

encoding, and randomly selects a number of patches according to a

masking ratio as masked patches xm(e.g., 60% of the input) and the

rest as visible patches xv(e.g., the remaining 40%). While we use

the same positional encoding as MAE [1], we tested various masking

ratios as discussed in Section 4.4.

Online and target networks The online network, deﬁned by a

set of weights θ, encodes the visible patches xvusing the online

encoder fθinto the representation zv=fθ(xv). It concatenates

shared, learnable masked tokens mto zv, adds the position encoding

p, and predicts ˆz, representations of entire input patches, using the

predictor gθ.

ˆz=gθ(concat(zv, m) + p)(1)

It then ﬁlters the prediction result ˆzto output ˆzm={ˆz[i]|i∈IM},

containing only masked patch representations, where IMis the set

of indices of the masked patches.

The target network is deﬁned by parameter ξand consists only

of momentum encoder fξ, which is identical to the online encoder

except for the parameter. The network encodes masked patches xm

using fξto output the representation zm=fξ(xm). We then stan-

dardize zmto ˜zm= (zm−mean(zm))/pvar(zm), for stabilizing

the training, which we empirically conﬁrmed in preliminary experi-

ments, rather than for performance gain as in MAE.

Calculating loss The loss is calculated using the standardized tar-

get output ˜zmas a training signal against the online prediction out-

put ˆzm. Inspired by BYOL [12], we calculate the loss Lby the mean

square error (MSE) of l2-normalized ˆzmand ˜zm.

L,||l2(ˆzm)−l2(˜zm)||2

2= 2 −2·hˆzm,˜zmi

||ˆzm||2· ||˜zm||2

,(2)

where h·,·i denotes the inner product.

Updating network parameters Our framework updates parame-

ters θand ξafter each training step. It updates θonly by minimizing

the loss Las depicted by the stop-gradient in Fig. 2, whereas updat-

ing ξis based on a slowly moving exponential average of θwith a

decay rate τ:

ξ←τξ + (1 −τ)θ(3)

It has been empirically shown that stop-gradient operation can

avoid collapsing to an uninformative solution, and the moving-

average behavior may lead to learning effective representations [11].

After the training, we transfer only the fθas a pre-trained model.

4. EXPERIMENTS

We validated the M2D step-by-step experiments that examined the

effectiveness of learning representations directly by comparing our

M2D with MAE (Section 4.2), the effectiveness of feeding only

masked patches to the target (Section 4.3), the impact of various

masking ratios (Section 4.4), and comparing ours with SOTA (Sec-

tion 4.5).

In all experiments, we applied our M2D to masked spectro-

gram modeling (MSM) [9], with an audio spectrogram as input to

learn general-purpose audio representations. We evaluated the per-

formance of pre-trained models in both a linear evaluation and ﬁne-

tuning on a variety of audio downstream tasks spanning environmen-

tal sounds, speech, and music.

4.1. Experimental Setup

We mainly focused on comparing M2D with an MAE, then adapted

MAE implementations and settings with as few changes as possible.

We implemented an additional target network on top of the MAE

code and adopted the MAE decoder as our predictor gθwithout

changes. We used vanilla ViT-Base [27] with a 768-d output fea-

ture as our encoders (fθand fξ) and ﬁxed the patch size to 16 ×16

for all experiments. We tested with masking ratios of 0.6 and 0.7,

which showed good performance in preliminary experiments.

We used the MSM-MAE [9] as an MAE for comparison, an

MAE variant optimized for MSM by making the decoder smaller

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MASKEDMODELINGDUO:LEARNINGREPRESENTATIONSBYENCOURAGINGBOTHNETWORKSTOMODELTHEINPUTDaisukeNiizumi,DaikiTakeuchi,YasunoriOhishi,NoboruHarada,andKunioKashinoNTTCorporation,JapanABSTRACTMaskedAutoencodersisasimpleyetpowerfulself-supervisedlearningmethod.However,itlearnsrepresentationsindirectlybyreconstr...

展开>> 收起<<

MASKED MODELING DUO LEARNING REPRESENTATIONS BY ENCOURAGING BOTH NETWORKS TO MODEL THE INPUT Daisuke Niizumi Daiki Takeuchi Yasunori Ohishi Noboru Harada and Kunio Kashino.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MASKED MODELING DUO LEARNING REPRESENTATIONS BY ENCOURAGING BOTH NETWORKS TO MODEL THE INPUT Daisuke Niizumi Daiki Takeuchi Yasunori Ohishi Noboru Harada and Kunio Kashino

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: