ADVERSARIAL PERMUTATION INVARIANT TRAINING FOR UNIVERSAL SOUND SEPARATION Emilian Postolache12 Jordi Pons1 Santiago Pascual1 Joan Serr a1

2025-05-06 0 0 517.39KB 5 页 10玖币

侵权投诉

ADVERSARIAL PERMUTATION INVARIANT TRAINING

FOR UNIVERSAL SOUND SEPARATION

Emilian Postolache∗1,2, Jordi Pons∗1, Santiago Pascual1, Joan Serr`

1Dolby Laboratories 2Sapienza University of Rome

ABSTRACT

Universal sound separation consists of separating mixes with ar-

bitrary sounds of different types, and permutation invariant train-

ing (PIT) is used to train source agnostic models that do so. In this

work, we complement PIT with adversarial losses but ﬁnd it chal-

lenging with the standard formulation used in speech source sep-

aration. We overcome this challenge with a novel I-replacement

context-based adversarial loss, and by training with multiple dis-

criminators. Our experiments show that by simply improving the loss

(keeping the same model and dataset) we obtain a non-negligible im-

provement of 1.4 dB SI-SNRIin the reverberant FUSS dataset. We

also ﬁnd adversarial PIT to be effective at reducing spectral holes,

ubiquitous in mask-based separation models, which highlights the

potential relevance of adversarial losses for source separation.

Index Terms—Adversarial, PIT, universal source separation.

1. INTRODUCTION

Audio source separation consists of separating the sources present in

an audio mix, as in music source separation (separating vocals, bass,

and drums from a music mix [1–3]) or speech source separation (sep-

arating various speakers talking simultaneously [4–6]). Recently,

universal sound separation was proposed [7]. It consists of building

source agnostic models that are not constrained to a speciﬁc domain

(like music or speech) and can separate any source given an arbitrary

mix. Permutation invariant training (PIT) [8] is used for training uni-

versal source separation models based on deep learning [7, 9, 10].

We consider mixes mof length Lwith K0arbitrary sources sas

follows: m=PK0

k=1 sk, out of which the separator model fθpre-

dicts Ksources ˆs =fθ(m). PIT optimizes the learnable parameters

θof fθby minimizing the following permutation invariant loss:

LPIT = min

k=1

Lsk,[Pˆ

s]k,(1)

where we consider all permutation matrices P,P∗is the optimal

permutation matrix minimizing Eq. 1, and Lcan be any regression

loss. Since fθoutputs Ksources, in case a mix contains K0< K

sources, we set the target sk= 0 for k > K0. Note that a permu-

tation invariant loss is required to build source agnostic models, be-

cause the outputs of fθcan be any source and in any order. As such,

the model must not focus on predicting one source type per output,

and any possible permutation of output sources must be equally cor-

rect [7, 8]. A common loss Lfor universal sound separation is the

τ-thresholded logarithmic mean squared error [7, 9], which is un-

bounded when sk= 0. In that case, since m6= 0, one can use a

different Lbased on thresholding with respect to the mixture [9]:

L(sk,ˆsk) = (10 log10 kˆskk2+τkmk2if sk= 0

10 log10 ksk−ˆskk2+τkskk2otherwise. (2)

∗Equal contribution.

In this work, we complement PIT with adversarial losses for univer-

sal sound separation. A number of speech source separation works

also complemented PIT with adversarial losses [11–14]. Yet, we

ﬁnd that the adversarial PIT formulation used in speech separation

does not perform well for universal source separation (sections 3

and 4). To improve upon that, in section 2 we extend speech sepa-

ration works with: a novel I-replacement context-based adversarial

loss, by combining multiple discriminators, and generalize adversar-

ial PIT such that it works for universal sound separation (with source

agnostic discriminators dealing with more than two sources). Table 1

outlines how our approach compares with speech separation works.

2. ADVERSARIAL PIT

Adversarial training, in the context of source separation, consists of

simultaneously training two models: fθproducing plausible separa-

tions ˆs, and one (or multiple) discriminator(s) Dassessing if sepa-

rations ˆs are produced by fθ(fake) or are ground-truth separations s

(real). Under this setup, the goal of fθis to estimate (fake) separa-

tions that are as close as possible to the (real) ones from the dataset,

such that Dmisclassiﬁes ˆs as s[15, 16]. We propose combining

variations of an instance-based discriminator Dinst with a novel I-

replacement context-based discriminator Dctx,I . Each Dhas a dif-

ferent role and is applicable to various domains: waveforms, magni-

tude STFTs, or masks. Without loss of generality, we present Dinst

and Dctx,I in the waveform domain and then show how to combine

multiple discriminators operating at various domains to train fθ.

Instance-based adversarial loss — The role of Dinst is to provide

adversarial cues on the realness of the separated sources without con-

text. That is, Dinst assesses the realness of each source individually:

[s1]/[ˆs1]. . . [sK]/[ˆsK].

Throughout the paper, we use brackets [ ] to denote the D’s input and

left / right for real / fake separations (not division). Hence, individ-

ual real / fake separations (instances) are input to Dinst, which learns

to classify them as real / fake (Fig. 1). Dinst is trained to maximize

Linst =1

k=1

(Lreal,k

inst +Lfake,k

inst ),

where Lreal,k

inst and Lfake,k

inst correspond to the hinge loss [17]:

Lreal,k

inst = min (0,−1 + Dinst(sk)) ,

Lfake,k

inst = min (0,−1−Dinst(ˆsk)) .

Previous works also explored using Dinst. However, they used source

speciﬁc setups where each Dinst was specialized in a source type,

e.g., for music source separation each Dinst was specialized in bass,

drums, and vocals [1, 18], or for speech source separation Dinst was

specialized in speech [12, 19]. Yet, each Dinst for universal sound

separation is not specialized in any source type (are source agnostic)

and assesses the realness of any audio, regardless of its source type.

arXiv:2210.12108v2 [cs.SD] 6 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ADVERSARIALPERMUTATIONINVARIANTTRAININGFORUNIVERSALSOUNDSEPARATIONEmilianPostolache1;2,JordiPons1,SantiagoPascual1,JoanSerra11DolbyLaboratories2SapienzaUniversityofRomeABSTRACTUniversalsoundseparationconsistsofseparatingmixeswithar-bitrarysoundsofdifferenttypes,andpermutationinvarianttrain-ing(PI...

展开>> 收起<<

ADVERSARIAL PERMUTATION INVARIANT TRAINING FOR UNIVERSAL SOUND SEPARATION Emilian Postolache12 Jordi Pons1 Santiago Pascual1 Joan Serr a1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ADVERSARIAL PERMUTATION INVARIANT TRAINING FOR UNIVERSAL SOUND SEPARATION Emilian Postolache12 Jordi Pons1 Santiago Pascual1 Joan Serr a1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: