ADVERSARIAL PERMUTATION INVARIANT TRAINING FOR UNIVERSAL SOUND SEPARATION Emilian Postolache12 Jordi Pons1 Santiago Pascual1 Joan Serr a1

2025-05-06 0 0 517.39KB 5 页 10玖币
侵权投诉
ADVERSARIAL PERMUTATION INVARIANT TRAINING
FOR UNIVERSAL SOUND SEPARATION
Emilian Postolache1,2, Jordi Pons1, Santiago Pascual1, Joan Serr`
a1
1Dolby Laboratories 2Sapienza University of Rome
ABSTRACT
Universal sound separation consists of separating mixes with ar-
bitrary sounds of different types, and permutation invariant train-
ing (PIT) is used to train source agnostic models that do so. In this
work, we complement PIT with adversarial losses but find it chal-
lenging with the standard formulation used in speech source sep-
aration. We overcome this challenge with a novel I-replacement
context-based adversarial loss, and by training with multiple dis-
criminators. Our experiments show that by simply improving the loss
(keeping the same model and dataset) we obtain a non-negligible im-
provement of 1.4 dB SI-SNRIin the reverberant FUSS dataset. We
also find adversarial PIT to be effective at reducing spectral holes,
ubiquitous in mask-based separation models, which highlights the
potential relevance of adversarial losses for source separation.
Index TermsAdversarial, PIT, universal source separation.
1. INTRODUCTION
Audio source separation consists of separating the sources present in
an audio mix, as in music source separation (separating vocals, bass,
and drums from a music mix [1–3]) or speech source separation (sep-
arating various speakers talking simultaneously [4–6]). Recently,
universal sound separation was proposed [7]. It consists of building
source agnostic models that are not constrained to a specific domain
(like music or speech) and can separate any source given an arbitrary
mix. Permutation invariant training (PIT) [8] is used for training uni-
versal source separation models based on deep learning [7, 9, 10].
We consider mixes mof length Lwith K0arbitrary sources sas
follows: m=PK0
k=1 sk, out of which the separator model fθpre-
dicts Ksources ˆs =fθ(m). PIT optimizes the learnable parameters
θof fθby minimizing the following permutation invariant loss:
LPIT = min
P
K
X
k=1
Lsk,[Pˆ
s]k,(1)
where we consider all permutation matrices P,Pis the optimal
permutation matrix minimizing Eq. 1, and Lcan be any regression
loss. Since fθoutputs Ksources, in case a mix contains K0< K
sources, we set the target sk= 0 for k > K0. Note that a permu-
tation invariant loss is required to build source agnostic models, be-
cause the outputs of fθcan be any source and in any order. As such,
the model must not focus on predicting one source type per output,
and any possible permutation of output sources must be equally cor-
rect [7, 8]. A common loss Lfor universal sound separation is the
τ-thresholded logarithmic mean squared error [7, 9], which is un-
bounded when sk= 0. In that case, since m6= 0, one can use a
different Lbased on thresholding with respect to the mixture [9]:
L(sk,ˆsk) = (10 log10 kˆskk2+τkmk2if sk= 0
10 log10 kskˆskk2+τkskk2otherwise. (2)
Equal contribution.
In this work, we complement PIT with adversarial losses for univer-
sal sound separation. A number of speech source separation works
also complemented PIT with adversarial losses [11–14]. Yet, we
find that the adversarial PIT formulation used in speech separation
does not perform well for universal source separation (sections 3
and 4). To improve upon that, in section 2 we extend speech sepa-
ration works with: a novel I-replacement context-based adversarial
loss, by combining multiple discriminators, and generalize adversar-
ial PIT such that it works for universal sound separation (with source
agnostic discriminators dealing with more than two sources). Table 1
outlines how our approach compares with speech separation works.
2. ADVERSARIAL PIT
Adversarial training, in the context of source separation, consists of
simultaneously training two models: fθproducing plausible separa-
tions ˆs, and one (or multiple) discriminator(s) Dassessing if sepa-
rations ˆs are produced by fθ(fake) or are ground-truth separations s
(real). Under this setup, the goal of fθis to estimate (fake) separa-
tions that are as close as possible to the (real) ones from the dataset,
such that Dmisclassifies ˆs as s[15, 16]. We propose combining
variations of an instance-based discriminator Dinst with a novel I-
replacement context-based discriminator Dctx,I . Each Dhas a dif-
ferent role and is applicable to various domains: waveforms, magni-
tude STFTs, or masks. Without loss of generality, we present Dinst
and Dctx,I in the waveform domain and then show how to combine
multiple discriminators operating at various domains to train fθ.
Instance-based adversarial loss — The role of Dinst is to provide
adversarial cues on the realness of the separated sources without con-
text. That is, Dinst assesses the realness of each source individually:
[s1]/[ˆs1]. . . [sK]/[ˆsK].
Throughout the paper, we use brackets [ ] to denote the Ds input and
left / right for real / fake separations (not division). Hence, individ-
ual real / fake separations (instances) are input to Dinst, which learns
to classify them as real / fake (Fig. 1). Dinst is trained to maximize
Linst =1
K
K
X
k=1
(Lreal,k
inst +Lfake,k
inst ),
where Lreal,k
inst and Lfake,k
inst correspond to the hinge loss [17]:
Lreal,k
inst = min (0,1 + Dinst(sk)) ,
Lfake,k
inst = min (0,1Dinst(ˆsk)) .
Previous works also explored using Dinst. However, they used source
specific setups where each Dinst was specialized in a source type,
e.g., for music source separation each Dinst was specialized in bass,
drums, and vocals [1, 18], or for speech source separation Dinst was
specialized in speech [12, 19]. Yet, each Dinst for universal sound
separation is not specialized in any source type (are source agnostic)
and assesses the realness of any audio, regardless of its source type.
arXiv:2210.12108v2 [cs.SD] 6 Mar 2023
摘要:

ADVERSARIALPERMUTATIONINVARIANTTRAININGFORUNIVERSALSOUNDSEPARATIONEmilianPostolache1;2,JordiPons1,SantiagoPascual1,JoanSerra11DolbyLaboratories2SapienzaUniversityofRomeABSTRACTUniversalsoundseparationconsistsofseparatingmixeswithar-bitrarysoundsofdifferenttypes,andpermutationinvarianttrain-ing(PI...

展开>> 收起<<
ADVERSARIAL PERMUTATION INVARIANT TRAINING FOR UNIVERSAL SOUND SEPARATION Emilian Postolache12 Jordi Pons1 Santiago Pascual1 Joan Serr a1.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:517.39KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注