
ADVERSARIAL PERMUTATION INVARIANT TRAINING
FOR UNIVERSAL SOUND SEPARATION
Emilian Postolache∗1,2, Jordi Pons∗1, Santiago Pascual1, Joan Serr`
a1
1Dolby Laboratories 2Sapienza University of Rome
ABSTRACT
Universal sound separation consists of separating mixes with ar-
bitrary sounds of different types, and permutation invariant train-
ing (PIT) is used to train source agnostic models that do so. In this
work, we complement PIT with adversarial losses but find it chal-
lenging with the standard formulation used in speech source sep-
aration. We overcome this challenge with a novel I-replacement
context-based adversarial loss, and by training with multiple dis-
criminators. Our experiments show that by simply improving the loss
(keeping the same model and dataset) we obtain a non-negligible im-
provement of 1.4 dB SI-SNRIin the reverberant FUSS dataset. We
also find adversarial PIT to be effective at reducing spectral holes,
ubiquitous in mask-based separation models, which highlights the
potential relevance of adversarial losses for source separation.
Index Terms—Adversarial, PIT, universal source separation.
1. INTRODUCTION
Audio source separation consists of separating the sources present in
an audio mix, as in music source separation (separating vocals, bass,
and drums from a music mix [1–3]) or speech source separation (sep-
arating various speakers talking simultaneously [4–6]). Recently,
universal sound separation was proposed [7]. It consists of building
source agnostic models that are not constrained to a specific domain
(like music or speech) and can separate any source given an arbitrary
mix. Permutation invariant training (PIT) [8] is used for training uni-
versal source separation models based on deep learning [7, 9, 10].
We consider mixes mof length Lwith K0arbitrary sources sas
follows: m=PK0
k=1 sk, out of which the separator model fθpre-
dicts Ksources ˆs =fθ(m). PIT optimizes the learnable parameters
θof fθby minimizing the following permutation invariant loss:
LPIT = min
P
K
X
k=1
Lsk,[Pˆ
s]k,(1)
where we consider all permutation matrices P,P∗is the optimal
permutation matrix minimizing Eq. 1, and Lcan be any regression
loss. Since fθoutputs Ksources, in case a mix contains K0< K
sources, we set the target sk= 0 for k > K0. Note that a permu-
tation invariant loss is required to build source agnostic models, be-
cause the outputs of fθcan be any source and in any order. As such,
the model must not focus on predicting one source type per output,
and any possible permutation of output sources must be equally cor-
rect [7, 8]. A common loss Lfor universal sound separation is the
τ-thresholded logarithmic mean squared error [7, 9], which is un-
bounded when sk= 0. In that case, since m6= 0, one can use a
different Lbased on thresholding with respect to the mixture [9]:
L(sk,ˆsk) = (10 log10 kˆskk2+τkmk2if sk= 0
10 log10 ksk−ˆskk2+τkskk2otherwise. (2)
∗Equal contribution.
In this work, we complement PIT with adversarial losses for univer-
sal sound separation. A number of speech source separation works
also complemented PIT with adversarial losses [11–14]. Yet, we
find that the adversarial PIT formulation used in speech separation
does not perform well for universal source separation (sections 3
and 4). To improve upon that, in section 2 we extend speech sepa-
ration works with: a novel I-replacement context-based adversarial
loss, by combining multiple discriminators, and generalize adversar-
ial PIT such that it works for universal sound separation (with source
agnostic discriminators dealing with more than two sources). Table 1
outlines how our approach compares with speech separation works.
2. ADVERSARIAL PIT
Adversarial training, in the context of source separation, consists of
simultaneously training two models: fθproducing plausible separa-
tions ˆs, and one (or multiple) discriminator(s) Dassessing if sepa-
rations ˆs are produced by fθ(fake) or are ground-truth separations s
(real). Under this setup, the goal of fθis to estimate (fake) separa-
tions that are as close as possible to the (real) ones from the dataset,
such that Dmisclassifies ˆs as s[15, 16]. We propose combining
variations of an instance-based discriminator Dinst with a novel I-
replacement context-based discriminator Dctx,I . Each Dhas a dif-
ferent role and is applicable to various domains: waveforms, magni-
tude STFTs, or masks. Without loss of generality, we present Dinst
and Dctx,I in the waveform domain and then show how to combine
multiple discriminators operating at various domains to train fθ.
Instance-based adversarial loss — The role of Dinst is to provide
adversarial cues on the realness of the separated sources without con-
text. That is, Dinst assesses the realness of each source individually:
[s1]/[ˆs1]. . . [sK]/[ˆsK].
Throughout the paper, we use brackets [ ] to denote the D’s input and
left / right for real / fake separations (not division). Hence, individ-
ual real / fake separations (instances) are input to Dinst, which learns
to classify them as real / fake (Fig. 1). Dinst is trained to maximize
Linst =1
K
K
X
k=1
(Lreal,k
inst +Lfake,k
inst ),
where Lreal,k
inst and Lfake,k
inst correspond to the hinge loss [17]:
Lreal,k
inst = min (0,−1 + Dinst(sk)) ,
Lfake,k
inst = min (0,−1−Dinst(ˆsk)) .
Previous works also explored using Dinst. However, they used source
specific setups where each Dinst was specialized in a source type,
e.g., for music source separation each Dinst was specialized in bass,
drums, and vocals [1, 18], or for speech source separation Dinst was
specialized in speech [12, 19]. Yet, each Dinst for universal sound
separation is not specialized in any source type (are source agnostic)
and assesses the realness of any audio, regardless of its source type.
arXiv:2210.12108v2 [cs.SD] 6 Mar 2023