Individualized Conditioning and Negative Distances
for Speaker Separation
Tao Sun∗, Nidal Abuhajar∗, Shuyu Gong∗, Zhewei Wang§, Charles D. Smith†, Xianhui Wang‡, Li Xu‡, Jundong Liu∗
∗School of Electrical Engineering and Computer Science, Ohio University, Athens, OH 45701
†Department of Neurology, University of Kentucky, Lexington, KY 40536
‡Division of Communication Sciences, Ohio University, Athens, OH 45701
§Massachusetts General Hospital, Boston, MA 02114
Abstract—Speaker separation aims to extract multiple voices
from a mixed signal. In this paper, we propose two speaker-aware
designs to improve the existing speaker separation solutions.
The first model is a speaker conditioning network that integrates
speech samples to generate individualized speaker conditions,
which then provide informed guidance for a separation module
to produce well-separated outputs.
The second design aims to reduce non-target voices in the
separated speech. To this end, we propose negative distances to
penalize the appearance of any non-target voice in the channel
outputs, and positive distances to drive the separated voices closer
to the clean targets. We explore two different setups, weighted-
sum and triplet-like, to integrate these two distances to form a
combined auxiliary loss for the separation networks. Experiments
conducted on LibriMix demonstrate the effectiveness of our
proposed models.
Index Terms—Speaker separation, conditioning, negative dis-
tances, speech representation, wav2vec, Conv-TasNet.
@inproceedings
I. INTRODUCTION
Speech separation, also known as the cocktail party prob-
lem, aims to separate a target speech from its background
interference [1]. It often serves as a preprocessing step in
real-world speech processing systems, including ASR, speaker
recognition, hearing prostheses, and mobile telecommunica-
tions. Speaker separation (SS) is a sub-problem of speech
separation, where the main goal is to extract multiple voices
from a mixed signal.
Following the mechanism of the human auditory system,
traditional speaker separation solutions commonly rely on cer-
tain heuristic grouping rules, such as periodicity and pitch tra-
jectories, to separate mixed signals [2]–[4]. Two-dimensional
ideal time-frequency (T-F) masks are often generated based
on these rules and applied to mixed signals to extract individ-
ual sources. Due to the hand-crafted nature, these grouping
rules, however, often have limited generalization capability in
handling diverse real-world signals [5].
Similar to many other AI-related areas, deep neural net-
works (DNNs) have recently emerged as a dominant paradigm
to solve the SS problems. Early DNNs were mostly frequency-
domain models [6]–[10], aiming to approximate ideal T-F
masks and rely on them to restore individual sources through
short-time Fourier transform (STFT). As the modified T-F
representations may not be converted back to the time domain,
these methods commonly suffer from the so-called invalid
STFT problem [11].
Waveform-based DNN models have grown in popularity
in recent years, partly because they can avoid the invalid
STFT problem [11]–[15]. Pioneered by TasNet [11] and Conv-
TasNet [13], early waveform solutions tackle the separation
task with three stages: encoding, separating, and decoding.
However, speaker information is often not explicitly integrated
into the network training and/or inference procedures.
Speaker-aware SS models [16]–[21] provide a remedy in
this regard. This group of solutions can be roughly divided
into speaker-conditioned methods [16]–[18] and auxiliary-
loss based methods [19]–[21]. The former rely on a speaker
module to infer speaker information, which is then taken
as conditions by a separation module to generate separated
output waveforms. The existing speaker-conditioned solutions,
however, are either not in the time-domain [16]–[18] or do
not explicitly integrate speech information into the speaker
conditioning process [18].
Auxiliary-loss based methods [19]–[21] achieve speaker
awareness through composite loss functions. In addition to a
main loss, an auxiliary loss (or losses) is used to incorporate
speakers’ information into the network training procedure.
Such auxiliary losses are commonly formulated to ensure
a match between network outputs and the target speakers.
However, to the best of our knowledge, no solution has
attempted to explicitly suppress voices from other non-target
speakers. As a result, residual voices of non-target speakers
are often noticeable in the network outputs.
In this paper, we propose two waveform speaker separa-
tion models to address the aforementioned limitations. The
first model is a speaker conditioning network that integrates
individual speech samples in the speaker module to pro-
duce tailored speaker conditions. The integration is based on
speaker embeddings computed through a pretrained speaker
recognition network. The second solution aims to completely
suppress non-target speaker voices in the separated speech. We
propose an auxiliary loss with two terms: the first drives the
separated voices close to target clean voices, while the second
term penalizes the appearance of any non-target voice in the
arXiv:2210.06368v1 [cs.SD] 12 Oct 2022