Individualized Conditioning and Negative Distances for Speaker Separation Tao Sun Nidal Abuhajar Shuyu Gong Zhewei Wangx Charles D. Smithy Xianhui Wangz Li Xuz Jundong Liu

2025-04-27
0
0
410.62KB
6 页
10玖币
侵权投诉
Individualized Conditioning and Negative Distances
for Speaker Separation
Tao Sun∗, Nidal Abuhajar∗, Shuyu Gong∗, Zhewei Wang§, Charles D. Smith†, Xianhui Wang‡, Li Xu‡, Jundong Liu∗
∗School of Electrical Engineering and Computer Science, Ohio University, Athens, OH 45701
†Department of Neurology, University of Kentucky, Lexington, KY 40536
‡Division of Communication Sciences, Ohio University, Athens, OH 45701
§Massachusetts General Hospital, Boston, MA 02114
Abstract—Speaker separation aims to extract multiple voices
from a mixed signal. In this paper, we propose two speaker-aware
designs to improve the existing speaker separation solutions.
The first model is a speaker conditioning network that integrates
speech samples to generate individualized speaker conditions,
which then provide informed guidance for a separation module
to produce well-separated outputs.
The second design aims to reduce non-target voices in the
separated speech. To this end, we propose negative distances to
penalize the appearance of any non-target voice in the channel
outputs, and positive distances to drive the separated voices closer
to the clean targets. We explore two different setups, weighted-
sum and triplet-like, to integrate these two distances to form a
combined auxiliary loss for the separation networks. Experiments
conducted on LibriMix demonstrate the effectiveness of our
proposed models.
Index Terms—Speaker separation, conditioning, negative dis-
tances, speech representation, wav2vec, Conv-TasNet.
@inproceedings
I. INTRODUCTION
Speech separation, also known as the cocktail party prob-
lem, aims to separate a target speech from its background
interference [1]. It often serves as a preprocessing step in
real-world speech processing systems, including ASR, speaker
recognition, hearing prostheses, and mobile telecommunica-
tions. Speaker separation (SS) is a sub-problem of speech
separation, where the main goal is to extract multiple voices
from a mixed signal.
Following the mechanism of the human auditory system,
traditional speaker separation solutions commonly rely on cer-
tain heuristic grouping rules, such as periodicity and pitch tra-
jectories, to separate mixed signals [2]–[4]. Two-dimensional
ideal time-frequency (T-F) masks are often generated based
on these rules and applied to mixed signals to extract individ-
ual sources. Due to the hand-crafted nature, these grouping
rules, however, often have limited generalization capability in
handling diverse real-world signals [5].
Similar to many other AI-related areas, deep neural net-
works (DNNs) have recently emerged as a dominant paradigm
to solve the SS problems. Early DNNs were mostly frequency-
domain models [6]–[10], aiming to approximate ideal T-F
masks and rely on them to restore individual sources through
short-time Fourier transform (STFT). As the modified T-F
representations may not be converted back to the time domain,
these methods commonly suffer from the so-called invalid
STFT problem [11].
Waveform-based DNN models have grown in popularity
in recent years, partly because they can avoid the invalid
STFT problem [11]–[15]. Pioneered by TasNet [11] and Conv-
TasNet [13], early waveform solutions tackle the separation
task with three stages: encoding, separating, and decoding.
However, speaker information is often not explicitly integrated
into the network training and/or inference procedures.
Speaker-aware SS models [16]–[21] provide a remedy in
this regard. This group of solutions can be roughly divided
into speaker-conditioned methods [16]–[18] and auxiliary-
loss based methods [19]–[21]. The former rely on a speaker
module to infer speaker information, which is then taken
as conditions by a separation module to generate separated
output waveforms. The existing speaker-conditioned solutions,
however, are either not in the time-domain [16]–[18] or do
not explicitly integrate speech information into the speaker
conditioning process [18].
Auxiliary-loss based methods [19]–[21] achieve speaker
awareness through composite loss functions. In addition to a
main loss, an auxiliary loss (or losses) is used to incorporate
speakers’ information into the network training procedure.
Such auxiliary losses are commonly formulated to ensure
a match between network outputs and the target speakers.
However, to the best of our knowledge, no solution has
attempted to explicitly suppress voices from other non-target
speakers. As a result, residual voices of non-target speakers
are often noticeable in the network outputs.
In this paper, we propose two waveform speaker separa-
tion models to address the aforementioned limitations. The
first model is a speaker conditioning network that integrates
individual speech samples in the speaker module to pro-
duce tailored speaker conditions. The integration is based on
speaker embeddings computed through a pretrained speaker
recognition network. The second solution aims to completely
suppress non-target speaker voices in the separated speech. We
propose an auxiliary loss with two terms: the first drives the
separated voices close to target clean voices, while the second
term penalizes the appearance of any non-target voice in the
arXiv:2210.06368v1 [cs.SD] 12 Oct 2022
Fig. 1: Feature-wise Linear Modulation (FiLM) architecture.
separated outputs. The latter, which we call negative distances,
is achieved by maximizing distances between the speech
representations of extracted sources and those of the non-target
sources. We also explore different schemes to integrate the
proposed distances.
II. BACKGROUND
We start with this section to provide some background
knowledge concerning our proposed speaker-aware SS models,
which includes conditioning in machine learning, triplet loss,
and speech representations generated through self-supervised
learning.
A. Conditioning and FiLM
In everyday life, it is often helpful to process one source of
information in the context of another. For example, video and
audio in a movie can be better understood in the context of
each other. This context-based processing is called condition-
ing in machine learning, where computations through a model
are conditioned or modulated by information extracted from
auxiliary inputs. For speaker-conditioned speaker separation,
conditioning in a network can be done through its separation
module, which would take speaker information as the context
to produces the output voices.
Feature-wise Linear modulation (FiLM) [22] is a popular
feature conditioning method that was shown to enhance the
performance of neural network solutions for a variety of tasks,
including visual reasoning and speech separation [16], [18].
FiLM learns to adaptively influence the output of a neural
network by applying an affine transformation to the network’s
intermediate features, based on some input. As shown in
Fig.1, FiLM conditioning architecture consists of a FiLM
generator and one or more FiLM layers. The generator takes
a conditioning representation as input and generates FiLM
vector parameters, which are later used in the FiLM layers
to modulate the input with an affine transformation, i.e., a
combination of conditional biasing and conditional scaling.
B. Triplet Loss
In machine learning, we often consider triplet samples [23],
each consisting of an anchor input, a matching input with the
same label (called a positive sample), and a non-matching
input with a different label (called a negative sample). The
triplet loss, initially introduced in metric learning [24], is a
loss function based on relative comparisons, i.e., an anchor
xais compared to one positive sample xpand one negative
sample xn.
Shown in Fig. 2, a triplet loss learns embeddings to min-
imize the distance between an anchor input and the positive
samples, and at the same time maximize the distance from
the anchor to the negative inputs. More specifically, for one
triplet, we want:
||f(xa)−f(xp)||2
2+α < ||f(xa)−f(xn)||2
2,(1)
where f(·)is the embedding function and αis defined as the
minimum margin between positive and negative pairs.
Fig. 2: Learning through the triplet loss [23].
C. Speech Representations via Self-supervised Learning
In self-supervised learning (SSL), models are trained to
predict one part of the data from other parts [25]. SSL models
for speech data and tasks commonly aim to output speech
representations in the form of compact vectors that capture
high-level semantic information from raw speech data [26]–
[30]. In our work, we utilize the speech feature representations
generated from Wav2vec [27], an SSL network, to pass high-
level meaningful features to our networks.
Wav2vec is trained on LibriSpeech corpus using the con-
trastive predictive coding (CPC) loss [30] to pretrain speech
representations for ASR tasks. Experiment results showed that
wav2vec can significantly improve the performance over the
chosen baseline solutions. Wav2vec consists of two parts,
an encoder and a context network. The former is a seven-
layer convolutional network, and its functionality is to extract
latent features from the inputs. The context network combines
multiple outputs from the encoder into a contextualized tensor,
which then could be used for downstream tasks.
III. METHOD
Let Xbe a waveform produced by mixing Csources
x1,x2, ..., xc, i.e.,
X=
C
X
i=1
xi(2)
A waveform monaural speaker separation model aims to
directly separate the mixed signal Xinto Cestimations
˜
x1,˜
x2, ..., ˜
xc.
摘要:
收起<<
IndividualizedConditioningandNegativeDistancesforSpeakerSeparationTaoSun,NidalAbuhajar,ShuyuGong,ZheweiWangx,CharlesD.Smithy,XianhuiWangz,LiXuz,JundongLiuSchoolofElectricalEngineeringandComputerScience,OhioUniversity,Athens,OH45701yDepartmentofNeurology,UniversityofKentucky,Lexington,KY40536zDi...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 1
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 2
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 1
-
VIP免费2024-11-21 1
分类:图书资源
价格:10玖币
属性:6 页
大小:410.62KB
格式:PDF
时间:2025-04-27
作者详情
-
Voltage-Controlled High-Bandwidth Terahertz Oscillators Based On Antiferromagnets Mike A. Lund1Davi R. Rodrigues2Karin Everschor-Sitte3and Kjetil M. D. Hals1 1Department of Engineering Sciences University of Agder 4879 Grimstad Norway10 玖币0人下载
-
Voltage-controlled topological interface states for bending waves in soft dielectric phononic crystal plates10 玖币0人下载