Individualized Conditioning and Negative Distances for Speaker Separation Tao Sun Nidal Abuhajar Shuyu Gong Zhewei Wangx Charles D. Smithy Xianhui Wangz Li Xuz Jundong Liu

2025-04-27 0 0 410.62KB 6 页 10玖币

侵权投诉

Individualized Conditioning and Negative Distances

for Speaker Separation

Tao Sun∗, Nidal Abuhajar∗, Shuyu Gong∗, Zhewei Wang§, Charles D. Smith†, Xianhui Wang‡, Li Xu‡, Jundong Liu∗

∗School of Electrical Engineering and Computer Science, Ohio University, Athens, OH 45701

†Department of Neurology, University of Kentucky, Lexington, KY 40536

‡Division of Communication Sciences, Ohio University, Athens, OH 45701

§Massachusetts General Hospital, Boston, MA 02114

Abstract—Speaker separation aims to extract multiple voices

from a mixed signal. In this paper, we propose two speaker-aware

designs to improve the existing speaker separation solutions.

The ﬁrst model is a speaker conditioning network that integrates

speech samples to generate individualized speaker conditions,

which then provide informed guidance for a separation module

to produce well-separated outputs.

The second design aims to reduce non-target voices in the

separated speech. To this end, we propose negative distances to

penalize the appearance of any non-target voice in the channel

outputs, and positive distances to drive the separated voices closer

to the clean targets. We explore two different setups, weighted-

sum and triplet-like, to integrate these two distances to form a

combined auxiliary loss for the separation networks. Experiments

conducted on LibriMix demonstrate the effectiveness of our

proposed models.

Index Terms—Speaker separation, conditioning, negative dis-

tances, speech representation, wav2vec, Conv-TasNet.

@inproceedings

I. INTRODUCTION

Speech separation, also known as the cocktail party prob-

lem, aims to separate a target speech from its background

interference [1]. It often serves as a preprocessing step in

real-world speech processing systems, including ASR, speaker

recognition, hearing prostheses, and mobile telecommunica-

tions. Speaker separation (SS) is a sub-problem of speech

separation, where the main goal is to extract multiple voices

from a mixed signal.

Following the mechanism of the human auditory system,

traditional speaker separation solutions commonly rely on cer-

tain heuristic grouping rules, such as periodicity and pitch tra-

jectories, to separate mixed signals [2]–[4]. Two-dimensional

ideal time-frequency (T-F) masks are often generated based

on these rules and applied to mixed signals to extract individ-

ual sources. Due to the hand-crafted nature, these grouping

rules, however, often have limited generalization capability in

handling diverse real-world signals [5].

Similar to many other AI-related areas, deep neural net-

works (DNNs) have recently emerged as a dominant paradigm

to solve the SS problems. Early DNNs were mostly frequency-

domain models [6]–[10], aiming to approximate ideal T-F

masks and rely on them to restore individual sources through

short-time Fourier transform (STFT). As the modiﬁed T-F

representations may not be converted back to the time domain,

these methods commonly suffer from the so-called invalid

STFT problem [11].

Waveform-based DNN models have grown in popularity

in recent years, partly because they can avoid the invalid

STFT problem [11]–[15]. Pioneered by TasNet [11] and Conv-

TasNet [13], early waveform solutions tackle the separation

task with three stages: encoding, separating, and decoding.

However, speaker information is often not explicitly integrated

into the network training and/or inference procedures.

Speaker-aware SS models [16]–[21] provide a remedy in

this regard. This group of solutions can be roughly divided

into speaker-conditioned methods [16]–[18] and auxiliary-

loss based methods [19]–[21]. The former rely on a speaker

module to infer speaker information, which is then taken

as conditions by a separation module to generate separated

output waveforms. The existing speaker-conditioned solutions,

however, are either not in the time-domain [16]–[18] or do

not explicitly integrate speech information into the speaker

conditioning process [18].

Auxiliary-loss based methods [19]–[21] achieve speaker

awareness through composite loss functions. In addition to a

main loss, an auxiliary loss (or losses) is used to incorporate

speakers’ information into the network training procedure.

Such auxiliary losses are commonly formulated to ensure

a match between network outputs and the target speakers.

However, to the best of our knowledge, no solution has

attempted to explicitly suppress voices from other non-target

speakers. As a result, residual voices of non-target speakers

are often noticeable in the network outputs.

In this paper, we propose two waveform speaker separa-

tion models to address the aforementioned limitations. The

ﬁrst model is a speaker conditioning network that integrates

individual speech samples in the speaker module to pro-

duce tailored speaker conditions. The integration is based on

speaker embeddings computed through a pretrained speaker

recognition network. The second solution aims to completely

suppress non-target speaker voices in the separated speech. We

propose an auxiliary loss with two terms: the ﬁrst drives the

separated voices close to target clean voices, while the second

term penalizes the appearance of any non-target voice in the

arXiv:2210.06368v1 [cs.SD] 12 Oct 2022

Fig. 1: Feature-wise Linear Modulation (FiLM) architecture.

separated outputs. The latter, which we call negative distances,

is achieved by maximizing distances between the speech

representations of extracted sources and those of the non-target

sources. We also explore different schemes to integrate the

proposed distances.

II. BACKGROUND

We start with this section to provide some background

knowledge concerning our proposed speaker-aware SS models,

which includes conditioning in machine learning, triplet loss,

and speech representations generated through self-supervised

learning.

A. Conditioning and FiLM

In everyday life, it is often helpful to process one source of

information in the context of another. For example, video and

audio in a movie can be better understood in the context of

each other. This context-based processing is called condition-

ing in machine learning, where computations through a model

are conditioned or modulated by information extracted from

auxiliary inputs. For speaker-conditioned speaker separation,

conditioning in a network can be done through its separation

module, which would take speaker information as the context

to produces the output voices.

Feature-wise Linear modulation (FiLM) [22] is a popular

feature conditioning method that was shown to enhance the

performance of neural network solutions for a variety of tasks,

including visual reasoning and speech separation [16], [18].

FiLM learns to adaptively inﬂuence the output of a neural

network by applying an afﬁne transformation to the network’s

intermediate features, based on some input. As shown in

Fig.1, FiLM conditioning architecture consists of a FiLM

generator and one or more FiLM layers. The generator takes

a conditioning representation as input and generates FiLM

vector parameters, which are later used in the FiLM layers

to modulate the input with an afﬁne transformation, i.e., a

combination of conditional biasing and conditional scaling.

B. Triplet Loss

In machine learning, we often consider triplet samples [23],

each consisting of an anchor input, a matching input with the

same label (called a positive sample), and a non-matching

input with a different label (called a negative sample). The

triplet loss, initially introduced in metric learning [24], is a

loss function based on relative comparisons, i.e., an anchor

xais compared to one positive sample xpand one negative

sample xn.

Shown in Fig. 2, a triplet loss learns embeddings to min-

imize the distance between an anchor input and the positive

samples, and at the same time maximize the distance from

the anchor to the negative inputs. More speciﬁcally, for one

triplet, we want:

||f(xa)−f(xp)||2

2+α < ||f(xa)−f(xn)||2

2,(1)

where f(·)is the embedding function and αis deﬁned as the

minimum margin between positive and negative pairs.

Fig. 2: Learning through the triplet loss [23].

C. Speech Representations via Self-supervised Learning

In self-supervised learning (SSL), models are trained to

predict one part of the data from other parts [25]. SSL models

for speech data and tasks commonly aim to output speech

representations in the form of compact vectors that capture

high-level semantic information from raw speech data [26]–

[30]. In our work, we utilize the speech feature representations

generated from Wav2vec [27], an SSL network, to pass high-

level meaningful features to our networks.

Wav2vec is trained on LibriSpeech corpus using the con-

trastive predictive coding (CPC) loss [30] to pretrain speech

representations for ASR tasks. Experiment results showed that

wav2vec can signiﬁcantly improve the performance over the

chosen baseline solutions. Wav2vec consists of two parts,

an encoder and a context network. The former is a seven-

layer convolutional network, and its functionality is to extract

latent features from the inputs. The context network combines

multiple outputs from the encoder into a contextualized tensor,

which then could be used for downstream tasks.

III. METHOD

Let Xbe a waveform produced by mixing Csources

x1,x2, ..., xc, i.e.,

i=1

xi(2)

A waveform monaural speaker separation model aims to

directly separate the mixed signal Xinto Cestimations

x1,˜

x2, ..., ˜

xc.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IndividualizedConditioningandNegativeDistancesforSpeakerSeparationTaoSun,NidalAbuhajar,ShuyuGong,ZheweiWangx,CharlesD.Smithy,XianhuiWangz,LiXuz,JundongLiuSchoolofElectricalEngineeringandComputerScience,OhioUniversity,Athens,OH45701yDepartmentofNeurology,UniversityofKentucky,Lexington,KY40536zDi...

收起<<

Individualized Conditioning and Negative Distances for Speaker Separation Tao Sun Nidal Abuhajar Shuyu Gong Zhewei Wangx Charles D. Smithy Xianhui Wangz Li Xuz Jundong Liu.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Individualized Conditioning and Negative Distances for Speaker Separation Tao Sun Nidal Abuhajar Shuyu Gong Zhewei Wangx Charles D. Smithy Xianhui Wangz Li Xuz Jundong Liu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: