VCSE Time-Domain Visual-Contextual Speaker Extraction Network Junjie Li1 Meng Ge12 Zexu Pan2 Longbiao Wang1 Jianwu Dang13 1Tianjin Key Laboratory of Cognitive Computing and Application

2025-04-26 0 0 361.61KB 5 页 10玖币

侵权投诉

VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Junjie Li1, Meng Ge1,2,∗, Zexu Pan2, Longbiao Wang1,∗, Jianwu Dang1,3

1Tianjin Key Laboratory of Cognitive Computing and Application,

College of Intelligence and Computing, Tianjin University, Tianjin, China

2Department of Electrical and Computer Engineering, National University of Singapore, Singapore

3Japan Advanced Institute of Science and Technology, Ishikawa, Japan

{mrjunjieli,gemeng,longbiao wang}@tju.edu.cn, pan zexu@u.nus.edu

Abstract

Speaker extraction seeks to extract the target speech in a multi-

talker scenario given an auxiliary reference. Such reference can

be auditory, i.e., a pre-recorded speech, visual, i.e., lip move-

ments, or contextual, i.e., phonetic sequence. References in dif-

ferent modalities provide distinct and complementary informa-

tion that could be fused to form top-down attention on the tar-

get speaker. Previous studies have introduced visual and con-

textual modalities in a single model. In this paper, we pro-

pose a two-stage time-domain visual-contextual speaker extrac-

tion network named VCSE, which incorporates visual and self-

enrolled contextual cues stage by stage to take full advantage

of every modality. In the ﬁrst stage, we pre-extract a target

speech with visual cues and estimate the underlying phonetic

sequence. In the second stage, we reﬁne the pre-extracted tar-

get speech with the self-enrolled contextual cues. Experimen-

tal results on the real-world Lip Reading Sentences 3 (LRS3)

database demonstrate that our proposed VCSE network consis-

tently outperforms other state-of-the-art baselines.

Index Terms: Speaker extraction, time-domain, phonetic se-

quence, visual-contextual

1. Introduction

Speaker extraction aims to separate the speech of the target

speaker from a multi-talker mixture signal, which is also known

as the cocktail party problem [1]. This is a fundamental but cru-

cial problem to solve in signal processing that beneﬁts a wide

range of downstream applications, such as hearing aids [2], ac-

tive speaker detection [3], speaker localization [4], and auto-

matic speech recognition (ASR) [5, 6]. Although human can

easily do that, it is a huge challenge to realize it in machines.

Before the deep learning era, popular techniques were com-

putational auditory scene analysis [7] and non-negative matrix

factorization [8]. The prior studies have laid the foundation for

recent progress. With the advent and success of deep learn-

ing, speech separation algorithms such as permutation invari-

ant training (PIT) [9], deep clustering [10], wavesplit [11] and

Conv-TasNet [12], tackle the cocktail party problem by sep-

arating every speaker out in the mixture signal. Although a

great success, there is inherent ambiguity in speaker labeling

of the separated signals. An auxiliary reference, such as a

pre-recorded speech signal [13–16] or video frame sequence

[17–20], can be used to solve the speaker ambiguity. The

speaker extraction algorithm employs such auxiliary reference

to form top-down attention on the target speaker and extracts its

speech.

∗Corresponding author.

Neuroscience studies [21, 22] suggest that human percep-

tion is multimodal. According to the reentry theory [23], mul-

timodal information is processed in our brain in an interactive

manner, which educates each other. At the cocktail party, we

hear the voice of the person, observe the lip motions and un-

derstand the contextual relevance. The lip motions that are

synchronized with the target speech help us better capture the

places of articulation, and it is robust against acoustic noise.

The contextual information connects the preceding or following

parts of the speech, which helps ﬁll up the current severely cor-

rupted speech by inferring from the context. The information

from different modalities complements each other and together

forms effective communication [24–27]. Inspired by these prior

studies, we aim to emulate such a human perception process, to

utilize the visual and contextual cues.

A number of audio-visual speaker extraction works explore

the visual and contextual information by using the viseme-

phoneme mapping cues [28–30]. They encode the lip images

into visemes using a visual encoder pre-trained on the lip read-

ing task, in which each viseme maps to multiple phonemes.

However, such phoneme information derived from visual im-

ages only is weak, and the network may be using more of the

lip motions cues derived from the visemes.

There are also studies incorporating ASR derived phonemes

to explore contextual information in the speech separation [31–

33] algorithm. In [31], the authors propose a two-stage method.

The ﬁrst stage is to obtain separated speech signals from a mix-

ture signal with PIT. Then, they estimate the contextual embed-

ding and guide the model to obtain the ﬁnal speech in the sec-

ond stage. In [32], the authors use the speech mixture and visual

cues to get contextual information. This work utilizes visual and

contextual modalities for the ﬁrst time. Both of these two works

are frequency-domain methods. In this paper, we aim to ﬁnd a

solution in the time-domain, as time-domain methods usually

outperform frequency-domain counterparts by avoiding the dif-

ﬁcult phase estimation problem [12]. Instead of incorporating

multiple modalities in a single model, we aim to introduce one

modality in each single stage.

Motivated by the previous works, we propose a two-stage

time-domain visual-contextual speaker extraction (VCSE) net-

work. The VCSE network is conditioned on auxiliary visual

reference only, but it makes use of both the visual lip movement

cues and self-enrolled contextual cues in the extraction process.

In the ﬁrst stage, the network pre-extracts the target speech and

estimates the underlying phonemes using a pre-trained ASR

system. In the second stage, the pre-extracted speech is reﬁned

with the contextual cues encoded from the self-enrolled pho-

netic sequence. Experimental results on real-world Lip Reading

Sentences 3 (LRS3) database [34] show that our VCSE network

consistently outperforms other state-of-the-art baselines.

arXiv:2210.06177v1 [cs.CV] 9 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VCSE:Time-DomainVisual-ContextualSpeakerExtractionNetworkJunjieLi1,MengGe1;2;,ZexuPan2,LongbiaoWang1;,JianwuDang1;31TianjinKeyLaboratoryofCognitiveComputingandApplication,CollegeofIntelligenceandComputing,TianjinUniversity,Tianjin,China2DepartmentofElectricalandComputerEngineering,NationalUniversi...

展开>> 收起<<

VCSE Time-Domain Visual-Contextual Speaker Extraction Network Junjie Li1 Meng Ge12 Zexu Pan2 Longbiao Wang1 Jianwu Dang13 1Tianjin Key Laboratory of Cognitive Computing and Application.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

VCSE Time-Domain Visual-Contextual Speaker Extraction Network Junjie Li1 Meng Ge12 Zexu Pan2 Longbiao Wang1 Jianwu Dang13 1Tianjin Key Laboratory of Cognitive Computing and Application

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: