
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network
Junjie Li1, Meng Ge1,2,∗, Zexu Pan2, Longbiao Wang1,∗, Jianwu Dang1,3
1Tianjin Key Laboratory of Cognitive Computing and Application,
College of Intelligence and Computing, Tianjin University, Tianjin, China
2Department of Electrical and Computer Engineering, National University of Singapore, Singapore
3Japan Advanced Institute of Science and Technology, Ishikawa, Japan
{mrjunjieli,gemeng,longbiao wang}@tju.edu.cn, pan zexu@u.nus.edu
Abstract
Speaker extraction seeks to extract the target speech in a multi-
talker scenario given an auxiliary reference. Such reference can
be auditory, i.e., a pre-recorded speech, visual, i.e., lip move-
ments, or contextual, i.e., phonetic sequence. References in dif-
ferent modalities provide distinct and complementary informa-
tion that could be fused to form top-down attention on the tar-
get speaker. Previous studies have introduced visual and con-
textual modalities in a single model. In this paper, we pro-
pose a two-stage time-domain visual-contextual speaker extrac-
tion network named VCSE, which incorporates visual and self-
enrolled contextual cues stage by stage to take full advantage
of every modality. In the first stage, we pre-extract a target
speech with visual cues and estimate the underlying phonetic
sequence. In the second stage, we refine the pre-extracted tar-
get speech with the self-enrolled contextual cues. Experimen-
tal results on the real-world Lip Reading Sentences 3 (LRS3)
database demonstrate that our proposed VCSE network consis-
tently outperforms other state-of-the-art baselines.
Index Terms: Speaker extraction, time-domain, phonetic se-
quence, visual-contextual
1. Introduction
Speaker extraction aims to separate the speech of the target
speaker from a multi-talker mixture signal, which is also known
as the cocktail party problem [1]. This is a fundamental but cru-
cial problem to solve in signal processing that benefits a wide
range of downstream applications, such as hearing aids [2], ac-
tive speaker detection [3], speaker localization [4], and auto-
matic speech recognition (ASR) [5, 6]. Although human can
easily do that, it is a huge challenge to realize it in machines.
Before the deep learning era, popular techniques were com-
putational auditory scene analysis [7] and non-negative matrix
factorization [8]. The prior studies have laid the foundation for
recent progress. With the advent and success of deep learn-
ing, speech separation algorithms such as permutation invari-
ant training (PIT) [9], deep clustering [10], wavesplit [11] and
Conv-TasNet [12], tackle the cocktail party problem by sep-
arating every speaker out in the mixture signal. Although a
great success, there is inherent ambiguity in speaker labeling
of the separated signals. An auxiliary reference, such as a
pre-recorded speech signal [13–16] or video frame sequence
[17–20], can be used to solve the speaker ambiguity. The
speaker extraction algorithm employs such auxiliary reference
to form top-down attention on the target speaker and extracts its
speech.
∗Corresponding author.
Neuroscience studies [21, 22] suggest that human percep-
tion is multimodal. According to the reentry theory [23], mul-
timodal information is processed in our brain in an interactive
manner, which educates each other. At the cocktail party, we
hear the voice of the person, observe the lip motions and un-
derstand the contextual relevance. The lip motions that are
synchronized with the target speech help us better capture the
places of articulation, and it is robust against acoustic noise.
The contextual information connects the preceding or following
parts of the speech, which helps fill up the current severely cor-
rupted speech by inferring from the context. The information
from different modalities complements each other and together
forms effective communication [24–27]. Inspired by these prior
studies, we aim to emulate such a human perception process, to
utilize the visual and contextual cues.
A number of audio-visual speaker extraction works explore
the visual and contextual information by using the viseme-
phoneme mapping cues [28–30]. They encode the lip images
into visemes using a visual encoder pre-trained on the lip read-
ing task, in which each viseme maps to multiple phonemes.
However, such phoneme information derived from visual im-
ages only is weak, and the network may be using more of the
lip motions cues derived from the visemes.
There are also studies incorporating ASR derived phonemes
to explore contextual information in the speech separation [31–
33] algorithm. In [31], the authors propose a two-stage method.
The first stage is to obtain separated speech signals from a mix-
ture signal with PIT. Then, they estimate the contextual embed-
ding and guide the model to obtain the final speech in the sec-
ond stage. In [32], the authors use the speech mixture and visual
cues to get contextual information. This work utilizes visual and
contextual modalities for the first time. Both of these two works
are frequency-domain methods. In this paper, we aim to find a
solution in the time-domain, as time-domain methods usually
outperform frequency-domain counterparts by avoiding the dif-
ficult phase estimation problem [12]. Instead of incorporating
multiple modalities in a single model, we aim to introduce one
modality in each single stage.
Motivated by the previous works, we propose a two-stage
time-domain visual-contextual speaker extraction (VCSE) net-
work. The VCSE network is conditioned on auxiliary visual
reference only, but it makes use of both the visual lip movement
cues and self-enrolled contextual cues in the extraction process.
In the first stage, the network pre-extracts the target speech and
estimates the underlying phonemes using a pre-trained ASR
system. In the second stage, the pre-extracted speech is refined
with the contextual cues encoded from the self-enrolled pho-
netic sequence. Experimental results on real-world Lip Reading
Sentences 3 (LRS3) database [34] show that our VCSE network
consistently outperforms other state-of-the-art baselines.
arXiv:2210.06177v1 [cs.CV] 9 Oct 2022