1 Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

2025-04-30 0 0 1.9MB 13 页 10玖币
侵权投诉
1
Self-Supervised Training of Speaker Encoder with
Multi-Modal Diverse Positive Pairs
Ruijie Tao, Student Member, IEEE,, Kong Aik Lee, Senior Member, IEEE, Rohan Kumar Das, Senior
Member, IEEE, Ville Hautam¨
aki, Member, IEEE, and Haizhou Li, Fellow, IEEE
Abstract—We study a novel neural architecture and its training
strategies of speaker encoder for speaker recognition without
using any identity labels. The speaker encoder is trained to
extract a fixed-size speaker embedding from a spoken utterance
of various length. Contrastive learning is a typical self-supervised
learning technique. However, the quality of the speaker encoder
depends very much on the sampling strategy of positive and
negative pairs. It is common that we sample a positive pair of
segments from the same utterance. Unfortunately, such poor-
man’s positive pairs (PPP) lack necessary diversity for the
training of a robust encoder. In this work, we propose a
multi-modal contrastive learning technique with novel sampling
strategies. By cross-referencing between speech and face data,
we study a method that finds diverse positive pairs (DPP) for
contrastive learning, thus improving the robustness of the speaker
encoder. We train the speaker encoder on the VoxCeleb2 dataset
without any speaker labels, and achieve an equal error rate (EER)
of 2.89%, 3.17% and 6.27% under the proposed progressive
clustering strategy, and an EER of 1.44%, 1.77% and 3.27%
under the two-stage learning strategy with pseudo labels, on the
three test sets of VoxCeleb1. This novel solution outperforms
the state-of-the-art self-supervised learning methods by a large
margin, at the same time, achieves comparable results with
the supervised learning counterpart. We also evaluate our self-
supervised learning technique on LRS2 and LRW datasets, where
the speaker information is unknown. All experiments suggest
that the proposed neural architecture and sampling strategies
are robust across datasets.
Index Terms—Self-supervised learning, speaker recognition,
diverse positive pairs, multi-modal, progressive clustering
I. INTRODUCTION
SPEAKER recognition (SR) seeks to authenticate an iden-
tity claim by using the speaker’s voice [1]–[3]. It typically
This research is supported by the internal project of Shenzhen Research
Institute of Big Data, Grant No. T00120220002, by the Guangdong Provincial
Key Laboratory of Big Data Computing, Grant No. B10120210117-KP02,
by the National Research Foundation Singapore under the National Robotics
Program, Human-Robot Interaction Phase 1, Grant No. 1922500054, and by
the DFG German Research Foundation under Germany’s Excellence Strategy,
EXC 2077. (Corresponding author: Ruijie Tao).
Ruijie Tao, and Ville Hautam¨
aki are with the Department of Electrical and
Computer Engineering, National University of Singapore, Singapore 119077
(e-mail: ruijie.tao@u.nus.edu; villeh@cs.joensuu.fi).
Kong Aik Lee is with the Institute for Infocomm Research, A?STAR,
Singapore 138632 (e-mail: lee kong aik@i2r.a-star.edu.sg).
Rohan Kumar Das is with Fortemedia, Singapore 138589 (e-mail: ecero-
han@gmail.com).
Ville Hautam¨
aki is also with the School of Computing, University of
Eastern Finland, Joensuu 80101, Finland.
Haizhou Li is with Chinese University of Hong Kong, Shenzhen, China,
University of Bremen, Bremen, Germany, and Kriston AI, China (e-mail:
haizhouli@cuhk.edu.cn).
Fig. 1. A segment is an excerpt from a video clip. In contrastive learning,
a pair of two segments, either positive or negative, forms a training data
point. A poor-man’s positive pair (PPP) in the upper panel is made up by
two segments from the same utterance, that represents the same speaker,
acoustic environment, speaker state, and discussion topic, therefore, under
a homogeneous acoustic condition. A diverse positive pair (DPP) is made up
by two segments from two distinct utterances of the same speaker, one from
the upper panel and another from the lower panel, where only their speaker
identity is in common. DPP is more effective for comparison.
relies on a speaker encoder that transforms a speech sample
into a speaker embedding vector. For supervised learning of
speaker encoder, a large-scale dataset with manually annotated
speaker labels is required [4]–[6]. As manual annotation is
labour intensive and costly, the self-supervised learning (SSL)
learning technique becomes a competitive alternative by solv-
ing a pretext task on unlabelled data [7]. It has shown promis-
ing results in many areas, such as GPT [8] and BERT [9] in
natural language processing (NLP), MOCO [10], BYOL [11]
and DINO [12] in computer vision (CV), wav2vec [13] and
HuBERT [14] in speech processing. We are prompted to
investigate the training of speaker encoder on the abundantly
available unlabelled data.
Contrastive learning [7], [15] is a successful implemen-
tation of self-supervised learning. It forces the encoder to
produce similar representations between a pair of positive
samples, i.e., speech samples by the same speaker. A positive
arXiv:2210.15385v1 [eess.AS] 27 Oct 2022
2
pair contains an anchor segment and a positive counterpart,
which are typically two disjoint segments in the same utter-
ance [15], [16], while a negative pair consists of two speech
segments from different speakers, typically from two distant
utterances. For each anchor segment, the speaker encoder
learns to discriminate the positive pair from all negative pairs
in the mini-batch.
It is efficient to sample negative pairs from two distant
utterances. However, we believe that the positive pairs from
the same utterance are not the best learning target as they lack
sufficient diversity. While contrastive learning encourages the
speaker encoder to learn the speaker voice characteristic [17],
the resulting encoder is also affected by other confounding
factors. For instance, in an utterance from an indoor talk show
in the upper panel of Fig. 1, the speaker encoder may also
learn the spoken content, the speaker emotion, the speaker
state, the acoustic environment, and the recording channel, if
the positive pairs are always extracted from the same utterance
during comparison. We refer to such positive pairs as the poor-
man’s positive pairs (PPP).
In contrast, if a positive pair is extracted from two distant
utterances of the same speaker, for example, an indoor and an
outdoor interview of the same person in Fig. 1, we can take
one segment from each of the two utterances to form a positive
pair. In this way, the non-identity information is very different
between the two samples, thus greatly reducing the effect of
the confounding factors. We refer to such positive pairs as
the diverse positive pairs (DPP). As opposed to the poor-
man’s positive pairs (PPP), we have good reason to expect
that the DPP will serve better contrastive learning than the
PPP counterpart.
In general, prior studies also suggest that contrastive learn-
ing benefits from diverse and varied training samples. The
study on the prototypical loss in the supervised learning
paradigm shows that speaker recognition benefits from varied
positive samples generated from the ground-truth speaker la-
bels across utterances [18]. The similar idea has been validated
in computer vision. In [19], it is suggested to find the nearest
neighbour for each anchor image as the positive counterpart
rather than the augmented anchor image. In SCAN [20] and
CoCLR [21], a fixed number of positive pairs for each image
are discovered after one round of contrastive learning. The
newly found positive pairs are then used for a new round of
contrastive training. These studies all point to the direction
that DPP lead to better models. To the best of our knowledge,
there is no study of DPP in self-supervised learning of speaker
encoder yet.
In this work, we hypothesize that DPP will outperform PPP
in the self-supervised learning of the speaker encoder. The
question is how to sample the DPP such that they are both
accurate, i.e., from the same speaker, and diverse, i.e., varying
across different acoustic conditions. One way is to use the
anchor utterance to search for positive utterances of the same
speaker in the database. However, it can hardly guarantee
the accuracy and diversity of found positive pairs. From the
biometric recognition study, we know that facial image and
voice constitute complementary biometric evidence [22], [23].
So we are motivated to apply both audio and visual data to
find positive counterparts that are both accurate and diverse.
We are inspired by the co-training technique to construct the
framework, which describes a data sample from two different
views and enhances two encoders gradually [24], [25]. We
involve a face encoder and train it with the speaker encoder
together. To ensure that the found positive pairs are truly
positive, we make use of the complementary nature of the two
modalities, then exploit both the audio and visual modalities
to search for positive pairs of video clips. This complementary
effect improves the quality of the found positive pairs. As far
as diversity is concerned, the cross-modal co-reference allows
us to find positive speech pairs that are from very different
acoustic conditions, and positive pairs of facial images that
are from very different photographic environments.
We make the following contributions in this paper.
For the first time, we hypothesize and validate the idea of
diverse positive pairs (DPP) for self-supervised learning
of speaker encoder.
We propose a multi-modal contrastive learning (MCL)
framework with diverse positive pairs (MCL-DPP) via a
novel neural architecture and formulate its self-supervised
learning strategies.
We successfully implement MCL and MCL-DPP frame-
works and achieve the state-of-the-art performance for
self-supervised learning, that is comparable with its su-
pervised learning counterpart.
II. RELATED WORK
A. Speaker encoder and speaker recognition
A neural network solution to speaker recognition typically
consists of a speaker encoder, and a speaker comparison
module mechanism.
The speaker encoder learns to convert a time-domain speech
signal or its spectral features, i.e., spectrograms, filter banks,
and mel-frequency cepstral coefficients (MFCCs) [26] into an
utterance-level speaker embedding. The examples of speaker
encoder include time-delay neural networks (TDNN) based
x-vector [27], convolutional neural network (CNN) based
ResNet [18]. Recently, the emphasized channel attention,
propagation and aggregation in time-delay neural network
(ECAPA-TDNN), has attracted much attention [28] that adopts
many advanced designs, such as Res2Net blocks [29], squeeze-
and-excitation blocks [30] and multi-layer feature aggregation.
As a speaker characterization frontend, the speaker encoder
usually be trained in a supervised manner with classification
objectives [27] or metric learning objectives [18].
The speaker comparison module is designed to decide if two
speaker embeddings are from the same speaker. At run-time
for speaker verification, the cosine similarity [18] or proba-
bilistic linear discriminant analysis (PLDA) [31] backend can
be used to calculate the similarity between the test and target
speaker embeddings. It is noted that speaker embedding is also
widely used in related areas, such as speaker diarization [32],
speaker extraction [33], text-to-speech synthesis (TTS) [34]
and voice conversion [35], [36]. Therefore, the quality of
speaker encoder is all important across many studies.
3
B. Self-supervised learning of speaker encoder
Self-supervised learning is achieved by deriving supervisory
signals from the unlabelled data itself. It leverages the intrinsic
structural knowledge in the data. For speaker encoder training,
there are two general design strategies for supervisory signals,
namely single-stage learning, and two-stage learning.
1) Single-stage learning: Single-stage learning is a typical
end-to-end training following a comparison-based pretext task.
The key is to construct this pretext task effectively. Simple con-
trastive learning (SCL) technique trains the speaker encoder
by attracting positive pairs (two augmented segments from the
same utterance) and repelling negative pairs (two augmented
segments from different utterances) [7], [17]. Others further
set additional training targets to improve the comparative
efficiency, such as invariance of augmentation [17], invariance
of channel [16], equilibrium learning [37] and positive term
regularization [38].
Besides SCL, other comparison-based self-supervised learn-
ing techniques include the MOCO framework [39], [40], which
stores the negative pairs in the memory bank; the DINO
framework [12], [41]–[43] that only involves positive pairs
and achieves considerable improvement. For efficiency and
effectiveness, we adopt SCL framework in this study, and
focus on the sampling strategy of positive pairs. It is noted
that our proposal can be extended to other frameworks, such
as MOCO and DINO.
2) Two-stage learning: With a two-stage learning strategy,
we view the single-stage learning as the first stage and improve
it with pseudo labels in the second stage. Based on the trained
speaker encoder in the first stage, speaker embeddings can be
derived from the unlabelled speech data for an unsupervised
clustering. In the second stage, the cluster identity of a speech
sample serves as its pseudo speaker label for the supervised
learning of speaker encoder. The clustering-training loop is
repeated to improve the speaker encoder [15], [40], [44].
In our previous work [45], we focus our study on the second
stage as to how to effectively find the reliable pseudo labels.
The studies on two-stage learning validated that the idea of
pseudo labels greatly benefit from the unlabelled data. In this
work, we would like to focus our study on the first stage about
diverse positive pairs.
C. Multi-modal speaker recognition
Human face also provides identity information [46], that
is helpful for speaker encoder training. Under the supervised
learning framework, the speech-face early [22], middle [22],
[47] and late [22], [23], [48], [49] fusion strategies were
studied. They all concluded that human face provides com-
plementary biometric information in speaker recognition.
In the recent study of self-supervised learning for speaker
encoder [50], the visual modality is introduced in the second
stage of a two-stage learning system where speech-face multi-
modal embeddings are used to improve speaker clustering,
thus the quality of speaker encoder. It remains a challenge how
we effectively use the abundant unlabelled videos with speech-
face pairs in the single-stage learning system, that motivates
this study.
III. MCL: MULTI-MODAL CONTRASTIVE LEARNING
We now propose an end-to-end multi-modal constrastive
learning framework (MCL) as the baseline. As shown in Fig. 2,
the MCL consists of three components: the contrastive learning
for speaker encoder with speech input, the contrastive learning
for face encoder with face frames input, and the embedding
projector network with cross-modal joint loss. Here, each
video clip in the training set contains only one talking face.
Such data can be obtained via an audio-visual active speaker
detection system [51].
We assume that the speech segments or face frames drawn
from the same video clip share the same identity. On the
other hand, those drawn from different video clips belong to
different people. In this way, we don’t rule out the possibility
of having false-negative pairs. However, considering the size
of the mini-batch with respect to a relatively large training
set [16], such false-negative pairs are rare and can be ignored.
A. Contrastive learning for speaker encoder
The contrastive learning scheme of the speaker encoder
is similar to that in our previous work [45]. As shown
in the orange box in Fig. 2. Each training video clip xi
contains the speech utterance x(s)
iand face frames x(f)
i.
For one utterance, we randomly consider two same-length,
disjoint speech segments x(s)
i,1and x(s)
i,2after stochastic noise
augmentation. When we view x(s)
i,1as the anchor segment, x(s)
i,2
is the positive counterpart. They are the inputs of the speaker
encoder E(s)(·), and the outputs are speaker embeddings
y(s)
i,1and y(s)
i,2. As shown in (1), a contrastive loss [7] is a
function whose value is low when anchor segment x(s)
i,1is
similar to its positive counterpart x(s)
i,2and dissimilar to all
other segments in the mini-batch (i.e., negative segments). Let
s(a, b) = exp(cos(a, b)),cos be the cosine similarity, and
τbe the temperate parameter. The loss function L(s)for the
mini-batch is defined as,
L(s)=1
2M
M
X
i=1
2
X
j=1
(log s(y(s)
i,1, y(s)
i,2)
PM
k=1 P2
l=1 k6=i,l6=js(y(s)
i,j , y(s)
k,l ))
(1)
where Mis the batch size, is an indicator function evaluating
1 when k6=iand l6=j. For each segment, there are one
positive pair (y(s)
i,1with y(s)
i,2) and 2(M1) negative pairs
since each utterance provides two segments.
B. Contrastive learning for face encoder
There is a lack of study on self-supervised learning of face
encoder. As shown in the bottom left blue box of Fig. 2, we
formulate the contrastive learning for face frame sequence in
the same way as that for speech signal. In other words, two
speech segments are replaced by two face frames x(f)
i,1and
x(f)
i,2, stochastic noise augmentation is substituted by image
augmentation. Similarly, the face encoder is denoted as E(f)(·)
that derives the face embeddings y(f)
i,1and y(f)
i,2. We follow the
same selection process for positive and negative pairs. The
摘要:

1Self-SupervisedTrainingofSpeakerEncoderwithMulti-ModalDiversePositivePairsRuijieTao,StudentMember,IEEE,,KongAikLee,SeniorMember,IEEE,RohanKumarDas,SeniorMember,IEEE,VilleHautam¨aki,Member,IEEE,andHaizhouLi,Fellow,IEEEAbstract—Westudyanovelneuralarchitectureanditstrainingstrategiesofspeakerencoderfo...

展开>> 收起<<
1 Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.9MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注