1 Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

2025-04-30 0 0 1.9MB 13 页 10玖币

侵权投诉

Self-Supervised Training of Speaker Encoder with

Multi-Modal Diverse Positive Pairs

Ruijie Tao, Student Member, IEEE,, Kong Aik Lee, Senior Member, IEEE, Rohan Kumar Das, Senior

Member, IEEE, Ville Hautam¨

aki, Member, IEEE, and Haizhou Li, Fellow, IEEE

Abstract—We study a novel neural architecture and its training

strategies of speaker encoder for speaker recognition without

using any identity labels. The speaker encoder is trained to

extract a ﬁxed-size speaker embedding from a spoken utterance

of various length. Contrastive learning is a typical self-supervised

learning technique. However, the quality of the speaker encoder

depends very much on the sampling strategy of positive and

negative pairs. It is common that we sample a positive pair of

segments from the same utterance. Unfortunately, such poor-

man’s positive pairs (PPP) lack necessary diversity for the

training of a robust encoder. In this work, we propose a

multi-modal contrastive learning technique with novel sampling

strategies. By cross-referencing between speech and face data,

we study a method that ﬁnds diverse positive pairs (DPP) for

contrastive learning, thus improving the robustness of the speaker

encoder. We train the speaker encoder on the VoxCeleb2 dataset

without any speaker labels, and achieve an equal error rate (EER)

of 2.89%, 3.17% and 6.27% under the proposed progressive

clustering strategy, and an EER of 1.44%, 1.77% and 3.27%

under the two-stage learning strategy with pseudo labels, on the

three test sets of VoxCeleb1. This novel solution outperforms

the state-of-the-art self-supervised learning methods by a large

margin, at the same time, achieves comparable results with

the supervised learning counterpart. We also evaluate our self-

supervised learning technique on LRS2 and LRW datasets, where

the speaker information is unknown. All experiments suggest

that the proposed neural architecture and sampling strategies

are robust across datasets.

Index Terms—Self-supervised learning, speaker recognition,

diverse positive pairs, multi-modal, progressive clustering

I. INTRODUCTION

SPEAKER recognition (SR) seeks to authenticate an iden-

tity claim by using the speaker’s voice [1]–[3]. It typically

This research is supported by the internal project of Shenzhen Research

Institute of Big Data, Grant No. T00120220002, by the Guangdong Provincial

Key Laboratory of Big Data Computing, Grant No. B10120210117-KP02,

by the National Research Foundation Singapore under the National Robotics

Program, Human-Robot Interaction Phase 1, Grant No. 1922500054, and by

the DFG German Research Foundation under Germany’s Excellence Strategy,

EXC 2077. (Corresponding author: Ruijie Tao).

Ruijie Tao, and Ville Hautam¨

aki are with the Department of Electrical and

Computer Engineering, National University of Singapore, Singapore 119077

(e-mail: ruijie.tao@u.nus.edu; villeh@cs.joensuu.ﬁ).

Kong Aik Lee is with the Institute for Infocomm Research, A?STAR,

Singapore 138632 (e-mail: lee kong aik@i2r.a-star.edu.sg).

Rohan Kumar Das is with Fortemedia, Singapore 138589 (e-mail: ecero-

han@gmail.com).

Ville Hautam¨

aki is also with the School of Computing, University of

Eastern Finland, Joensuu 80101, Finland.

Haizhou Li is with Chinese University of Hong Kong, Shenzhen, China,

University of Bremen, Bremen, Germany, and Kriston AI, China (e-mail:

haizhouli@cuhk.edu.cn).

Fig. 1. A segment is an excerpt from a video clip. In contrastive learning,

a pair of two segments, either positive or negative, forms a training data

point. A poor-man’s positive pair (PPP) in the upper panel is made up by

two segments from the same utterance, that represents the same speaker,

acoustic environment, speaker state, and discussion topic, therefore, under

a homogeneous acoustic condition. A diverse positive pair (DPP) is made up

by two segments from two distinct utterances of the same speaker, one from

the upper panel and another from the lower panel, where only their speaker

identity is in common. DPP is more effective for comparison.

relies on a speaker encoder that transforms a speech sample

into a speaker embedding vector. For supervised learning of

speaker encoder, a large-scale dataset with manually annotated

speaker labels is required [4]–[6]. As manual annotation is

labour intensive and costly, the self-supervised learning (SSL)

learning technique becomes a competitive alternative by solv-

ing a pretext task on unlabelled data [7]. It has shown promis-

ing results in many areas, such as GPT [8] and BERT [9] in

natural language processing (NLP), MOCO [10], BYOL [11]

and DINO [12] in computer vision (CV), wav2vec [13] and

HuBERT [14] in speech processing. We are prompted to

investigate the training of speaker encoder on the abundantly

available unlabelled data.

Contrastive learning [7], [15] is a successful implemen-

tation of self-supervised learning. It forces the encoder to

produce similar representations between a pair of positive

samples, i.e., speech samples by the same speaker. A positive

arXiv:2210.15385v1 [eess.AS] 27 Oct 2022

pair contains an anchor segment and a positive counterpart,

which are typically two disjoint segments in the same utter-

ance [15], [16], while a negative pair consists of two speech

segments from different speakers, typically from two distant

utterances. For each anchor segment, the speaker encoder

learns to discriminate the positive pair from all negative pairs

in the mini-batch.

It is efﬁcient to sample negative pairs from two distant

utterances. However, we believe that the positive pairs from

the same utterance are not the best learning target as they lack

sufﬁcient diversity. While contrastive learning encourages the

speaker encoder to learn the speaker voice characteristic [17],

the resulting encoder is also affected by other confounding

factors. For instance, in an utterance from an indoor talk show

in the upper panel of Fig. 1, the speaker encoder may also

learn the spoken content, the speaker emotion, the speaker

state, the acoustic environment, and the recording channel, if

the positive pairs are always extracted from the same utterance

during comparison. We refer to such positive pairs as the poor-

man’s positive pairs (PPP).

In contrast, if a positive pair is extracted from two distant

utterances of the same speaker, for example, an indoor and an

outdoor interview of the same person in Fig. 1, we can take

one segment from each of the two utterances to form a positive

pair. In this way, the non-identity information is very different

between the two samples, thus greatly reducing the effect of

the confounding factors. We refer to such positive pairs as

the diverse positive pairs (DPP). As opposed to the poor-

man’s positive pairs (PPP), we have good reason to expect

that the DPP will serve better contrastive learning than the

PPP counterpart.

In general, prior studies also suggest that contrastive learn-

ing beneﬁts from diverse and varied training samples. The

study on the prototypical loss in the supervised learning

paradigm shows that speaker recognition beneﬁts from varied

positive samples generated from the ground-truth speaker la-

bels across utterances [18]. The similar idea has been validated

in computer vision. In [19], it is suggested to ﬁnd the nearest

neighbour for each anchor image as the positive counterpart

rather than the augmented anchor image. In SCAN [20] and

CoCLR [21], a ﬁxed number of positive pairs for each image

are discovered after one round of contrastive learning. The

newly found positive pairs are then used for a new round of

contrastive training. These studies all point to the direction

that DPP lead to better models. To the best of our knowledge,

there is no study of DPP in self-supervised learning of speaker

encoder yet.

In this work, we hypothesize that DPP will outperform PPP

in the self-supervised learning of the speaker encoder. The

question is how to sample the DPP such that they are both

accurate, i.e., from the same speaker, and diverse, i.e., varying

across different acoustic conditions. One way is to use the

anchor utterance to search for positive utterances of the same

speaker in the database. However, it can hardly guarantee

the accuracy and diversity of found positive pairs. From the

biometric recognition study, we know that facial image and

voice constitute complementary biometric evidence [22], [23].

So we are motivated to apply both audio and visual data to

ﬁnd positive counterparts that are both accurate and diverse.

We are inspired by the co-training technique to construct the

framework, which describes a data sample from two different

views and enhances two encoders gradually [24], [25]. We

involve a face encoder and train it with the speaker encoder

together. To ensure that the found positive pairs are truly

positive, we make use of the complementary nature of the two

modalities, then exploit both the audio and visual modalities

to search for positive pairs of video clips. This complementary

effect improves the quality of the found positive pairs. As far

as diversity is concerned, the cross-modal co-reference allows

us to ﬁnd positive speech pairs that are from very different

acoustic conditions, and positive pairs of facial images that

are from very different photographic environments.

We make the following contributions in this paper.

•For the ﬁrst time, we hypothesize and validate the idea of

diverse positive pairs (DPP) for self-supervised learning

of speaker encoder.

•We propose a multi-modal contrastive learning (MCL)

framework with diverse positive pairs (MCL-DPP) via a

novel neural architecture and formulate its self-supervised

learning strategies.

•We successfully implement MCL and MCL-DPP frame-

works and achieve the state-of-the-art performance for

self-supervised learning, that is comparable with its su-

pervised learning counterpart.

II. RELATED WORK

A. Speaker encoder and speaker recognition

A neural network solution to speaker recognition typically

consists of a speaker encoder, and a speaker comparison

module mechanism.

The speaker encoder learns to convert a time-domain speech

signal or its spectral features, i.e., spectrograms, ﬁlter banks,

and mel-frequency cepstral coefﬁcients (MFCCs) [26] into an

utterance-level speaker embedding. The examples of speaker

encoder include time-delay neural networks (TDNN) based

x-vector [27], convolutional neural network (CNN) based

ResNet [18]. Recently, the emphasized channel attention,

propagation and aggregation in time-delay neural network

(ECAPA-TDNN), has attracted much attention [28] that adopts

many advanced designs, such as Res2Net blocks [29], squeeze-

and-excitation blocks [30] and multi-layer feature aggregation.

As a speaker characterization frontend, the speaker encoder

usually be trained in a supervised manner with classiﬁcation

objectives [27] or metric learning objectives [18].

The speaker comparison module is designed to decide if two

speaker embeddings are from the same speaker. At run-time

for speaker veriﬁcation, the cosine similarity [18] or proba-

bilistic linear discriminant analysis (PLDA) [31] backend can

be used to calculate the similarity between the test and target

speaker embeddings. It is noted that speaker embedding is also

widely used in related areas, such as speaker diarization [32],

speaker extraction [33], text-to-speech synthesis (TTS) [34]

and voice conversion [35], [36]. Therefore, the quality of

speaker encoder is all important across many studies.

B. Self-supervised learning of speaker encoder

Self-supervised learning is achieved by deriving supervisory

signals from the unlabelled data itself. It leverages the intrinsic

structural knowledge in the data. For speaker encoder training,

there are two general design strategies for supervisory signals,

namely single-stage learning, and two-stage learning.

1) Single-stage learning: Single-stage learning is a typical

end-to-end training following a comparison-based pretext task.

The key is to construct this pretext task effectively. Simple con-

trastive learning (SCL) technique trains the speaker encoder

by attracting positive pairs (two augmented segments from the

same utterance) and repelling negative pairs (two augmented

segments from different utterances) [7], [17]. Others further

set additional training targets to improve the comparative

efﬁciency, such as invariance of augmentation [17], invariance

of channel [16], equilibrium learning [37] and positive term

regularization [38].

Besides SCL, other comparison-based self-supervised learn-

ing techniques include the MOCO framework [39], [40], which

stores the negative pairs in the memory bank; the DINO

framework [12], [41]–[43] that only involves positive pairs

and achieves considerable improvement. For efﬁciency and

effectiveness, we adopt SCL framework in this study, and

focus on the sampling strategy of positive pairs. It is noted

that our proposal can be extended to other frameworks, such

as MOCO and DINO.

2) Two-stage learning: With a two-stage learning strategy,

we view the single-stage learning as the ﬁrst stage and improve

it with pseudo labels in the second stage. Based on the trained

speaker encoder in the ﬁrst stage, speaker embeddings can be

derived from the unlabelled speech data for an unsupervised

clustering. In the second stage, the cluster identity of a speech

sample serves as its pseudo speaker label for the supervised

learning of speaker encoder. The clustering-training loop is

repeated to improve the speaker encoder [15], [40], [44].

In our previous work [45], we focus our study on the second

stage as to how to effectively ﬁnd the reliable pseudo labels.

The studies on two-stage learning validated that the idea of

pseudo labels greatly beneﬁt from the unlabelled data. In this

work, we would like to focus our study on the ﬁrst stage about

diverse positive pairs.

C. Multi-modal speaker recognition

Human face also provides identity information [46], that

is helpful for speaker encoder training. Under the supervised

learning framework, the speech-face early [22], middle [22],

[47] and late [22], [23], [48], [49] fusion strategies were

studied. They all concluded that human face provides com-

plementary biometric information in speaker recognition.

In the recent study of self-supervised learning for speaker

encoder [50], the visual modality is introduced in the second

stage of a two-stage learning system where speech-face multi-

modal embeddings are used to improve speaker clustering,

thus the quality of speaker encoder. It remains a challenge how

we effectively use the abundant unlabelled videos with speech-

face pairs in the single-stage learning system, that motivates

this study.

III. MCL: MULTI-MODAL CONTRASTIVE LEARNING

We now propose an end-to-end multi-modal constrastive

learning framework (MCL) as the baseline. As shown in Fig. 2,

the MCL consists of three components: the contrastive learning

for speaker encoder with speech input, the contrastive learning

for face encoder with face frames input, and the embedding

projector network with cross-modal joint loss. Here, each

video clip in the training set contains only one talking face.

Such data can be obtained via an audio-visual active speaker

detection system [51].

We assume that the speech segments or face frames drawn

from the same video clip share the same identity. On the

other hand, those drawn from different video clips belong to

different people. In this way, we don’t rule out the possibility

of having false-negative pairs. However, considering the size

of the mini-batch with respect to a relatively large training

set [16], such false-negative pairs are rare and can be ignored.

A. Contrastive learning for speaker encoder

The contrastive learning scheme of the speaker encoder

is similar to that in our previous work [45]. As shown

in the orange box in Fig. 2. Each training video clip xi

contains the speech utterance x(s)

iand face frames x(f)

For one utterance, we randomly consider two same-length,

disjoint speech segments x(s)

i,1and x(s)

i,2after stochastic noise

augmentation. When we view x(s)

i,1as the anchor segment, x(s)

i,2

is the positive counterpart. They are the inputs of the speaker

encoder E(s)(·), and the outputs are speaker embeddings

y(s)

i,1and y(s)

i,2. As shown in (1), a contrastive loss [7] is a

function whose value is low when anchor segment x(s)

i,1is

similar to its positive counterpart x(s)

i,2and dissimilar to all

other segments in the mini-batch (i.e., negative segments). Let

s(a, b) = exp(cos(a, b))/τ,cos be the cosine similarity, and

τbe the temperate parameter. The loss function L(s)for the

mini-batch is deﬁned as,

L(s)=1

i=1

j=1

(−log s(y(s)

i,1, y(s)

i,2)

k=1 P2

l=1 k6=i,l6=js(y(s)

i,j , y(s)

k,l ))

(1)

where Mis the batch size, is an indicator function evaluating

1 when k6=iand l6=j. For each segment, there are one

positive pair (y(s)

i,1with y(s)

i,2) and 2(M−1) negative pairs

since each utterance provides two segments.

B. Contrastive learning for face encoder

There is a lack of study on self-supervised learning of face

encoder. As shown in the bottom left blue box of Fig. 2, we

formulate the contrastive learning for face frame sequence in

the same way as that for speech signal. In other words, two

speech segments are replaced by two face frames x(f)

i,1and

x(f)

i,2, stochastic noise augmentation is substituted by image

augmentation. Similarly, the face encoder is denoted as E(f)(·)

that derives the face embeddings y(f)

i,1and y(f)

i,2. We follow the

same selection process for positive and negative pairs. The

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1Self-SupervisedTrainingofSpeakerEncoderwithMulti-ModalDiversePositivePairsRuijieTao,StudentMember,IEEE,,KongAikLee,SeniorMember,IEEE,RohanKumarDas,SeniorMember,IEEE,VilleHautam¨aki,Member,IEEE,andHaizhouLi,Fellow,IEEEAbstractWestudyanovelneuralarchitectureanditstrainingstrategiesofspeakerencoderfo...

展开>> 收起<<

1 Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: