2
pair contains an anchor segment and a positive counterpart,
which are typically two disjoint segments in the same utter-
ance [15], [16], while a negative pair consists of two speech
segments from different speakers, typically from two distant
utterances. For each anchor segment, the speaker encoder
learns to discriminate the positive pair from all negative pairs
in the mini-batch.
It is efficient to sample negative pairs from two distant
utterances. However, we believe that the positive pairs from
the same utterance are not the best learning target as they lack
sufficient diversity. While contrastive learning encourages the
speaker encoder to learn the speaker voice characteristic [17],
the resulting encoder is also affected by other confounding
factors. For instance, in an utterance from an indoor talk show
in the upper panel of Fig. 1, the speaker encoder may also
learn the spoken content, the speaker emotion, the speaker
state, the acoustic environment, and the recording channel, if
the positive pairs are always extracted from the same utterance
during comparison. We refer to such positive pairs as the poor-
man’s positive pairs (PPP).
In contrast, if a positive pair is extracted from two distant
utterances of the same speaker, for example, an indoor and an
outdoor interview of the same person in Fig. 1, we can take
one segment from each of the two utterances to form a positive
pair. In this way, the non-identity information is very different
between the two samples, thus greatly reducing the effect of
the confounding factors. We refer to such positive pairs as
the diverse positive pairs (DPP). As opposed to the poor-
man’s positive pairs (PPP), we have good reason to expect
that the DPP will serve better contrastive learning than the
PPP counterpart.
In general, prior studies also suggest that contrastive learn-
ing benefits from diverse and varied training samples. The
study on the prototypical loss in the supervised learning
paradigm shows that speaker recognition benefits from varied
positive samples generated from the ground-truth speaker la-
bels across utterances [18]. The similar idea has been validated
in computer vision. In [19], it is suggested to find the nearest
neighbour for each anchor image as the positive counterpart
rather than the augmented anchor image. In SCAN [20] and
CoCLR [21], a fixed number of positive pairs for each image
are discovered after one round of contrastive learning. The
newly found positive pairs are then used for a new round of
contrastive training. These studies all point to the direction
that DPP lead to better models. To the best of our knowledge,
there is no study of DPP in self-supervised learning of speaker
encoder yet.
In this work, we hypothesize that DPP will outperform PPP
in the self-supervised learning of the speaker encoder. The
question is how to sample the DPP such that they are both
accurate, i.e., from the same speaker, and diverse, i.e., varying
across different acoustic conditions. One way is to use the
anchor utterance to search for positive utterances of the same
speaker in the database. However, it can hardly guarantee
the accuracy and diversity of found positive pairs. From the
biometric recognition study, we know that facial image and
voice constitute complementary biometric evidence [22], [23].
So we are motivated to apply both audio and visual data to
find positive counterparts that are both accurate and diverse.
We are inspired by the co-training technique to construct the
framework, which describes a data sample from two different
views and enhances two encoders gradually [24], [25]. We
involve a face encoder and train it with the speaker encoder
together. To ensure that the found positive pairs are truly
positive, we make use of the complementary nature of the two
modalities, then exploit both the audio and visual modalities
to search for positive pairs of video clips. This complementary
effect improves the quality of the found positive pairs. As far
as diversity is concerned, the cross-modal co-reference allows
us to find positive speech pairs that are from very different
acoustic conditions, and positive pairs of facial images that
are from very different photographic environments.
We make the following contributions in this paper.
•For the first time, we hypothesize and validate the idea of
diverse positive pairs (DPP) for self-supervised learning
of speaker encoder.
•We propose a multi-modal contrastive learning (MCL)
framework with diverse positive pairs (MCL-DPP) via a
novel neural architecture and formulate its self-supervised
learning strategies.
•We successfully implement MCL and MCL-DPP frame-
works and achieve the state-of-the-art performance for
self-supervised learning, that is comparable with its su-
pervised learning counterpart.
II. RELATED WORK
A. Speaker encoder and speaker recognition
A neural network solution to speaker recognition typically
consists of a speaker encoder, and a speaker comparison
module mechanism.
The speaker encoder learns to convert a time-domain speech
signal or its spectral features, i.e., spectrograms, filter banks,
and mel-frequency cepstral coefficients (MFCCs) [26] into an
utterance-level speaker embedding. The examples of speaker
encoder include time-delay neural networks (TDNN) based
x-vector [27], convolutional neural network (CNN) based
ResNet [18]. Recently, the emphasized channel attention,
propagation and aggregation in time-delay neural network
(ECAPA-TDNN), has attracted much attention [28] that adopts
many advanced designs, such as Res2Net blocks [29], squeeze-
and-excitation blocks [30] and multi-layer feature aggregation.
As a speaker characterization frontend, the speaker encoder
usually be trained in a supervised manner with classification
objectives [27] or metric learning objectives [18].
The speaker comparison module is designed to decide if two
speaker embeddings are from the same speaker. At run-time
for speaker verification, the cosine similarity [18] or proba-
bilistic linear discriminant analysis (PLDA) [31] backend can
be used to calculate the similarity between the test and target
speaker embeddings. It is noted that speaker embedding is also
widely used in related areas, such as speaker diarization [32],
speaker extraction [33], text-to-speech synthesis (TTS) [34]
and voice conversion [35], [36]. Therefore, the quality of
speaker encoder is all important across many studies.