
MULTITASK DETECTION OF SPEAKER CHANGES, OVERLAPPING SPEECH AND VOICE
ACTIVITY USING WAV2VEC 2.0
Marie Kuneˇ
sov´
a and Zbynˇ
ek Zaj´
ıc
New Technologies for the Information Society and Department of Cybernetics,
Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
ABSTRACT
Self-supervised learning approaches have lately achieved great
success on a broad spectrum of machine learning problems. In the
field of speech processing, one of the most successful recent self-
supervised models is wav2vec 2.0. In this paper, we explore the
effectiveness of this model on three basic speech classification tasks:
speaker change detection, overlapped speech detection, and voice
activity detection. First, we concentrate on only one task – speaker
change detection – where our proposed system surpasses the previ-
ously reported results on four different corpora, and achieves com-
parable performance even when trained on out-of-domain data from
an artificially designed dataset. Then we expand our approach to
tackle all three tasks in a single multitask system with state-of-the-
art performance on the AMI corpus. The implementation of the
algorithms in this paper is publicly available at
https://github.
com/mkunes/w2v2_audioFrameClassification.
Index Terms—
multitask learning, speaker change detection,
overlapped speech detection, voice activity detection, wav2vec 2.0
1. INTRODUCTION
Speaker change detection (SCD), overlapping speech detection
(OSD), and voice activity detection (VAD) are three basic speech
processing tasks that are relevant for a variety of different speech
applications. SCD is the task of finding the points in a conversation
where the speaker is changing, while OSD is concerned with identi-
fying intervals where multiple speakers are active at the same time.
Both of these are particularly important to speaker diarization [1, 2],
as well as other tasks related to processing multi-speaker audio [
3
].
Voice activity detection simply distinguishes between speech and
non-speech and has a use in nearly all speech processing.
In the past, approaches to SCD have mainly consisted of comput-
ing a distance between two sliding windows over the signal and then
locating peaks [
4
] or detecting the differences in pitch [
5
]. Nowa-
days, the prevailing approach is to use deep learning. First attempts
used precomputed features based on i-vectors or x-vectors [
3
], Mel-
frequency cepstral coefficients (MFCCs) [
5
], spectrograms [
6
], or
combinations of multiple types of features [
7
], even including lexical
information gained from automated transcripts [
8
,
9
]. Different neural
network model architectures have been applied, such as LSTM [10],
CNN [6], or sequence-level modeling methods [11].
A similar shift from conventional techniques to deep learning
has also occurred for OSD. Recent approaches include convolutional
This research was supported by the Czech Ministry of Interior, project
ROZKAZ (VJ01010108) and by the Czech Ministry of Education, Youth and
Sports, Project No. (LM2023062) LINDAT/CLARIAH-CZ. Computational
resources were provided by the e-INFRA CZ project (ID:90140), supported
by the Ministry of Education, Youth and Sports of the Czech Republic.
Fig. 1
: Illustration of the multitask wav2vec 2.0 detector of speaker
changes, voice activity, and overlapping speech. The model outputs a
set of labels for each audio frame (every 20 ms).
neural networks [
12
] or LSTMs [
2
], and the input can be in the
form of MFCCs [
13
], spectrograms [
14
], x-vectors [
15
] or raw wave-
forms [16].
VAD has been the target of a large amount of research for many
years [
17
,
18
], but these days it is rarely the main focus by itself.
Rather, it usually appears as one part or a by-product of a more
complex system [13, 19].
In this paper, we first propose an end-to-end approach for SCD
using the transformer network concept (which has recently seen great
success on a variety of tasks, including but not limited to speech
processing [
20
]) – specifically using the wav2vec 2.0 [
21
] frame-
work – and evaluate it on four commonly used conversational speech
corpora. Then we expand the approach to also perform two other
tasks (OSD and VAD) in a single multitask system, as illustrated in
Figure 1. These three tasks are closely related to each other, and it
has been previously demonstrated that joint learning can improve
learning efficiency and prediction accuracy compared to task-specific
models [22].
2. SINGLE-TASK MODEL FOR SCD
Wav2vec 2.0 (hereafter referred to as “wav2vec2”) is a transformer-
based self-supervised framework for speech representation, which
has been used for a wide range of speech processing tasks, such as
automatic speech recognition [23] and many others [24].
Inspired by our previous work [
25
] on prosodic boundary detec-
tion, we treat the SCD problem as an audio frame classification task.
©
2023 IEEE. Personal use of this material is permitted. Permission
from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional
purposes, creating new collective works, for resale or redistribution to servers
or lists, or reuse of any copyrighted component of this work in other works.
To appear in Proc. ICASSP 2023, June 04-10, 2023, Rhodes Island, Greece © 2023 IEEE
arXiv:2210.14755v2 [eess.AS] 10 Mar 2023