MULTITASK DETECTION OF SPEAKER CHANGES OVERLAPPING SPEECH AND VOICE ACTIVITY USING WA V2VEC 2.0 Marie Kune ˇsova and Zbyn ˇek Zaj ıc
2025-05-02
0
0
276.33KB
5 页
10玖币
侵权投诉
MULTITASK DETECTION OF SPEAKER CHANGES, OVERLAPPING SPEECH AND VOICE
ACTIVITY USING WAV2VEC 2.0
Marie Kuneˇ
sov´
a and Zbynˇ
ek Zaj´
ıc
New Technologies for the Information Society and Department of Cybernetics,
Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
ABSTRACT
Self-supervised learning approaches have lately achieved great
success on a broad spectrum of machine learning problems. In the
field of speech processing, one of the most successful recent self-
supervised models is wav2vec 2.0. In this paper, we explore the
effectiveness of this model on three basic speech classification tasks:
speaker change detection, overlapped speech detection, and voice
activity detection. First, we concentrate on only one task – speaker
change detection – where our proposed system surpasses the previ-
ously reported results on four different corpora, and achieves com-
parable performance even when trained on out-of-domain data from
an artificially designed dataset. Then we expand our approach to
tackle all three tasks in a single multitask system with state-of-the-
art performance on the AMI corpus. The implementation of the
algorithms in this paper is publicly available at
https://github.
com/mkunes/w2v2_audioFrameClassification.
Index Terms—
multitask learning, speaker change detection,
overlapped speech detection, voice activity detection, wav2vec 2.0
1. INTRODUCTION
Speaker change detection (SCD), overlapping speech detection
(OSD), and voice activity detection (VAD) are three basic speech
processing tasks that are relevant for a variety of different speech
applications. SCD is the task of finding the points in a conversation
where the speaker is changing, while OSD is concerned with identi-
fying intervals where multiple speakers are active at the same time.
Both of these are particularly important to speaker diarization [1, 2],
as well as other tasks related to processing multi-speaker audio [
3
].
Voice activity detection simply distinguishes between speech and
non-speech and has a use in nearly all speech processing.
In the past, approaches to SCD have mainly consisted of comput-
ing a distance between two sliding windows over the signal and then
locating peaks [
4
] or detecting the differences in pitch [
5
]. Nowa-
days, the prevailing approach is to use deep learning. First attempts
used precomputed features based on i-vectors or x-vectors [
3
], Mel-
frequency cepstral coefficients (MFCCs) [
5
], spectrograms [
6
], or
combinations of multiple types of features [
7
], even including lexical
information gained from automated transcripts [
8
,
9
]. Different neural
network model architectures have been applied, such as LSTM [10],
CNN [6], or sequence-level modeling methods [11].
A similar shift from conventional techniques to deep learning
has also occurred for OSD. Recent approaches include convolutional
This research was supported by the Czech Ministry of Interior, project
ROZKAZ (VJ01010108) and by the Czech Ministry of Education, Youth and
Sports, Project No. (LM2023062) LINDAT/CLARIAH-CZ. Computational
resources were provided by the e-INFRA CZ project (ID:90140), supported
by the Ministry of Education, Youth and Sports of the Czech Republic.
Fig. 1
: Illustration of the multitask wav2vec 2.0 detector of speaker
changes, voice activity, and overlapping speech. The model outputs a
set of labels for each audio frame (every 20 ms).
neural networks [
12
] or LSTMs [
2
], and the input can be in the
form of MFCCs [
13
], spectrograms [
14
], x-vectors [
15
] or raw wave-
forms [16].
VAD has been the target of a large amount of research for many
years [
17
,
18
], but these days it is rarely the main focus by itself.
Rather, it usually appears as one part or a by-product of a more
complex system [13, 19].
In this paper, we first propose an end-to-end approach for SCD
using the transformer network concept (which has recently seen great
success on a variety of tasks, including but not limited to speech
processing [
20
]) – specifically using the wav2vec 2.0 [
21
] frame-
work – and evaluate it on four commonly used conversational speech
corpora. Then we expand the approach to also perform two other
tasks (OSD and VAD) in a single multitask system, as illustrated in
Figure 1. These three tasks are closely related to each other, and it
has been previously demonstrated that joint learning can improve
learning efficiency and prediction accuracy compared to task-specific
models [22].
2. SINGLE-TASK MODEL FOR SCD
Wav2vec 2.0 (hereafter referred to as “wav2vec2”) is a transformer-
based self-supervised framework for speech representation, which
has been used for a wide range of speech processing tasks, such as
automatic speech recognition [23] and many others [24].
Inspired by our previous work [
25
] on prosodic boundary detec-
tion, we treat the SCD problem as an audio frame classification task.
©
2023 IEEE. Personal use of this material is permitted. Permission
from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional
purposes, creating new collective works, for resale or redistribution to servers
or lists, or reuse of any copyrighted component of this work in other works.
To appear in Proc. ICASSP 2023, June 04-10, 2023, Rhodes Island, Greece © 2023 IEEE
arXiv:2210.14755v2 [eess.AS] 10 Mar 2023
摘要:
展开>>
收起<<
MULTITASKDETECTIONOFSPEAKERCHANGES,OVERLAPPINGSPEECHANDVOICEACTIVITYUSINGWAV2VEC2.0MarieKunesov´aandZbynekZaj´cNewTechnologiesfortheInformationSocietyandDepartmentofCybernetics,FacultyofAppliedSciences,UniversityofWestBohemia,Pilsen,CzechRepublicABSTRACTSelf-supervisedlearningapproacheshavelately...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
公司营销部领导述职述廉报告VIP免费
2024-12-03 4 -
100套述职述廉述法述学框架提纲VIP免费
2024-12-03 3 -
20220106政府党组班子党史学习教育专题民主生活会“五个带头”对照检查材料VIP免费
2024-12-03 3 -
20220106县纪委监委领导班子党史学习教育专题民主生活会对照检查材料VIP免费
2024-12-03 6 -
A文秘笔杆子工作资料汇编手册(近70000字)VIP免费
2024-12-03 3 -
20220106县领导班子党史学习教育专题民主生活会对照检查材料VIP免费
2024-12-03 4 -
经济开发区党工委书记管委会主任述学述职述廉述法报告VIP免费
2024-12-03 34 -
20220106政府领导专题民主生活会五个方面对照检查材料VIP免费
2024-12-03 11 -
派出所教导员述职述廉报告6篇VIP免费
2024-12-03 8 -
民主生活会对县委班子及其成员批评意见清单VIP免费
2024-12-03 50
分类:图书资源
价格:10玖币
属性:5 页
大小:276.33KB
格式:PDF
时间:2025-05-02


渝公网安备50010702506394