MULTITASK DETECTION OF SPEAKER CHANGES OVERLAPPING SPEECH AND VOICE ACTIVITY USING WA V2VEC 2.0 Marie Kune ˇsova and Zbyn ˇek Zaj ıc

2025-05-02 0 0 276.33KB 5 页 10玖币
侵权投诉
MULTITASK DETECTION OF SPEAKER CHANGES, OVERLAPPING SPEECH AND VOICE
ACTIVITY USING WAV2VEC 2.0
Marie Kuneˇ
sov´
a and Zbynˇ
ek Zaj´
ıc
New Technologies for the Information Society and Department of Cybernetics,
Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
ABSTRACT
Self-supervised learning approaches have lately achieved great
success on a broad spectrum of machine learning problems. In the
field of speech processing, one of the most successful recent self-
supervised models is wav2vec 2.0. In this paper, we explore the
effectiveness of this model on three basic speech classification tasks:
speaker change detection, overlapped speech detection, and voice
activity detection. First, we concentrate on only one task – speaker
change detection – where our proposed system surpasses the previ-
ously reported results on four different corpora, and achieves com-
parable performance even when trained on out-of-domain data from
an artificially designed dataset. Then we expand our approach to
tackle all three tasks in a single multitask system with state-of-the-
art performance on the AMI corpus. The implementation of the
algorithms in this paper is publicly available at
https://github.
com/mkunes/w2v2_audioFrameClassification.
Index Terms
multitask learning, speaker change detection,
overlapped speech detection, voice activity detection, wav2vec 2.0
1. INTRODUCTION
Speaker change detection (SCD), overlapping speech detection
(OSD), and voice activity detection (VAD) are three basic speech
processing tasks that are relevant for a variety of different speech
applications. SCD is the task of finding the points in a conversation
where the speaker is changing, while OSD is concerned with identi-
fying intervals where multiple speakers are active at the same time.
Both of these are particularly important to speaker diarization [1, 2],
as well as other tasks related to processing multi-speaker audio [
3
].
Voice activity detection simply distinguishes between speech and
non-speech and has a use in nearly all speech processing.
In the past, approaches to SCD have mainly consisted of comput-
ing a distance between two sliding windows over the signal and then
locating peaks [
4
] or detecting the differences in pitch [
5
]. Nowa-
days, the prevailing approach is to use deep learning. First attempts
used precomputed features based on i-vectors or x-vectors [
3
], Mel-
frequency cepstral coefficients (MFCCs) [
5
], spectrograms [
6
], or
combinations of multiple types of features [
7
], even including lexical
information gained from automated transcripts [
8
,
9
]. Different neural
network model architectures have been applied, such as LSTM [10],
CNN [6], or sequence-level modeling methods [11].
A similar shift from conventional techniques to deep learning
has also occurred for OSD. Recent approaches include convolutional
This research was supported by the Czech Ministry of Interior, project
ROZKAZ (VJ01010108) and by the Czech Ministry of Education, Youth and
Sports, Project No. (LM2023062) LINDAT/CLARIAH-CZ. Computational
resources were provided by the e-INFRA CZ project (ID:90140), supported
by the Ministry of Education, Youth and Sports of the Czech Republic.
Fig. 1
: Illustration of the multitask wav2vec 2.0 detector of speaker
changes, voice activity, and overlapping speech. The model outputs a
set of labels for each audio frame (every 20 ms).
neural networks [
12
] or LSTMs [
2
], and the input can be in the
form of MFCCs [
13
], spectrograms [
14
], x-vectors [
15
] or raw wave-
forms [16].
VAD has been the target of a large amount of research for many
years [
17
,
18
], but these days it is rarely the main focus by itself.
Rather, it usually appears as one part or a by-product of a more
complex system [13, 19].
In this paper, we first propose an end-to-end approach for SCD
using the transformer network concept (which has recently seen great
success on a variety of tasks, including but not limited to speech
processing [
20
]) – specifically using the wav2vec 2.0 [
21
] frame-
work – and evaluate it on four commonly used conversational speech
corpora. Then we expand the approach to also perform two other
tasks (OSD and VAD) in a single multitask system, as illustrated in
Figure 1. These three tasks are closely related to each other, and it
has been previously demonstrated that joint learning can improve
learning efficiency and prediction accuracy compared to task-specific
models [22].
2. SINGLE-TASK MODEL FOR SCD
Wav2vec 2.0 (hereafter referred to as “wav2vec2”) is a transformer-
based self-supervised framework for speech representation, which
has been used for a wide range of speech processing tasks, such as
automatic speech recognition [23] and many others [24].
Inspired by our previous work [
25
] on prosodic boundary detec-
tion, we treat the SCD problem as an audio frame classification task.
©
2023 IEEE. Personal use of this material is permitted. Permission
from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional
purposes, creating new collective works, for resale or redistribution to servers
or lists, or reuse of any copyrighted component of this work in other works.
To appear in Proc. ICASSP 2023, June 04-10, 2023, Rhodes Island, Greece © 2023 IEEE
arXiv:2210.14755v2 [eess.AS] 10 Mar 2023
摘要:

MULTITASKDETECTIONOFSPEAKERCHANGES,OVERLAPPINGSPEECHANDVOICEACTIVITYUSINGWAV2VEC2.0MarieKunesov´aandZbynekZaj´cNewTechnologiesfortheInformationSocietyandDepartmentofCybernetics,FacultyofAppliedSciences,UniversityofWestBohemia,Pilsen,CzechRepublicABSTRACTSelf-supervisedlearningapproacheshavelately...

展开>> 收起<<
MULTITASK DETECTION OF SPEAKER CHANGES OVERLAPPING SPEECH AND VOICE ACTIVITY USING WA V2VEC 2.0 Marie Kune ˇsova and Zbyn ˇek Zaj ıc.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:276.33KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注