MULTITASK DETECTION OF SPEAKER CHANGES OVERLAPPING SPEECH AND VOICE ACTIVITY USING WA V2VEC 2.0 Marie Kune ˇsova and Zbyn ˇek Zaj ıc

2025-05-02 0 0 276.33KB 5 页 10玖币

MULTITASK DETECTION OF SPEAKER CHANGES, OVERLAPPING SPEECH AND VOICE

ACTIVITY USING WAV2VEC 2.0

Marie Kuneˇ

sov´

a and Zbynˇ

ek Zaj´

ıc

New Technologies for the Information Society and Department of Cybernetics,

Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic

ABSTRACT

Self-supervised learning approaches have lately achieved great

success on a broad spectrum of machine learning problems. In the

ﬁeld of speech processing, one of the most successful recent self-

supervised models is wav2vec 2.0. In this paper, we explore the

effectiveness of this model on three basic speech classiﬁcation tasks:

speaker change detection, overlapped speech detection, and voice

activity detection. First, we concentrate on only one task – speaker

change detection – where our proposed system surpasses the previ-

ously reported results on four different corpora, and achieves com-

parable performance even when trained on out-of-domain data from

an artiﬁcially designed dataset. Then we expand our approach to

tackle all three tasks in a single multitask system with state-of-the-

art performance on the AMI corpus. The implementation of the

algorithms in this paper is publicly available at

https://github.

com/mkunes/w2v2_audioFrameClassification.

Index Terms—

multitask learning, speaker change detection,

overlapped speech detection, voice activity detection, wav2vec 2.0

1. INTRODUCTION

Speaker change detection (SCD), overlapping speech detection

(OSD), and voice activity detection (VAD) are three basic speech

processing tasks that are relevant for a variety of different speech

applications. SCD is the task of ﬁnding the points in a conversation

where the speaker is changing, while OSD is concerned with identi-

fying intervals where multiple speakers are active at the same time.

Both of these are particularly important to speaker diarization [1, 2],

as well as other tasks related to processing multi-speaker audio [

3

].

Voice activity detection simply distinguishes between speech and

non-speech and has a use in nearly all speech processing.

In the past, approaches to SCD have mainly consisted of comput-

ing a distance between two sliding windows over the signal and then

locating peaks [

4

] or detecting the differences in pitch [

5

]. Nowa-

days, the prevailing approach is to use deep learning. First attempts

used precomputed features based on i-vectors or x-vectors [

3

], Mel-

frequency cepstral coefﬁcients (MFCCs) [

5

], spectrograms [

6

], or

combinations of multiple types of features [

7

], even including lexical

information gained from automated transcripts [

8

,

9

]. Different neural

network model architectures have been applied, such as LSTM [10],

CNN [6], or sequence-level modeling methods [11].

A similar shift from conventional techniques to deep learning

has also occurred for OSD. Recent approaches include convolutional

This research was supported by the Czech Ministry of Interior, project

ROZKAZ (VJ01010108) and by the Czech Ministry of Education, Youth and

Sports, Project No. (LM2023062) LINDAT/CLARIAH-CZ. Computational

resources were provided by the e-INFRA CZ project (ID:90140), supported

by the Ministry of Education, Youth and Sports of the Czech Republic.

Fig. 1

: Illustration of the multitask wav2vec 2.0 detector of speaker

changes, voice activity, and overlapping speech. The model outputs a

set of labels for each audio frame (every 20 ms).

neural networks [

12

] or LSTMs [

2

], and the input can be in the

form of MFCCs [

13

], spectrograms [

14

], x-vectors [

15

] or raw wave-

forms [16].

VAD has been the target of a large amount of research for many

years [

17

,

18

], but these days it is rarely the main focus by itself.

Rather, it usually appears as one part or a by-product of a more

complex system [13, 19].

In this paper, we ﬁrst propose an end-to-end approach for SCD

using the transformer network concept (which has recently seen great

success on a variety of tasks, including but not limited to speech

processing [

20

]) – speciﬁcally using the wav2vec 2.0 [

21

] frame-

work – and evaluate it on four commonly used conversational speech

corpora. Then we expand the approach to also perform two other

tasks (OSD and VAD) in a single multitask system, as illustrated in

Figure 1. These three tasks are closely related to each other, and it

has been previously demonstrated that joint learning can improve

learning efﬁciency and prediction accuracy compared to task-speciﬁc

models [22].

2. SINGLE-TASK MODEL FOR SCD

Wav2vec 2.0 (hereafter referred to as “wav2vec2”) is a transformer-

based self-supervised framework for speech representation, which

has been used for a wide range of speech processing tasks, such as

automatic speech recognition [23] and many others [24].

Inspired by our previous work [

25

] on prosodic boundary detec-

tion, we treat the SCD problem as an audio frame classiﬁcation task.

©

2023 IEEE. Personal use of this material is permitted. Permission

from IEEE must be obtained for all other uses, in any current or future media,

including reprinting/republishing this material for advertising or promotional

purposes, creating new collective works, for resale or redistribution to servers

or lists, or reuse of any copyrighted component of this work in other works.

To appear in Proc. ICASSP 2023, June 04-10, 2023, Rhodes Island, Greece © 2023 IEEE

arXiv:2210.14755v2 [eess.AS] 10 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MULTITASKDETECTIONOFSPEAKERCHANGES,OVERLAPPINGSPEECHANDVOICEACTIVITYUSINGWAV2VEC2.0MarieKunesov´aandZbynekZaj´cNewTechnologiesfortheInformationSocietyandDepartmentofCybernetics,FacultyofAppliedSciences,UniversityofWestBohemia,Pilsen,CzechRepublicABSTRACTSelf-supervisedlearningapproacheshavelately...

展开>> 收起<<

MULTITASK DETECTION OF SPEAKER CHANGES OVERLAPPING SPEECH AND VOICE ACTIVITY USING WA V2VEC 2.0 Marie Kune ˇsova and Zbyn ˇek Zaj ıc.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

相关推荐

更多

立即下载

分类：图书资源 价格：10玖币 属性：5 页 大小：276.33KB 格式：PDF 时间：2025-05-02

开通VIP享超值会员特权

多端同步记录
高速下载文档
免费文档工具
分享文档赚钱
每日登录抽奖
优质衍生服务

作者详情

MAOOA..
高级编辑

文档 14218 粉丝 0

相关内容

更多

热门标签

人际关系配电装置动力学连接体力的合成高考理综全宋诗作者索引公务员考试

/ 5

评分收藏

立即下载

关于我们联系我们隐私政策用户协议免责申明会员服务协议
本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！ Copyright ©Jiubeiyunall rights reserved SITEMAP| 备案号：渝ICP备2024044455号| 渝公网安备50010702506394 | 违法与不良信息举报方式：微信:jiubeiyun2024,QQ:264159069,电话:15523442343,邮箱:jiubeiyun@126.com

客服

关注

二维码已失效
刷新

打开微信，点击“扫一扫”

安全高效便捷

免密登录