Detection and Classiﬁcation of Acoustic Scenes and Events 2022 34 November 2022 Nancy France A HYBRID SYSTEM OF SOUND EVENT DETECTION TRANSFORMER AND FRAME-WISE MODEL FOR DCASE 2022 TASK 4

2025-05-06 0 0 346.49KB 5 页 10玖币

侵权投诉

Detection and Classiﬁcation of Acoustic Scenes and Events 2022 3–4 November 2022, Nancy, France

A HYBRID SYSTEM OF SOUND EVENT DETECTION TRANSFORMER AND FRAME-WISE

MODEL FOR DCASE 2022 TASK 4

Yiming Li1,2, Zhifang Guo1,2, Zhirong Ye1,2, Xiangdong Wang1,†, Hong Liu1, Yueliang Qian1,

Rui Tao3, Long Yan3, Kazushige Ouchi3

1Beijing Key Laboratory of Mobile Computing and Pervasive Device,

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China,

eamon.y.li@gmail.com, {guozhifang21s, yezhirong19s, xdwang, hliu, ylqian}@ict.ac.cn

2University of Chinese Academy of Sciences, Beijing, China

3Toshiba China R&D Center, Beijing, China,

{taorui, yanlong}@toshiba.com.cn, kazushige.ouchi@toshiba.co.jp

ABSTRACT

In this paper, we describe in detail our system for DCASE 2022

Task4. The system combines two considerably different mod-

els: an end-to-end Sound Event Detection Transformer (SEDT)

and a frame-wise model, Metric Learning and Focal Loss CNN

(MLFL-CNN). The former is an event-wise model which learns

event-level representations and predicts sound event categories and

boundaries directly, while the latter is based on the widely-adopted

frame-classiﬁcation scheme, under which each frame is classiﬁed

into event categories and event boundaries are obtained by post-

processing such as thresholding and smoothing. For SEDT, self-

supervised pre-training using unlabeled data is applied, and semi-

supervised learning is adopted by using an online teacher, which

is updated from the student model using the Exponential Moving

Average (EMA) strategy and generates reliable pseudo labels for

weakly-labeled and unlabeled data. For the frame-wise model, the

ICT-TOSHIBA system of DCASE 2021 Task 4 is used. Experi-

mental results show that the hybrid system considerably outper-

forms either individual model, and achieves psds1 of 0.420 and

psds2 of 0.783 on the validation set without external data. The

code is available at https://github.com/965694547/Hybrid-system-

of-frame-wise-model-and-SEDT.

Index Terms—Sound Event Detection Transformer, Online

Pseudo-labelling, Hybrid System

1. INTRODUCTION

Sound Event Detection (SED) aims at identifying the category of

foreground sound events as well as their corresponding onset and

offset timestamps. Task4 of the DCASE challenge has been focus-

ing on weakly supervised SED for several years. The DCASE 2022

Task4 [1] is a follow up of last year’s challenge [2]. This year,

in addition to exploring a heterogeneous development dataset con-

taining unlabeled data, synthetic data and weakly labeled data, par-

ticipants are allowed to incorporate external dataset or pre-trained

embeddings. As last year, the SED system will be evaluated by

Polyphonic Sound Detection Score (PSDS) [3] under two different

real-life settings.

For weakly supervised SED, most existing works follow the

Multiple Instance Learning (MIL) framework, and formulate SED

as a seq2seq classiﬁcation task. They usually design Convolutional

Neural Networks (CNNs) or Convolutional Recurrent Neural Net-

works (CRNNs) to obtain frame-level classiﬁcation probability and

then apply pooling mechanism to aggregate frame-level predictions

to event-level results. However, such methods do not take sound

events as a whole, which may ignore some global information, such

as the correlation between frames or event duration. Recently, an

event-wise model, namely SEDT, is proposed to handle such prob-

lems [4]. It models SED as a set prediction problem, which di-

rectly maps audio spectrogram to a set of candidate events, thus

freeing SED models from trivial post-processing, namely frame-

level thresholding or median ﬁltering. Empirical study has shown

that SEDT can achieve competitive performance compared with its

frame-wise counterparts [4]. Moreover, we ﬁnd that the two models

can supplement each other, as they solve the SED task in differ-

ent ways. Therefore, combining them together may be an intuitive

approach to reach promising SED performance.

In this paper, we describe our system participating in DCASE

2022 Task 4. It is a combination of SEDT and frame-wise CNN

model. For SEDT, specially-designed training formulas, including

supervised learning, self-supervised learning and semi-supervised

learning, are studied to help it learn from the heterogeneous de-

velopment dataset. For frame-wise CNN model, metric learning is

applied to narrow the domain gap between real and synthetic data,

mean-teacher framework is implemented to provide supervision for

unlabeled data and a tag-conditioned CNN model is used to gener-

ate ﬁnal predictions based on audio tags. After obtaining each well-

trained model, we explore the fusion strategy and post-processing

methods of the ensemble model. By using the methods above, the

hybrid system achieves competitive results on the validation dataset.

2. SEMI-SUPERVISED SEDT

2.1. Sound Event Detection Transformer

An overview of SEDT is shown in Fig. 1. It represents each

sound event as yi= (ci, bi), where ciis the event category and

bi= (mi, li)denotes the event temporal boundary containing nor-

malized event center miand duration li, and directly seeks a map-

ping between input features and ground-truth events. Given the in-

put spectrogram, the backbone CNN is adopted to extract its feature

map, which is then added with one-dimensional positional encoding

and fed into transformer encoder for further feature processing. The

transformer decoder takes N+ 1 learnable embeddings (Nevent

queries and 1 audio query) as input event query, where each of them

arXiv:2210.09529v1 [cs.SD] 18 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DetectionandClassicationofAcousticScenesandEvents202234November2022,Nancy,FranceAHYBRIDSYSTEMOFSOUNDEVENTDETECTIONTRANSFORMERANDFRAME-WISEMODELFORDCASE2022TASK4YimingLi1;2,ZhifangGuo1;2,ZhirongYe1;2,XiangdongWang1;y,HongLiu1,YueliangQian1,RuiTao3,LongYan3,KazushigeOuchi31BeijingKeyLaboratoryofMobi...

展开>> 收起<<

Detection and Classiﬁcation of Acoustic Scenes and Events 2022 34 November 2022 Nancy France A HYBRID SYSTEM OF SOUND EVENT DETECTION TRANSFORMER AND FRAME-WISE MODEL FOR DCASE 2022 TASK 4.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Detection and Classiﬁcation of Acoustic Scenes and Events 2022 34 November 2022 Nancy France A HYBRID SYSTEM OF SOUND EVENT DETECTION TRANSFORMER AND FRAME-WISE MODEL FOR DCASE 2022 TASK 4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: