Detection and Classification of Acoustic Scenes and Events 2022 34 November 2022 Nancy France A HYBRID SYSTEM OF SOUND EVENT DETECTION TRANSFORMER AND FRAME-WISE MODEL FOR DCASE 2022 TASK 4

2025-05-06 0 0 346.49KB 5 页 10玖币
侵权投诉
Detection and Classification of Acoustic Scenes and Events 2022 3–4 November 2022, Nancy, France
A HYBRID SYSTEM OF SOUND EVENT DETECTION TRANSFORMER AND FRAME-WISE
MODEL FOR DCASE 2022 TASK 4
Yiming Li1,2, Zhifang Guo1,2, Zhirong Ye1,2, Xiangdong Wang1,, Hong Liu1, Yueliang Qian1,
Rui Tao3, Long Yan3, Kazushige Ouchi3
1Beijing Key Laboratory of Mobile Computing and Pervasive Device,
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China,
eamon.y.li@gmail.com, {guozhifang21s, yezhirong19s, xdwang, hliu, ylqian}@ict.ac.cn
2University of Chinese Academy of Sciences, Beijing, China
3Toshiba China R&D Center, Beijing, China,
{taorui, yanlong}@toshiba.com.cn, kazushige.ouchi@toshiba.co.jp
ABSTRACT
In this paper, we describe in detail our system for DCASE 2022
Task4. The system combines two considerably different mod-
els: an end-to-end Sound Event Detection Transformer (SEDT)
and a frame-wise model, Metric Learning and Focal Loss CNN
(MLFL-CNN). The former is an event-wise model which learns
event-level representations and predicts sound event categories and
boundaries directly, while the latter is based on the widely-adopted
frame-classification scheme, under which each frame is classified
into event categories and event boundaries are obtained by post-
processing such as thresholding and smoothing. For SEDT, self-
supervised pre-training using unlabeled data is applied, and semi-
supervised learning is adopted by using an online teacher, which
is updated from the student model using the Exponential Moving
Average (EMA) strategy and generates reliable pseudo labels for
weakly-labeled and unlabeled data. For the frame-wise model, the
ICT-TOSHIBA system of DCASE 2021 Task 4 is used. Experi-
mental results show that the hybrid system considerably outper-
forms either individual model, and achieves psds1 of 0.420 and
psds2 of 0.783 on the validation set without external data. The
code is available at https://github.com/965694547/Hybrid-system-
of-frame-wise-model-and-SEDT.
Index TermsSound Event Detection Transformer, Online
Pseudo-labelling, Hybrid System
1. INTRODUCTION
Sound Event Detection (SED) aims at identifying the category of
foreground sound events as well as their corresponding onset and
offset timestamps. Task4 of the DCASE challenge has been focus-
ing on weakly supervised SED for several years. The DCASE 2022
Task4 [1] is a follow up of last year’s challenge [2]. This year,
in addition to exploring a heterogeneous development dataset con-
taining unlabeled data, synthetic data and weakly labeled data, par-
ticipants are allowed to incorporate external dataset or pre-trained
embeddings. As last year, the SED system will be evaluated by
Polyphonic Sound Detection Score (PSDS) [3] under two different
real-life settings.
For weakly supervised SED, most existing works follow the
Multiple Instance Learning (MIL) framework, and formulate SED
as a seq2seq classification task. They usually design Convolutional
Neural Networks (CNNs) or Convolutional Recurrent Neural Net-
works (CRNNs) to obtain frame-level classification probability and
then apply pooling mechanism to aggregate frame-level predictions
to event-level results. However, such methods do not take sound
events as a whole, which may ignore some global information, such
as the correlation between frames or event duration. Recently, an
event-wise model, namely SEDT, is proposed to handle such prob-
lems [4]. It models SED as a set prediction problem, which di-
rectly maps audio spectrogram to a set of candidate events, thus
freeing SED models from trivial post-processing, namely frame-
level thresholding or median filtering. Empirical study has shown
that SEDT can achieve competitive performance compared with its
frame-wise counterparts [4]. Moreover, we find that the two models
can supplement each other, as they solve the SED task in differ-
ent ways. Therefore, combining them together may be an intuitive
approach to reach promising SED performance.
In this paper, we describe our system participating in DCASE
2022 Task 4. It is a combination of SEDT and frame-wise CNN
model. For SEDT, specially-designed training formulas, including
supervised learning, self-supervised learning and semi-supervised
learning, are studied to help it learn from the heterogeneous de-
velopment dataset. For frame-wise CNN model, metric learning is
applied to narrow the domain gap between real and synthetic data,
mean-teacher framework is implemented to provide supervision for
unlabeled data and a tag-conditioned CNN model is used to gener-
ate final predictions based on audio tags. After obtaining each well-
trained model, we explore the fusion strategy and post-processing
methods of the ensemble model. By using the methods above, the
hybrid system achieves competitive results on the validation dataset.
2. SEMI-SUPERVISED SEDT
2.1. Sound Event Detection Transformer
An overview of SEDT is shown in Fig. 1. It represents each
sound event as yi= (ci, bi), where ciis the event category and
bi= (mi, li)denotes the event temporal boundary containing nor-
malized event center miand duration li, and directly seeks a map-
ping between input features and ground-truth events. Given the in-
put spectrogram, the backbone CNN is adopted to extract its feature
map, which is then added with one-dimensional positional encoding
and fed into transformer encoder for further feature processing. The
transformer decoder takes N+ 1 learnable embeddings (Nevent
queries and 1 audio query) as input event query, where each of them
arXiv:2210.09529v1 [cs.SD] 18 Oct 2022
摘要:

DetectionandClassicationofAcousticScenesandEvents20223–4November2022,Nancy,FranceAHYBRIDSYSTEMOFSOUNDEVENTDETECTIONTRANSFORMERANDFRAME-WISEMODELFORDCASE2022TASK4YimingLi1;2,ZhifangGuo1;2,ZhirongYe1;2,XiangdongWang1;y,HongLiu1,YueliangQian1,RuiTao3,LongYan3,KazushigeOuchi31BeijingKeyLaboratoryofMobi...

展开>> 收起<<
Detection and Classification of Acoustic Scenes and Events 2022 34 November 2022 Nancy France A HYBRID SYSTEM OF SOUND EVENT DETECTION TRANSFORMER AND FRAME-WISE MODEL FOR DCASE 2022 TASK 4.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:5 页 大小:346.49KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注