GCT GATED CONTEXTUAL TRANSFORMER FOR SEQUENTIAL AUDIO TAGGING Yuanbo Hou1 Yun Wang2 Wenwu Wang3 Dick Botteldooren1 1WA VES Ghent University Belgium2Meta AI USA3CVSSP University of Surrey UK

2025-05-06 0 0 579.22KB 6 页 10玖币

侵权投诉

GCT: GATED CONTEXTUAL TRANSFORMER FOR SEQUENTIAL AUDIO TAGGING

Yuanbo Hou1, Yun Wang2, Wenwu Wang3, Dick Botteldooren1

1WAVES, Ghent University, Belgium 2Meta AI, USA 3CVSSP, University of Surrey, UK

ABSTRACT

Audio tagging aims to assign predeﬁned tags to audio clips

to indicate the class information of audio events. Sequential

audio tagging (SAT) means detecting both the class informa-

tion of audio events, and the order in which they occur within

the audio clip. Most existing methods for SAT are based on

connectionist temporal classiﬁcation (CTC). However, CTC

cannot effectively capture connections between events due to

the conditional independence assumption between outputs at

different times. The contextual Transformer (cTransformer)

addresses this issue by exploiting contextual information in

SAT. Nevertheless, cTransformer is also limited in exploiting

contextual information as it only uses forward information

in inference. This paper proposes a gated contextual Trans-

former (GCT) with forward-backward inference (FBI). In ad-

dition, a gated contextual multi-layer perceptron (GCMLP)

block is proposed in GCT to improve the performance of

cTransformer structurally. Experiments on two real-life au-

dio datasets show that the proposed GCT with GCMLP and

FBI performs better than the CTC-based methods and cTrans-

former. To promote research on SAT, the manually annotated

sequential labels for the two datasets are released.

Index Terms—Sequential audio tagging, connectionist

temporal classiﬁcation, gated contextual Transformer

1. INTRODUCTION

Audio tagging (AT) [1] is a fundamental task in audio classi-

ﬁcation, where events in audio clips are identiﬁed via multi-

label classiﬁcation. Sequential audio tagging (SAT) [2] aims

to mine both the type information of events and the order in-

formation between events. With SAT, the number of events

can be estimated as a byproduct. Sequence-level AT offers

more information about the audio clip than temporal agnostic

event-level AT. SAT can not only provide information that is

useful for AT tasks [1], but also support tasks like audio cap-

tioning [3], event anticipation [4] and scene perception [5].

Prior works on SAT mostly use connectionist temporal

classiﬁcation (CTC) [6] as its core. To perform SAT on poly-

phonic audio clips, sequential labels are introduced in [7] to

train CTC-based convolutional recurrent neural networks to

tag diverse event sequences. The polyphony of audio makes

it hard to deﬁne the order of events, therefore, the order of the

beginning and end boundaries of events are used as sequential

labels in [7]. The double-boundary sequential labels are also

used to train a CTC-based recurrent neural network equipped

with the long short-term memory (BLSTM-CTC) [8] to per-

form sound event detection (SED), which detects the type and

temporal position of events in audio clips. Apart from these

methods using double-boundary labels, single-boundary se-

quential labels (sequences of the start boundary of events) are

exploited in a 2-stage CTC-based method [9] for SED. How-

ever, CTC-based methods have difﬁculty in modeling the con-

textual information in event sequences because CTC implic-

itly assumes that the network outputs are conditionally inde-

pendent at each time step [6]. To take advantage of context,

a contextual Transformer (cTransformer) [2] is proposed to

explore bidirectional information in event sequences to make

more effective use of the contextual information in SAT.

To exploit both the forward and backward information of

events, cTransformer uses a bidirectional decoder to model

the correlations between preceding and following events in

both directions in training, while only the forward direction

of the decoder is used for inferring [2]. In inference, cTrans-

former does not utilize the event sequences information con-

tained in the reverse sequence branch. To address this limita-

tion, we propose a gated contextual Transformer (GCT) with

a forward-backward inference (FBI) algorithm to infer the tar-

get event from both directions, where the context of the event

is incorporated during inference. In addition, to enhance the

decoder’s power to capture the context implicit in event se-

quences, a gated contextual multi-layer perceptron (GCMLP)

block is proposed to adapt the contextual information for the

estimation of ﬁnal predictions.

The contributions of this work are: 1) We propose GCT

equipped with GCMLP and FBI to improve cTransformer’s

ability to capture contextual information of event sequences;

2) We explore the effect of pretrained weights on modules of

GCT under two transfer learning modes to gain insight into

the role of GCT modules; 3) We visualize the attention distri-

bution in hidden layers of GCT to investigate how GCT con-

nects acoustic representations with clip-level event tokens and

bidirectionally infers the event sequences; 4) To evaluate the

performance of GCT, we sequentially annotate a polyphonic

audio dataset and release it. We compare the performance of

GCT, cTransformer, and CTC-based methods on two real-life

datasets. This paper is organized as follows, Section 2 intro-

duces the GCT. Section 3 describes the dataset, experimental

setup, and analyzes the results. Section 4 gives conclusions.

arXiv:2210.12541v1 [cs.SD] 22 Oct 2022

Log mel spectrogram

MHA Masked MHA

MHA

Masked MHA

MHA

Encoder Normal

sequence

branch

Decoder

Reverse

sequence

branch

Forward mask Backward mask

Input label sequence

Repeat

Times

Repeat

Times

Patch embedding layer

Positional embedding

Shared attention weights

Shared weights

Add & Norm layerFeed Forward layer

Gated MLP

Shared weights

GCMLP

Gated MLP

FC2 Gated FC3

Softmax

FC1

Fig. 1:The proposed gated contextual Transformer and gated MLP block. In the mask matrices, the red, gray, and white blocks present the

positions corresponding to the target to be predicted, the positions of masked data, and the positions of available data.

2. GATED CONTEXTUAL TRANSFORMER (GCT)

Following the approach in cTransformer [2], the sequence of

event start boundaries is used as the sequential label. For

example, given the normal sequential label lis “<S>,event1,

event2, ..., eventk,<E>”, where kis the number of events, <S>

and <E>are the tokens indicating the start and end of the se-

quence, respectively. The reverse sequential label l0is “<S0>,

eventk,eventk-1, ..., event1,<E>”, where <S0>is the token indi-

cating the start of the reverse sequence.

2.1. Encoder and Decoder of GCT

Encoder. There are two ways for the input to the encoder: 1)

the entire spectrogram of the audio clip, as in cTransformer

[2]; 2) the patch sequence by dividing the spectrogram clip

into patches, as in AST [10]. Inputting the entire clip enables

the encoder to directly utilize the global audio information of

events. However, the patch sequence may help the model to

align acoustic patch sequences with the corresponding event

label sequences. Fig. 1 shows the structure of GCT with in-

put patches. Referring to AST, GCT uses a patch embedding

layer to map the patches containing basic acoustic features

into high-level representations, and uses updatable positional

embedding (Pos emb) to capture the spatial information of

each patch. When inputting clips, the Pos emb before the en-

coder is removed, and the patch embedding layer is replaced

with a linear layer to keep the encoder input dimension con-

sistent with input patches. The encoder consists of Nidenti-

cal blocks with a multi-head attention layer (MHA) and a feed

forward layer (FF) with layer normalization, which are analo-

gous to the encoder in Transformer [11], so all the parameters

in MHA and FF follow Transformer’s default setting.

Decoder. The bidirectional decoder in GCT consists of a

normal sequence branch and a reverse sequence branch. To

facilitate information exchange between branches, in Fig. 1,

each branch consists of Midentical blocks serially, and each

block consists of similar structures containing masked MHA,

MHA, and FF. To preserve the autoregressive property [11] of

Transformer-based models, forward and backward masks are

used to block future and past information about the target in

the normal and reverse branches, respectively, to maintain the

sequential information between events in a sequence. With

the combined effect of the forward and backward mask ma-

trices, the normal and reverse sequence branches will infer the

same target at each time step. Thus, the decoder can extract

forward and backward information about the target. Further-

more, the weights shared between branches will facilitate the

model to capture contextual information about the target.

2.2. Gated contextual multi-layer perceptron (GCMLP)

GCMLP aims to perform the ﬁnal conditioning of the decoder

output based on the gated MLP (gMLP) block and shared

weights while considering the contextual information about

the target to achieve more accurate predictions. In Fig. 1,

gMLP consists of 3 fully-connected (FC) layers with the same

size. Denote the input as X, the weight of FC1, FC2 and

FC3 as W1,W2and W3, then the corresponding output is

F1=W1X,F2=ReLU(W2F1), and λ=σ(W3F1), where

σis logistic sigmoid: σ(x) = 1/(1 + e−x), and ReLU is

activation function [12]. Then, the output of gMLP is

gMLP =Softmax((1 −λ)F2+λF1)(1)

Where is the element-wise product, F2is a higher-level

representation of the target based on F1, and can be viewed

as the target’s embedding from another perspective. FC3 eval-

uates the relative importance of each element in F1, and then

combines it with F2according to the estimated importance of

each element. That is, gMLP is used to generate multi-view

results and fuse them relying on the learnable gate unit.

At each time step, denote the output of the normal and re-

verse branches through GCMLP as pand p0, the correspond-

ing ground-truth labels are yand y0, respectively. Cross en-

tropy (CE) [11] loss is used in GCT: Lnormal =CE(p, y),

Lreverse =CE(p0, y0). To further align the classiﬁcation space

of the two branches to allow the model to focus on the con-

textual information of the same target, the mean squared error

(MSE) [13] is used as a context loss to measure the distance

between pand p0in the latent space: Lcontext =MSE(p0, p).

Hence, the total loss of GCT is L=Lnormal +Lreverse +Lcontext.

2.3. Forward-backward inference (FBI)

To utilize both the normal and reverse branches in inference,

we propose FBI to make the two branches infer the same goal

at each step, and fuse the prediction to form the ﬁnal output.

While preserving the autoregressive property [11], FBI inte-

grates the forward and the backward sequence information

implied in the normal and reverse branches during inference,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GCT:GATEDCONTEXTUALTRANSFORMERFORSEQUENTIALAUDIOTAGGINGYuanboHou1,YunWang2,WenwuWang3,DickBotteldooren11WAVES,GhentUniversity,Belgium2MetaAI,USA3CVSSP,UniversityofSurrey,UKABSTRACTAudiotaggingaimstoassignpredenedtagstoaudioclipstoindicatetheclassinformationofaudioevents.Sequentialaudiotagging(SAT)m...

展开>> 收起<<

GCT GATED CONTEXTUAL TRANSFORMER FOR SEQUENTIAL AUDIO TAGGING Yuanbo Hou1 Yun Wang2 Wenwu Wang3 Dick Botteldooren1 1WA VES Ghent University Belgium2Meta AI USA3CVSSP University of Surrey UK.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GCT GATED CONTEXTUAL TRANSFORMER FOR SEQUENTIAL AUDIO TAGGING Yuanbo Hou1 Yun Wang2 Wenwu Wang3 Dick Botteldooren1 1WA VES Ghent University Belgium2Meta AI USA3CVSSP University of Surrey UK

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: