GCT GATED CONTEXTUAL TRANSFORMER FOR SEQUENTIAL AUDIO TAGGING Yuanbo Hou1 Yun Wang2 Wenwu Wang3 Dick Botteldooren1 1WA VES Ghent University Belgium2Meta AI USA3CVSSP University of Surrey UK

2025-05-06 0 0 579.22KB 6 页 10玖币
侵权投诉
GCT: GATED CONTEXTUAL TRANSFORMER FOR SEQUENTIAL AUDIO TAGGING
Yuanbo Hou1, Yun Wang2, Wenwu Wang3, Dick Botteldooren1
1WAVES, Ghent University, Belgium 2Meta AI, USA 3CVSSP, University of Surrey, UK
ABSTRACT
Audio tagging aims to assign predefined tags to audio clips
to indicate the class information of audio events. Sequential
audio tagging (SAT) means detecting both the class informa-
tion of audio events, and the order in which they occur within
the audio clip. Most existing methods for SAT are based on
connectionist temporal classification (CTC). However, CTC
cannot effectively capture connections between events due to
the conditional independence assumption between outputs at
different times. The contextual Transformer (cTransformer)
addresses this issue by exploiting contextual information in
SAT. Nevertheless, cTransformer is also limited in exploiting
contextual information as it only uses forward information
in inference. This paper proposes a gated contextual Trans-
former (GCT) with forward-backward inference (FBI). In ad-
dition, a gated contextual multi-layer perceptron (GCMLP)
block is proposed in GCT to improve the performance of
cTransformer structurally. Experiments on two real-life au-
dio datasets show that the proposed GCT with GCMLP and
FBI performs better than the CTC-based methods and cTrans-
former. To promote research on SAT, the manually annotated
sequential labels for the two datasets are released.
Index TermsSequential audio tagging, connectionist
temporal classification, gated contextual Transformer
1. INTRODUCTION
Audio tagging (AT) [1] is a fundamental task in audio classi-
fication, where events in audio clips are identified via multi-
label classification. Sequential audio tagging (SAT) [2] aims
to mine both the type information of events and the order in-
formation between events. With SAT, the number of events
can be estimated as a byproduct. Sequence-level AT offers
more information about the audio clip than temporal agnostic
event-level AT. SAT can not only provide information that is
useful for AT tasks [1], but also support tasks like audio cap-
tioning [3], event anticipation [4] and scene perception [5].
Prior works on SAT mostly use connectionist temporal
classification (CTC) [6] as its core. To perform SAT on poly-
phonic audio clips, sequential labels are introduced in [7] to
train CTC-based convolutional recurrent neural networks to
tag diverse event sequences. The polyphony of audio makes
it hard to define the order of events, therefore, the order of the
beginning and end boundaries of events are used as sequential
labels in [7]. The double-boundary sequential labels are also
used to train a CTC-based recurrent neural network equipped
with the long short-term memory (BLSTM-CTC) [8] to per-
form sound event detection (SED), which detects the type and
temporal position of events in audio clips. Apart from these
methods using double-boundary labels, single-boundary se-
quential labels (sequences of the start boundary of events) are
exploited in a 2-stage CTC-based method [9] for SED. How-
ever, CTC-based methods have difficulty in modeling the con-
textual information in event sequences because CTC implic-
itly assumes that the network outputs are conditionally inde-
pendent at each time step [6]. To take advantage of context,
a contextual Transformer (cTransformer) [2] is proposed to
explore bidirectional information in event sequences to make
more effective use of the contextual information in SAT.
To exploit both the forward and backward information of
events, cTransformer uses a bidirectional decoder to model
the correlations between preceding and following events in
both directions in training, while only the forward direction
of the decoder is used for inferring [2]. In inference, cTrans-
former does not utilize the event sequences information con-
tained in the reverse sequence branch. To address this limita-
tion, we propose a gated contextual Transformer (GCT) with
a forward-backward inference (FBI) algorithm to infer the tar-
get event from both directions, where the context of the event
is incorporated during inference. In addition, to enhance the
decoder’s power to capture the context implicit in event se-
quences, a gated contextual multi-layer perceptron (GCMLP)
block is proposed to adapt the contextual information for the
estimation of final predictions.
The contributions of this work are: 1) We propose GCT
equipped with GCMLP and FBI to improve cTransformer’s
ability to capture contextual information of event sequences;
2) We explore the effect of pretrained weights on modules of
GCT under two transfer learning modes to gain insight into
the role of GCT modules; 3) We visualize the attention distri-
bution in hidden layers of GCT to investigate how GCT con-
nects acoustic representations with clip-level event tokens and
bidirectionally infers the event sequences; 4) To evaluate the
performance of GCT, we sequentially annotate a polyphonic
audio dataset and release it. We compare the performance of
GCT, cTransformer, and CTC-based methods on two real-life
datasets. This paper is organized as follows, Section 2 intro-
duces the GCT. Section 3 describes the dataset, experimental
setup, and analyzes the results. Section 4 gives conclusions.
arXiv:2210.12541v1 [cs.SD] 22 Oct 2022
Log mel spectrogram
MHA Masked MHA
MHA
Masked MHA
MHA
Encoder Normal
sequence
branch
Decoder
Reverse
sequence
branch
Forward mask Backward mask
Input label sequence
Repeat
Times
Repeat
Times
Patch embedding layer
Positional embedding
Positional embedding
Shared attention weights
Shared attention weights
Shared weights
Add & Norm layerFeed Forward layer
Gated MLP
Shared weights
GCMLP
Gated MLP
Gated MLP
FC2 Gated FC3
Softmax
FC1
Fig. 1:The proposed gated contextual Transformer and gated MLP block. In the mask matrices, the red, gray, and white blocks present the
positions corresponding to the target to be predicted, the positions of masked data, and the positions of available data.
2. GATED CONTEXTUAL TRANSFORMER (GCT)
Following the approach in cTransformer [2], the sequence of
event start boundaries is used as the sequential label. For
example, given the normal sequential label lis “<S>,event1,
event2, ..., eventk,<E>”, where kis the number of events, <S>
and <E>are the tokens indicating the start and end of the se-
quence, respectively. The reverse sequential label l0is “<S0>,
eventk,eventk-1, ..., event1,<E>”, where <S0>is the token indi-
cating the start of the reverse sequence.
2.1. Encoder and Decoder of GCT
Encoder. There are two ways for the input to the encoder: 1)
the entire spectrogram of the audio clip, as in cTransformer
[2]; 2) the patch sequence by dividing the spectrogram clip
into patches, as in AST [10]. Inputting the entire clip enables
the encoder to directly utilize the global audio information of
events. However, the patch sequence may help the model to
align acoustic patch sequences with the corresponding event
label sequences. Fig. 1 shows the structure of GCT with in-
put patches. Referring to AST, GCT uses a patch embedding
layer to map the patches containing basic acoustic features
into high-level representations, and uses updatable positional
embedding (Pos emb) to capture the spatial information of
each patch. When inputting clips, the Pos emb before the en-
coder is removed, and the patch embedding layer is replaced
with a linear layer to keep the encoder input dimension con-
sistent with input patches. The encoder consists of Nidenti-
cal blocks with a multi-head attention layer (MHA) and a feed
forward layer (FF) with layer normalization, which are analo-
gous to the encoder in Transformer [11], so all the parameters
in MHA and FF follow Transformer’s default setting.
Decoder. The bidirectional decoder in GCT consists of a
normal sequence branch and a reverse sequence branch. To
facilitate information exchange between branches, in Fig. 1,
each branch consists of Midentical blocks serially, and each
block consists of similar structures containing masked MHA,
MHA, and FF. To preserve the autoregressive property [11] of
Transformer-based models, forward and backward masks are
used to block future and past information about the target in
the normal and reverse branches, respectively, to maintain the
sequential information between events in a sequence. With
the combined effect of the forward and backward mask ma-
trices, the normal and reverse sequence branches will infer the
same target at each time step. Thus, the decoder can extract
forward and backward information about the target. Further-
more, the weights shared between branches will facilitate the
model to capture contextual information about the target.
2.2. Gated contextual multi-layer perceptron (GCMLP)
GCMLP aims to perform the final conditioning of the decoder
output based on the gated MLP (gMLP) block and shared
weights while considering the contextual information about
the target to achieve more accurate predictions. In Fig. 1,
gMLP consists of 3 fully-connected (FC) layers with the same
size. Denote the input as X, the weight of FC1, FC2 and
FC3 as W1,W2and W3, then the corresponding output is
F1=W1X,F2=ReLU(W2F1), and λ=σ(W3F1), where
σis logistic sigmoid: σ(x) = 1/(1 + ex), and ReLU is
activation function [12]. Then, the output of gMLP is
gMLP =Softmax((1 λ)F2+λF1)(1)
Where is the element-wise product, F2is a higher-level
representation of the target based on F1, and can be viewed
as the target’s embedding from another perspective. FC3 eval-
uates the relative importance of each element in F1, and then
combines it with F2according to the estimated importance of
each element. That is, gMLP is used to generate multi-view
results and fuse them relying on the learnable gate unit.
At each time step, denote the output of the normal and re-
verse branches through GCMLP as pand p0, the correspond-
ing ground-truth labels are yand y0, respectively. Cross en-
tropy (CE) [11] loss is used in GCT: Lnormal =CE(p, y),
Lreverse =CE(p0, y0). To further align the classification space
of the two branches to allow the model to focus on the con-
textual information of the same target, the mean squared error
(MSE) [13] is used as a context loss to measure the distance
between pand p0in the latent space: Lcontext =MSE(p0, p).
Hence, the total loss of GCT is L=Lnormal +Lreverse +Lcontext.
2.3. Forward-backward inference (FBI)
To utilize both the normal and reverse branches in inference,
we propose FBI to make the two branches infer the same goal
at each step, and fuse the prediction to form the final output.
While preserving the autoregressive property [11], FBI inte-
grates the forward and the backward sequence information
implied in the normal and reverse branches during inference,
摘要:

GCT:GATEDCONTEXTUALTRANSFORMERFORSEQUENTIALAUDIOTAGGINGYuanboHou1,YunWang2,WenwuWang3,DickBotteldooren11WAVES,GhentUniversity,Belgium2MetaAI,USA3CVSSP,UniversityofSurrey,UKABSTRACTAudiotaggingaimstoassignpredenedtagstoaudioclipstoindicatetheclassinformationofaudioevents.Sequentialaudiotagging(SAT)m...

展开>> 收起<<
GCT GATED CONTEXTUAL TRANSFORMER FOR SEQUENTIAL AUDIO TAGGING Yuanbo Hou1 Yun Wang2 Wenwu Wang3 Dick Botteldooren1 1WA VES Ghent University Belgium2Meta AI USA3CVSSP University of Surrey UK.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:579.22KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注