
Log mel spectrogram
MHA Masked MHA
MHA
Masked MHA
MHA
Encoder Normal
sequence
branch
Decoder
Reverse
sequence
branch
Forward mask Backward mask
Input label sequence
Repeat
Times
Repeat
Times
Patch embedding layer
Positional embedding
Positional embedding
Shared attention weights
Shared attention weights
Shared weights
Add & Norm layerFeed Forward layer
Gated MLP
Shared weights
GCMLP
Gated MLP
Gated MLP
FC2 Gated FC3
Softmax
FC1
Fig. 1:The proposed gated contextual Transformer and gated MLP block. In the mask matrices, the red, gray, and white blocks present the
positions corresponding to the target to be predicted, the positions of masked data, and the positions of available data.
2. GATED CONTEXTUAL TRANSFORMER (GCT)
Following the approach in cTransformer [2], the sequence of
event start boundaries is used as the sequential label. For
example, given the normal sequential label lis “<S>,event1,
event2, ..., eventk,<E>”, where kis the number of events, <S>
and <E>are the tokens indicating the start and end of the se-
quence, respectively. The reverse sequential label l0is “<S0>,
eventk,eventk-1, ..., event1,<E>”, where <S0>is the token indi-
cating the start of the reverse sequence.
2.1. Encoder and Decoder of GCT
Encoder. There are two ways for the input to the encoder: 1)
the entire spectrogram of the audio clip, as in cTransformer
[2]; 2) the patch sequence by dividing the spectrogram clip
into patches, as in AST [10]. Inputting the entire clip enables
the encoder to directly utilize the global audio information of
events. However, the patch sequence may help the model to
align acoustic patch sequences with the corresponding event
label sequences. Fig. 1 shows the structure of GCT with in-
put patches. Referring to AST, GCT uses a patch embedding
layer to map the patches containing basic acoustic features
into high-level representations, and uses updatable positional
embedding (Pos emb) to capture the spatial information of
each patch. When inputting clips, the Pos emb before the en-
coder is removed, and the patch embedding layer is replaced
with a linear layer to keep the encoder input dimension con-
sistent with input patches. The encoder consists of Nidenti-
cal blocks with a multi-head attention layer (MHA) and a feed
forward layer (FF) with layer normalization, which are analo-
gous to the encoder in Transformer [11], so all the parameters
in MHA and FF follow Transformer’s default setting.
Decoder. The bidirectional decoder in GCT consists of a
normal sequence branch and a reverse sequence branch. To
facilitate information exchange between branches, in Fig. 1,
each branch consists of Midentical blocks serially, and each
block consists of similar structures containing masked MHA,
MHA, and FF. To preserve the autoregressive property [11] of
Transformer-based models, forward and backward masks are
used to block future and past information about the target in
the normal and reverse branches, respectively, to maintain the
sequential information between events in a sequence. With
the combined effect of the forward and backward mask ma-
trices, the normal and reverse sequence branches will infer the
same target at each time step. Thus, the decoder can extract
forward and backward information about the target. Further-
more, the weights shared between branches will facilitate the
model to capture contextual information about the target.
2.2. Gated contextual multi-layer perceptron (GCMLP)
GCMLP aims to perform the final conditioning of the decoder
output based on the gated MLP (gMLP) block and shared
weights while considering the contextual information about
the target to achieve more accurate predictions. In Fig. 1,
gMLP consists of 3 fully-connected (FC) layers with the same
size. Denote the input as X, the weight of FC1, FC2 and
FC3 as W1,W2and W3, then the corresponding output is
F1=W1X,F2=ReLU(W2F1), and λ=σ(W3F1), where
σis logistic sigmoid: σ(x) = 1/(1 + e−x), and ReLU is
activation function [12]. Then, the output of gMLP is
gMLP =Softmax((1 −λ)F2+λF1)(1)
Where is the element-wise product, F2is a higher-level
representation of the target based on F1, and can be viewed
as the target’s embedding from another perspective. FC3 eval-
uates the relative importance of each element in F1, and then
combines it with F2according to the estimated importance of
each element. That is, gMLP is used to generate multi-view
results and fuse them relying on the learnable gate unit.
At each time step, denote the output of the normal and re-
verse branches through GCMLP as pand p0, the correspond-
ing ground-truth labels are yand y0, respectively. Cross en-
tropy (CE) [11] loss is used in GCT: Lnormal =CE(p, y),
Lreverse =CE(p0, y0). To further align the classification space
of the two branches to allow the model to focus on the con-
textual information of the same target, the mean squared error
(MSE) [13] is used as a context loss to measure the distance
between pand p0in the latent space: Lcontext =MSE(p0, p).
Hence, the total loss of GCT is L=Lnormal +Lreverse +Lcontext.
2.3. Forward-backward inference (FBI)
To utilize both the normal and reverse branches in inference,
we propose FBI to make the two branches infer the same goal
at each step, and fuse the prediction to form the final output.
While preserving the autoregressive property [11], FBI inte-
grates the forward and the backward sequence information
implied in the normal and reverse branches during inference,