2 Related Work
Video and language understanding has been extensively investigated due to the wide range of
potential applications in human-computer interactions. Tasks such as video captioning [
44
,
49
,
5
],
video question-answering [
19
,
48
,
31
,
20
], and video dialog [
1
,
13
,
24
] study the complex interplay
between the vision and natural language modalities. In the case of video question-answering, effective
performance depends on extracting strong visual representation for the input video and efficiently
fusing it with the associated text. For video dialog, Alamri et al. [
1
] introduced the Audio-Visual
Scene Aware Task (AVSD) as a multi-modal learning problem, the objective of which is to answer a
question based on a short video, with an associated audio and a dialog history. The task supports
a discriminative setting, where the model ranks a list of candidate answers [
1
,
29
], or a generative
setting, where a decoder is trained to auto-regressively generate an answer [23, 11].
Self-attention models, known as transformers [
42
], have been very successful at generating deep
contextual linguistic representations. They are generally pre-trained with self-supervised learning on
very large unlabelled text corpora, and subsequently fine-tuned on downstream tasks. They deliver
state-of-the-art results for several natural language understanding and generation tasks [
42
,
33
,
32
,
8
].
In our work we utilize a pre-trained BERT[
8
] model to encode the input question and the dialog
history.
Inspired by this success, a large body of work has adapted self-attention models to multi-modal
learning, including image question answering [
27
,
7
,
22
,
21
,
41
], image dialog [
7
], video question
answering [
39
,
38
,
39
], and video dialog [
23
,
3
,
17
]. In general, these approaches can be categorized
into single-stream and two-stream networks.
In the two-stream approach, each modality is independently encoded by a transformer-based network,
and information is fused through concatenation or cross-attention [
41
,
27
]. In the one-stream
approach, Li et al. [
22
], Su et al. [
37
], and Li et al. [
29
] utilize a unified transformer network where
video and text tokens are combined as one sequence.
In the two-stream approach, the visual features and the text features extracted using modal specific
encoders then fused jointly via transformer-based encoder Luo et al. [
28
]. This study builds on the
proposed model in [
28
] and extends it in two ways: first, a 3D-CNN network is added to the backbone
visual encoder. Second, an audio transformer-based encoder is added to learn a representation from
the audio signal, which is combined with the other modalities via a cross-encoder, and the different
encoders and the decoder are jointly trained in an end-to-end fashion. The experiments demonstrate
the benefits of this approach.
Le. H. et al. proposed a multimodal transformer network with query-attention [
17
]. Zekang et
al. et al. [
33
] utilized a pretrained GPT
2
model and extended it to learn joint audiovisual and text
featuring by training the model on multitask learning objectives [
23
]. Cherian A. et al. extend the
audio-visual transformer by adding student-teacher learning [
35
]. While all these approaches for
video dialog tasks have achieved promising improvements, the utilization of visual features remains
limited. All the approaches rely on pre-extracted visual features from 3D-CNN networks with no
further fine-tuning or training. This has resulted in models that do not fully capture the multimodal
nature of the task [
26
]. In contrast, this model designed in this study also updates the visual extractor
(a 3D-CNN) in an end-to-end fashion, which leads to the improved learning of visual features tailored
to the video question answering task.
3 Method
This section introduces the framework for the video-based dialog task. It presents the different modal-
specific encoders, pre-processing of the input modalities, training objectives, and the evaluation
process.
3.1 Task Formulation
Given an input video
V=(V1, . . . , Vi, . . . , Vn)
, where
Vi
is the
ith
frame sampled from the video,
a dialog history
DHt=(C, (Q1,Ans1),· · ·,(Qt−1,Anst−1))
, where
C
is the video caption and
(Qt−1,Anst−1)
corresponds to a question-answer pair at round
t−1
, and audio
A
(see Figure 1), the
task is formulated such that, given a follow-up question
Qt
, the model must generate a response
Rt
considering input features: V,DH1:(t−1),A, and Qt:
3