End-to-End Multimodal Representation Learning for Video Dialog Huda Alamri

2025-05-08 0 0 2.79MB 15 页 10玖币
侵权投诉
End-to-End Multimodal Representation Learning for
Video Dialog
Huda Alamri
Georgia Institute of Technology
halamri3@gatech.edu
Anthony Bilic
Georgia Institute of Technology
abilic3@gatech.edu
Michael Hu
Georgia Institute of Technology
mhu93@gatech.edu
Apoorva Beedu
Georgia Institute of Technology
abeedu3@gatech.edu
Irfan Essa
Georgia Institute of Technology
irfan@gatech.edu
Abstract
Video-based dialog task is a challenging multimodal learning task that has received
increasing attention over the past few years with state-of-the-art obtaining new
performance records. This progress is largely powered by the adaptation of the
more powerful transformer-based language encoders. Despite this progress, exist-
ing approaches do not effectively utilize visual features to help solve tasks. Recent
studies show that state-of-the-art models are biased towards textual information
rather than visual cues. In order to better leverage the available visual informa-
tion, this study proposes a new framework that combines 3D-CNN network and
transformer-based networks into a single visual encoder to extract more robust
semantic representations from videos. The visual encoder is jointly trained end-to-
end with other input modalities such as text and audio. Experiments on the AVSD
task show significant improvement over baselines in both generative and retrieval
tasks.
1 Introduction
The goal of the video-based dialog task is to answer questions about a dynamic scene presented in
the video. More precisely, given a short video clip and multiple rounds of questions and answers
about the video, the model should provide an accurate response to a follow-up question. An example
of this is shown in 1, where a model is presented with a short video and a conversation about it.
When the model is asked a follow-up question: “Did she re-enter the room?”, to provide an accurate
answer, the model has to acknowledge that the person “she” refers to the “woman” mentioned in the
previous utterances. The model also has to identify the action “re-entering the room” from the actions
in the video. This video-based dialog task represents a challenging multi-modal learning problem
that serves as a test bed for video and language representation learning. Advances in this research
field influences a wide range of applications, including providing road assistance for autonomous
vehicles [
14
], helping visually impaired individuals to understand their surroundings, and navigating
through a very long video etc.
Success in this multi-modal learning task hinges on tackling four main challenges: (i) extracting
strong visual representations; (ii) extracting strong textual representations; (iii) effectively combining
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.14512v1 [cs.CV] 26 Oct 2022
User
Agent
DH1:Is there just one woman in the video?
DH2:Does she pick up the box??
DH3: Is it water she is drinking?
DH4: What type of room is she in?
Q1: Does she re-enter or is she just gone?
Yes, there is just one women
No, she never picks it up
No, I think it's some sort of soda it's a dark color
It looks like a living room area.
A: She is just gone, never comes back
Figure 1: In video dialog task, the model is presented with a short video, a dialog about the video, and
a follow-up question. The goal is to correctly answer the question conditioned on the audio-visual
cues and the dialog history (DH).
both features with other modalities (audio, when available); and finally, (iv) generating an accurate
response in natural language. While the task has received considerable interest from the community,
current work largely focuses on obtaining strong textual and visual representations independently
and combining the features [
3
,
23
,
34
,
18
,
17
], while the knowledge and cues from the video-text
association have not been extensively explored. This was investigated by Liu et al. [
26
], who
demonstrated that most models are biased towards the textual information, while visual features
not contributing substantially towards performance. This study argues that using the visual features
extracted from frozen 3D-CNN networks learned from action recognition data, without the added
knowledge about the corresponding text association, i.e. the questions, result in reduced performance
compared to joint training with both modalities.
Our work addresses this limited utilization of visual information in the video-based dialog task by
making the models more visually aware. First, a
3
D-CNN network extracts local temporal features
from the input video, which is then passed to a transformer based visual encoder network that
generates contextual representation through self-attention mechanism. These visual features are
then effectively combined with text and audio features to generate a best response for the input
video and question. These multiple modules form one unified framework that is trained end-to-end
which enables the model to generate stronger latent representations. Experiments on the video-
based dialog task AVSD show that our model learns a stronger joint visual-textual features which
contribute significantly to its performance. Through several baselines, we show how recent methods
pre-extract visual features and improve the vision-based language tasks due to the strong performance
of the language models (e.g.,BERT and GPT2). On the contrary, our framework is designed to use
standard architectures to emphasize that joint learning of visual and textual information is vital for
the video-dialog task.
The contributions of our work are as follows:
We propose a new framework for video-based language understanding and generation tasks.
This multi-modal framework effectively learns contextual representation using strong visual
features from video, and through self-attention.
Our framework is flexible and can use any number of modalities and different encoders for
these inputs. We show ablations on using Audio in addition to Text and Video modalities
in 4.
We also show the effectiveness of joint training on the retrieval task with a simpler framework.
We provide extensive experiments and detailed analysis of both generative and retrieval
tasks in the AVSD dataset are provided.
2
2 Related Work
Video and language understanding has been extensively investigated due to the wide range of
potential applications in human-computer interactions. Tasks such as video captioning [
44
,
49
,
5
],
video question-answering [
19
,
48
,
31
,
20
], and video dialog [
1
,
13
,
24
] study the complex interplay
between the vision and natural language modalities. In the case of video question-answering, effective
performance depends on extracting strong visual representation for the input video and efficiently
fusing it with the associated text. For video dialog, Alamri et al. [
1
] introduced the Audio-Visual
Scene Aware Task (AVSD) as a multi-modal learning problem, the objective of which is to answer a
question based on a short video, with an associated audio and a dialog history. The task supports
a discriminative setting, where the model ranks a list of candidate answers [
1
,
29
], or a generative
setting, where a decoder is trained to auto-regressively generate an answer [23, 11].
Self-attention models, known as transformers [
42
], have been very successful at generating deep
contextual linguistic representations. They are generally pre-trained with self-supervised learning on
very large unlabelled text corpora, and subsequently fine-tuned on downstream tasks. They deliver
state-of-the-art results for several natural language understanding and generation tasks [
42
,
33
,
32
,
8
].
In our work we utilize a pre-trained BERT[
8
] model to encode the input question and the dialog
history.
Inspired by this success, a large body of work has adapted self-attention models to multi-modal
learning, including image question answering [
27
,
7
,
22
,
21
,
41
], image dialog [
7
], video question
answering [
39
,
38
,
39
], and video dialog [
23
,
3
,
17
]. In general, these approaches can be categorized
into single-stream and two-stream networks.
In the two-stream approach, each modality is independently encoded by a transformer-based network,
and information is fused through concatenation or cross-attention [
41
,
27
]. In the one-stream
approach, Li et al. [
22
], Su et al. [
37
], and Li et al. [
29
] utilize a unified transformer network where
video and text tokens are combined as one sequence.
In the two-stream approach, the visual features and the text features extracted using modal specific
encoders then fused jointly via transformer-based encoder Luo et al. [
28
]. This study builds on the
proposed model in [
28
] and extends it in two ways: first, a 3D-CNN network is added to the backbone
visual encoder. Second, an audio transformer-based encoder is added to learn a representation from
the audio signal, which is combined with the other modalities via a cross-encoder, and the different
encoders and the decoder are jointly trained in an end-to-end fashion. The experiments demonstrate
the benefits of this approach.
Le. H. et al. proposed a multimodal transformer network with query-attention [
17
]. Zekang et
al. et al. [
33
] utilized a pretrained GPT
2
model and extended it to learn joint audiovisual and text
featuring by training the model on multitask learning objectives [
23
]. Cherian A. et al. extend the
audio-visual transformer by adding student-teacher learning [
35
]. While all these approaches for
video dialog tasks have achieved promising improvements, the utilization of visual features remains
limited. All the approaches rely on pre-extracted visual features from 3D-CNN networks with no
further fine-tuning or training. This has resulted in models that do not fully capture the multimodal
nature of the task [
26
]. In contrast, this model designed in this study also updates the visual extractor
(a 3D-CNN) in an end-to-end fashion, which leads to the improved learning of visual features tailored
to the video question answering task.
3 Method
This section introduces the framework for the video-based dialog task. It presents the different modal-
specific encoders, pre-processing of the input modalities, training objectives, and the evaluation
process.
3.1 Task Formulation
Given an input video
V=(V1, . . . , Vi, . . . , Vn)
, where
Vi
is the
ith
frame sampled from the video,
a dialog history
DHt=(C, (Q1,Ans1),· · ·,(Qt1,Anst1))
, where
C
is the video caption and
(Qt1,Anst1)
corresponds to a question-answer pair at round
t1
, and audio
A
(see Figure 1), the
task is formulated such that, given a follow-up question
Qt
, the model must generate a response
Rt
considering input features: V,DH1:(t1),A, and Qt:
3
摘要:

End-to-EndMultimodalRepresentationLearningforVideoDialogHudaAlamriGeorgiaInstituteofTechnologyhalamri3@gatech.eduAnthonyBilicGeorgiaInstituteofTechnologyabilic3@gatech.eduMichaelHuGeorgiaInstituteofTechnologymhu93@gatech.eduApoorvaBeeduGeorgiaInstituteofTechnologyabeedu3@gatech.eduIrfanEssaGeorgiaIn...

展开>> 收起<<
End-to-End Multimodal Representation Learning for Video Dialog Huda Alamri.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:15 页 大小:2.79MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注