End-to-End Multimodal Representation Learning for Video Dialog Huda Alamri

2025-05-08 0 0 2.79MB 15 页 10玖币

侵权投诉

End-to-End Multimodal Representation Learning for

Video Dialog

Huda Alamri

Georgia Institute of Technology

halamri3@gatech.edu

Anthony Bilic

Georgia Institute of Technology

abilic3@gatech.edu

Michael Hu

Georgia Institute of Technology

mhu93@gatech.edu

Apoorva Beedu

Georgia Institute of Technology

abeedu3@gatech.edu

Irfan Essa

Georgia Institute of Technology

irfan@gatech.edu

Abstract

Video-based dialog task is a challenging multimodal learning task that has received

increasing attention over the past few years with state-of-the-art obtaining new

performance records. This progress is largely powered by the adaptation of the

more powerful transformer-based language encoders. Despite this progress, exist-

ing approaches do not effectively utilize visual features to help solve tasks. Recent

studies show that state-of-the-art models are biased towards textual information

rather than visual cues. In order to better leverage the available visual informa-

tion, this study proposes a new framework that combines 3D-CNN network and

transformer-based networks into a single visual encoder to extract more robust

semantic representations from videos. The visual encoder is jointly trained end-to-

end with other input modalities such as text and audio. Experiments on the AVSD

task show signiﬁcant improvement over baselines in both generative and retrieval

tasks.

1 Introduction

The goal of the video-based dialog task is to answer questions about a dynamic scene presented in

the video. More precisely, given a short video clip and multiple rounds of questions and answers

about the video, the model should provide an accurate response to a follow-up question. An example

of this is shown in 1, where a model is presented with a short video and a conversation about it.

When the model is asked a follow-up question: “Did she re-enter the room?”, to provide an accurate

answer, the model has to acknowledge that the person “she” refers to the “woman” mentioned in the

previous utterances. The model also has to identify the action “re-entering the room” from the actions

in the video. This video-based dialog task represents a challenging multi-modal learning problem

that serves as a test bed for video and language representation learning. Advances in this research

ﬁeld inﬂuences a wide range of applications, including providing road assistance for autonomous

vehicles [

], helping visually impaired individuals to understand their surroundings, and navigating

through a very long video etc.

Success in this multi-modal learning task hinges on tackling four main challenges: (i) extracting

strong visual representations; (ii) extracting strong textual representations; (iii) effectively combining

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.14512v1 [cs.CV] 26 Oct 2022

User

Agent

DH1:Is there just one woman in the video?

DH2:Does she pick up the box??

DH3: Is it water she is drinking?

DH4: What type of room is she in?

Q1: Does she re-enter or is she just gone?

Yes, there is just one women

No, she never picks it up

No, I think it's some sort of soda it's a dark color

It looks like a living room area.

A: She is just gone, never comes back

Figure 1: In video dialog task, the model is presented with a short video, a dialog about the video, and

a follow-up question. The goal is to correctly answer the question conditioned on the audio-visual

cues and the dialog history (DH).

both features with other modalities (audio, when available); and ﬁnally, (iv) generating an accurate

response in natural language. While the task has received considerable interest from the community,

current work largely focuses on obtaining strong textual and visual representations independently

and combining the features [

], while the knowledge and cues from the video-text

association have not been extensively explored. This was investigated by Liu et al. [

], who

demonstrated that most models are biased towards the textual information, while visual features

not contributing substantially towards performance. This study argues that using the visual features

extracted from frozen 3D-CNN networks learned from action recognition data, without the added

knowledge about the corresponding text association, i.e. the questions, result in reduced performance

compared to joint training with both modalities.

Our work addresses this limited utilization of visual information in the video-based dialog task by

making the models more visually aware. First, a

D-CNN network extracts local temporal features

from the input video, which is then passed to a transformer based visual encoder network that

generates contextual representation through self-attention mechanism. These visual features are

then effectively combined with text and audio features to generate a best response for the input

video and question. These multiple modules form one uniﬁed framework that is trained end-to-end

which enables the model to generate stronger latent representations. Experiments on the video-

based dialog task AVSD show that our model learns a stronger joint visual-textual features which

contribute signiﬁcantly to its performance. Through several baselines, we show how recent methods

pre-extract visual features and improve the vision-based language tasks due to the strong performance

of the language models (e.g.,BERT and GPT2). On the contrary, our framework is designed to use

standard architectures to emphasize that joint learning of visual and textual information is vital for

the video-dialog task.

The contributions of our work are as follows:

•

We propose a new framework for video-based language understanding and generation tasks.

This multi-modal framework effectively learns contextual representation using strong visual

features from video, and through self-attention.

•

Our framework is ﬂexible and can use any number of modalities and different encoders for

these inputs. We show ablations on using Audio in addition to Text and Video modalities

in 4.

•

We also show the effectiveness of joint training on the retrieval task with a simpler framework.

We provide extensive experiments and detailed analysis of both generative and retrieval

tasks in the AVSD dataset are provided.

2 Related Work

Video and language understanding has been extensively investigated due to the wide range of

potential applications in human-computer interactions. Tasks such as video captioning [

video question-answering [

], and video dialog [

] study the complex interplay

between the vision and natural language modalities. In the case of video question-answering, effective

performance depends on extracting strong visual representation for the input video and efﬁciently

fusing it with the associated text. For video dialog, Alamri et al. [

] introduced the Audio-Visual

Scene Aware Task (AVSD) as a multi-modal learning problem, the objective of which is to answer a

question based on a short video, with an associated audio and a dialog history. The task supports

a discriminative setting, where the model ranks a list of candidate answers [

], or a generative

setting, where a decoder is trained to auto-regressively generate an answer [23, 11].

Self-attention models, known as transformers [

], have been very successful at generating deep

contextual linguistic representations. They are generally pre-trained with self-supervised learning on

very large unlabelled text corpora, and subsequently ﬁne-tuned on downstream tasks. They deliver

state-of-the-art results for several natural language understanding and generation tasks [

In our work we utilize a pre-trained BERT[

] model to encode the input question and the dialog

history.

Inspired by this success, a large body of work has adapted self-attention models to multi-modal

learning, including image question answering [

], image dialog [

], video question

answering [

], and video dialog [

]. In general, these approaches can be categorized

into single-stream and two-stream networks.

In the two-stream approach, each modality is independently encoded by a transformer-based network,

and information is fused through concatenation or cross-attention [

]. In the one-stream

approach, Li et al. [

], Su et al. [

], and Li et al. [

] utilize a uniﬁed transformer network where

video and text tokens are combined as one sequence.

In the two-stream approach, the visual features and the text features extracted using modal speciﬁc

encoders then fused jointly via transformer-based encoder Luo et al. [

]. This study builds on the

proposed model in [

] and extends it in two ways: ﬁrst, a 3D-CNN network is added to the backbone

visual encoder. Second, an audio transformer-based encoder is added to learn a representation from

the audio signal, which is combined with the other modalities via a cross-encoder, and the different

encoders and the decoder are jointly trained in an end-to-end fashion. The experiments demonstrate

the beneﬁts of this approach.

Le. H. et al. proposed a multimodal transformer network with query-attention [

]. Zekang et

al. et al. [

] utilized a pretrained GPT

model and extended it to learn joint audiovisual and text

featuring by training the model on multitask learning objectives [

]. Cherian A. et al. extend the

audio-visual transformer by adding student-teacher learning [

]. While all these approaches for

video dialog tasks have achieved promising improvements, the utilization of visual features remains

limited. All the approaches rely on pre-extracted visual features from 3D-CNN networks with no

further ﬁne-tuning or training. This has resulted in models that do not fully capture the multimodal

nature of the task [

]. In contrast, this model designed in this study also updates the visual extractor

(a 3D-CNN) in an end-to-end fashion, which leads to the improved learning of visual features tailored

to the video question answering task.

3 Method

This section introduces the framework for the video-based dialog task. It presents the different modal-

speciﬁc encoders, pre-processing of the input modalities, training objectives, and the evaluation

process.

3.1 Task Formulation

Given an input video

V=(V1, . . . , Vi, . . . , Vn)

, where

is the

ith

frame sampled from the video,

a dialog history

DHt=(C, (Q1,Ans1),· · ·,(Qt−1,Anst−1))

, where

is the video caption and

(Qt−1,Anst−1)

corresponds to a question-answer pair at round

t−1

, and audio

(see Figure 1), the

task is formulated such that, given a follow-up question

, the model must generate a response

considering input features: V,DH1:(t−1),A, and Qt:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

End-to-EndMultimodalRepresentationLearningforVideoDialogHudaAlamriGeorgiaInstituteofTechnologyhalamri3@gatech.eduAnthonyBilicGeorgiaInstituteofTechnologyabilic3@gatech.eduMichaelHuGeorgiaInstituteofTechnologymhu93@gatech.eduApoorvaBeeduGeorgiaInstituteofTechnologyabeedu3@gatech.eduIrfanEssaGeorgiaIn...

展开>> 收起<<

End-to-End Multimodal Representation Learning for Video Dialog Huda Alamri.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

End-to-End Multimodal Representation Learning for Video Dialog Huda Alamri

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: