Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation Xueliang Zhao12 Yuxuan Wang12 Chongyang Tao1

2025-04-29 0 0 544.96KB 11 页 10玖币

侵权投诉

Collaborative Reasoning on Multi-Modal Semantic Graphs for

Video-Grounded Dialogue Generation

Xueliang Zhao1,2∗

, Yuxuan Wang1,2∗

, Chongyang Tao1,

Chenshuo Wang1,2and Dongyan Zhao1,2,3†

1Wangxuan Institute of Computer Technology, Peking University

2Center for Data Science, AAIS, Peking University

3Beijing Institute for General Artiﬁcial Intelligence

{xl.zhao,chongyangtao,zhaody}@pku.edu.cn {wyx,wcs}@stu.pku.edu.cn

Abstract

We study video-grounded dialogue genera-

tion, where a response is generated based on

the dialogue context and the associated video.

The primary challenges of this task lie in (1)

the difﬁculty of integrating video data into

pre-trained language models (PLMs) which

presents obstacles to exploiting the power of

large-scale pre-training; and (2) the necessity

of taking into account the complementarity of

various modalities throughout the reasoning

process. Although having made remarkable

progress in video-grounded dialogue genera-

tion, existing methods still fall short when it

comes to integrating with PLMs in a way that

allows information from different modalities

to complement each other. To alleviate these

issues, we ﬁrst propose extracting pertinent in-

formation from videos and turning it into rea-

soning paths that are acceptable to PLMs. Ad-

ditionally, we propose a multi-agent reinforce-

ment learning method to collaboratively per-

form reasoning on different modalities (i.e.,

video and dialogue context). Empirical exper-

iment results on two public datasets indicate

that the proposed model can signiﬁcantly out-

perform state-of-the-art models by large mar-

gins on both automatic and human evaluations.

1 Introduction

Conversing with computers has become a crucial

step toward general artiﬁcial intelligence, and it

has attracted increasing attention from AI and NLP

researchers. Multi-turn dialogue response genera-

tion and multi-modal question answering are two

high-proﬁle initiatives made toward this goal. The

task of multi-turn dialogue response generation

necessitates the agent comprehending the key infor-

mation in the dialogue context in order to provide

a cohesive, ﬂuent and informative response (Zhao

et al.,2017;Tao et al.,2018). Multi-modal ques-

tion answering, on the other hand, necessitates the

*Equal Contribution.

†Corresponding author: Dongyan Zhao.

agent’s understanding of both the textual and vi-

sual contexts (Antol et al.,2015;Tapaswi et al.,

2016;Jang et al.,2017). The video-grounded dia-

logue (Alamri et al.,2018;Pasunuru and Bansal,

2018) is a generalization of the above two tasks,

in which the agent must observe multi-modal con-

tents and engage in a conversation with the hu-

man, rather than simply responding to the last

utterance or ignoring the visual contents. Com-

pared to multi-turn dialogue response generation

and multi-modal question answering, the distinc-

tive challenges posed by video-grounded dialogue

generation can be summarized as: (1) Unlike tra-

ditional multi-turn dialogue that can directly use

large-scale pre-trained language models (PLMs),

video-grounded dialogue cannot directly use PLMs

due to their incapacity to process video input; (2)

In comparison to multi-modal question answering,

video-grounded dialogue necessitates reasoning on

both video and multi-turn textual context, and there

is usually a complementarity between different

modalities that should be taken into account.

Although having made notable progress in video-

grounded dialogue, existing approaches still fail to

recognize the aforementioned challenges. On one

hand, existing approaches cannot be effectively

combined with PLMs, which presents obstacles to

exploiting the power of state-of-the-art pre-training

technology. The reasons can be summarized into

two categories: (1) Simply appending the video fea-

tures to the text embeddings presents a challenge

for the model to obtain an in-depth understanding

of the video (Li et al.,2020;Le and Hoi,2020;

Le et al.,2021). To investigate this problem fur-

ther, we compare the performance of these models

before and after removing the video from the in-

put. As demonstrated in Table 1, most metrics only

show a tiny shift, and several even increase once the

video is removed; and (2) Overly complex designs

for the Transformer that are difﬁcult to transfer to

PLMs (Le et al.,2020;Kim et al.,2021;Geng et al.,

arXiv:2210.12460v1 [cs.CL] 22 Oct 2022

Model BLEU4 METEOR ROUGE-L CIDEr

with video

RLM 0.402 0.254 0.544 1.052

VGD-GPT2 0.388 0.251 0.539 0.998

PDC-GPT 0.385 0.260 0.545 1.010

Ours 0.414 0.265 0.558 1.078

w/o video

RLM 0.401 0.255 0.545 1.038

VGD-GPT2 0.393 0.251 0.537 1.016

PDC-GPT 0.388 0.261 0.543 1.020

Ours 0.405 0.264 0.554 1.064

Table 1: Pilot study on AVSD@DSTC7.

2021). On the other hand, multi-modal informa-

tion should be used in conjunction with each other,

and reasoning on different modalities should be

done

collaboratively

rather than

independently

Existing approaches fall short when it comes to rea-

soning jointly on multi-modalities, since they either

separate the reasoning of different modalities (Li

et al.,2020) or employ a cross-modal attention

mechanism which is difﬁcult to train without direct

supervision (Le et al.,2020;Kim et al.,2021;Geng

et al.,2021).

To address the aforementioned issues, we pro-

pose extracting relevant information from videos

and converting it into reasoning paths, which are

in the form of natural language and can be fed

directly into PLMs. Besides, we propose a multi-

agent reasoning framework that is based on the

multi-agent reinforcement learning (MARL) the-

ory. Speciﬁcally, we design a video agent and a

context agent which learn to ﬁnd the chains of rea-

soning on the multi-modal semantic graphs. We

further design a central communicator to make the

two agents work in a collaborative manner. Our

framework has the following advantages: (1) the

multi-modal reasoning paths are compatible with

the input of PLMs; (2) the reasoning process can

be “supervised” by designing appropriate reward

functions; and (3) the communication mechanism

allows the information from different modalities

better complement each other. We conduct ex-

tensive experiments on two benchmark datasets

for video-grounded dialogue generation, including

AVSD

DSTC7 (Alamri et al.,2018) and Twitch-

FIFA (Pasunuru and Bansal,2018). Experiment re-

sults show that, thanks to the multi-agent reasoning

framework, our model can signiﬁcantly outperform

state-of-the-art methods in terms of both automatic

and human evaluations.

Our contributions in the paper are three-fold: (1)

Identifying the issue that current PLMs-based ap-

proaches are unable to fully comprehend the video

content although showing promising results in au-

tomatic evaluation metrics. (2) Proposal of a multi-

agent reasoning framework upon PLMs that can

let information from different modalities reinforce

each other and discover multi-modal reasoning

paths. (3) Empirical veriﬁcation of the effective-

ness of the proposed model on two benchmarks of

video-grounded dialogue generation.

2 Related Work

The majority of early works on dialogue generation

use hand-crafted rules or templates to construct dia-

logue systems (Weizenbaum,1966;Wallace,2009).

A number of initiatives have been made to de-

velop end-to-end open-domain dialogue generation

models (Ritter et al.,2011;Gehring et al.,2017;

Vaswani et al.,2017), which have been inspired by

the developments in the ﬁeld of machine transla-

tion. Following that, the vanilla encoder-decoder

architecture is frequently utilized to enhance re-

sponse quality, and numerous modiﬁcations to this

architecture have been made to enhance response

diversity (Zhao et al.,2017;Tao et al.,2018), model

the structure of conversation contexts (Zhang et al.,

2019), introduce external knowledge (Dinan et al.,

2019;Zhao et al.,2020) and control response at-

tributes (Wang et al.,2018;See et al.,2019;Wang

et al.,2020).

The research on generating dialogue from video

was started by Alamri et al. (2018). After that, Hori

et al. (2019a) present an LSTM-based encoder-

decoder architecture with multi-modal attention

that merely combines textual and visual data via

a projection matrix. A multi-modal transformer

network is introduced in Le et al. (2019) to encode

videos and incorporate data from several modali-

ties. Hori et al. (2019b) uses a joint student-teacher

learning approach to make up for a missing video

description in which the student network is trained

to mimic the teacher’s response. VGD-GPT (Le

and Hoi,2020) is based on a pre-trained GPT-

2 model and formulates the video-grounded dia-

logue generation as a sequence-to-sequence task.

On a pre-trained GPT-2 model, RLM (Li et al.,

2020) provides a multi-task learning strategy. Ad-

ditionally, BiST (Le et al.,2020) models the de-

pendencies between text and visual in two direc-

tions: spatial to temporal and temporal to spa-

tial. With visual attention, PDC-GPT (Le et al.,

2021) learns to anticipate the reasoning process

on turn-level semantic graphs. For further reason-

ing, SCGA (Kim et al.,2021) constructs a struc-

tured graph based on a multi-modal coreference

technique, while STSGR (Geng et al.,2021) intro-

duce a shufﬂed transformer reasoning framework

on semantic scene graph. In contrast to previous

approaches, this paper focuses on how to build a

multi-modal reasoning approach that can cooperate

with PLMs in a way that facilitates the complemen-

tary nature of information from various modalities.

The study of reasoning on various types of graph

structures for dialogue generation is related to our

work. Moon et al. (2019) create a KG walk path

for each entity retrieved in an effort to explain con-

versational reasoning in a natural way. Jung et al.

(2020) develop a dialogue-conditioned path traver-

sal model with attention ﬂows and improve the

comprehension of the path reasoning process. Xu

et al. (2020) propose to represent dialogue transi-

tions as graphs. Previous approaches typically con-

centrate on textual graphs, but video-grounded dia-

logue contains multi-modal contexts, which makes

it difﬁcult to conduct reasoning.

3 Approach

3.1 Overview

Suppose that we have a dataset

{Vi, Ui, Ri}N

i=1

with

denoting the to-

tal number of datapoints. For the

-th

datapoint,

signiﬁes a brief video clip,

Ui={ui,1, ui,2,· · · , ui,n}

serves as the dialogue

context with

ui,j ={w1

i,j , w2

i,j ,· · · , wm

i,j }

denot-

ing the

-th utterance.

and

are the number of

utterances in a context and the number of words

in an utterance respectively.

is a response that

is factually consistent with the video while also

catching up with the dialogue context. Our goal

is to learn a generation model

p(R|V, U;θ)1

(

denotes the parameters of the model) from

, so

that given a new dialogue context

associated

with a video

, one can generate a response

following p(R|V, U;θ).

To alleviate the heterogeneity of different modal-

ities, we ﬁrst represent the video as well as the

dialogue context as semantic graphs (will be elab-

orated in Section 3.2). Figure 1illustrates the ar-

chitecture of the proposed model. In a nutshell,

the model is composed of a multi-modal reason-

ing module and a generation module. The multi-

1We omit the subscript to reduce clutter.

modal reasoning module is responsible for extract-

ing crucial signals from multi-modal contexts (Sec-

tion 3.3). Speciﬁcally, it consists of a video agent,

a text agent and a central communicator. The video

agent and the text agent are responsible for ex-

tracting reasoning paths from the video semantic

graph and the text semantic graph respectively. Tak-

ing the latest context utterance as input, they de-

termine the query entities from which they start

traversing the graphs to ﬁnd the answer-providing

paths. To search for answer-providing paths more

efﬁciently, we devise a central communicator to

transport the entire path histories between video

and text agents. The reasoning paths, which form

interpretable provenances for the prediction, are

integrated by the generation module to synthesize

a response (Section 3.4).

3.2 Multi-Modal Graph Construction

The crucial step in building the semantic graph

for video reasoning is gathering the collection of

facts from the unstructured video data, which take

the form of subject-predicate-object triplets. Al-

though there have been some previous attempts

to extract such triplets from videos using relation

detection (Liu et al.,2020), the models that have

been made public struggle to build the proper rela-

tions because of the dramatic domain discrepancy

between their training corpus and the benchmark

dataset for video-grounded dialogue. Therefore,

we resort to video action recognition (Zhu et al.,

2020) to extract meaningful structural representa-

tions from video. Speciﬁcally, we ﬁrst employ the

slowfast model (Feichtenhofer et al.,2019), which

is pre-trained on the Charades (Sigurdsson et al.,

2016) and Kinetics dataset (Kay et al.,2017), to

extract all potential action classes and only reserve

those with a probability greater than

0.5

. Given

the extracted facts

{(ev

s, r, ev

o)}

with

and

standing for subject, predicate and object respec-

tively, we can construct a video semantic graph

Gv= (Nv, Ev)

in which the entities

and

are

represented as nodes (i.e.,

s, ev

o∈Nv

) and the

relation

is represented as a labeled edge connect-

ing them (i.e., (ev

s, r, ev

o)∈Ev).

The semantic graph for dialogue context,

Gu=

(Nu, Eu)

, is constructed in a similar way, except

that we employ open information extraction (Ope-

nIE) technology to extract subject-predicate-object

triplets. Speciﬁcally, we ﬁrst apply the co-reference

resolution tool (e.g., AllenNLP (Gardner et al.,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CollaborativeReasoningonMulti-ModalSemanticGraphsforVideo-GroundedDialogueGenerationXueliangZhao1;2,YuxuanWang1;2,ChongyangTao1,ChenshuoWang1;2andDongyanZhao1;2;3y1WangxuanInstituteofComputerTechnology,PekingUniversity2CenterforDataScience,AAIS,PekingUniversity3BeijingInstituteforGeneralArticialI...

展开>> 收起<<

Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation Xueliang Zhao12 Yuxuan Wang12 Chongyang Tao1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation Xueliang Zhao12 Yuxuan Wang12 Chongyang Tao1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: