Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation Xueliang Zhao12 Yuxuan Wang12 Chongyang Tao1

2025-04-29 0 0 544.96KB 11 页 10玖币
侵权投诉
Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation
Xueliang Zhao1,2
, Yuxuan Wang1,2
, Chongyang Tao1,
Chenshuo Wang1,2and Dongyan Zhao1,2,3
1Wangxuan Institute of Computer Technology, Peking University
2Center for Data Science, AAIS, Peking University
3Beijing Institute for General Artificial Intelligence
{xl.zhao,chongyangtao,zhaody}@pku.edu.cn {wyx,wcs}@stu.pku.edu.cn
Abstract
We study video-grounded dialogue genera-
tion, where a response is generated based on
the dialogue context and the associated video.
The primary challenges of this task lie in (1)
the difficulty of integrating video data into
pre-trained language models (PLMs) which
presents obstacles to exploiting the power of
large-scale pre-training; and (2) the necessity
of taking into account the complementarity of
various modalities throughout the reasoning
process. Although having made remarkable
progress in video-grounded dialogue genera-
tion, existing methods still fall short when it
comes to integrating with PLMs in a way that
allows information from different modalities
to complement each other. To alleviate these
issues, we first propose extracting pertinent in-
formation from videos and turning it into rea-
soning paths that are acceptable to PLMs. Ad-
ditionally, we propose a multi-agent reinforce-
ment learning method to collaboratively per-
form reasoning on different modalities (i.e.,
video and dialogue context). Empirical exper-
iment results on two public datasets indicate
that the proposed model can significantly out-
perform state-of-the-art models by large mar-
gins on both automatic and human evaluations.
1 Introduction
Conversing with computers has become a crucial
step toward general artificial intelligence, and it
has attracted increasing attention from AI and NLP
researchers. Multi-turn dialogue response genera-
tion and multi-modal question answering are two
high-profile initiatives made toward this goal. The
task of multi-turn dialogue response generation
necessitates the agent comprehending the key infor-
mation in the dialogue context in order to provide
a cohesive, fluent and informative response (Zhao
et al.,2017;Tao et al.,2018). Multi-modal ques-
tion answering, on the other hand, necessitates the
*Equal Contribution.
Corresponding author: Dongyan Zhao.
agent’s understanding of both the textual and vi-
sual contexts (Antol et al.,2015;Tapaswi et al.,
2016;Jang et al.,2017). The video-grounded dia-
logue (Alamri et al.,2018;Pasunuru and Bansal,
2018) is a generalization of the above two tasks,
in which the agent must observe multi-modal con-
tents and engage in a conversation with the hu-
man, rather than simply responding to the last
utterance or ignoring the visual contents. Com-
pared to multi-turn dialogue response generation
and multi-modal question answering, the distinc-
tive challenges posed by video-grounded dialogue
generation can be summarized as: (1) Unlike tra-
ditional multi-turn dialogue that can directly use
large-scale pre-trained language models (PLMs),
video-grounded dialogue cannot directly use PLMs
due to their incapacity to process video input; (2)
In comparison to multi-modal question answering,
video-grounded dialogue necessitates reasoning on
both video and multi-turn textual context, and there
is usually a complementarity between different
modalities that should be taken into account.
Although having made notable progress in video-
grounded dialogue, existing approaches still fail to
recognize the aforementioned challenges. On one
hand, existing approaches cannot be effectively
combined with PLMs, which presents obstacles to
exploiting the power of state-of-the-art pre-training
technology. The reasons can be summarized into
two categories: (1) Simply appending the video fea-
tures to the text embeddings presents a challenge
for the model to obtain an in-depth understanding
of the video (Li et al.,2020;Le and Hoi,2020;
Le et al.,2021). To investigate this problem fur-
ther, we compare the performance of these models
before and after removing the video from the in-
put. As demonstrated in Table 1, most metrics only
show a tiny shift, and several even increase once the
video is removed; and (2) Overly complex designs
for the Transformer that are difficult to transfer to
PLMs (Le et al.,2020;Kim et al.,2021;Geng et al.,
arXiv:2210.12460v1 [cs.CL] 22 Oct 2022
Model BLEU4 METEOR ROUGE-L CIDEr
with video
RLM 0.402 0.254 0.544 1.052
VGD-GPT2 0.388 0.251 0.539 0.998
PDC-GPT 0.385 0.260 0.545 1.010
Ours 0.414 0.265 0.558 1.078
w/o video
RLM 0.401 0.255 0.545 1.038
VGD-GPT2 0.393 0.251 0.537 1.016
PDC-GPT 0.388 0.261 0.543 1.020
Ours 0.405 0.264 0.554 1.064
Table 1: Pilot study on AVSD@DSTC7.
2021). On the other hand, multi-modal informa-
tion should be used in conjunction with each other,
and reasoning on different modalities should be
done
collaboratively
rather than
independently
.
Existing approaches fall short when it comes to rea-
soning jointly on multi-modalities, since they either
separate the reasoning of different modalities (Li
et al.,2020) or employ a cross-modal attention
mechanism which is difficult to train without direct
supervision (Le et al.,2020;Kim et al.,2021;Geng
et al.,2021).
To address the aforementioned issues, we pro-
pose extracting relevant information from videos
and converting it into reasoning paths, which are
in the form of natural language and can be fed
directly into PLMs. Besides, we propose a multi-
agent reasoning framework that is based on the
multi-agent reinforcement learning (MARL) the-
ory. Specifically, we design a video agent and a
context agent which learn to find the chains of rea-
soning on the multi-modal semantic graphs. We
further design a central communicator to make the
two agents work in a collaborative manner. Our
framework has the following advantages: (1) the
multi-modal reasoning paths are compatible with
the input of PLMs; (2) the reasoning process can
be “supervised” by designing appropriate reward
functions; and (3) the communication mechanism
allows the information from different modalities
better complement each other. We conduct ex-
tensive experiments on two benchmark datasets
for video-grounded dialogue generation, including
AVSD
@
DSTC7 (Alamri et al.,2018) and Twitch-
FIFA (Pasunuru and Bansal,2018). Experiment re-
sults show that, thanks to the multi-agent reasoning
framework, our model can significantly outperform
state-of-the-art methods in terms of both automatic
and human evaluations.
Our contributions in the paper are three-fold: (1)
Identifying the issue that current PLMs-based ap-
proaches are unable to fully comprehend the video
content although showing promising results in au-
tomatic evaluation metrics. (2) Proposal of a multi-
agent reasoning framework upon PLMs that can
let information from different modalities reinforce
each other and discover multi-modal reasoning
paths. (3) Empirical verification of the effective-
ness of the proposed model on two benchmarks of
video-grounded dialogue generation.
2 Related Work
The majority of early works on dialogue generation
use hand-crafted rules or templates to construct dia-
logue systems (Weizenbaum,1966;Wallace,2009).
A number of initiatives have been made to de-
velop end-to-end open-domain dialogue generation
models (Ritter et al.,2011;Gehring et al.,2017;
Vaswani et al.,2017), which have been inspired by
the developments in the field of machine transla-
tion. Following that, the vanilla encoder-decoder
architecture is frequently utilized to enhance re-
sponse quality, and numerous modifications to this
architecture have been made to enhance response
diversity (Zhao et al.,2017;Tao et al.,2018), model
the structure of conversation contexts (Zhang et al.,
2019), introduce external knowledge (Dinan et al.,
2019;Zhao et al.,2020) and control response at-
tributes (Wang et al.,2018;See et al.,2019;Wang
et al.,2020).
The research on generating dialogue from video
was started by Alamri et al. (2018). After that, Hori
et al. (2019a) present an LSTM-based encoder-
decoder architecture with multi-modal attention
that merely combines textual and visual data via
a projection matrix. A multi-modal transformer
network is introduced in Le et al. (2019) to encode
videos and incorporate data from several modali-
ties. Hori et al. (2019b) uses a joint student-teacher
learning approach to make up for a missing video
description in which the student network is trained
to mimic the teacher’s response. VGD-GPT (Le
and Hoi,2020) is based on a pre-trained GPT-
2 model and formulates the video-grounded dia-
logue generation as a sequence-to-sequence task.
On a pre-trained GPT-2 model, RLM (Li et al.,
2020) provides a multi-task learning strategy. Ad-
ditionally, BiST (Le et al.,2020) models the de-
pendencies between text and visual in two direc-
tions: spatial to temporal and temporal to spa-
tial. With visual attention, PDC-GPT (Le et al.,
2021) learns to anticipate the reasoning process
on turn-level semantic graphs. For further reason-
ing, SCGA (Kim et al.,2021) constructs a struc-
tured graph based on a multi-modal coreference
technique, while STSGR (Geng et al.,2021) intro-
duce a shuffled transformer reasoning framework
on semantic scene graph. In contrast to previous
approaches, this paper focuses on how to build a
multi-modal reasoning approach that can cooperate
with PLMs in a way that facilitates the complemen-
tary nature of information from various modalities.
The study of reasoning on various types of graph
structures for dialogue generation is related to our
work. Moon et al. (2019) create a KG walk path
for each entity retrieved in an effort to explain con-
versational reasoning in a natural way. Jung et al.
(2020) develop a dialogue-conditioned path traver-
sal model with attention flows and improve the
comprehension of the path reasoning process. Xu
et al. (2020) propose to represent dialogue transi-
tions as graphs. Previous approaches typically con-
centrate on textual graphs, but video-grounded dia-
logue contains multi-modal contexts, which makes
it difficult to conduct reasoning.
3 Approach
3.1 Overview
Suppose that we have a dataset
D=
{Vi, Ui, Ri}N
i=1
with
N
denoting the to-
tal number of datapoints. For the
i
-th
datapoint,
Vi
signifies a brief video clip,
Ui={ui,1, ui,2,· · · , ui,n}
serves as the dialogue
context with
ui,j ={w1
i,j , w2
i,j ,· · · , wm
i,j }
denot-
ing the
j
-th utterance.
n
and
m
are the number of
utterances in a context and the number of words
in an utterance respectively.
Ri
is a response that
is factually consistent with the video while also
catching up with the dialogue context. Our goal
is to learn a generation model
p(R|V, U;θ)1
(
θ
denotes the parameters of the model) from
D
, so
that given a new dialogue context
U
associated
with a video
V
, one can generate a response
following p(R|V, U;θ).
To alleviate the heterogeneity of different modal-
ities, we first represent the video as well as the
dialogue context as semantic graphs (will be elab-
orated in Section 3.2). Figure 1illustrates the ar-
chitecture of the proposed model. In a nutshell,
the model is composed of a multi-modal reason-
ing module and a generation module. The multi-
1We omit the subscript to reduce clutter.
modal reasoning module is responsible for extract-
ing crucial signals from multi-modal contexts (Sec-
tion 3.3). Specifically, it consists of a video agent,
a text agent and a central communicator. The video
agent and the text agent are responsible for ex-
tracting reasoning paths from the video semantic
graph and the text semantic graph respectively. Tak-
ing the latest context utterance as input, they de-
termine the query entities from which they start
traversing the graphs to find the answer-providing
paths. To search for answer-providing paths more
efficiently, we devise a central communicator to
transport the entire path histories between video
and text agents. The reasoning paths, which form
interpretable provenances for the prediction, are
integrated by the generation module to synthesize
a response (Section 3.4).
3.2 Multi-Modal Graph Construction
The crucial step in building the semantic graph
for video reasoning is gathering the collection of
facts from the unstructured video data, which take
the form of subject-predicate-object triplets. Al-
though there have been some previous attempts
to extract such triplets from videos using relation
detection (Liu et al.,2020), the models that have
been made public struggle to build the proper rela-
tions because of the dramatic domain discrepancy
between their training corpus and the benchmark
dataset for video-grounded dialogue. Therefore,
we resort to video action recognition (Zhu et al.,
2020) to extract meaningful structural representa-
tions from video. Specifically, we first employ the
slowfast model (Feichtenhofer et al.,2019), which
is pre-trained on the Charades (Sigurdsson et al.,
2016) and Kinetics dataset (Kay et al.,2017), to
extract all potential action classes and only reserve
those with a probability greater than
0.5
. Given
the extracted facts
{(ev
s, r, ev
o)}
with
ev
s
,
rv
and
ev
o
standing for subject, predicate and object respec-
tively, we can construct a video semantic graph
Gv= (Nv, Ev)
in which the entities
ev
s
and
ev
o
are
represented as nodes (i.e.,
ev
s, ev
oNv
) and the
relation
rv
is represented as a labeled edge connect-
ing them (i.e., (ev
s, r, ev
o)Ev).
The semantic graph for dialogue context,
Gu=
(Nu, Eu)
, is constructed in a similar way, except
that we employ open information extraction (Ope-
nIE) technology to extract subject-predicate-object
triplets. Specifically, we first apply the co-reference
resolution tool (e.g., AllenNLP (Gardner et al.,
摘要:

CollaborativeReasoningonMulti-ModalSemanticGraphsforVideo-GroundedDialogueGenerationXueliangZhao1;2,YuxuanWang1;2,ChongyangTao1,ChenshuoWang1;2andDongyanZhao1;2;3y1WangxuanInstituteofComputerTechnology,PekingUniversity2CenterforDataScience,AAIS,PekingUniversity3BeijingInstituteforGeneralArticialI...

展开>> 收起<<
Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation Xueliang Zhao12 Yuxuan Wang12 Chongyang Tao1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:544.96KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注