
Model BLEU4 METEOR ROUGE-L CIDEr
with video
RLM 0.402 0.254 0.544 1.052
VGD-GPT2 0.388 0.251 0.539 0.998
PDC-GPT 0.385 0.260 0.545 1.010
Ours 0.414 0.265 0.558 1.078
w/o video
RLM 0.401 0.255 0.545 1.038
VGD-GPT2 0.393 0.251 0.537 1.016
PDC-GPT 0.388 0.261 0.543 1.020
Ours 0.405 0.264 0.554 1.064
Table 1: Pilot study on AVSD@DSTC7.
2021). On the other hand, multi-modal informa-
tion should be used in conjunction with each other,
and reasoning on different modalities should be
done
collaboratively
rather than
independently
.
Existing approaches fall short when it comes to rea-
soning jointly on multi-modalities, since they either
separate the reasoning of different modalities (Li
et al.,2020) or employ a cross-modal attention
mechanism which is difficult to train without direct
supervision (Le et al.,2020;Kim et al.,2021;Geng
et al.,2021).
To address the aforementioned issues, we pro-
pose extracting relevant information from videos
and converting it into reasoning paths, which are
in the form of natural language and can be fed
directly into PLMs. Besides, we propose a multi-
agent reasoning framework that is based on the
multi-agent reinforcement learning (MARL) the-
ory. Specifically, we design a video agent and a
context agent which learn to find the chains of rea-
soning on the multi-modal semantic graphs. We
further design a central communicator to make the
two agents work in a collaborative manner. Our
framework has the following advantages: (1) the
multi-modal reasoning paths are compatible with
the input of PLMs; (2) the reasoning process can
be “supervised” by designing appropriate reward
functions; and (3) the communication mechanism
allows the information from different modalities
better complement each other. We conduct ex-
tensive experiments on two benchmark datasets
for video-grounded dialogue generation, including
AVSD
@
DSTC7 (Alamri et al.,2018) and Twitch-
FIFA (Pasunuru and Bansal,2018). Experiment re-
sults show that, thanks to the multi-agent reasoning
framework, our model can significantly outperform
state-of-the-art methods in terms of both automatic
and human evaluations.
Our contributions in the paper are three-fold: (1)
Identifying the issue that current PLMs-based ap-
proaches are unable to fully comprehend the video
content although showing promising results in au-
tomatic evaluation metrics. (2) Proposal of a multi-
agent reasoning framework upon PLMs that can
let information from different modalities reinforce
each other and discover multi-modal reasoning
paths. (3) Empirical verification of the effective-
ness of the proposed model on two benchmarks of
video-grounded dialogue generation.
2 Related Work
The majority of early works on dialogue generation
use hand-crafted rules or templates to construct dia-
logue systems (Weizenbaum,1966;Wallace,2009).
A number of initiatives have been made to de-
velop end-to-end open-domain dialogue generation
models (Ritter et al.,2011;Gehring et al.,2017;
Vaswani et al.,2017), which have been inspired by
the developments in the field of machine transla-
tion. Following that, the vanilla encoder-decoder
architecture is frequently utilized to enhance re-
sponse quality, and numerous modifications to this
architecture have been made to enhance response
diversity (Zhao et al.,2017;Tao et al.,2018), model
the structure of conversation contexts (Zhang et al.,
2019), introduce external knowledge (Dinan et al.,
2019;Zhao et al.,2020) and control response at-
tributes (Wang et al.,2018;See et al.,2019;Wang
et al.,2020).
The research on generating dialogue from video
was started by Alamri et al. (2018). After that, Hori
et al. (2019a) present an LSTM-based encoder-
decoder architecture with multi-modal attention
that merely combines textual and visual data via
a projection matrix. A multi-modal transformer
network is introduced in Le et al. (2019) to encode
videos and incorporate data from several modali-
ties. Hori et al. (2019b) uses a joint student-teacher
learning approach to make up for a missing video
description in which the student network is trained
to mimic the teacher’s response. VGD-GPT (Le
and Hoi,2020) is based on a pre-trained GPT-
2 model and formulates the video-grounded dia-
logue generation as a sequence-to-sequence task.
On a pre-trained GPT-2 model, RLM (Li et al.,
2020) provides a multi-task learning strategy. Ad-
ditionally, BiST (Le et al.,2020) models the de-
pendencies between text and visual in two direc-
tions: spatial to temporal and temporal to spa-
tial. With visual attention, PDC-GPT (Le et al.,