outside the text information is needed poses more
challenges to traditional query rewrite models that
based only on textual features.
In this paper, we propose the task of multimodal
conversational query rewrite (McQR), which aims
to perform query rewrite under the multimodal vi-
sual conversation setting. To achieve this goal,
we collect a large-scale dataset called McQueen.
Specifically, for each visual conversation consisting
of an image and the corresponding history question-
answer context, we provide manual rewrite for the
query, where the coreference resolution and ellipsis
completion are performed respectively. Further-
more, in order to assist downstream tasks such as
coreference entity detection, for all the entities ap-
pearing in the rewrite, we annotate the image boxes
for representing their corresponding image area.
We then use the McQueen dataset to benchmark
a state-of-the-art method for effectively tackling
the McQR task. Inspired by the big success of
pre-trained models such as BERT (Devlin et al.,
2019), our model is based on a multimodal pre-
trained model where interactions between different
modalities can be better captured. Furthermore,
we enhance the model with a pointer generator
specially designed for the multimodal Transformer
blocks (Vaswani et al.,2017), so that the rewritten
query is either generated from scratch or copied
from certain contextual parts with high attention
weights. Extensive experiments are conducted to
compare our method with several state-of-the-art
methods. Our model outperforms all the meth-
ods in both the McQR task and two subtasks. We
further perform analysis to investigate the role of
different modalities in this task and demonstrate
that the introduction of image information provides
extra guidance for the concerned query rewrite task.
In summary, the contribution of our paper lies in
three folds:
•
We formally define the task of multimodal
conversational query rewrite (McQR), which
aims to generate a fully-specified rewrite
query based on both the context history and
the visual image.
•
We propose a large-scale dataset McQueen,
containing 15k visual conversations and over
80k rewrites. For the entities appearing in the
rewrites, we also annotate the image boxes for
representing their corresponding image area.
•
We benchmark a multimodal Transformer-
based model with pointer mechanism for ef-
fectively tackling the McQR task. Extensive
analysis shows the role of different modalities
in our model.
2 Related Work
2.1 Query Rewrite
The task of query rewrite provides reconstructed
queries based on abbreviated in-context queries
without changing their semantic meaning. First
introduced by (Elgohary et al.,2019;Su et al.,
2019;Pan et al.,2019), most works formulate it
as a standard generation task, which can be solved
via a Sequence-to-Sequence model (Quan et al.,
2019;Vakulenko et al.,2021;Anantha et al.,2021).
Some attempts have been made to introduce a multi-
task learning setup in order to enhance the training
process (Rastogi et al.,2019;Song et al.,2020;
Zhang et al.,2020). Some works seek to focus on
query rewrite under the low-resource scenario (Yu
et al.,2020;Voskarides et al.,2020;Yu et al.,2021).
For modeling the linguistic knowledge in conver-
sational context more effectively, prior knowledge
is leveraged such as using semantic role labeling
to provide extra guidance (Xu et al.,2020), and
reducing the generation search space via sequence
tagging (Hao et al.,2021). Although these works
have achieved great performance on their corre-
sponding task, query rewrite under the multimodal
setting has not been explored.
2.2 Visual Coreference Resolution
Visual dialog entails answering a set of questions
grounded by an image (Das et al.,2017). Based on
that, visual coreference resolution involves linking
the words in the text (usually nouns and pronouns)
to a certain area in the image (Kong et al.,2014;
Kottur et al.,2018). Following this line, Li et al.
(2021) restrict coreference resolution to pronouns
and resolve coreferences in visual dialog in an un-
supervised way. Yu et al. (2019) define the task
of visual pronoun coreference resolution where a
dataset called VisPro and a model called VisCoref
are benchmarked accordingly. Based on that, Yu
et al. (2022) resolve pronoun coreference and pro-
pose a novel framework to improve visual dialog
understanding. This task can be seen as a subtask
of the McQR task where coreference resolution and
ellipsis completion are both taken into account.