McQueen a Benchmark for Multimodal Conversational Query Rewrite Yifei Yuan1 Chen Shi2 Runze Wang2 Liyi Chen3 Feijun Jiang2 Yuan You2 and Wai Lam1

2025-05-02 0 0 3.83MB 11 页 10玖币
侵权投诉
McQueen: a Benchmark for Multimodal Conversational Query Rewrite
Yifei Yuan1
, Chen Shi2, Runze Wang2,
Liyi Chen3, Feijun Jiang2, Yuan You2, and Wai Lam1
1The Chinese University of Hong Kong
2Alibaba Group
3Nankai University
{yfyuan,wlam}@se.cuhk.edu.hk
{deling.sc,yunze.wrz,feijun.jiangfj,youyuan.yy}@alibaba-inc.com
liyichen@mail.nankai.edu.cn
Abstract
The task of query rewrite aims to convert an
in-context query to its fully-specified version
where ellipsis and coreference are completed
and referred-back according to the history con-
text. Although much progress has been made,
less efforts have been paid to real scenario con-
versations that involve drawing information
from more than one modalities. In this pa-
per, we propose the task of multimodal con-
versational query rewrite (McQR), which per-
forms query rewrite under the multimodal vi-
sual conversation setting. We collect a large-
scale dataset named McQueen based on man-
ual annotation, which contains 15k visual con-
versations and over 80k queries where each
one is associated with a fully-specified rewrite
version. In addition, for entities appearing in
the rewrite, we provide the corresponding im-
age box annotation. We then use the McQueen
dataset to benchmark a state-of-the-art method
for effectively tackling the McQR task, which
is based on a multimodal pre-trained model
with pointer generator. Extensive experiments
are performed to demonstrate the effectiveness
of our model on this task1.
1 Introduction
Recent years have witnessed an increasing attention
in conversational-related tasks, such as conversa-
tional question answering (Choi et al.,2018;Reddy
et al.,2019), visual conversation modeling (Das
et al.,2017), etc. One main challenge in multi-turn
conversation modeling is that information from
context history is easy to be abbreviated or omit-
ted in the follow-up queries, causing the so-called
coreference and ellipsis. To address this concern,
Work done when Yifei Yuan was an intern at Alibaba.
This work was supported by Alibaba Group through Alibaba
Research Intern Program, and a grant from the Research Grant
Council of the Hong Kong Special Administrative Region,
China (Project Codes: 14200620).
1
The dataset and code of this paper are both available in
https://github.com/yfyuan01/MQR
!!:Is it a black Labrador?
!!
:Is the dog a black Labrador?
#!:Yes.
!#:How may people are there in the scene?
!#
:How may people are there in the scene?
##:Just one.
!$:Can you see other people?
!$
:Can you see other people except for the man?
#$:No.
the dog
the man
Figure 1: An example of the mulitmodal query rewrite
task, where Q,Q, and Adenote the queries, their
corresponding rewrites and the answers. Red color de-
notes the coreference rewrite part and blue denotes the
ellipsis rewrite part. Image boxes are utilized for repre-
senting the area of the entities in the rewrite.
the task of query rewrite (Elgohary et al.,2019;
Pan et al.,2019;Su et al.,2019) aims to reconstruct
the original query to a fully specified form based
on its history context. The rewrite eliminates the
coreference and ellipsis in the original query with-
out changing its semantic information, thus helping
turn the more challenging multi-turn conversation
modeling problem to a single-turn version.
Following this line, several attempts have been
made in the query rewrite task which achieve de-
cent performance on the natural language level.
Nevertheless, conversations in real scenario tend
to involve knowledge from more than one modali-
ties, such as vision, text, speech, etc. Information
from different modalities is not handled in isola-
tion, but often integrated together to improve the
quality of perception and understanding. For ex-
ample, as shown in Figure 1, in the first turn of the
visual conversation, with the lack of context, the
user directly uses the pronoun “it” to refer to the
dog in the image. In the third turn, for ellipsis that
does not appear in the context history, in order to
perform ellipsis completion, one also needs to find
clues from the corresponding image. Rewriting
the query under the circumstance where grounding
arXiv:2210.12775v1 [cs.CL] 23 Oct 2022
outside the text information is needed poses more
challenges to traditional query rewrite models that
based only on textual features.
In this paper, we propose the task of multimodal
conversational query rewrite (McQR), which aims
to perform query rewrite under the multimodal vi-
sual conversation setting. To achieve this goal,
we collect a large-scale dataset called McQueen.
Specifically, for each visual conversation consisting
of an image and the corresponding history question-
answer context, we provide manual rewrite for the
query, where the coreference resolution and ellipsis
completion are performed respectively. Further-
more, in order to assist downstream tasks such as
coreference entity detection, for all the entities ap-
pearing in the rewrite, we annotate the image boxes
for representing their corresponding image area.
We then use the McQueen dataset to benchmark
a state-of-the-art method for effectively tackling
the McQR task. Inspired by the big success of
pre-trained models such as BERT (Devlin et al.,
2019), our model is based on a multimodal pre-
trained model where interactions between different
modalities can be better captured. Furthermore,
we enhance the model with a pointer generator
specially designed for the multimodal Transformer
blocks (Vaswani et al.,2017), so that the rewritten
query is either generated from scratch or copied
from certain contextual parts with high attention
weights. Extensive experiments are conducted to
compare our method with several state-of-the-art
methods. Our model outperforms all the meth-
ods in both the McQR task and two subtasks. We
further perform analysis to investigate the role of
different modalities in this task and demonstrate
that the introduction of image information provides
extra guidance for the concerned query rewrite task.
In summary, the contribution of our paper lies in
three folds:
We formally define the task of multimodal
conversational query rewrite (McQR), which
aims to generate a fully-specified rewrite
query based on both the context history and
the visual image.
We propose a large-scale dataset McQueen,
containing 15k visual conversations and over
80k rewrites. For the entities appearing in the
rewrites, we also annotate the image boxes for
representing their corresponding image area.
We benchmark a multimodal Transformer-
based model with pointer mechanism for ef-
fectively tackling the McQR task. Extensive
analysis shows the role of different modalities
in our model.
2 Related Work
2.1 Query Rewrite
The task of query rewrite provides reconstructed
queries based on abbreviated in-context queries
without changing their semantic meaning. First
introduced by (Elgohary et al.,2019;Su et al.,
2019;Pan et al.,2019), most works formulate it
as a standard generation task, which can be solved
via a Sequence-to-Sequence model (Quan et al.,
2019;Vakulenko et al.,2021;Anantha et al.,2021).
Some attempts have been made to introduce a multi-
task learning setup in order to enhance the training
process (Rastogi et al.,2019;Song et al.,2020;
Zhang et al.,2020). Some works seek to focus on
query rewrite under the low-resource scenario (Yu
et al.,2020;Voskarides et al.,2020;Yu et al.,2021).
For modeling the linguistic knowledge in conver-
sational context more effectively, prior knowledge
is leveraged such as using semantic role labeling
to provide extra guidance (Xu et al.,2020), and
reducing the generation search space via sequence
tagging (Hao et al.,2021). Although these works
have achieved great performance on their corre-
sponding task, query rewrite under the multimodal
setting has not been explored.
2.2 Visual Coreference Resolution
Visual dialog entails answering a set of questions
grounded by an image (Das et al.,2017). Based on
that, visual coreference resolution involves linking
the words in the text (usually nouns and pronouns)
to a certain area in the image (Kong et al.,2014;
Kottur et al.,2018). Following this line, Li et al.
(2021) restrict coreference resolution to pronouns
and resolve coreferences in visual dialog in an un-
supervised way. Yu et al. (2019) define the task
of visual pronoun coreference resolution where a
dataset called VisPro and a model called VisCoref
are benchmarked accordingly. Based on that, Yu
et al. (2022) resolve pronoun coreference and pro-
pose a novel framework to improve visual dialog
understanding. This task can be seen as a subtask
of the McQR task where coreference resolution and
ellipsis completion are both taken into account.
3 The McQueen Dataset
3.1 Dataset Overview
Our dataset is based on a visual dialog dataset
called VisDial (Das et al.,2017). The original
VisDial dataset consists of over 133k dialogs, each
associated with an image and 10 rounds of question-
answer pairs. All question-answer pairs are con-
ducted in a conversational format and revolve
around the content of the picture.
Our dataset randomly selects 15k conversations
from the VisDial dataset with the total of over
80k rewrite utterances. For each query in a vi-
sual conversation, we conduct manual annotation
to resolve the information omission. The query is
reconstructed based on the image as well as the his-
tory context so that the coreference and ellipsis are
referred-back or completed. For negative queries
that do not contain any information omission, the
rewrite stays the same as the original query. In
addition, for all the entities appearing in the coref-
erence and ellipsis, we annotate the image boxes
for representing their corresponding image areas.
3.2 Dataset Construction
3.2.1 Text Rewrite Annotation
For manual annotation, we hire 16 annotators in
total. Before the annotation starts, we provide 100
examples for all the annotators to refer to. We
also provide a guideline and some tutorials by list-
ing some typical coreference and ellipsis cases so
that the bias and language style shift between in-
dividuals are minimized as possible. After that,
the annotators start working on a small portion
of data where query rewrite is performed. After
all the results are returned and the data quality is
checked, the main annotation phase begins and the
rest of data is labeled. On average, each annota-
tor is in charge with the rewrite of 5059 queries.
The rewrite annotation interface can be seen in Ap-
pendix A.
3.2.2 Image Box Annotation
Besides the rewrite annotation, we also provide im-
age annotation to assist downstream or related tasks
(e.g. coreference entity detection). The image box
annotation begins right after the rewrite annotation.
The overall procedure also follows the (1) tutorial
(2) trial phase (3) main phase pipeline. Specifically,
the annotators have to extract the entities in the el-
lipsis and coreference part and draw the bounding
boxes of them in the image. Each annotator is in
charge of the image annotation of the rewrites writ-
ten by him/herself. The image annotation interface
can be seen in Appendix B.
3.2.3 Quality Control
After the all the annotation is finished, we re-group
and shuffle the annotators to perform cross quality
inspection. Each group is asked to check the anno-
tation results of other groups. In addition, two new
annotators who do not take part in the annotation
phase are recruited to check the quality of all the
annotation results. The annotators have to answer
three questions for each query : (1) Is the rewrite
result correct or not? (2) Are all the coreference
and ellipsis resolved in the rewrite? (3) Are the
entities in the coreference and ellipsis correctly an-
notated in the image? All the conversation rewrites
must get the all “yes” result from all the annota-
tors before official acceptance, otherwise they are
collected to be revised and re-checked (the ques-
tionnaire interface is shown in Appendix C). The
whole check-revise process lasts for three iterations.
Considering chance agreement, we measured the
Inter-Annotator Agreement (IAA) in terms of Co-
hen’s
κ
(Cohen,1960). The final
κ
score is 0.82,
reaching the “almost perfect” level
2
. Besides, after
each quality check iteration, we randomly sample
100 conversations from the dataset and manually
evaluate the utterance-level precision and recall
rate, where precision denotes the rate of the re-
trieved rewrites being correct, while the recall rate
records the portion that the coreference and ellipsis
being handled. The precision and recall rate in the
1st/2nd/3rd iteration are (89.0%, 87.1%)/(95.5%,
94.2%)/(98.3%, 98.2%), respectively.
3.2.4 Annotation Cost and Duration
The overall phase including the annotation and
quality check spanned for 10 weeks (from March to
May 2022), where the annotation guidance lasts for
2 weeks, data annotation lasts for 5 weeks, quality
check lasts for 3 weeks. All the annotators are En-
glish native speakers recruited from a professional
data management company Appen
3
. The annota-
tion costs $5942 US dollars in total, with $0.31 per
utterance rewrite, $0.03 per image box annotation.
3.3 Dataset Statistics
According to Table 2, 86.5% of our dataset covers
positive rewrite cases that coreference or ellipsis
2According to (Landis and Koch,1977), if κ >= 0.81
3https://appen.com/
摘要:

McQueen:aBenchmarkforMultimodalConversationalQueryRewriteYifeiYuan1,ChenShi2,RunzeWang2,LiyiChen3,FeijunJiang2,YuanYou2,andWaiLam11TheChineseUniversityofHongKong2AlibabaGroup3NankaiUniversity{yfyuan,wlam}@se.cuhk.edu.hk{deling.sc,yunze.wrz,feijun.jiangfj,youyuan.yy}@alibaba-inc.comliyichen@mail.nan...

展开>> 收起<<
McQueen a Benchmark for Multimodal Conversational Query Rewrite Yifei Yuan1 Chen Shi2 Runze Wang2 Liyi Chen3 Feijun Jiang2 Yuan You2 and Wai Lam1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:3.83MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注