McQueen a Benchmark for Multimodal Conversational Query Rewrite Yifei Yuan1 Chen Shi2 Runze Wang2 Liyi Chen3 Feijun Jiang2 Yuan You2 and Wai Lam1

2025-05-02 0 0 3.83MB 11 页 10玖币

侵权投诉

McQueen: a Benchmark for Multimodal Conversational Query Rewrite

Yifei Yuan1∗

, Chen Shi2, Runze Wang2,

Liyi Chen3, Feijun Jiang2, Yuan You2, and Wai Lam1

1The Chinese University of Hong Kong

2Alibaba Group

3Nankai University

{yfyuan,wlam}@se.cuhk.edu.hk

{deling.sc,yunze.wrz,feijun.jiangfj,youyuan.yy}@alibaba-inc.com

liyichen@mail.nankai.edu.cn

Abstract

The task of query rewrite aims to convert an

in-context query to its fully-speciﬁed version

where ellipsis and coreference are completed

and referred-back according to the history con-

text. Although much progress has been made,

less efforts have been paid to real scenario con-

versations that involve drawing information

from more than one modalities. In this pa-

per, we propose the task of multimodal con-

versational query rewrite (McQR), which per-

forms query rewrite under the multimodal vi-

sual conversation setting. We collect a large-

scale dataset named McQueen based on man-

ual annotation, which contains 15k visual con-

versations and over 80k queries where each

one is associated with a fully-speciﬁed rewrite

version. In addition, for entities appearing in

the rewrite, we provide the corresponding im-

age box annotation. We then use the McQueen

dataset to benchmark a state-of-the-art method

for effectively tackling the McQR task, which

is based on a multimodal pre-trained model

with pointer generator. Extensive experiments

are performed to demonstrate the effectiveness

of our model on this task1.

1 Introduction

Recent years have witnessed an increasing attention

in conversational-related tasks, such as conversa-

tional question answering (Choi et al.,2018;Reddy

et al.,2019), visual conversation modeling (Das

et al.,2017), etc. One main challenge in multi-turn

conversation modeling is that information from

context history is easy to be abbreviated or omit-

ted in the follow-up queries, causing the so-called

coreference and ellipsis. To address this concern,

∗

Work done when Yifei Yuan was an intern at Alibaba.

This work was supported by Alibaba Group through Alibaba

Research Intern Program, and a grant from the Research Grant

Council of the Hong Kong Special Administrative Region,

China (Project Codes: 14200620).

The dataset and code of this paper are both available in

https://github.com/yfyuan01/MQR

!!:Is it a black Labrador?

∗:Is the dog a black Labrador?

#!:Yes.

!#:How may people are there in the scene?

∗:How may people are there in the scene?

##:Just one.

!$:Can you see other people?

∗:Can you see other people except for the man?

#$:No.

the dog

the man

Figure 1: An example of the mulitmodal query rewrite

task, where Q,Q∗, and Adenote the queries, their

corresponding rewrites and the answers. Red color de-

notes the coreference rewrite part and blue denotes the

ellipsis rewrite part. Image boxes are utilized for repre-

senting the area of the entities in the rewrite.

the task of query rewrite (Elgohary et al.,2019;

Pan et al.,2019;Su et al.,2019) aims to reconstruct

the original query to a fully speciﬁed form based

on its history context. The rewrite eliminates the

coreference and ellipsis in the original query with-

out changing its semantic information, thus helping

turn the more challenging multi-turn conversation

modeling problem to a single-turn version.

Following this line, several attempts have been

made in the query rewrite task which achieve de-

cent performance on the natural language level.

Nevertheless, conversations in real scenario tend

to involve knowledge from more than one modali-

ties, such as vision, text, speech, etc. Information

from different modalities is not handled in isola-

tion, but often integrated together to improve the

quality of perception and understanding. For ex-

ample, as shown in Figure 1, in the ﬁrst turn of the

visual conversation, with the lack of context, the

user directly uses the pronoun “it” to refer to the

dog in the image. In the third turn, for ellipsis that

does not appear in the context history, in order to

perform ellipsis completion, one also needs to ﬁnd

clues from the corresponding image. Rewriting

the query under the circumstance where grounding

arXiv:2210.12775v1 [cs.CL] 23 Oct 2022

outside the text information is needed poses more

challenges to traditional query rewrite models that

based only on textual features.

In this paper, we propose the task of multimodal

conversational query rewrite (McQR), which aims

to perform query rewrite under the multimodal vi-

sual conversation setting. To achieve this goal,

we collect a large-scale dataset called McQueen.

Speciﬁcally, for each visual conversation consisting

of an image and the corresponding history question-

answer context, we provide manual rewrite for the

query, where the coreference resolution and ellipsis

completion are performed respectively. Further-

more, in order to assist downstream tasks such as

coreference entity detection, for all the entities ap-

pearing in the rewrite, we annotate the image boxes

for representing their corresponding image area.

We then use the McQueen dataset to benchmark

a state-of-the-art method for effectively tackling

the McQR task. Inspired by the big success of

pre-trained models such as BERT (Devlin et al.,

2019), our model is based on a multimodal pre-

trained model where interactions between different

modalities can be better captured. Furthermore,

we enhance the model with a pointer generator

specially designed for the multimodal Transformer

blocks (Vaswani et al.,2017), so that the rewritten

query is either generated from scratch or copied

from certain contextual parts with high attention

weights. Extensive experiments are conducted to

compare our method with several state-of-the-art

methods. Our model outperforms all the meth-

ods in both the McQR task and two subtasks. We

further perform analysis to investigate the role of

different modalities in this task and demonstrate

that the introduction of image information provides

extra guidance for the concerned query rewrite task.

In summary, the contribution of our paper lies in

three folds:

•

We formally deﬁne the task of multimodal

conversational query rewrite (McQR), which

aims to generate a fully-speciﬁed rewrite

query based on both the context history and

the visual image.

•

We propose a large-scale dataset McQueen,

containing 15k visual conversations and over

80k rewrites. For the entities appearing in the

rewrites, we also annotate the image boxes for

representing their corresponding image area.

•

We benchmark a multimodal Transformer-

based model with pointer mechanism for ef-

fectively tackling the McQR task. Extensive

analysis shows the role of different modalities

in our model.

2 Related Work

2.1 Query Rewrite

The task of query rewrite provides reconstructed

queries based on abbreviated in-context queries

without changing their semantic meaning. First

introduced by (Elgohary et al.,2019;Su et al.,

2019;Pan et al.,2019), most works formulate it

as a standard generation task, which can be solved

via a Sequence-to-Sequence model (Quan et al.,

2019;Vakulenko et al.,2021;Anantha et al.,2021).

Some attempts have been made to introduce a multi-

task learning setup in order to enhance the training

process (Rastogi et al.,2019;Song et al.,2020;

Zhang et al.,2020). Some works seek to focus on

query rewrite under the low-resource scenario (Yu

et al.,2020;Voskarides et al.,2020;Yu et al.,2021).

For modeling the linguistic knowledge in conver-

sational context more effectively, prior knowledge

is leveraged such as using semantic role labeling

to provide extra guidance (Xu et al.,2020), and

reducing the generation search space via sequence

tagging (Hao et al.,2021). Although these works

have achieved great performance on their corre-

sponding task, query rewrite under the multimodal

setting has not been explored.

2.2 Visual Coreference Resolution

Visual dialog entails answering a set of questions

grounded by an image (Das et al.,2017). Based on

that, visual coreference resolution involves linking

the words in the text (usually nouns and pronouns)

to a certain area in the image (Kong et al.,2014;

Kottur et al.,2018). Following this line, Li et al.

(2021) restrict coreference resolution to pronouns

and resolve coreferences in visual dialog in an un-

supervised way. Yu et al. (2019) deﬁne the task

of visual pronoun coreference resolution where a

dataset called VisPro and a model called VisCoref

are benchmarked accordingly. Based on that, Yu

et al. (2022) resolve pronoun coreference and pro-

pose a novel framework to improve visual dialog

understanding. This task can be seen as a subtask

of the McQR task where coreference resolution and

ellipsis completion are both taken into account.

3 The McQueen Dataset

3.1 Dataset Overview

Our dataset is based on a visual dialog dataset

called VisDial (Das et al.,2017). The original

VisDial dataset consists of over 133k dialogs, each

associated with an image and 10 rounds of question-

answer pairs. All question-answer pairs are con-

ducted in a conversational format and revolve

around the content of the picture.

Our dataset randomly selects 15k conversations

from the VisDial dataset with the total of over

80k rewrite utterances. For each query in a vi-

sual conversation, we conduct manual annotation

to resolve the information omission. The query is

reconstructed based on the image as well as the his-

tory context so that the coreference and ellipsis are

referred-back or completed. For negative queries

that do not contain any information omission, the

rewrite stays the same as the original query. In

addition, for all the entities appearing in the coref-

erence and ellipsis, we annotate the image boxes

for representing their corresponding image areas.

3.2 Dataset Construction

3.2.1 Text Rewrite Annotation

For manual annotation, we hire 16 annotators in

total. Before the annotation starts, we provide 100

examples for all the annotators to refer to. We

also provide a guideline and some tutorials by list-

ing some typical coreference and ellipsis cases so

that the bias and language style shift between in-

dividuals are minimized as possible. After that,

the annotators start working on a small portion

of data where query rewrite is performed. After

all the results are returned and the data quality is

checked, the main annotation phase begins and the

rest of data is labeled. On average, each annota-

tor is in charge with the rewrite of 5059 queries.

The rewrite annotation interface can be seen in Ap-

pendix A.

3.2.2 Image Box Annotation

Besides the rewrite annotation, we also provide im-

age annotation to assist downstream or related tasks

(e.g. coreference entity detection). The image box

annotation begins right after the rewrite annotation.

The overall procedure also follows the (1) tutorial

(2) trial phase (3) main phase pipeline. Speciﬁcally,

the annotators have to extract the entities in the el-

lipsis and coreference part and draw the bounding

boxes of them in the image. Each annotator is in

charge of the image annotation of the rewrites writ-

ten by him/herself. The image annotation interface

can be seen in Appendix B.

3.2.3 Quality Control

After the all the annotation is ﬁnished, we re-group

and shufﬂe the annotators to perform cross quality

inspection. Each group is asked to check the anno-

tation results of other groups. In addition, two new

annotators who do not take part in the annotation

phase are recruited to check the quality of all the

annotation results. The annotators have to answer

three questions for each query : (1) Is the rewrite

result correct or not? (2) Are all the coreference

and ellipsis resolved in the rewrite? (3) Are the

entities in the coreference and ellipsis correctly an-

notated in the image? All the conversation rewrites

must get the all “yes” result from all the annota-

tors before ofﬁcial acceptance, otherwise they are

collected to be revised and re-checked (the ques-

tionnaire interface is shown in Appendix C). The

whole check-revise process lasts for three iterations.

Considering chance agreement, we measured the

Inter-Annotator Agreement (IAA) in terms of Co-

hen’s

(Cohen,1960). The ﬁnal

score is 0.82,

reaching the “almost perfect” level

. Besides, after

each quality check iteration, we randomly sample

100 conversations from the dataset and manually

evaluate the utterance-level precision and recall

rate, where precision denotes the rate of the re-

trieved rewrites being correct, while the recall rate

records the portion that the coreference and ellipsis

being handled. The precision and recall rate in the

1st/2nd/3rd iteration are (89.0%, 87.1%)/(95.5%,

94.2%)/(98.3%, 98.2%), respectively.

3.2.4 Annotation Cost and Duration

The overall phase including the annotation and

quality check spanned for 10 weeks (from March to

May 2022), where the annotation guidance lasts for

2 weeks, data annotation lasts for 5 weeks, quality

check lasts for 3 weeks. All the annotators are En-

glish native speakers recruited from a professional

data management company Appen

. The annota-

tion costs $5942 US dollars in total, with $0.31 per

utterance rewrite, $0.03 per image box annotation.

3.3 Dataset Statistics

According to Table 2, 86.5% of our dataset covers

positive rewrite cases that coreference or ellipsis

2According to (Landis and Koch,1977), if κ >= 0.81

3https://appen.com/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

McQueen:aBenchmarkforMultimodalConversationalQueryRewriteYifeiYuan1,ChenShi2,RunzeWang2,LiyiChen3,FeijunJiang2,YuanYou2,andWaiLam11TheChineseUniversityofHongKong2AlibabaGroup3NankaiUniversity{yfyuan,wlam}@se.cuhk.edu.hk{deling.sc,yunze.wrz,feijun.jiangfj,youyuan.yy}@alibaba-inc.comliyichen@mail.nan...

展开>> 收起<<

McQueen a Benchmark for Multimodal Conversational Query Rewrite Yifei Yuan1 Chen Shi2 Runze Wang2 Liyi Chen3 Feijun Jiang2 Yuan You2 and Wai Lam1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

McQueen a Benchmark for Multimodal Conversational Query Rewrite Yifei Yuan1 Chen Shi2 Runze Wang2 Liyi Chen3 Feijun Jiang2 Yuan You2 and Wai Lam1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: