Extending Phrase Grounding with Pronouns in Visual Dialogues Panzhong Lu1 Xin Zhang1 Meishan Zhang2 Min Zhang2 1School of New Media and Communication Tianjin University

2025-05-06 0 0 2.57MB 12 页 10玖币

侵权投诉

Extending Phrase Grounding with Pronouns in Visual Dialogues

Panzhong Lu1, Xin Zhang1, Meishan Zhang2∗

, Min Zhang2

1School of New Media and Communication, Tianjin University

2Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen)

{panzhong171,hsinz}@tju.edu.cn, {zhangmeishan,zhangmin2021}@hit.edu.cn

Abstract

Conventional phrase grounding aims to local-

ize noun phrases mentioned in a given caption

to their corresponding image regions, which

has achieved great success recently. Ap-

parently, sole noun phrase grounding is not

enough for cross-modal visual language under-

standing. Here we extend the task by consid-

ering pronouns as well. First, we construct

a dataset of phrase grounding with both noun

phrases and pronouns to image regions. Based

on the dataset, we test the performance of

phrase grounding by using a state-of-the-art lit-

erature model of this line. Then, we enhance

the baseline grounding model with corefer-

ence information which should help our task

potentially, modeling the coreference struc-

tures with graph convolutional networks. Ex-

periments on our dataset, interestingly, show

that pronouns are easier to ground than noun

phrases, where the possible reason might be

that these pronouns are much less ambiguous.

Additionally, our ﬁnal model with coreference

information can signiﬁcantly boost the ground-

ing performance of both noun phrases and pro-

nouns.

1 Introduction

Grounded language learning has been prevailing

for decades in many ﬁelds (Chandu et al.,2021),

generally aiming to learn the real-world meaning

of textual units (e.g., words or phrases) by con-

jointly leveraging the perception data (e.g., images

or videos). Bisk et al. (2020) advocate that we

cannot overlook the physical world that language

describes when doing language understanding re-

search from a novel perspective. In particular, with

the stimulation of modeling techniques and multi-

modal data collection paradigms, the task has made

excellent progress in the downstream tasks, which

involves multi-modal question answering (Agrawal

∗Corresponding author.

C: Two kids are eating food at a table.

Q: What color hair do they have?

A: Dirty blonde.

Q: What type of food are they eating?

A: Pizza.

Q: Are they drinking anything?

A: Looks like water.

Figure 1: An example of grounding noun phrases and

pronouns referred in the caption and dialogue (partly)

to the associated image regions. With an image de-

scribed by a caption, two people are discussing what

they can see. Here, we annotate the same object with

the same color, respectively. And obviously, the same

object mentioned in the text naturally forms a corefer-

ence chain.

et al.,2017;Chang et al.,2022), video-text align-

ment (Yang et al.,2021) and robot navigation (Ro-

man Roman et al.,2020;Gu et al.,2022).

Typically, as one branch of grounded language

learning, phrase grounding, ﬁrst proposed by Plum-

mer et al. (2015), also plays a key role in visual

language understanding. Its goal is to ground the

phrases in a given caption to the corresponding

image regions. Recently, many researchers have at-

tempted varied approaches to explore this task. Mu

et al. (2021) propose a novel graph learning frame-

work for phrase grounding to distinguish the diver-

sity of context among phrases and image regions.

Wang et al. (2020) develop a multimodal alignment

framework to utilize the caption-image datasets

under weak supervision. Kamath et al. (2021)

advance phrase grounding with their end-to-end

modulated pre-trained network named MDETR.

Overall, the natural language processing (NLP) and

computer vision (CV) communities have seen huge

achievements in the task of phrase grounding.

In spite of its apparent success, there remains

a worth-thinking weakness. Almost all previous

works mainly focus on the noun phrases/words,

which can derive their meanings by the expres-

arXiv:2210.12658v1 [cs.CL] 23 Oct 2022

sional forms to some extent. There is little work

that takes account into pronouns. As shown in Fig-

ure 1, pronouns deﬁnitely have underlying effects

on the performance of visual grounding, which

should be carefully examined (Yu et al.,2019). As

a result, here we shift our eyes from the common

(almost noun) phrase grounding with the extension

of pronouns for the ﬁrst time.

In this paper, we present the ﬁrst work for inves-

tigating phrase grounding that includes pronouns,

and explore how coreference chains can have an ef-

fect on the performance of our task. We annotate an

initial dataset based on visual dialogue (Das et al.,

2017), as shown in Figure 1. For the model, we

can directly apply MDETR (Kamath et al.,2021),

which is an end-to-end modulated detector. How-

ever, the model does offer much information to

understand pronouns. Thus, we enhance the vanilla

model with coreference information from the dia-

logue end, where a graph neural network is adopted

to encode the graph-style coreference knowledge.

Finally, we conduct experiments on our con-

structed dataset to benchmark the extended phrase

grounding task. According to the results, we ﬁnd

that interestingly, pronouns are easier to ground

by MDETR than phrases. The underlying reason

might be that the pronouns are always more im-

portant during dialogue, leading to less ambiguity

in speech communication. In addition, our ﬁnal

model can be signiﬁcantly enhanced by adding the

gold graph-style coreference knowledge; however,

the model fails to obtain any positive gain when

the coreference information is sourced from a state-

of-the-art machine learning model. We conduct

several in-depth analyses for comprehensive under-

standing of our task as well as the model.

In summary, our contributions are as follows:

•

We extend the task of phrase grounding by

taking account of pronouns, and correspond-

ingly establish a new dataset manually, named

VD-Ref, which is the ﬁrst dataset with ground-

truth mappings from both noun phrases and

pronouns to image regions.

•

We benchmark the extended phrase grounding

task by a state-of-the-art model, and also in-

vestigate our task with the coreference knowl-

edge of the text, which should beneﬁt our task

straightforwardly.

•

We observe several unexpected results by our

empirical veriﬁcation, and to understand these

results, in-depth analyses are offered to illus-

Sect. #Img #Pronoun #Phrase #Box #Coref

Train 6199 18600 35118 16559 14582

Dev 1063 3256 5739 3074 2503

Test 1595 4033 7941 4347 3754

Table 1: Data statistics of our constructed dataset.

#Box means the number of bounding boxes in the im-

age. #Coref means the number of coreference chains.

trate them, which might be useful for the fu-

ture investigation of phrase grounding.

2 Our Task and The VD-Ref Dataset

2.1 Task Description

The phrase grounding task’s general purpose is to

map multiple noun phrases to the image regions,

however, in this paper, we take the challenge a

step further by grounding various noun phrases

and pronouns from the given dialogue to the ap-

propriate regions of an image. Take Figure 1for

example, with all the expressions mentioned in the

dialogue, like the coreference chain that includes

“Two kids” and “they”, the task needs to predict the

corresponding regions of the object “kids” using

bounding boxes in image.

Formally, we deﬁne the task as follows: given

an image

and the corresponding ground-truth

dialogue

, we denote

M={N, P }

as all the lan-

guage expressions, typically,

is the oun phrases

and

is the pronouns, the prime objective of the

task is to predict a bounding box (or bounding

boxes) Bfor each expression.

2.2 Data Collection

With the aim to build a high-quality dataset that

includes sufﬁcient pronouns, we adopt the large-

scaled VisDialog dataset (Das et al.,2017) which

contains 120k images from the COCO (Lin et al.,

2014), where each image is associated with a dia-

logue

around to the image. We randomly choose

a set of 10k complete sets from the VisDialog

dataset, and use the StanfordCoreNLP (Manning

et al.,2014) tool to tokenize the sentences, making

it proper for the succeeding human annotation.

2.3 Annotation Process

The whole annotation workﬂow is divided into

three stages as follows: (i) developing a conve-

nient online tool for the user annotation; (ii) setting

If not speciﬁed, the following dialogues that are discussed

all contain a caption.

Noun Phrases

Six C, 4%

Five C, 8%

Four C, 13%

Three C, 26%

Two C, 30% One C, 17%

Seven C, 2%

Total

48798

Six C, 3%

Five C, 6%

Four C, 11%

Three C, 23%

Two C, 30%

One C, 26%

Seven C, 1%

Pronouns

Total

25889

Note: C means the Coreference Chain.Note: C means the Coreference Chain.

Figure 2: The proportion of noun phrases and pronouns

in different number coreference chains.

up a standard annotation guideline according to our

task purpose; (iii) Recruiting sufﬁcient expert users

to annotate the dataset and ensuring each instance

with three annotations. Firstly, we adopt the label-

studio platform (Tkachenko et al.,2020-2022) as

the basis to design a user-friend interface targeted

to our task, where the concreted interface is shown

in Appendix A. Then, we let three people with the

visual grounding research experience previously as

our experts. They annotate 100 data-pairs together

as examples, and establish an annotation guideline

based on their consensus after several discussions.

Next, we recruit a number of college students

who are expertised at English skills to annotate

our dataset. Before starting our task, the students

are asked to read the guideline of the annotation

process carefully and attempt to annotate some

test sets of data, during this period, we examine

these students and choose 20 of them to do the

following annotation task. In the annotation of

each datapoint, the prepared data is split into micro-

tasks so that each one consists of 500 dialogues.

We assign three workers to each micro-task, and

their identities are remained hidden from each other.

After all annotation tasks are ﬁnished, we let our

experts check the results and make corrections of

the unconsistent annotations as well.

Finally, we establish the

VD-Ref

dataset, which

is annotated manually with the noun phrases and

pronouns that naturally form the coreference chains

as well as the relevant bounding boxes in images.

2.4 Statistics of the VD-Ref Dataset

Totally, we collect 74,687 entity mentions and

23,980 objects from 8,857 VisDialog datasets,

where the mentions include 48,798 noun phrases

and 25,889 pronouns, on average, a dialogue con-

sists of 5.51 noun phrases and 2.92 pronouns.

On the contrary, the existing datasets for phrase

grounding hardly consider the pronouns. The

ReferItGame dataset (Kazemzadeh et al.,2014)

only involves in the noun and noun phrases, while

the Flickr30k Entities dataset does not label the

corresponding bounding boxes in images, although

it annotate the pronouns in captions.

Alternatively, because of the diversity of our

dataset, the number of coreference chains varies.

As Figure 2shows, the pie charts display the dis-

tinctive distributions of noun phrases and pronouns

in the

VD-Ref

dataset. It is clear that whether for

the noun phrases or the pronouns, the dialogues

that have no more than three coreference chains

account for the major proportion, up to 70%, ac-

cordingly, the dialogues that have more than three

coreference chains constitute the rest proportion.

Moreover, as the mentions of the coreference

chains and bounding boxes come in pairs, we can

deﬁne the coreference chain into four types:

•one mention vs. one box

: This type contains

only one mention and one corresponding box,

indicating that the chain exclude pronoun.

•one mention vs. boxes

: As the referred ob-

ject is separated into several regions, more

than one box is needed to annotate.

•mentions vs. one box

: In this coreference

chain, all noun phrases and pronouns refer to

the same single box on the image.

•mentions vs. boxes

: This type contains sev-

eral mentions that have noun phrases and pro-

nouns and associated multiple boxes.

Finally, the train, validation and test sets con-

tain 6,199 (70.00%), 1,063 (12.00%) and 1,595

(18.00%) image-dialogue pairs, respectively. We

report other statistics in Table 1as well.

3 Method

Recent works (Kamath et al.,2021;Li et al.,2022)

bring the successful vision-language transformer ar-

chitecture and the pre-train-then-ﬁnetune paradigm

to the phrase grounding task, achieving state-of-

the-art performance. To explore our constructed

dataset, we adopt the representative MDETR (Ka-

math et al.,2021) model. Meanwhile, we propose

to enhance the textual representations with the natu-

ral coreference chains in texts by Relational Graph

Convolutional Networks (R-GCN) (Schlichtkrull

et al.,2018). Bellow, we brieﬂy describe how

MDETR learns and grounds, and then present our

suggested coreference graph encoding.

3.1 Grounding Model

As depicted in Figure 3, for a given image-text pair,

MDETR ﬁrst use an image-encoder (Tan and Le,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExtendingPhraseGroundingwithPronounsinVisualDialoguesPanzhongLu1,XinZhang1,MeishanZhang2,MinZhang21SchoolofNewMediaandCommunication,TianjinUniversity2InstituteofComputingandIntelligence,HarbinInstituteofTechnology(Shenzhen){panzhong171,hsinz}@tju.edu.cn,{zhangmeishan,zhangmin2021}@hit.edu.cnAbstrac...

展开>> 收起<<

Extending Phrase Grounding with Pronouns in Visual Dialogues Panzhong Lu1 Xin Zhang1 Meishan Zhang2 Min Zhang2 1School of New Media and Communication Tianjin University.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Extending Phrase Grounding with Pronouns in Visual Dialogues Panzhong Lu1 Xin Zhang1 Meishan Zhang2 Min Zhang2 1School of New Media and Communication Tianjin University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: