Extending Phrase Grounding with Pronouns in Visual Dialogues Panzhong Lu1 Xin Zhang1 Meishan Zhang2 Min Zhang2 1School of New Media and Communication Tianjin University

2025-05-06 0 0 2.57MB 12 页 10玖币
侵权投诉
Extending Phrase Grounding with Pronouns in Visual Dialogues
Panzhong Lu1, Xin Zhang1, Meishan Zhang2
, Min Zhang2
1School of New Media and Communication, Tianjin University
2Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen)
{panzhong171,hsinz}@tju.edu.cn, {zhangmeishan,zhangmin2021}@hit.edu.cn
Abstract
Conventional phrase grounding aims to local-
ize noun phrases mentioned in a given caption
to their corresponding image regions, which
has achieved great success recently. Ap-
parently, sole noun phrase grounding is not
enough for cross-modal visual language under-
standing. Here we extend the task by consid-
ering pronouns as well. First, we construct
a dataset of phrase grounding with both noun
phrases and pronouns to image regions. Based
on the dataset, we test the performance of
phrase grounding by using a state-of-the-art lit-
erature model of this line. Then, we enhance
the baseline grounding model with corefer-
ence information which should help our task
potentially, modeling the coreference struc-
tures with graph convolutional networks. Ex-
periments on our dataset, interestingly, show
that pronouns are easier to ground than noun
phrases, where the possible reason might be
that these pronouns are much less ambiguous.
Additionally, our final model with coreference
information can significantly boost the ground-
ing performance of both noun phrases and pro-
nouns.
1 Introduction
Grounded language learning has been prevailing
for decades in many fields (Chandu et al.,2021),
generally aiming to learn the real-world meaning
of textual units (e.g., words or phrases) by con-
jointly leveraging the perception data (e.g., images
or videos). Bisk et al. (2020) advocate that we
cannot overlook the physical world that language
describes when doing language understanding re-
search from a novel perspective. In particular, with
the stimulation of modeling techniques and multi-
modal data collection paradigms, the task has made
excellent progress in the downstream tasks, which
involves multi-modal question answering (Agrawal
Corresponding author.
C: Two kids are eating food at a table.
Q: What color hair do they have?
A: Dirty blonde.
Q: What type of food are they eating?
A: Pizza.
Q: Are they drinking anything?
A: Looks like water.
Figure 1: An example of grounding noun phrases and
pronouns referred in the caption and dialogue (partly)
to the associated image regions. With an image de-
scribed by a caption, two people are discussing what
they can see. Here, we annotate the same object with
the same color, respectively. And obviously, the same
object mentioned in the text naturally forms a corefer-
ence chain.
et al.,2017;Chang et al.,2022), video-text align-
ment (Yang et al.,2021) and robot navigation (Ro-
man Roman et al.,2020;Gu et al.,2022).
Typically, as one branch of grounded language
learning, phrase grounding, first proposed by Plum-
mer et al. (2015), also plays a key role in visual
language understanding. Its goal is to ground the
phrases in a given caption to the corresponding
image regions. Recently, many researchers have at-
tempted varied approaches to explore this task. Mu
et al. (2021) propose a novel graph learning frame-
work for phrase grounding to distinguish the diver-
sity of context among phrases and image regions.
Wang et al. (2020) develop a multimodal alignment
framework to utilize the caption-image datasets
under weak supervision. Kamath et al. (2021)
advance phrase grounding with their end-to-end
modulated pre-trained network named MDETR.
Overall, the natural language processing (NLP) and
computer vision (CV) communities have seen huge
achievements in the task of phrase grounding.
In spite of its apparent success, there remains
a worth-thinking weakness. Almost all previous
works mainly focus on the noun phrases/words,
which can derive their meanings by the expres-
arXiv:2210.12658v1 [cs.CL] 23 Oct 2022
sional forms to some extent. There is little work
that takes account into pronouns. As shown in Fig-
ure 1, pronouns definitely have underlying effects
on the performance of visual grounding, which
should be carefully examined (Yu et al.,2019). As
a result, here we shift our eyes from the common
(almost noun) phrase grounding with the extension
of pronouns for the first time.
In this paper, we present the first work for inves-
tigating phrase grounding that includes pronouns,
and explore how coreference chains can have an ef-
fect on the performance of our task. We annotate an
initial dataset based on visual dialogue (Das et al.,
2017), as shown in Figure 1. For the model, we
can directly apply MDETR (Kamath et al.,2021),
which is an end-to-end modulated detector. How-
ever, the model does offer much information to
understand pronouns. Thus, we enhance the vanilla
model with coreference information from the dia-
logue end, where a graph neural network is adopted
to encode the graph-style coreference knowledge.
Finally, we conduct experiments on our con-
structed dataset to benchmark the extended phrase
grounding task. According to the results, we find
that interestingly, pronouns are easier to ground
by MDETR than phrases. The underlying reason
might be that the pronouns are always more im-
portant during dialogue, leading to less ambiguity
in speech communication. In addition, our final
model can be significantly enhanced by adding the
gold graph-style coreference knowledge; however,
the model fails to obtain any positive gain when
the coreference information is sourced from a state-
of-the-art machine learning model. We conduct
several in-depth analyses for comprehensive under-
standing of our task as well as the model.
In summary, our contributions are as follows:
We extend the task of phrase grounding by
taking account of pronouns, and correspond-
ingly establish a new dataset manually, named
VD-Ref, which is the first dataset with ground-
truth mappings from both noun phrases and
pronouns to image regions.
We benchmark the extended phrase grounding
task by a state-of-the-art model, and also in-
vestigate our task with the coreference knowl-
edge of the text, which should benefit our task
straightforwardly.
We observe several unexpected results by our
empirical verification, and to understand these
results, in-depth analyses are offered to illus-
Sect. #Img #Pronoun #Phrase #Box #Coref
Train 6199 18600 35118 16559 14582
Dev 1063 3256 5739 3074 2503
Test 1595 4033 7941 4347 3754
Table 1: Data statistics of our constructed dataset.
#Box means the number of bounding boxes in the im-
age. #Coref means the number of coreference chains.
trate them, which might be useful for the fu-
ture investigation of phrase grounding.
2 Our Task and The VD-Ref Dataset
2.1 Task Description
The phrase grounding task’s general purpose is to
map multiple noun phrases to the image regions,
however, in this paper, we take the challenge a
step further by grounding various noun phrases
and pronouns from the given dialogue to the ap-
propriate regions of an image. Take Figure 1for
example, with all the expressions mentioned in the
dialogue, like the coreference chain that includes
“Two kids” and “they”, the task needs to predict the
corresponding regions of the object “kids” using
bounding boxes in image.
Formally, we define the task as follows: given
an image
I
and the corresponding ground-truth
dialogue
D
, we denote
M={N, P }
as all the lan-
guage expressions, typically,
N
is the oun phrases
and
P
is the pronouns, the prime objective of the
task is to predict a bounding box (or bounding
boxes) Bfor each expression.
2.2 Data Collection
With the aim to build a high-quality dataset that
includes sufficient pronouns, we adopt the large-
scaled VisDialog dataset (Das et al.,2017) which
contains 120k images from the COCO (Lin et al.,
2014), where each image is associated with a dia-
logue
1
around to the image. We randomly choose
a set of 10k complete sets from the VisDialog
dataset, and use the StanfordCoreNLP (Manning
et al.,2014) tool to tokenize the sentences, making
it proper for the succeeding human annotation.
2.3 Annotation Process
The whole annotation workflow is divided into
three stages as follows: (i) developing a conve-
nient online tool for the user annotation; (ii) setting
1
If not specified, the following dialogues that are discussed
all contain a caption.
Noun Phrases
Six C, 4%
Five C, 8%
Four C, 13%
Three C, 26%
Two C, 30% One C, 17%
Seven C, 2%
Total
48798
Six C, 3%
Five C, 6%
Four C, 11%
Three C, 23%
Two C, 30%
One C, 26%
Seven C, 1%
Pronouns
Total
25889
Note: C means the Coreference Chain.Note: C means the Coreference Chain.
Figure 2: The proportion of noun phrases and pronouns
in different number coreference chains.
up a standard annotation guideline according to our
task purpose; (iii) Recruiting sufficient expert users
to annotate the dataset and ensuring each instance
with three annotations. Firstly, we adopt the label-
studio platform (Tkachenko et al.,2020-2022) as
the basis to design a user-friend interface targeted
to our task, where the concreted interface is shown
in Appendix A. Then, we let three people with the
visual grounding research experience previously as
our experts. They annotate 100 data-pairs together
as examples, and establish an annotation guideline
based on their consensus after several discussions.
Next, we recruit a number of college students
who are expertised at English skills to annotate
our dataset. Before starting our task, the students
are asked to read the guideline of the annotation
process carefully and attempt to annotate some
test sets of data, during this period, we examine
these students and choose 20 of them to do the
following annotation task. In the annotation of
each datapoint, the prepared data is split into micro-
tasks so that each one consists of 500 dialogues.
We assign three workers to each micro-task, and
their identities are remained hidden from each other.
After all annotation tasks are finished, we let our
experts check the results and make corrections of
the unconsistent annotations as well.
Finally, we establish the
VD-Ref
dataset, which
is annotated manually with the noun phrases and
pronouns that naturally form the coreference chains
as well as the relevant bounding boxes in images.
2.4 Statistics of the VD-Ref Dataset
Totally, we collect 74,687 entity mentions and
23,980 objects from 8,857 VisDialog datasets,
where the mentions include 48,798 noun phrases
and 25,889 pronouns, on average, a dialogue con-
sists of 5.51 noun phrases and 2.92 pronouns.
On the contrary, the existing datasets for phrase
grounding hardly consider the pronouns. The
ReferItGame dataset (Kazemzadeh et al.,2014)
only involves in the noun and noun phrases, while
the Flickr30k Entities dataset does not label the
corresponding bounding boxes in images, although
it annotate the pronouns in captions.
Alternatively, because of the diversity of our
dataset, the number of coreference chains varies.
As Figure 2shows, the pie charts display the dis-
tinctive distributions of noun phrases and pronouns
in the
VD-Ref
dataset. It is clear that whether for
the noun phrases or the pronouns, the dialogues
that have no more than three coreference chains
account for the major proportion, up to 70%, ac-
cordingly, the dialogues that have more than three
coreference chains constitute the rest proportion.
Moreover, as the mentions of the coreference
chains and bounding boxes come in pairs, we can
define the coreference chain into four types:
one mention vs. one box
: This type contains
only one mention and one corresponding box,
indicating that the chain exclude pronoun.
one mention vs. boxes
: As the referred ob-
ject is separated into several regions, more
than one box is needed to annotate.
mentions vs. one box
: In this coreference
chain, all noun phrases and pronouns refer to
the same single box on the image.
mentions vs. boxes
: This type contains sev-
eral mentions that have noun phrases and pro-
nouns and associated multiple boxes.
Finally, the train, validation and test sets con-
tain 6,199 (70.00%), 1,063 (12.00%) and 1,595
(18.00%) image-dialogue pairs, respectively. We
report other statistics in Table 1as well.
3 Method
Recent works (Kamath et al.,2021;Li et al.,2022)
bring the successful vision-language transformer ar-
chitecture and the pre-train-then-finetune paradigm
to the phrase grounding task, achieving state-of-
the-art performance. To explore our constructed
dataset, we adopt the representative MDETR (Ka-
math et al.,2021) model. Meanwhile, we propose
to enhance the textual representations with the natu-
ral coreference chains in texts by Relational Graph
Convolutional Networks (R-GCN) (Schlichtkrull
et al.,2018). Bellow, we briefly describe how
MDETR learns and grounds, and then present our
suggested coreference graph encoding.
3.1 Grounding Model
As depicted in Figure 3, for a given image-text pair,
MDETR first use an image-encoder (Tan and Le,
摘要:

ExtendingPhraseGroundingwithPronounsinVisualDialoguesPanzhongLu1,XinZhang1,MeishanZhang2,MinZhang21SchoolofNewMediaandCommunication,TianjinUniversity2InstituteofComputingandIntelligence,HarbinInstituteofTechnology(Shenzhen){panzhong171,hsinz}@tju.edu.cn,{zhangmeishan,zhangmin2021}@hit.edu.cnAbstrac...

展开>> 收起<<
Extending Phrase Grounding with Pronouns in Visual Dialogues Panzhong Lu1 Xin Zhang1 Meishan Zhang2 Min Zhang2 1School of New Media and Communication Tianjin University.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:2.57MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注