
Noun Phrases
Six C, 4%
Five C, 8%
Four C, 13%
Three C, 26%
Two C, 30% One C, 17%
Seven C, 2%
Total
48798
Six C, 3%
Five C, 6%
Four C, 11%
Three C, 23%
Two C, 30%
One C, 26%
Seven C, 1%
Pronouns
Total
25889
Note: C means the Coreference Chain.Note: C means the Coreference Chain.
Figure 2: The proportion of noun phrases and pronouns
in different number coreference chains.
up a standard annotation guideline according to our
task purpose; (iii) Recruiting sufficient expert users
to annotate the dataset and ensuring each instance
with three annotations. Firstly, we adopt the label-
studio platform (Tkachenko et al.,2020-2022) as
the basis to design a user-friend interface targeted
to our task, where the concreted interface is shown
in Appendix A. Then, we let three people with the
visual grounding research experience previously as
our experts. They annotate 100 data-pairs together
as examples, and establish an annotation guideline
based on their consensus after several discussions.
Next, we recruit a number of college students
who are expertised at English skills to annotate
our dataset. Before starting our task, the students
are asked to read the guideline of the annotation
process carefully and attempt to annotate some
test sets of data, during this period, we examine
these students and choose 20 of them to do the
following annotation task. In the annotation of
each datapoint, the prepared data is split into micro-
tasks so that each one consists of 500 dialogues.
We assign three workers to each micro-task, and
their identities are remained hidden from each other.
After all annotation tasks are finished, we let our
experts check the results and make corrections of
the unconsistent annotations as well.
Finally, we establish the
VD-Ref
dataset, which
is annotated manually with the noun phrases and
pronouns that naturally form the coreference chains
as well as the relevant bounding boxes in images.
2.4 Statistics of the VD-Ref Dataset
Totally, we collect 74,687 entity mentions and
23,980 objects from 8,857 VisDialog datasets,
where the mentions include 48,798 noun phrases
and 25,889 pronouns, on average, a dialogue con-
sists of 5.51 noun phrases and 2.92 pronouns.
On the contrary, the existing datasets for phrase
grounding hardly consider the pronouns. The
ReferItGame dataset (Kazemzadeh et al.,2014)
only involves in the noun and noun phrases, while
the Flickr30k Entities dataset does not label the
corresponding bounding boxes in images, although
it annotate the pronouns in captions.
Alternatively, because of the diversity of our
dataset, the number of coreference chains varies.
As Figure 2shows, the pie charts display the dis-
tinctive distributions of noun phrases and pronouns
in the
VD-Ref
dataset. It is clear that whether for
the noun phrases or the pronouns, the dialogues
that have no more than three coreference chains
account for the major proportion, up to 70%, ac-
cordingly, the dialogues that have more than three
coreference chains constitute the rest proportion.
Moreover, as the mentions of the coreference
chains and bounding boxes come in pairs, we can
define the coreference chain into four types:
•one mention vs. one box
: This type contains
only one mention and one corresponding box,
indicating that the chain exclude pronoun.
•one mention vs. boxes
: As the referred ob-
ject is separated into several regions, more
than one box is needed to annotate.
•mentions vs. one box
: In this coreference
chain, all noun phrases and pronouns refer to
the same single box on the image.
•mentions vs. boxes
: This type contains sev-
eral mentions that have noun phrases and pro-
nouns and associated multiple boxes.
Finally, the train, validation and test sets con-
tain 6,199 (70.00%), 1,063 (12.00%) and 1,595
(18.00%) image-dialogue pairs, respectively. We
report other statistics in Table 1as well.
3 Method
Recent works (Kamath et al.,2021;Li et al.,2022)
bring the successful vision-language transformer ar-
chitecture and the pre-train-then-finetune paradigm
to the phrase grounding task, achieving state-of-
the-art performance. To explore our constructed
dataset, we adopt the representative MDETR (Ka-
math et al.,2021) model. Meanwhile, we propose
to enhance the textual representations with the natu-
ral coreference chains in texts by Relational Graph
Convolutional Networks (R-GCN) (Schlichtkrull
et al.,2018). Bellow, we briefly describe how
MDETR learns and grounds, and then present our
suggested coreference graph encoding.
3.1 Grounding Model
As depicted in Figure 3, for a given image-text pair,
MDETR first use an image-encoder (Tan and Le,