
WebNLG O-EN O-ZH
Percentage of First Mentions 85% 43% 43%
Percentage of Proper Names 71% 21% 15%
Average Number of Tokens 18.62 106.44 139.55
Table 3: Statistics of WebNLG and OntoNotes. O-EN
and O-ZH stand for OntoNotes-EN and OntoNotes-ZH.
chains consist mainly of first/second-person ref-
erents, and we do not expect much variation in
referential form in these cases. In other words, we
only included the chains that have at least one overt
non-pronominal RE.
Third, we delexicalised the corpus following
Castro Ferreira et al. (2018a). Additionally, since
we used the Chinese
BERT
as one of our RFS mod-
els and it only accepts input shorter than 512 char-
acters, we removed all samples in OntoNotes-ZH
whose total length (calculated by removing all un-
derscores introduced during delexicalisation and
summing the length of pre-contexts, post-contexts,
and target referents) is longer than 512 characters.
Experiments with models other than
BERT
on the
original OntoNotes-ZH show that this does not bias
the conclusions of this study (see Appendix A).
Last, we split the whole dataset into a training
set and a test set in accordance with the CoNLL
2012 Shared Task (Pradhan et al.,2012). Since ZPs
in Chinese are only annotated in the training and
development sets, following Chen and Ng (2016),
Chen et al. (2018), and Yin et al. (2018), we used
the development set as the test set and sampled 10%
of the documents from the training set as the de-
velopment data. Thus, we obtained OntoNotes-EN,
where the training, development, and test sets con-
tain 71667, 8149, and 7619 samples, respectively,
and OntoNotes-ZH, where the training, develop-
ment, and test sets contain 70428, 9217, and 11607
samples, respectively.
OntoNotes vs. WebNLG.
Based on the nature
of OntoNotes and the statistics in Table 3, we ob-
serve that: (1) the WebNLG data is all from DBPe-
dia, while the OntoNotes data is multi-genre; (2)
OntoNotes has a much smaller proportion of first
mentions and proper names; and (3) the documents
in OntoNotes are on average much longer than those
in WebNLG.
Another difference between WebNLG and
OntoNotes is in the ratio of seen and unseen en-
tities in their test sets. Castro Ferreira et al. (2018b)
divided the documents in the WebNLG’s test set
into seen (where all the data come from the same
domains as the training data) and unseen (where
all the data come from different domains than the
training data). Almost all referents from the seen
test set appear in the training set (9580 out of 9644),
while only a few referents from the unseen test set
appear in the training set (688 out of 9644).
4
In
OntoNotes, 38.44% and 41.45% of the referents in
the test sets of OntoNotes-EN and OntoNotes-ZH
also appear in the training sets.
Having said this, OntoNotes largely mitigates the
problems of WebNLG discussed in §1. If OntoNotes
is a “better” and more “representative" corpus for
assessing REG/RFS models, we can expect more
“expected” results: models with pre-training out-
perform those without, and models that learn more
useful linguistic information outperform those that
learn less. We will detail our expectations in §5.
4 Modelling RFS
We introduce how we represent entities and how
we adapt the RFS models of Chen et al. (2021).
4.1 Entity Representation
Unlike WebNLG, whose 99.34% of referents in
the test set appear in the training set, the majority
of referents in OntoNotes do not appear in both
training and test sets. This means that RFS mod-
els should be able to handle unseen referents, but
mapping each entity to a general entity tag with
underscores would prevent the models from doing
so (Cao and Cheung,2019;Cunha et al.,2020) be-
cause entity tags of unseen entities are usually out-
of-vocabulary (OOV) words. Additionally, when
incorporating pre-trained word embeddings and
language models, using entity tags prevents en-
tity representations from benefiting from these pre-
trained models (again since the entity tags of un-
seen entities are usually OOV words).
Similar to Cunha et al. (2020), we replaced
underscores in general entity tags (e.g. “Amatri-
ciana_sauce”) with whitespaces (henceforth, lex-
ical tags, e.g. “Amatriciana sauce”). Arguably,
there is a trade-off between using entity tags and
using lexical tags. In contrast to lexical tags, the
use of entity tags helps models identify mentions
of the same entity in discourse, which has been
shown to be a crucial feature for RFS. However, us-
ing entity tags prevents models from dealing with
4
Chen et al. (2021) used only seen entities because the size
of the underlying triples of the unseen test set differs from
both the training set and seen test set.