
contrast to this trend, we address this limitation
and explore the effect of more expressive textual
representation on state-of-the-art local methods. To
this end, we propose to complement Wikipedia ti-
tles with their description in Wikidata so that, for
instance, the candidates for Ronaldo in
Ronaldo
scored two goals for Portugal would be Cristiano
Ronaldo: Portoguese association football player
and Ronaldo: Brazilian association football player,
rather than the less informative Cristiano Ronaldo
and Ronaldo. We test our novel representations on
generative and extractive formulations, and evalu-
ate against standard benchmarks in ED, both in and
out of domain, reporting statistically significant
improvements for the latter group.
2 Method
We now formally introduce ED and the textual
representation strategy we put forward. Then, we
describe the two formulations with which we im-
plement and test our proposal.
ED with Entity Definitions
Given a mention
m
occurring in a context
cm
, Entity Disambiguation
is formally defined as the task of identifying, out
of a set of candidates
e1, . . . , en
, the correct entity
e∗
that
m
refers to. In generative and extractive
formulations, each candidate
e
is additionally asso-
ciated with a text representation
ˆe
, which is a string
describing its meaning. Whereas previous works
have considered the title that
e
had in Wikipedia as
ˆe
, here we focus on more expressive alternatives
and leverage Wikidata to achieve this objective. In
particular, we first retrieve the Wikidata description
of
e
. Then, we define as the new representation of
e
the colon-separated concatenation of its Wikipedia
title and its Wikidata description, e.g., Ronaldo:
Brazilian association football player.
Generative Modeling
In our first formulation,
we follow De Cao et al. (2021) and frame ED as a
text generation problem. Starting from a mention
m
and its context
cm
, we first wrap the location
of
m
in
cm
between two special symbols, namely
<s> and </s>; we denote this modified sequence by
˜cm
. Then, we train a sequence-to-sequence model
to generate the textual sequence
ˆe∗
of the correct
entity e∗by learning the following probability:
p(ˆe∗|˜cm) =
|ˆe∗|
Y
j=1
p(ˆe∗
j|ˆe∗
1:j−1,˜cm)
Dataset Instances Candidates Failures
AIDA
Train 18,448 905,916 /79,561 5038 /682
Validation 4791 236,193 /43,339 1360 /296
Test 4485 231,595 /46,660 1395 /323
OOD
MSNBC 656 17,895 /8336 149 /72
AQUAINT 727 23,917 /16,948 142 /121
ACE2004 257 12,292 /8045 66 /50
CWEB 11,154 462,423 /119,781 3642 /1265
WIKI 6821 222,870 /105,440 1216 /719
Table 1: Number of instances, candidates and failures
to map a Wikipedia title to its Wikidata definition in
the AIDA-CoNLL (top) and out-of-domain (bottom)
datasets. For candidates and failures, we report both
their total (base) and unique (exponent) number.
where
ˆe∗
j
denotes the
j
-th token of
ˆe∗
and
ˆe∗
0
is
a special start symbol. The purpose of <s> and
</s> is to signal the model that
m
is the token we
are interested in disambiguating. As in the refer-
ence work, we use BART (Lewis et al.,2020) as
our sequence-to-sequence architecture for our ex-
periments and, most importantly, adopt constraint
decoding on the candidate set at inference time.
Indeed, applying standard decoding methods such
as beam search might result in outputs that do not
match any of the original candidates; thus, to ob-
tain only valid sequences, at each generation step,
we constrain the set of tokens that can be generated
according to a prefix tree (Cormen et al.,2009)
built over the candidate set.
Extractive Modeling
Additionally, we also con-
sider the formulation recently presented by Barba
et al. (2022) that frames ED as extractive question
answering. Here,
˜cm
, defined analogously to the
previous paragraph, represents the query, whereas
the context is built by concatenating a textual rep-
resentation of each candidate
e1, . . . , en
. A model
is then trained to extract the text span that corre-
sponds to
e∗
. Following the efficiency reasoning
of the authors, we use as our underlying model the
Longformer (Beltagy et al.,2020), whose linear
attention better scales to this type of long-input
formulations. Compared to the above generative
method, the benefits of this approach lie in i) drop-
ping the need for a potentially slow auto-regressive
decoding process and ii) enabling full joint con-
textualization both between context and candidates
and across candidates themselves.