
Joint Coreference Resolution for Zeros and non-Zeros in Arabic
Abdulrahman Aloraini1, 2 Sameer Pradhan3, 4 Massimo Poesio1
1School of Electronic Engineering and Computer Science, Queen Mary University of London
2Department of Information Technology, College of Computer, Qassim University
3cemantix.org
4Linguistic Data Consortium, University of Pennsylvania, Philadelphia, USA
a.aloraini@qmul.ac.uk spradhan@cemantix.org
upenn.edu m.poesio@qmul.ac.uk
Abstract
Most existing proposals about anaphoric zero
pronoun (AZP) resolution regard full mention
coreference and AZP resolution as two indepen-
dent tasks, even though the two tasks are clearly
related. The main issues that need tackling to
develop a joint model for zero and non-zero
mentions are the difference between the two
types of arguments (zero pronouns, being null,
provide no nominal information) and the lack
of annotated datasets of a suitable size in which
both types of arguments are annotated for lan-
guages other than Chinese and Japanese. In
this paper, we introduce two architectures for
jointly resolving AZPs and non-AZPs, and eval-
uate them on Arabic, a language for which, as
far as we know, there has been no prior work
on joint resolution. Doing this also required
creating a new version of the Arabic subset of
the standard coreference resolution dataset used
for the CoNLL-2012 shared task (Pradhan et al.,
2012) in which both zeros and non-zeros are
included in a single dataset.
1 Introduction
In pronoun-dropping (pro-drop) languages such
as Arabic (Eid,1983), Chinese (Li and Thomp-
son,1979), Italian (Di Eugenio,1990) and other
romance languages (e.g., Portuguese, Spanish),
Japanese (Kameyama,1985), and others (Young-
Joo,2000), arguments in syntactic positions in
which a pronoun is used in English can be omitted.
Such arguments–sometimes called null arguments,
empty arguments, or zeros, and called anaphoric
zero pronouns (AZP) here when they are anaphoric,
are illustrated by the following example:
Ironically, Bush did not show any enthusiasm for the inter-
national conference, because Bush since the beginning, (he)
wanted to attend another conference ...
In the example, the ’*’ is an anaphoric zero
pronoun–a gap replacing an omitted pronoun which
refers to a previously mentioned noun, i.e. Bush.1
Although AZPs are common in pro-drop lan-
guages (Chen and Ng,2016), they are typically
not considered in standard coreference resolution
architectures. Existing coreference resolution sys-
tems for Arabic would cluster the overt mentions
of Bush, but not the AZP position; vice versa, AZP
resolution systems would resolve the AZP, to one of
the previous mentions, but not other mentions. The
main reason for this is that AZPs are empty men-
tions, meaning that it is not possible to encode fea-
tures commonly used in coreference systems–the
head, syntactic and lexical features as in pre-neural
systems. As a result, papers such as (Iida et al.,
2015) have shown that treating the resolution of
AZPs and realized mentions separately is beneficial.
However, it has been shown that the more recent
language models and end-to-end systems do not
suffer from these issues to the same extent. BERT,
for example, learns surface, semantic and syntac-
tic features of the whole context (Jawahar et al.,
2019) and it has been shown that BERT encodes
sufficient information about AZPs within its layers
to achieve reasonable performance (Aloraini and
Poesio,2020b,a). However, these findings have
not yet led to many coreference resolution mod-
els attempting to resolve both types of mentions in
a single learning framework (in fact, we are only
aware of two, (Chen et al.,2021;Yang et al.,2022),
the second of which was just proposed) and these
have not been evaluated with Arabic.
In this paper, we discuss two methods for jointly
clustering AZPs and non-AZPs, that we evaluate
on Arabic: a pipeline and a joint learning architec-
ture. In order to train and test these two architec-
tures, however, it was also necessary to create a
1
We use here the notation for AZPs used in the Arabic
portion of OntoNotes 5.0, in which AZPs are denoted as *and
we also use another notation which is *pro*.