Joint Coreference Resolution for Zeros and non-Zeros in Arabic AbdulrahmanAloraini1 2Sameer Pradhan3 4Massimo Poesio1 1School of Electronic Engineering and Computer Science Queen Mary University of London

2025-05-05 0 0 788.9KB 11 页 10玖币
侵权投诉
Joint Coreference Resolution for Zeros and non-Zeros in Arabic
Abdulrahman Aloraini1, 2 Sameer Pradhan3, 4 Massimo Poesio1
1School of Electronic Engineering and Computer Science, Queen Mary University of London
2Department of Information Technology, College of Computer, Qassim University
3cemantix.org
4Linguistic Data Consortium, University of Pennsylvania, Philadelphia, USA
a.aloraini@qmul.ac.uk spradhan@cemantix.org
upenn.edu m.poesio@qmul.ac.uk
Abstract
Most existing proposals about anaphoric zero
pronoun (AZP) resolution regard full mention
coreference and AZP resolution as two indepen-
dent tasks, even though the two tasks are clearly
related. The main issues that need tackling to
develop a joint model for zero and non-zero
mentions are the difference between the two
types of arguments (zero pronouns, being null,
provide no nominal information) and the lack
of annotated datasets of a suitable size in which
both types of arguments are annotated for lan-
guages other than Chinese and Japanese. In
this paper, we introduce two architectures for
jointly resolving AZPs and non-AZPs, and eval-
uate them on Arabic, a language for which, as
far as we know, there has been no prior work
on joint resolution. Doing this also required
creating a new version of the Arabic subset of
the standard coreference resolution dataset used
for the CoNLL-2012 shared task (Pradhan et al.,
2012) in which both zeros and non-zeros are
included in a single dataset.
1 Introduction
In pronoun-dropping (pro-drop) languages such
as Arabic (Eid,1983), Chinese (Li and Thomp-
son,1979), Italian (Di Eugenio,1990) and other
romance languages (e.g., Portuguese, Spanish),
Japanese (Kameyama,1985), and others (Young-
Joo,2000), arguments in syntactic positions in
which a pronoun is used in English can be omitted.
Such arguments–sometimes called null arguments,
empty arguments, or zeros, and called anaphoric
zero pronouns (AZP) here when they are anaphoric,
are illustrated by the following example:

Ironically, Bush did not show any enthusiasm for the inter-
national conference, because Bush since the beginning, (he)
wanted to attend another conference ...
In the example, the ’*’ is an anaphoric zero
pronoun–a gap replacing an omitted pronoun which
refers to a previously mentioned noun, i.e. Bush.1
Although AZPs are common in pro-drop lan-
guages (Chen and Ng,2016), they are typically
not considered in standard coreference resolution
architectures. Existing coreference resolution sys-
tems for Arabic would cluster the overt mentions
of Bush, but not the AZP position; vice versa, AZP
resolution systems would resolve the AZP, to one of
the previous mentions, but not other mentions. The
main reason for this is that AZPs are empty men-
tions, meaning that it is not possible to encode fea-
tures commonly used in coreference systems–the
head, syntactic and lexical features as in pre-neural
systems. As a result, papers such as (Iida et al.,
2015) have shown that treating the resolution of
AZPs and realized mentions separately is beneficial.
However, it has been shown that the more recent
language models and end-to-end systems do not
suffer from these issues to the same extent. BERT,
for example, learns surface, semantic and syntac-
tic features of the whole context (Jawahar et al.,
2019) and it has been shown that BERT encodes
sufficient information about AZPs within its layers
to achieve reasonable performance (Aloraini and
Poesio,2020b,a). However, these findings have
not yet led to many coreference resolution mod-
els attempting to resolve both types of mentions in
a single learning framework (in fact, we are only
aware of two, (Chen et al.,2021;Yang et al.,2022),
the second of which was just proposed) and these
have not been evaluated with Arabic.
In this paper, we discuss two methods for jointly
clustering AZPs and non-AZPs, that we evaluate
on Arabic: a pipeline and a joint learning architec-
ture. In order to train and test these two architec-
tures, however, it was also necessary to create a
1
We use here the notation for AZPs used in the Arabic
portion of OntoNotes 5.0, in which AZPs are denoted as *and
we also use another notation which is *pro*.
new version of the Arabic portion of the CoNLL-
2012 shared task corpus in which both zeros and
non-zeros are annotated in the same documents. To
summarize, the contributions of this paper are as
follows:
We introduce two new architectures for resolv-
ing AZPs and non-AZPs together, the pipeline
and the joint learning architecture. One of
our architectures, the joint learning, outper-
forms the one existing joint end-to-end model
(Chen et al.,2021) when resolving both types
of mentions together.
We create an extended version of the Ara-
bic portion of CoNLL-2012 shared task in
which the zero and non-zero mentions are rep-
resented in the same document. The extended
dataset is suitable for training AZPs and non-
AZPs jointly or each type separately.
2 Related Work
Most existing works regard coreference resolu-
tion and AZP resolution as two independent tasks.
Many studies were dedicated to Arabic coreference
resolution using CoNLL-2012 dataset (li,2012;
Zhekova and Kübler,2010;Björkelund and Nugues,
2011;Stamborg et al.,2012;Uryupina et al.,2012;
Fernandes et al.,2014;Björkelund and Kuhn,2014;
Aloraini et al.,2020;Min,2021), but AZPs were ex-
cluded from the dataset so no work considered them.
Aloraini and Poesio (2020b) proposed a BERT-base
approach to resolve AZPs to their true antecedent,
but they did not resolve other mentions.
There have been a few proposals on solving the
two tasks jointly for other languages. Iida and
Poesio (2011) integrated the AZP resolver with
a coreference resolution system using an integer-
linear-programming model. Kong and Ng (2013)
employed AZPs to improve the coreference resolu-
tion of non-AZPs using a syntactic parser. Shibata
and Kurohashi (2018) proposed an entity-based
joint coreference resolution and predicate argu-
ment structure analysis for Japanese. However,
these works relied on language-specific features
and some assumed the presence of AZPs.
There are two end-to-end neural proposals about
learning AZPs and non-AZPs together. The first
proposal is by Chen et al. (2021) who combined
tokens and AZP gaps representations using an en-
coder. The two representations interact in a two-
stage mechanism to learn their coreference infor-
mation, as shown in Figure 5. The second pro-
posal, just published, is by (Yang et al.,2022), who
proposed the CorefDPR architecture. CorefDPR
consists of four components: the input representa-
tion layer, coreference resolution layer, pronoun
recovery layer and general CRF layer. In our ex-
periments, we only compared our results with the
first proposal because the second system was only
evaluated on the Chinese conversational speech of
OntoNotes
2
and the model is not publicly available
which makes it difficult to compare our results with
theirs.
3 An Extended Version of the CoNLL
Arabic dataset with AZPs
The goal of the CoNLL-2012 coreference shared
task is to learn coreference resolution for three lan-
guages (English, Chinese and Arabic). However,
AZPs were excluded from the task even though they
are annotated in OntoNotes Arabic and Chinese.
This was because considering AZPs decreased the
overall performance on Arabic and Chinese(Prad-
han et al.,2012), but not on English because it is
not a pro-drop language (White,1985). So in order
to study joint coreference resolution for explicit
mentions and zero anaphors, we had to create a
novel version of the CoNLL-2012 dataset in which
AZPs and all related information are included. The
CoNLL-2012 annotation layers consists of 13 lay-
ers and they are in Appendix A.
Existing proposals evaluated their AZP systems
using OntoNotes Normal Forms (ONF)
3
. They are
annotated with AZPs and other mentions; however,
they are not as well-prepared as CoNLL-2012. To
create a CoNLL-like dataset with AZPs, we ex-
tract AZPs from ONF and add them to the already-
existing CoNLL files. The goal of the new dataset
is to be suitable for clustering AZPs and non-AZPs,
and can be compared with previous proposals that
did not consider AZPs and as well as with future
works that consider them.
To include AZPs and their information (e.g., Part-
of-Speech and parse tree) to CoNLL-2012, we can
use ONF. However, while adding AZPs to the clus-
ters, we realized that there is one difficulty:some
2The TC part of the Chinese portion in OntoNotes.
3
The OntoNotes Normal Form (ONF) was originally meant
to be an human-readable integrated representation of the mul-
tiple layers in OntoNotes. However, it has been used by many
as a machine readable representation–as it is also more or less
true–to extract annotations, primarily zeros that are typically
excluded from the traditional CoNLL tabular representation.
Figure 1: A screenshot of OntoNotes Normal Forms (onf). Chain 71 is not considered part of a CoNLL-2012 shared
task because the cluster would become a singleton when we remove the AZP (denoted as *).
coreference chains only exist in ONF, but not in
CoNLL-2012. These are clusters consisting of only
one mention and one AZP, as in the example illus-
trated in Figure 1. Chain 71 has two mentions, an
AZP (denoted with *) and a mention. Since CoNLL-
2012 does not consider AZPs in coreference chains,
this cluster would only have a single mention be-
cause CoNLL-2012 removed AZPs (these clusters
are known as singletons, contains only one men-
tion). Our new dataset includes AZPs; therefore,
such clusters should be included. To add them to
the existing CoNLL-2012, we have to assign them
a new cluster. We did this by writing a script that
automatically extracts AZPs from ONF and adds
them in CoNLL-2012 following these steps:
1.
Finds all clusters that have AZPs in ONF and
extracts AZPs.
2. Each extracted AZP is either:
(a)
Clustered with two or more mentions:
For this case, CoNLL has already as-
signed a coreference-chain number and
we assign the AZP to the same number.
(b)
Clustered with only one mention: We cre-
ate a new cluster that include the single
mention and the AZP.
3.
Adds the AZP and writes other relevant infor-
mation, such as, Part-of-Speech, syntax, and
all the annotation layers.
Adding AZPs to CoNL-2012 is beneficial to
learn how to resolve them with other mentions or
can be useful for future CoNLL-shared tasks and
any other related NLP task. After preparing the
new CoNLL dataset as discussed, we used it to
train the joint coreference model. This new version
Category Training Development Test
Documents 359 44 44
Sentences 7,422 950 1,003
Words 264,589 30,942 30,935
AZPs 3,495 474 412
Table 1: The documents, sentences, words and AZPs of
the extended version of CoNLL-2012. We follow the
same split as in the original CoNLL-2012 for training,
development and test.
of Arabic OntoNotes will be made available with
the next release of OntoNotes. The distribution
of documents, sentences, words, and AZPs of this
extended dataset are in Table 1.
4 The Models
Earlier proposals resolved AZPs based on the an-
tecedents that are in the same sentence as the AZP
or two sentences away (Chen and Ng,2015,2016;
Yin et al.,2016,2017;Liu et al.,2017;Yin et al.,
2018;Aloraini and Poesio,2020b). However, it
has been shown that learning mention coreference
in the whole document is beneficial for AZP res-
olution (Chen et al.,2021). Therefore, we apply
two novel methods for resolving AZPs using clus-
ters and coreference chains. The pipeline resolves
AZPs based on the output clusters from the coref-
erence resolution model while the joint learning
learns how to resolve AZPs from the coreference
chains, we show an example of these two in Figure
2. In the example, the pipeline resolves AZPs to
clusters, instead of mentions and the joint learning
finds the coreference chains for mentions, including
AZPs. Earlier proposals suffered from two main
problems. First, they consider a limited number
of candidates (i.e mentions in two sentences away
摘要:

JointCoreferenceResolutionforZerosandnon-ZerosinArabicAbdulrahmanAloraini1,2SameerPradhan3,4MassimoPoesio11SchoolofElectronicEngineeringandComputerScience,QueenMaryUniversityofLondon2DepartmentofInformationTechnology,CollegeofComputer,QassimUniversity3cemantix.org4LinguisticDataConsortium,University...

展开>> 收起<<
Joint Coreference Resolution for Zeros and non-Zeros in Arabic AbdulrahmanAloraini1 2Sameer Pradhan3 4Massimo Poesio1 1School of Electronic Engineering and Computer Science Queen Mary University of London.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:788.9KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注