Joint Coreference Resolution for Zeros and non-Zeros in Arabic AbdulrahmanAloraini1 2Sameer Pradhan3 4Massimo Poesio1 1School of Electronic Engineering and Computer Science Queen Mary University of London

2025-05-05 0 0 788.9KB 11 页 10玖币

侵权投诉

Joint Coreference Resolution for Zeros and non-Zeros in Arabic

Abdulrahman Aloraini1, 2 Sameer Pradhan3, 4 Massimo Poesio1

1School of Electronic Engineering and Computer Science, Queen Mary University of London

2Department of Information Technology, College of Computer, Qassim University

3cemantix.org

4Linguistic Data Consortium, University of Pennsylvania, Philadelphia, USA

a.aloraini@qmul.ac.uk spradhan@cemantix.org

upenn.edu m.poesio@qmul.ac.uk

Abstract

Most existing proposals about anaphoric zero

pronoun (AZP) resolution regard full mention

coreference and AZP resolution as two indepen-

dent tasks, even though the two tasks are clearly

related. The main issues that need tackling to

develop a joint model for zero and non-zero

mentions are the difference between the two

types of arguments (zero pronouns, being null,

provide no nominal information) and the lack

of annotated datasets of a suitable size in which

both types of arguments are annotated for lan-

guages other than Chinese and Japanese. In

this paper, we introduce two architectures for

jointly resolving AZPs and non-AZPs, and eval-

uate them on Arabic, a language for which, as

far as we know, there has been no prior work

on joint resolution. Doing this also required

creating a new version of the Arabic subset of

the standard coreference resolution dataset used

for the CoNLL-2012 shared task (Pradhan et al.,

2012) in which both zeros and non-zeros are

included in a single dataset.

1 Introduction

In pronoun-dropping (pro-drop) languages such

as Arabic (Eid,1983), Chinese (Li and Thomp-

son,1979), Italian (Di Eugenio,1990) and other

romance languages (e.g., Portuguese, Spanish),

Japanese (Kameyama,1985), and others (Young-

Joo,2000), arguments in syntactic positions in

which a pronoun is used in English can be omitted.

Such arguments–sometimes called null arguments,

empty arguments, or zeros, and called anaphoric

zero pronouns (AZP) here when they are anaphoric,

are illustrated by the following example:









Ironically, Bush did not show any enthusiasm for the inter-

national conference, because Bush since the beginning, (he)

wanted to attend another conference ...

In the example, the ’*’ is an anaphoric zero

pronoun–a gap replacing an omitted pronoun which

refers to a previously mentioned noun, i.e. Bush.1

Although AZPs are common in pro-drop lan-

guages (Chen and Ng,2016), they are typically

not considered in standard coreference resolution

architectures. Existing coreference resolution sys-

tems for Arabic would cluster the overt mentions

of Bush, but not the AZP position; vice versa, AZP

resolution systems would resolve the AZP, to one of

the previous mentions, but not other mentions. The

main reason for this is that AZPs are empty men-

tions, meaning that it is not possible to encode fea-

tures commonly used in coreference systems–the

head, syntactic and lexical features as in pre-neural

systems. As a result, papers such as (Iida et al.,

2015) have shown that treating the resolution of

AZPs and realized mentions separately is beneficial.

However, it has been shown that the more recent

language models and end-to-end systems do not

suffer from these issues to the same extent. BERT,

for example, learns surface, semantic and syntac-

tic features of the whole context (Jawahar et al.,

2019) and it has been shown that BERT encodes

sufficient information about AZPs within its layers

to achieve reasonable performance (Aloraini and

Poesio,2020b,a). However, these findings have

not yet led to many coreference resolution mod-

els attempting to resolve both types of mentions in

a single learning framework (in fact, we are only

aware of two, (Chen et al.,2021;Yang et al.,2022),

the second of which was just proposed) and these

have not been evaluated with Arabic.

In this paper, we discuss two methods for jointly

clustering AZPs and non-AZPs, that we evaluate

on Arabic: a pipeline and a joint learning architec-

ture. In order to train and test these two architec-

tures, however, it was also necessary to create a

We use here the notation for AZPs used in the Arabic

portion of OntoNotes 5.0, in which AZPs are denoted as *and

we also use another notation which is *pro*.

new version of the Arabic portion of the CoNLL-

2012 shared task corpus in which both zeros and

non-zeros are annotated in the same documents. To

summarize, the contributions of this paper are as

follows:

•

We introduce two new architectures for resolv-

ing AZPs and non-AZPs together, the pipeline

and the joint learning architecture. One of

our architectures, the joint learning, outper-

forms the one existing joint end-to-end model

(Chen et al.,2021) when resolving both types

of mentions together.

•

We create an extended version of the Ara-

bic portion of CoNLL-2012 shared task in

which the zero and non-zero mentions are rep-

resented in the same document. The extended

dataset is suitable for training AZPs and non-

AZPs jointly or each type separately.

2 Related Work

Most existing works regard coreference resolu-

tion and AZP resolution as two independent tasks.

Many studies were dedicated to Arabic coreference

resolution using CoNLL-2012 dataset (li,2012;

Zhekova and Kübler,2010;Björkelund and Nugues,

2011;Stamborg et al.,2012;Uryupina et al.,2012;

Fernandes et al.,2014;Björkelund and Kuhn,2014;

Aloraini et al.,2020;Min,2021), but AZPs were ex-

cluded from the dataset so no work considered them.

Aloraini and Poesio (2020b) proposed a BERT-base

approach to resolve AZPs to their true antecedent,

but they did not resolve other mentions.

There have been a few proposals on solving the

two tasks jointly for other languages. Iida and

Poesio (2011) integrated the AZP resolver with

a coreference resolution system using an integer-

linear-programming model. Kong and Ng (2013)

employed AZPs to improve the coreference resolu-

tion of non-AZPs using a syntactic parser. Shibata

and Kurohashi (2018) proposed an entity-based

joint coreference resolution and predicate argu-

ment structure analysis for Japanese. However,

these works relied on language-specific features

and some assumed the presence of AZPs.

There are two end-to-end neural proposals about

learning AZPs and non-AZPs together. The first

proposal is by Chen et al. (2021) who combined

tokens and AZP gaps representations using an en-

coder. The two representations interact in a two-

stage mechanism to learn their coreference infor-

mation, as shown in Figure 5. The second pro-

posal, just published, is by (Yang et al.,2022), who

proposed the CorefDPR architecture. CorefDPR

consists of four components: the input representa-

tion layer, coreference resolution layer, pronoun

recovery layer and general CRF layer. In our ex-

periments, we only compared our results with the

first proposal because the second system was only

evaluated on the Chinese conversational speech of

OntoNotes

and the model is not publicly available

which makes it difficult to compare our results with

theirs.

3 An Extended Version of the CoNLL

Arabic dataset with AZPs

The goal of the CoNLL-2012 coreference shared

task is to learn coreference resolution for three lan-

guages (English, Chinese and Arabic). However,

AZPs were excluded from the task even though they

are annotated in OntoNotes Arabic and Chinese.

This was because considering AZPs decreased the

overall performance on Arabic and Chinese(Prad-

han et al.,2012), but not on English because it is

not a pro-drop language (White,1985). So in order

to study joint coreference resolution for explicit

mentions and zero anaphors, we had to create a

novel version of the CoNLL-2012 dataset in which

AZPs and all related information are included. The

CoNLL-2012 annotation layers consists of 13 lay-

ers and they are in Appendix A.

Existing proposals evaluated their AZP systems

using OntoNotes Normal Forms (ONF)

. They are

annotated with AZPs and other mentions; however,

they are not as well-prepared as CoNLL-2012. To

create a CoNLL-like dataset with AZPs, we ex-

tract AZPs from ONF and add them to the already-

existing CoNLL files. The goal of the new dataset

is to be suitable for clustering AZPs and non-AZPs,

and can be compared with previous proposals that

did not consider AZPs and as well as with future

works that consider them.

To include AZPs and their information (e.g., Part-

of-Speech and parse tree) to CoNLL-2012, we can

use ONF. However, while adding AZPs to the clus-

ters, we realized that there is one difficulty:some

2The TC part of the Chinese portion in OntoNotes.

The OntoNotes Normal Form (ONF) was originally meant

to be an human-readable integrated representation of the mul-

tiple layers in OntoNotes. However, it has been used by many

as a machine readable representation–as it is also more or less

true–to extract annotations, primarily zeros that are typically

excluded from the traditional CoNLL tabular representation.

Figure 1: A screenshot of OntoNotes Normal Forms (onf). Chain 71 is not considered part of a CoNLL-2012 shared

task because the cluster would become a singleton when we remove the AZP (denoted as *).

coreference chains only exist in ONF, but not in

CoNLL-2012. These are clusters consisting of only

one mention and one AZP, as in the example illus-

trated in Figure 1. Chain 71 has two mentions, an

AZP (denoted with *) and a mention. Since CoNLL-

2012 does not consider AZPs in coreference chains,

this cluster would only have a single mention be-

cause CoNLL-2012 removed AZPs (these clusters

are known as singletons, contains only one men-

tion). Our new dataset includes AZPs; therefore,

such clusters should be included. To add them to

the existing CoNLL-2012, we have to assign them

a new cluster. We did this by writing a script that

automatically extracts AZPs from ONF and adds

them in CoNLL-2012 following these steps:

Finds all clusters that have AZPs in ONF and

extracts AZPs.

2. Each extracted AZP is either:

(a)

Clustered with two or more mentions:

For this case, CoNLL has already as-

signed a coreference-chain number and

we assign the AZP to the same number.

(b)

Clustered with only one mention: We cre-

ate a new cluster that include the single

mention and the AZP.

Adds the AZP and writes other relevant infor-

mation, such as, Part-of-Speech, syntax, and

all the annotation layers.

Adding AZPs to CoNL-2012 is beneficial to

learn how to resolve them with other mentions or

can be useful for future CoNLL-shared tasks and

any other related NLP task. After preparing the

new CoNLL dataset as discussed, we used it to

train the joint coreference model. This new version

Category Training Development Test

Documents 359 44 44

Sentences 7,422 950 1,003

Words 264,589 30,942 30,935

AZPs 3,495 474 412

Table 1: The documents, sentences, words and AZPs of

the extended version of CoNLL-2012. We follow the

same split as in the original CoNLL-2012 for training,

development and test.

of Arabic OntoNotes will be made available with

the next release of OntoNotes. The distribution

of documents, sentences, words, and AZPs of this

extended dataset are in Table 1.

4 The Models

Earlier proposals resolved AZPs based on the an-

tecedents that are in the same sentence as the AZP

or two sentences away (Chen and Ng,2015,2016;

Yin et al.,2016,2017;Liu et al.,2017;Yin et al.,

2018;Aloraini and Poesio,2020b). However, it

has been shown that learning mention coreference

in the whole document is beneficial for AZP res-

olution (Chen et al.,2021). Therefore, we apply

two novel methods for resolving AZPs using clus-

ters and coreference chains. The pipeline resolves

AZPs based on the output clusters from the coref-

erence resolution model while the joint learning

learns how to resolve AZPs from the coreference

chains, we show an example of these two in Figure

2. In the example, the pipeline resolves AZPs to

clusters, instead of mentions and the joint learning

finds the coreference chains for mentions, including

AZPs. Earlier proposals suffered from two main

problems. First, they consider a limited number

of candidates (i.e mentions in two sentences away

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

JointCoreferenceResolutionforZerosandnon-ZerosinArabicAbdulrahmanAloraini1,2SameerPradhan3,4MassimoPoesio11SchoolofElectronicEngineeringandComputerScience,QueenMaryUniversityofLondon2DepartmentofInformationTechnology,CollegeofComputer,QassimUniversity3cemantix.org4LinguisticDataConsortium,University...

展开>> 收起<<

Joint Coreference Resolution for Zeros and non-Zeros in Arabic AbdulrahmanAloraini1 2Sameer Pradhan3 4Massimo Poesio1 1School of Electronic Engineering and Computer Science Queen Mary University of London.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Joint Coreference Resolution for Zeros and non-Zeros in Arabic AbdulrahmanAloraini1 2Sameer Pradhan3 4Massimo Poesio1 1School of Electronic Engineering and Computer Science Queen Mary University of London

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: