Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset Guanyi Chen Fahime Sameand Kees van Deemter

2025-05-02 0 0 747.42KB 12 页 10玖币

侵权投诉

Assessing Neural Referential Form Selectors on

a Realistic Multilingual Dataset

Guanyi Chen♠, Fahime Same♥,and Kees van Deemter♠

♠Department of Information and Computing Sciences, Utrecht University

♥Department of Linguistics, University of Cologne

g.chen@uu.nl, f.same@uni-koeln.de, c.j.vandeemter@uu.nl

Abstract

Previous work on Neural Referring Expression

Generation (REG) all uses WebNLG, an En-

glish dataset that has been shown to reﬂect a

very limited range of referring expression (RE)

use. To tackle this issue, we build a dataset

based on the OntoNotes corpus that contains

a broader range of RE use in both English

and Chinese (a language that uses zero pro-

nouns). We build neural Referential Form Se-

lection (RFS) models accordingly, assess them

on the dataset and conduct probing experi-

ments. The experiments suggest that, com-

pared to WebNLG,OntoNotes is better for as-

sessing REG/RFS models. We compare En-

glish and Chinese RFS and conﬁrm that in

both languages BERT has the highest perfor-

mance. Also, our results suggest that in line

with linguistic theories, Chinese RFS depends

more on discourse context than English.

1 Introduction

Referring Expression Generation (REG) In Context

is a key task in the classic Natural Language Gen-

eration pipeline (Reiter and Dale,2000;Gatt and

Krahmer,2018). Given a discourse whose refer-

ring expressions (REs) have yet to be realised and

given their intended referents, it aims to develop an

algorithm that generates all these REs.

Traditionally, REG In Context (hereafter REG)

is a two-step process. In the ﬁrst step, the Refer-

ential Form (RF) is determined, e.g. whether to

use a proper name, a description, a demonstrative

or a pronoun. This step is the focus of this work

and will be hereafter called Referential Form Se-

lection (RFS). In the second step, the content of

the RE is determined. For example, to refer to Joe

Biden, one needs to choose from options such as

“the president”, “the 46th president of US”.

In recent years, many works on REG have started

to use neural networks. For example, Castro Fer-

reira et al. (2018a); Cao and Cheung (2019); Cunha

et al. (2020) have proposed to generate REs in

an End2End manner, i.e., to tackle the selection

of form and content simultaneously. Chen et al.

(2021) used

BERT

(Devlin et al.,2019) to perform

RFS. One commonality between these studies is

that they were all tested on a benchmark dataset,

namely WebNLG (Gardent et al.,2017;Castro Fer-

reira et al.,2018b).

However, Chen et al. (2021) and Same et al.

(2022) found that WebNLG is not ideal for assess-

ing REG/RFS algorithms because (1) it consists

of rather formal texts that may not reﬂect every-

day RE use; (2) its texts are very short and have

a simple syntactic structure; and (3) most of its

REs are ﬁrst-mentions. These limitations led to

some unexpected results when they tested their

RFS models on WebNLG. For example, advanced

pre-trained models (i.e.,

BERT

) performed worse

than simpler models (i.e., single-layer GRU (Cho

et al.,2014)) without any pre-training. By prob-

ing

various RFS models, they found that though

BERT

encodes more linguistic information, which

is crucial for RFS, it still performs worse than GRU.

In this study, we are interested in how well each

RFS model performs when tested on a dataset that

addresses the above limitations – in what follows,

we call this a “realistic" dataset, for short.

Additionally, all the above studies were con-

ducted on English only. It has been pointed out

that speakers of East Asian languages (e.g. Chi-

nese and Japanese) use REs differently from speak-

ers of Western European languages (e.g. English

and Dutch; Newnham (1971)). Theoretical lin-

guists (Huang,1984) have suggested that East

Asian languages rely more heavily on context than

Western European languages (see Chen (2022) for

empirical testing and computational modelling).

As a result, speakers of East Asian languages fre-

quently use Zero Pronouns (ZPs), i.e. REs that

contain no words and are resolved based merely

Probing is an established method to analyse whether the

latent representations of a model encode certain information.

arXiv:2210.04828v2 [cs.CL] 11 Oct 2022

Text

: Amatriciana sauce is made with Tomato. It is a tra-

ditional Italian sauce. Amatriciana is a sauce containing

Tomato that comes from Italy.

Delexicialised Text

: Amatriciana_sauce is made with

Tomato. Amatriciana_sauce is a traditional Italy sauce.

Amatriciana_sauce is a sauce containing Tomato that

comes from Italy.

Table 1: An example data from the WebNLG corpus. In

the delexicalised text, every entity is

hlhighlighted.

on their context.

This poses two challenges for

REG/RFS models: (1) they need to be better able

to encode contextual information; (2) they need to

account for an additional RF (i.e. ZP). Therefore,

we are curious to see how well each RFS model

performs when tested on a language that has more

RFs and relies more on context than English.

To answer the research questions above, we con-

struct a “realistic" multilingual dataset of English

and Chinese and try different model architectures,

such as models with/without pre-trained word em-

beddings, and models incorporating

BERT

. We re-

port the results and compare model behaviours on

English and Chinese subsets. The code used in this

study is available at:

https://github.com/

a-quei/probe-neuralreg.

2 Referential Form Selection (RFS)

Using WebNLG,Castro Ferreira et al. (2018a) re-

deﬁned the REG task in order to accommodate

deep learning techniques. Subsequently, Chen et al.

(2021) adapted the deﬁnition to ﬁt the RFS task.

The ﬁrst step is to remove from each RE all in-

formation about the RF of that RE. Concretely, as

shown in Table 1,Castro Ferreira et al. (2018a) ﬁrst

“delexicalised" each text in WebNLG by assigning a

general entity tag to each entity and replacing all

REs referring to that entity with that tag. In most

cases, a tag is assigned to an entity by replacing

whitespaces in its proper name with underscores,

e.g. “Amatriciana sauce” to “Amatriciana_sauce”.

For a target referent

x(r)

(e.g. the second “Am-

atriciana_sauce” in Table 1), given the referent,

its pre-context in the discourse

x(pre)

(e.g. “Am-

atriciana_sauce is made with Tomato.”) and its

post-context

x(post)

(e.g. “is a traditional Italy

For example, consider the question in Chinese: “

你看见

比尔了吗？

” (Have you see Bill?). A Chinese speaker can

reply “

∅看见∅了。

” (

∅

saw

∅

.) where the two

∅

are ZPs that

refer to the speaker himself/herself and “Bill” respectively.

4-Way

Demonstrative, Description, Proper

Name, Pronoun

3-Way Description, Proper Name, Pronoun

2-Way Non-pronominal, Pronominal

5-Way

Demonstrative, Description, Proper

Name, Pronoun, ZP

4-Way

Description, Proper Name, Pronoun, ZP

3-Way Non-pronominal, Pronoun, ZP

2-Way Overt Referring Expression, ZP

Table 2: Types of RF classiﬁcation and possible classes.

Demonstratives are grouped with descriptions in 3-way

EN and 4-way ZH classiﬁcations under the category

Description. The category Non-pronominal contains

proper names, descriptions, and demonstratives.

sauce. Amatriciana_sauce is a sauce containing

Tomato that comes from Italy.”), the RFS task is to

decide the proper RF ˆ

f(e.g., pronoun).

3 Dataset Construction

To construct a realistic multilingual REG/RFS

dataset, we used the Chinese and English por-

tions of the OntoNotes dataset

whose contents

come from six sources, namely broadcast news,

newswires, broadcast conversations, telephone con-

versations, web blogs, and magazines. We call the

resulting Chinese subset OntoNotes-ZH and the En-

glish subset OntoNotes-EN. In the following, we

describe the construction process.

First, for each RE in OntoNotes, we used the

3 previous sentences as the pre-context and the 3

subsequent sentences as the post-context. Similar

to Chen et al. (2021), we are interested in different

RF classiﬁcation tasks. For Chinese, for exam-

ple, we not only have a 2-way classiﬁcation task

where models have to decide whether to use a ZP

or an overt RE, but also a 5-way task where mod-

els have to choose from a more ﬁne-grained list of

possible RFs. Table 2lists all categories in both

OntoNotes-EN and OntoNotes-ZH. Using the con-

stituency syntax tree of the sentence containing

the target referent and the surface form of the tar-

get, we automatically annotated each RE with its

RF category. For example, an RE is considered a

demonstrative if it is annotated in the syntax tree

as a noun phrase and its surface form contains a

demonstrative determiner.

Second, we excluded all coreferential chains con-

sisting only of pronouns and ZPs. The pronominal

OntoNotes is licensed under the Linguistic Data

Consortium:

https://catalog.ldc.upenn.edu/

LDC2013T19.

WebNLG O-EN O-ZH

Percentage of First Mentions 85% 43% 43%

Percentage of Proper Names 71% 21% 15%

Average Number of Tokens 18.62 106.44 139.55

Table 3: Statistics of WebNLG and OntoNotes. O-EN

and O-ZH stand for OntoNotes-EN and OntoNotes-ZH.

chains consist mainly of ﬁrst/second-person ref-

erents, and we do not expect much variation in

referential form in these cases. In other words, we

only included the chains that have at least one overt

non-pronominal RE.

Third, we delexicalised the corpus following

Castro Ferreira et al. (2018a). Additionally, since

we used the Chinese

BERT

as one of our RFS mod-

els and it only accepts input shorter than 512 char-

acters, we removed all samples in OntoNotes-ZH

whose total length (calculated by removing all un-

derscores introduced during delexicalisation and

summing the length of pre-contexts, post-contexts,

and target referents) is longer than 512 characters.

Experiments with models other than

BERT

on the

original OntoNotes-ZH show that this does not bias

the conclusions of this study (see Appendix A).

Last, we split the whole dataset into a training

set and a test set in accordance with the CoNLL

2012 Shared Task (Pradhan et al.,2012). Since ZPs

in Chinese are only annotated in the training and

development sets, following Chen and Ng (2016),

Chen et al. (2018), and Yin et al. (2018), we used

the development set as the test set and sampled 10%

of the documents from the training set as the de-

velopment data. Thus, we obtained OntoNotes-EN,

where the training, development, and test sets con-

tain 71667, 8149, and 7619 samples, respectively,

and OntoNotes-ZH, where the training, develop-

ment, and test sets contain 70428, 9217, and 11607

samples, respectively.

OntoNotes vs. WebNLG.

Based on the nature

of OntoNotes and the statistics in Table 3, we ob-

serve that: (1) the WebNLG data is all from DBPe-

dia, while the OntoNotes data is multi-genre; (2)

OntoNotes has a much smaller proportion of ﬁrst

mentions and proper names; and (3) the documents

in OntoNotes are on average much longer than those

in WebNLG.

Another difference between WebNLG and

OntoNotes is in the ratio of seen and unseen en-

tities in their test sets. Castro Ferreira et al. (2018b)

divided the documents in the WebNLG’s test set

into seen (where all the data come from the same

domains as the training data) and unseen (where

all the data come from different domains than the

training data). Almost all referents from the seen

test set appear in the training set (9580 out of 9644),

while only a few referents from the unseen test set

appear in the training set (688 out of 9644).

OntoNotes, 38.44% and 41.45% of the referents in

the test sets of OntoNotes-EN and OntoNotes-ZH

also appear in the training sets.

Having said this, OntoNotes largely mitigates the

problems of WebNLG discussed in §1. If OntoNotes

is a “better” and more “representative" corpus for

assessing REG/RFS models, we can expect more

“expected” results: models with pre-training out-

perform those without, and models that learn more

useful linguistic information outperform those that

learn less. We will detail our expectations in §5.

4 Modelling RFS

We introduce how we represent entities and how

we adapt the RFS models of Chen et al. (2021).

4.1 Entity Representation

Unlike WebNLG, whose 99.34% of referents in

the test set appear in the training set, the majority

of referents in OntoNotes do not appear in both

training and test sets. This means that RFS mod-

els should be able to handle unseen referents, but

mapping each entity to a general entity tag with

underscores would prevent the models from doing

so (Cao and Cheung,2019;Cunha et al.,2020) be-

cause entity tags of unseen entities are usually out-

of-vocabulary (OOV) words. Additionally, when

incorporating pre-trained word embeddings and

language models, using entity tags prevents en-

tity representations from beneﬁting from these pre-

trained models (again since the entity tags of un-

seen entities are usually OOV words).

Similar to Cunha et al. (2020), we replaced

underscores in general entity tags (e.g. “Amatri-

ciana_sauce”) with whitespaces (henceforth, lex-

ical tags, e.g. “Amatriciana sauce”). Arguably,

there is a trade-off between using entity tags and

using lexical tags. In contrast to lexical tags, the

use of entity tags helps models identify mentions

of the same entity in discourse, which has been

shown to be a crucial feature for RFS. However, us-

ing entity tags prevents models from dealing with

Chen et al. (2021) used only seen entities because the size

of the underlying triples of the unseen test set differs from

both the training set and seen test set.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AssessingNeuralReferentialFormSelectorsonaRealisticMultilingualDatasetGuanyiChen,FahimeSame~,andKeesvanDeemterDepartmentofInformationandComputingSciences,UtrechtUniversity~DepartmentofLinguistics,UniversityofCologneg.chen@uu.nl,f.same@uni-koeln.de,c.j.vandeemter@uu.nlAbstractPreviousworkonNeuralR...

展开>> 收起<<

Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset Guanyi Chen Fahime Sameand Kees van Deemter.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset Guanyi Chen Fahime Sameand Kees van Deemter

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: