Eeny meeny miny moe. How to choose data for morphological inflection Saliha Murado gluακMans Huldenχ αThe Australian National University ANUχUniversity of Colorado

2025-05-03 1 0 239.18KB 10 页 10玖币

侵权投诉

Eeny, meeny, miny, moe. How to choose data for morphological inﬂection

Saliha Murado˘

gluακ Mans Huldenχ

αThe Australian National University (ANU) χUniversity of Colorado

κARC Centre of Excellence for the Dynamics of Language (CoEDL)

saliha.muradoglu@anu.edu.au,mans.hulden@colorado.edu

Abstract

Data scarcity is a widespread problem in nu-

merous natural language processing (NLP)

tasks for low-resource languages. Within mor-

phology, the labour-intensive work of tag-

ging/glossing data is a serious bottleneck for

both NLP and language documentation. Active

learning (AL) aims to reduce the cost of data an-

notation by selecting data that is most informa-

tive for improving the model. In this paper, we

explore four sampling strategies for the task of

morphological inﬂection using a Transformer

model: a pair of oracle experiments where data

is chosen based on whether the model already

can or cannot inﬂect the test forms correctly,

as well as strategies based on high/low model

conﬁdence, entropy, as well as random selec-

tion. We investigate the robustness of each strat-

egy across 30 typologically diverse languages.

We also perform a more in-depth case study

of Natügu. Our results show a clear beneﬁt to

selecting data based on model conﬁdence and

entropy. Unsurprisingly, the oracle experiment,

where only incorrectly handled forms are cho-

sen for further training, which is presented as

a proxy for linguist/language consultant feed-

back, shows the most improvement. This is fol-

lowed closely by choosing low-conﬁdence and

high-entropy predictions. We also show that

despite the conventional wisdom of larger data

sets yielding better accuracy, introducing more

instances of high-conﬁdence or low-entropy

forms, or forms that the model can already in-

ﬂect correctly, can reduce model performance.

1 Introduction

The need for linguistically annotated data sets is

a drive that unites many ﬁelds within linguistics.

Computational linguists often use labelled data sets

for developing NLP systems. Theoretical linguists

may utilise corpora for constructing statistical argu-

mentation to support hypotheses about language or

phenomena. Documentary linguists create interlin-

ear glossed texts (IGTs) to preserve linguistic and

cultural examples, which typically aids in generat-

ing a grammatical description. With the renewed

focus on low-resource languages and diversity in

NLP and the urgency propelled by language extinc-

tion, there is widespread interest in addressing this

bottleneck.

One method for reducing annotation costs is ac-

tive learning (AL). AL is an iterative process to

optimise model performance by choosing the most

critical examples to label. It has been successfully

employed for various applications through NLP

tasks including deep pre-trained models (BERT)

(Ein-Dor et al.,2020), semantic role labelling (My-

ers and Palmer,2021), named entity recognition

(Shen et al.,2017), word sense disambiguation

(Zhu and Hovy,2007), sentiment classiﬁcation

(Dong et al.,2018) and machine translation (Zeng

et al.,2019;Zhang et al.,2018). The iterative na-

ture of AL aligns nicely with the language docu-

mentation process. It can be tied into the workﬂow

of a ﬁeld linguist who consults with a language

informant or visits a ﬁeld site in a periodic manner.

Prior to a ﬁeld trip, a linguist typically prepares

material/questions (such as elicitation’s or picture

tasks

) for language consultants which may focus

on elements of the language they are working to

describe or for material creation (e.g., pedagogical).

We propose AL as a method which can provide a

supplementary line of insight into the data collec-

tion process, particularly for communities that wish

to develop and engage with language technology

and/or resource building.

Previous work by Palmer (2009) details the ef-

ﬁciency gains from AL in the context of language

documentation for the task of morpheme labelling.

With deep learning models leading performance

for the task of morphological analysis (Pimentel

et al.,2021;Vylomova et al.,2020;McCarthy et al.,

Or indeed any materials such as those complied by

the Max Planck Institute for Psycholinguistics at

http://

fieldmanuals.mpi.nl/

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10

Cycle Number

Accuracy

Resample.Type

Correct

Highest Entropy

Highest log likelihood

Incorrect

Lowest Entropy

Lowest log likelihood

Random

Figure 1: The accuracy for each trained modelled, starting from the baseline (cycle 1). Each cycle 250 instances are re-sampled

via the seven sampling methods: correct/incorrect, high/low model conﬁdence, high/low entropy and random (coded with colour).

The reported error bars are calculated across 3 separate runs. See Table 1 in Appendix for more detail. After cycle 2, the same

sampling strategy is applied to that stream of experiment - e.g. for the lowest log-likelihood strategy, from cycle 2 to 10 the same

strategy is used.

2019), AL in the context of neural methods is

needed.

This paper addresses the following question:

How can we identify the type of data needed to

improve model performance? To answer this, we

explore the use of AL for the task of morphologi-

cal inﬂection using a Transformer model. We run

AL simulation experiments with four different sam-

pling strategies: (1) correctness oracle, (2) model

conﬁdence, (3) entropy and (4) random selection.

These strategies are tested across 30 typologically

diverse languages and a 10-cycle iterative experi-

ment using Natügu as a case study.

2 Data

We use data from the UniMorph Project (McCarthy

et al.,2020), Interlinear Glossed Texts (IGT) from

Moeller et al. (2020) and SIGMORPHON (Vy-

lomova et al.,2020;Pimentel et al.,2021). In

addition to the data availability, we consider ty-

pological diversity when selecting languages to

include. Broadly, we attempt to include types of

languages that exhibit varying degrees of complex-

ity for inﬂection. We also consider morphological

characteristics coded in WALS; preﬁxing vs. suf-

ﬁxing (Dryer,2013), inﬂectional synthesis of the

verb (Bickel and Nichols,2013b) and exponence

(Bickel and Nichols,2013a). An additional consid-

eration is the paradigm size for the morphological

system modelled.

We note data source type to account for the vari-

ation in standard across Wikipedia, IGT ﬁeld data,

glossed examples from grammars and data gener-

ated from computational grammars.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Eeny,meeny,miny,moe.HowtochoosedataformorphologicalinflectionSalihaMurado˘gluακMansHuldenχαTheAustralianNationalUniversity(ANU)χUniversityofColoradoκARCCentreofExcellencefortheDynamicsofLanguage(CoEDL)saliha.muradoglu@anu.edu.au,mans.hulden@colorado.eduAbstractDatascarcityisawidespreadprobleminnu-me...

展开>> 收起<<

Eeny meeny miny moe. How to choose data for morphological inflection Saliha Murado gluακMans Huldenχ αThe Australian National University ANUχUniversity of Colorado.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Eeny meeny miny moe. How to choose data for morphological inflection Saliha Murado gluακMans Huldenχ αThe Australian National University ANUχUniversity of Colorado

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: