Eeny meeny miny moe. How to choose data for morphological inflection Saliha Murado gluακMans Huldenχ αThe Australian National University ANUχUniversity of Colorado

2025-05-03 0 0 239.18KB 10 页 10玖币
侵权投诉
Eeny, meeny, miny, moe. How to choose data for morphological inflection
Saliha Murado˘
gluακ Mans Huldenχ
αThe Australian National University (ANU) χUniversity of Colorado
κARC Centre of Excellence for the Dynamics of Language (CoEDL)
saliha.muradoglu@anu.edu.au,mans.hulden@colorado.edu
Abstract
Data scarcity is a widespread problem in nu-
merous natural language processing (NLP)
tasks for low-resource languages. Within mor-
phology, the labour-intensive work of tag-
ging/glossing data is a serious bottleneck for
both NLP and language documentation. Active
learning (AL) aims to reduce the cost of data an-
notation by selecting data that is most informa-
tive for improving the model. In this paper, we
explore four sampling strategies for the task of
morphological inflection using a Transformer
model: a pair of oracle experiments where data
is chosen based on whether the model already
can or cannot inflect the test forms correctly,
as well as strategies based on high/low model
confidence, entropy, as well as random selec-
tion. We investigate the robustness of each strat-
egy across 30 typologically diverse languages.
We also perform a more in-depth case study
of Natügu. Our results show a clear benefit to
selecting data based on model confidence and
entropy. Unsurprisingly, the oracle experiment,
where only incorrectly handled forms are cho-
sen for further training, which is presented as
a proxy for linguist/language consultant feed-
back, shows the most improvement. This is fol-
lowed closely by choosing low-confidence and
high-entropy predictions. We also show that
despite the conventional wisdom of larger data
sets yielding better accuracy, introducing more
instances of high-confidence or low-entropy
forms, or forms that the model can already in-
flect correctly, can reduce model performance.
1 Introduction
The need for linguistically annotated data sets is
a drive that unites many fields within linguistics.
Computational linguists often use labelled data sets
for developing NLP systems. Theoretical linguists
may utilise corpora for constructing statistical argu-
mentation to support hypotheses about language or
phenomena. Documentary linguists create interlin-
ear glossed texts (IGTs) to preserve linguistic and
cultural examples, which typically aids in generat-
ing a grammatical description. With the renewed
focus on low-resource languages and diversity in
NLP and the urgency propelled by language extinc-
tion, there is widespread interest in addressing this
bottleneck.
One method for reducing annotation costs is ac-
tive learning (AL). AL is an iterative process to
optimise model performance by choosing the most
critical examples to label. It has been successfully
employed for various applications through NLP
tasks including deep pre-trained models (BERT)
(Ein-Dor et al.,2020), semantic role labelling (My-
ers and Palmer,2021), named entity recognition
(Shen et al.,2017), word sense disambiguation
(Zhu and Hovy,2007), sentiment classification
(Dong et al.,2018) and machine translation (Zeng
et al.,2019;Zhang et al.,2018). The iterative na-
ture of AL aligns nicely with the language docu-
mentation process. It can be tied into the workflow
of a field linguist who consults with a language
informant or visits a field site in a periodic manner.
Prior to a field trip, a linguist typically prepares
material/questions (such as elicitation’s or picture
tasks
1
) for language consultants which may focus
on elements of the language they are working to
describe or for material creation (e.g., pedagogical).
We propose AL as a method which can provide a
supplementary line of insight into the data collec-
tion process, particularly for communities that wish
to develop and engage with language technology
and/or resource building.
Previous work by Palmer (2009) details the ef-
ficiency gains from AL in the context of language
documentation for the task of morpheme labelling.
With deep learning models leading performance
for the task of morphological analysis (Pimentel
et al.,2021;Vylomova et al.,2020;McCarthy et al.,
1
Or indeed any materials such as those complied by
the Max Planck Institute for Psycholinguistics at
http://
fieldmanuals.mpi.nl/
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8 9 10
Cycle Number
Accuracy
Resample.Type
Correct
Highest Entropy
Highest log likelihood
Incorrect
Lowest Entropy
Lowest log likelihood
Random
Figure 1: The accuracy for each trained modelled, starting from the baseline (cycle 1). Each cycle 250 instances are re-sampled
via the seven sampling methods: correct/incorrect, high/low model confidence, high/low entropy and random (coded with colour).
The reported error bars are calculated across 3 separate runs. See Table 1 in Appendix for more detail. After cycle 2, the same
sampling strategy is applied to that stream of experiment - e.g. for the lowest log-likelihood strategy, from cycle 2 to 10 the same
strategy is used.
2019), AL in the context of neural methods is
needed.
This paper addresses the following question:
How can we identify the type of data needed to
improve model performance? To answer this, we
explore the use of AL for the task of morphologi-
cal inflection using a Transformer model. We run
AL simulation experiments with four different sam-
pling strategies: (1) correctness oracle, (2) model
confidence, (3) entropy and (4) random selection.
These strategies are tested across 30 typologically
diverse languages and a 10-cycle iterative experi-
ment using Natügu as a case study.
2 Data
We use data from the UniMorph Project (McCarthy
et al.,2020), Interlinear Glossed Texts (IGT) from
Moeller et al. (2020) and SIGMORPHON (Vy-
lomova et al.,2020;Pimentel et al.,2021). In
addition to the data availability, we consider ty-
pological diversity when selecting languages to
include. Broadly, we attempt to include types of
languages that exhibit varying degrees of complex-
ity for inflection. We also consider morphological
characteristics coded in WALS; prefixing vs. suf-
fixing (Dryer,2013), inflectional synthesis of the
verb (Bickel and Nichols,2013b) and exponence
(Bickel and Nichols,2013a). An additional consid-
eration is the paradigm size for the morphological
system modelled.
We note data source type to account for the vari-
ation in standard across Wikipedia, IGT field data,
glossed examples from grammars and data gener-
ated from computational grammars.
摘要:

Eeny,meeny,miny,moe.HowtochoosedataformorphologicalinflectionSalihaMurado˘gluακMansHuldenχαTheAustralianNationalUniversity(ANU)χUniversityofColoradoκARCCentreofExcellencefortheDynamicsofLanguage(CoEDL)saliha.muradoglu@anu.edu.au,mans.hulden@colorado.eduAbstractDatascarcityisawidespreadprobleminnu-me...

展开>> 收起<<
Eeny meeny miny moe. How to choose data for morphological inflection Saliha Murado gluακMans Huldenχ αThe Australian National University ANUχUniversity of Colorado.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:239.18KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注