
Eeny, meeny, miny, moe. How to choose data for morphological inflection
Saliha Murado˘
gluακ Mans Huldenχ
αThe Australian National University (ANU) χUniversity of Colorado
κARC Centre of Excellence for the Dynamics of Language (CoEDL)
saliha.muradoglu@anu.edu.au,mans.hulden@colorado.edu
Abstract
Data scarcity is a widespread problem in nu-
merous natural language processing (NLP)
tasks for low-resource languages. Within mor-
phology, the labour-intensive work of tag-
ging/glossing data is a serious bottleneck for
both NLP and language documentation. Active
learning (AL) aims to reduce the cost of data an-
notation by selecting data that is most informa-
tive for improving the model. In this paper, we
explore four sampling strategies for the task of
morphological inflection using a Transformer
model: a pair of oracle experiments where data
is chosen based on whether the model already
can or cannot inflect the test forms correctly,
as well as strategies based on high/low model
confidence, entropy, as well as random selec-
tion. We investigate the robustness of each strat-
egy across 30 typologically diverse languages.
We also perform a more in-depth case study
of Natügu. Our results show a clear benefit to
selecting data based on model confidence and
entropy. Unsurprisingly, the oracle experiment,
where only incorrectly handled forms are cho-
sen for further training, which is presented as
a proxy for linguist/language consultant feed-
back, shows the most improvement. This is fol-
lowed closely by choosing low-confidence and
high-entropy predictions. We also show that
despite the conventional wisdom of larger data
sets yielding better accuracy, introducing more
instances of high-confidence or low-entropy
forms, or forms that the model can already in-
flect correctly, can reduce model performance.
1 Introduction
The need for linguistically annotated data sets is
a drive that unites many fields within linguistics.
Computational linguists often use labelled data sets
for developing NLP systems. Theoretical linguists
may utilise corpora for constructing statistical argu-
mentation to support hypotheses about language or
phenomena. Documentary linguists create interlin-
ear glossed texts (IGTs) to preserve linguistic and
cultural examples, which typically aids in generat-
ing a grammatical description. With the renewed
focus on low-resource languages and diversity in
NLP and the urgency propelled by language extinc-
tion, there is widespread interest in addressing this
bottleneck.
One method for reducing annotation costs is ac-
tive learning (AL). AL is an iterative process to
optimise model performance by choosing the most
critical examples to label. It has been successfully
employed for various applications through NLP
tasks including deep pre-trained models (BERT)
(Ein-Dor et al.,2020), semantic role labelling (My-
ers and Palmer,2021), named entity recognition
(Shen et al.,2017), word sense disambiguation
(Zhu and Hovy,2007), sentiment classification
(Dong et al.,2018) and machine translation (Zeng
et al.,2019;Zhang et al.,2018). The iterative na-
ture of AL aligns nicely with the language docu-
mentation process. It can be tied into the workflow
of a field linguist who consults with a language
informant or visits a field site in a periodic manner.
Prior to a field trip, a linguist typically prepares
material/questions (such as elicitation’s or picture
tasks
1
) for language consultants which may focus
on elements of the language they are working to
describe or for material creation (e.g., pedagogical).
We propose AL as a method which can provide a
supplementary line of insight into the data collec-
tion process, particularly for communities that wish
to develop and engage with language technology
and/or resource building.
Previous work by Palmer (2009) details the ef-
ficiency gains from AL in the context of language
documentation for the task of morpheme labelling.
With deep learning models leading performance
for the task of morphological analysis (Pimentel
et al.,2021;Vylomova et al.,2020;McCarthy et al.,
1
Or indeed any materials such as those complied by
the Max Planck Institute for Psycholinguistics at
http://
fieldmanuals.mpi.nl/