Subword Segmental Language Modelling for Nguni Languages Francois Meyer and Jan Buys Department of Computer Science

2025-04-24 0 0 402.82KB 14 页 10玖币
侵权投诉
Subword Segmental Language Modelling for Nguni Languages
Francois Meyer and Jan Buys
Department of Computer Science
University of Cape Town
MYRFRA008@myuct.ac.za, jbuys@cs.uct.ac.za
Abstract
Subwords have become the standard units
of text in NLP, enabling efficient open-
vocabulary models. With algorithms like
byte-pair encoding (BPE), subword segmen-
tation is viewed as a preprocessing step ap-
plied to the corpus before training. This can
lead to sub-optimal segmentations for low-
resource languages with complex morpholo-
gies. We propose a subword segmental lan-
guage model (SSLM) that learns how to seg-
ment words while being trained for autoregres-
sive language modelling. By unifying sub-
word segmentation and language modelling,
our model learns subwords that optimise LM
performance. We train our model on the 4
Nguni languages of South Africa. These are
low-resource agglutinative languages, so sub-
word information is critical. As an LM, SSLM
outperforms existing approaches such as BPE-
based models on average across the 4 lan-
guages. Furthermore, it outperforms standard
subword segmenters on unsupervised morpho-
logical segmentation. We also train our model
as a word-level sequence model, resulting in
an unsupervised morphological segmenter that
outperforms existing methods by a large mar-
gin for all 4 languages. Our results show that
learning subword segmentation is an effective
alternative to existing subword segmenters, en-
abling the model to discover morpheme-like
subwords that improve its LM capabilities.
1 Introduction
Subword segmentation has become a standard prac-
tice in Natural Language Processing (NLP). The
dominant approach is to run an algorithm like BPE
(Sennrich et al.,2016) as a preprocessing step, seg-
menting the corpus into subwords. This enables the
model to learn features based on subwords, com-
pose words, and handle rare and unknown words
as an open-vocabulary model. Subword segmenta-
tion is an active area of research, since no single
technique outperforms others across all tasks, lan-
sesihambe
Morphemes se-si-hamb-e
BPE sesi-ha-mbe
Unigram LM se-si-hambe
Morfessor se-s-ihambe
SSLM se-si-hamb-e
Table 1: Segmentations of the isiXhosa word sesi-
hambe produced by existing subword segmentation al-
gorithms, compared to the actual morphemes and the
output of our model (SSLM).
guages, and dataset sizes (Zhu et al.,2019a,b). Be-
sides deterministic segmenters like BPE, stochastic
algorithms like unigram LM (ULM) (Kudo,2018)
have also been proposed.
Subword segmentation is particularly important
for the Nguni languages of South Africa (isiX-
hosa, isiZulu, isiNdebele, and Siswati) because
they are agglutinative languages that are written
conjunctively.
1
These are morphologically rich
languages in which words are formed by string-
ing together morphemes (Taljard and Bosch,2006).
Morphemes are the primary linguistic units. For
example, the isiXhosa word “sesihambe” means
“we are gone”, where “se” means “we”, “si” means
“are”, and “hamb-e” means “gone”, with the “-e”
suffix indicating past tense. As shown in Table 1,
existing segmenters do not reliably capture this.
The Nguni languages are under-resourced,
which compounds the importance of subword seg-
mentation. Available datasets are small, so any
held-out dataset will contain rare or previously un-
seen words. Therefore it is critical for models to
learn useful subword features and effectively model
morphological composition. In a low-resource set-
ting it may then be more effective to learn subword
1
The Sotho-Tswana languages of South Africa are also ag-
glutinative, but are written disjunctively i.e. a single linguistic
word may be written as several orthographic words.
arXiv:2210.06525v1 [cs.CL] 12 Oct 2022
segmentation as part of model training rather than
as a distinct preprocessing step.
The probabilistic models underlying existing
subword segmentation methods such as ULM and
Morfessor (Creutz and Lagus,2007) assume that
subwords are context-independent, making them
unsuitable for language modelling. In this paper
we propose a subword segmental language model
that simultaneously learns how to segment words
while training as an autoregressive LM. This al-
lows the model to learn subword segmentations
that optimise a left-to-right language modelling ob-
jective, thereby being conditioned on the context.
Our model learns the subwords that it can most
effectively leverage for language modelling.
We train our model in the 4 Nguni languages of
South Africa. We compile LM datasets for these
languages from publicly available corpora and re-
lease our train/validation/test sets. On intrinsic
language modelling performance averaged across
the 4 languages our model outperforms neural LMs
trained with characters, BPE, and ULM subwords.
On the task of unsupervised morphological segmen-
tation (which determines to what extent subwords
correspond to actual morphemes) our model out-
performs standard subword segmenters like BPE
and ULM on average across the 4 languages.
In addition to these LMs, we train a second set
of subword segmental models that train on sin-
gle words in isolation (without having to model
context for long-range language modelling). Our
word-level models outperform all existing meth-
ods on unsupervised morphological segmentation
(including segmenters like Morfessor) by a large
margin across all 4 languages. Finally, we discuss
the importance of a subword lexicon to our model,
analysing how hyperparameters that control lexi-
con construction affect performance. In summary,
this paper makes the following contributions:2
1.
We propose a subword segmental language
model (SSLM) that unifies subword segmen-
tation and language modelling in a single end-
to-end neural architecture.
2.
We compile and release curated LM datasets
for 4 Nguni languages.
3.
We evaluate our model as an LM and an un-
supervised morphological segmenter, and it
2
Our code, trained models, and datasets are avail-
able at
https://github.com/francois-meyer/
subword-segmental-lm.
outperforms existing methods on both tasks.
4.
We present an analysis of how lexicon-related
hyperparameters affect our model.
2 Subword Segmentation
In this section we review the paradigm that cur-
rently dominates subword segmentation, discuss its
limitations, and introduce the family of models we
draw inspiration from for our approach to subword
segmentation — segmental sequence models.
2.1 Subword Segmentation Algorithms
Recently proposed subword segmentation algo-
rithms start with some initial vocabulary (e.g. all
characters) and iteratively amend it based on corpus
subword statistics until a pre-specified vocabulary
size has been reached. The goal of BPE (Sennrich
et al.,2016) is to represent common characters se-
quences as distinct vocabulary items. ULM (Kudo,
2018) aims to maximise the likelihood of the train-
ing corpus under a unigram LM, in which subwords
are generated independently.
These algorithms work well in certain contexts,
but are not universally applicable. Klein and Tsar-
faty (2020) show that they are sub-optimal for mor-
phologically rich languages. Zhu et al. (2019b)
show that the best method varies across languages
and tasks, and existing segmenters require exten-
sive tuning. They also find that subword segmenta-
tion is particularly beneficial for low-resource lan-
guages, but on average a simple character n-gram
method outperforms BPE (Zhu et al.,2019a).
Recently it has become popular to construct
shared multilingual vocabularies, but this leads to
over-segmented words in low-resource languages
(Wang et al.,2021b;Ács,2019). Some have argued
that these problems can be overcome by avoiding
segmentation altogether (Clark et al.,2021) or by
more sophisticated hyperparameter tuning (Salesky
et al.,2020). But the limitations arise partly from
the fact that the segmentation algorithms them-
selves are separated from model training. To over-
come this we turn to a different paradigm, where
we can cast subword segmentation as something
for the model to learn.
2.2 Segmental Sequence Models
The main idea behind segmental sequence mod-
elling is to let the model learn segmentation itself.
This involves treating sequence segmentation as a
Figure 1: The SSLM computing the probability for the subword segment “ph” in the isiXhosa sentence “Uya phi?”
A character-level LSTM encodes the unsegmented text history “Uya ”, while a mixture model (equation 6) that
interpolates between a separate character-level LSTM decoder and a lexicon model generates the segment “ph”.
This is repeated for all possible subwords in a sequence to compute the forward scores of equation 3.
latent variable to be marginalised over. The moti-
vation behind this is that the model would be able
to “discover” the optimal segments for sequence
prediction. These segments might correspond to
natural underlying sequence units, such as words
in text or phonemes in speech.
Variants of this idea have been used in a few neu-
ral sequence models. Kong et al. (2016) propose
a bidirectional RNN that learns segment embed-
dings for handwriting recognition and POS tagging.
Wang et al. (2017) propose SWAN (Sleep-WAke
Networks), a segmental RNN for text segmenta-
tion and speech recognition. Both of these models
use dynamic programming to efficiently compute
marginal likelihood during training (by summing
over all possible segmentations) and to find the
most likely segmentation of a sequence.
Sun and Deng (2018) coined the term “seg-
mental language model” (SLM) in applying this
approach to Chinese language modelling for un-
supervised word segmentation. Kawakami et al.
(2019) extended their approach by equipping the
model with a lexical memory and introducing seg-
ment length regularisation. Segmental models for
word discovery have also been proposed as masked
LMs (Downey et al.,2021) and bi-directional LMs
(Wang et al.,2021a). Inspired by these works, we
adapt segmental sequence modelling for the joint
task of language modelling and subword discovery.
3 Subword Segmental Language Model
Our SSLM combines autoregressive language mod-
elling and subword segmentation in a single model
that can be trained end-to-end. The architecture
is shown in Figure 1. It represents a radical di-
vergence from segmenters like BPE and ULM,
which view subword segmentation as context-
independent. The SSLM views subword segmenta-
tion and language modelling jointly, so it can learn
subwords that optimise conditional LM generation.
3.1 Generative Model
The SSLM generates a sequence of space-separated
words
w=w1,w2,...,wn
, corresponding to
an underlying sequence of characters
x
, and gen-
erates each word
wi
as a sequence of subwords
si=si1, si2, . . . , si|si|
. The probability of a text
sequence
w
is computed through the marginal dis-
tribution over all possible word segmentations as
p(w) = X
s:π(s)=w
p(s),(1)
where
π(s)
is the unsegmented text underlying the
sequence of segmented words
s=s1,s2,...,sn
.
Using the chain rule, we define the probability of a
sequence of segmented words as
p(s) =
|w|
Y
i=1
|si|
Y
j=1
p(sij |si,<j),(2)
where
si,<j
is the subword sequence preceding
the
jth
subword of the
ith
word (this includes all
subwords in the preceding words and the subwords
preceding sij in the current word).
We treat white spaces and punctuation as as-
sumed segments that are equivalent to 1-character
words. In this way we implicitly model the end of a
word. When the model predicts a non-alphabetical
character (e.g. space) that is equivalent to a word
boundary. Segments cannot cross word boundaries,
so the only segmentation learned by the model is
how to segment words into subwords.
3.2 Dynamic Programming Algorithm
Conditioning the probabilities of a segment
p(sij |si,<j)
on all possible segmentation histories
摘要:

SubwordSegmentalLanguageModellingforNguniLanguagesFrancoisMeyerandJanBuysDepartmentofComputerScienceUniversityofCapeTownMYRFRA008@myuct.ac.za,jbuys@cs.uct.ac.zaAbstractSubwordshavebecomethestandardunitsoftextinNLP,enablingefcientopen-vocabularymodels.Withalgorithmslikebyte-pairencoding(BPE),subword...

展开>> 收起<<
Subword Segmental Language Modelling for Nguni Languages Francois Meyer and Jan Buys Department of Computer Science.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:402.82KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注