
segmentation as part of model training rather than
as a distinct preprocessing step.
The probabilistic models underlying existing
subword segmentation methods such as ULM and
Morfessor (Creutz and Lagus,2007) assume that
subwords are context-independent, making them
unsuitable for language modelling. In this paper
we propose a subword segmental language model
that simultaneously learns how to segment words
while training as an autoregressive LM. This al-
lows the model to learn subword segmentations
that optimise a left-to-right language modelling ob-
jective, thereby being conditioned on the context.
Our model learns the subwords that it can most
effectively leverage for language modelling.
We train our model in the 4 Nguni languages of
South Africa. We compile LM datasets for these
languages from publicly available corpora and re-
lease our train/validation/test sets. On intrinsic
language modelling performance averaged across
the 4 languages our model outperforms neural LMs
trained with characters, BPE, and ULM subwords.
On the task of unsupervised morphological segmen-
tation (which determines to what extent subwords
correspond to actual morphemes) our model out-
performs standard subword segmenters like BPE
and ULM on average across the 4 languages.
In addition to these LMs, we train a second set
of subword segmental models that train on sin-
gle words in isolation (without having to model
context for long-range language modelling). Our
word-level models outperform all existing meth-
ods on unsupervised morphological segmentation
(including segmenters like Morfessor) by a large
margin across all 4 languages. Finally, we discuss
the importance of a subword lexicon to our model,
analysing how hyperparameters that control lexi-
con construction affect performance. In summary,
this paper makes the following contributions:2
1.
We propose a subword segmental language
model (SSLM) that unifies subword segmen-
tation and language modelling in a single end-
to-end neural architecture.
2.
We compile and release curated LM datasets
for 4 Nguni languages.
3.
We evaluate our model as an LM and an un-
supervised morphological segmenter, and it
2
Our code, trained models, and datasets are avail-
able at
https://github.com/francois-meyer/
subword-segmental-lm.
outperforms existing methods on both tasks.
4.
We present an analysis of how lexicon-related
hyperparameters affect our model.
2 Subword Segmentation
In this section we review the paradigm that cur-
rently dominates subword segmentation, discuss its
limitations, and introduce the family of models we
draw inspiration from for our approach to subword
segmentation — segmental sequence models.
2.1 Subword Segmentation Algorithms
Recently proposed subword segmentation algo-
rithms start with some initial vocabulary (e.g. all
characters) and iteratively amend it based on corpus
subword statistics until a pre-specified vocabulary
size has been reached. The goal of BPE (Sennrich
et al.,2016) is to represent common characters se-
quences as distinct vocabulary items. ULM (Kudo,
2018) aims to maximise the likelihood of the train-
ing corpus under a unigram LM, in which subwords
are generated independently.
These algorithms work well in certain contexts,
but are not universally applicable. Klein and Tsar-
faty (2020) show that they are sub-optimal for mor-
phologically rich languages. Zhu et al. (2019b)
show that the best method varies across languages
and tasks, and existing segmenters require exten-
sive tuning. They also find that subword segmenta-
tion is particularly beneficial for low-resource lan-
guages, but on average a simple character n-gram
method outperforms BPE (Zhu et al.,2019a).
Recently it has become popular to construct
shared multilingual vocabularies, but this leads to
over-segmented words in low-resource languages
(Wang et al.,2021b;Ács,2019). Some have argued
that these problems can be overcome by avoiding
segmentation altogether (Clark et al.,2021) or by
more sophisticated hyperparameter tuning (Salesky
et al.,2020). But the limitations arise partly from
the fact that the segmentation algorithms them-
selves are separated from model training. To over-
come this we turn to a different paradigm, where
we can cast subword segmentation as something
for the model to learn.
2.2 Segmental Sequence Models
The main idea behind segmental sequence mod-
elling is to let the model learn segmentation itself.
This involves treating sequence segmentation as a