Subword Segmental Language Modelling for Nguni Languages Francois Meyer and Jan Buys Department of Computer Science

2025-04-24 0 0 402.82KB 14 页 10玖币

侵权投诉

Subword Segmental Language Modelling for Nguni Languages

Francois Meyer and Jan Buys

Department of Computer Science

University of Cape Town

MYRFRA008@myuct.ac.za, jbuys@cs.uct.ac.za

Abstract

Subwords have become the standard units

of text in NLP, enabling efﬁcient open-

vocabulary models. With algorithms like

byte-pair encoding (BPE), subword segmen-

tation is viewed as a preprocessing step ap-

plied to the corpus before training. This can

lead to sub-optimal segmentations for low-

resource languages with complex morpholo-

gies. We propose a subword segmental lan-

guage model (SSLM) that learns how to seg-

ment words while being trained for autoregres-

sive language modelling. By unifying sub-

word segmentation and language modelling,

our model learns subwords that optimise LM

performance. We train our model on the 4

Nguni languages of South Africa. These are

low-resource agglutinative languages, so sub-

word information is critical. As an LM, SSLM

outperforms existing approaches such as BPE-

based models on average across the 4 lan-

guages. Furthermore, it outperforms standard

subword segmenters on unsupervised morpho-

logical segmentation. We also train our model

as a word-level sequence model, resulting in

an unsupervised morphological segmenter that

outperforms existing methods by a large mar-

gin for all 4 languages. Our results show that

learning subword segmentation is an effective

alternative to existing subword segmenters, en-

abling the model to discover morpheme-like

subwords that improve its LM capabilities.

1 Introduction

Subword segmentation has become a standard prac-

tice in Natural Language Processing (NLP). The

dominant approach is to run an algorithm like BPE

(Sennrich et al.,2016) as a preprocessing step, seg-

menting the corpus into subwords. This enables the

model to learn features based on subwords, com-

pose words, and handle rare and unknown words

as an open-vocabulary model. Subword segmenta-

tion is an active area of research, since no single

technique outperforms others across all tasks, lan-

sesihambe

Morphemes se-si-hamb-e

BPE sesi-ha-mbe

Unigram LM se-si-hambe

Morfessor se-s-ihambe

SSLM se-si-hamb-e

Table 1: Segmentations of the isiXhosa word sesi-

hambe produced by existing subword segmentation al-

gorithms, compared to the actual morphemes and the

output of our model (SSLM).

guages, and dataset sizes (Zhu et al.,2019a,b). Be-

sides deterministic segmenters like BPE, stochastic

algorithms like unigram LM (ULM) (Kudo,2018)

have also been proposed.

Subword segmentation is particularly important

for the Nguni languages of South Africa (isiX-

hosa, isiZulu, isiNdebele, and Siswati) because

they are agglutinative languages that are written

conjunctively.

These are morphologically rich

languages in which words are formed by string-

ing together morphemes (Taljard and Bosch,2006).

Morphemes are the primary linguistic units. For

example, the isiXhosa word “sesihambe” means

“we are gone”, where “se” means “we”, “si” means

“are”, and “hamb-e” means “gone”, with the “-e”

sufﬁx indicating past tense. As shown in Table 1,

existing segmenters do not reliably capture this.

The Nguni languages are under-resourced,

which compounds the importance of subword seg-

mentation. Available datasets are small, so any

held-out dataset will contain rare or previously un-

seen words. Therefore it is critical for models to

learn useful subword features and effectively model

morphological composition. In a low-resource set-

ting it may then be more effective to learn subword

The Sotho-Tswana languages of South Africa are also ag-

glutinative, but are written disjunctively i.e. a single linguistic

word may be written as several orthographic words.

arXiv:2210.06525v1 [cs.CL] 12 Oct 2022

segmentation as part of model training rather than

as a distinct preprocessing step.

The probabilistic models underlying existing

subword segmentation methods such as ULM and

Morfessor (Creutz and Lagus,2007) assume that

subwords are context-independent, making them

unsuitable for language modelling. In this paper

we propose a subword segmental language model

that simultaneously learns how to segment words

while training as an autoregressive LM. This al-

lows the model to learn subword segmentations

that optimise a left-to-right language modelling ob-

jective, thereby being conditioned on the context.

Our model learns the subwords that it can most

effectively leverage for language modelling.

We train our model in the 4 Nguni languages of

South Africa. We compile LM datasets for these

languages from publicly available corpora and re-

lease our train/validation/test sets. On intrinsic

language modelling performance averaged across

the 4 languages our model outperforms neural LMs

trained with characters, BPE, and ULM subwords.

On the task of unsupervised morphological segmen-

tation (which determines to what extent subwords

correspond to actual morphemes) our model out-

performs standard subword segmenters like BPE

and ULM on average across the 4 languages.

In addition to these LMs, we train a second set

of subword segmental models that train on sin-

gle words in isolation (without having to model

context for long-range language modelling). Our

word-level models outperform all existing meth-

ods on unsupervised morphological segmentation

(including segmenters like Morfessor) by a large

margin across all 4 languages. Finally, we discuss

the importance of a subword lexicon to our model,

analysing how hyperparameters that control lexi-

con construction affect performance. In summary,

this paper makes the following contributions:2

We propose a subword segmental language

model (SSLM) that uniﬁes subword segmen-

tation and language modelling in a single end-

to-end neural architecture.

We compile and release curated LM datasets

for 4 Nguni languages.

We evaluate our model as an LM and an un-

supervised morphological segmenter, and it

Our code, trained models, and datasets are avail-

able at

https://github.com/francois-meyer/

subword-segmental-lm.

outperforms existing methods on both tasks.

We present an analysis of how lexicon-related

hyperparameters affect our model.

2 Subword Segmentation

In this section we review the paradigm that cur-

rently dominates subword segmentation, discuss its

limitations, and introduce the family of models we

draw inspiration from for our approach to subword

segmentation — segmental sequence models.

2.1 Subword Segmentation Algorithms

Recently proposed subword segmentation algo-

rithms start with some initial vocabulary (e.g. all

characters) and iteratively amend it based on corpus

subword statistics until a pre-speciﬁed vocabulary

size has been reached. The goal of BPE (Sennrich

et al.,2016) is to represent common characters se-

quences as distinct vocabulary items. ULM (Kudo,

2018) aims to maximise the likelihood of the train-

ing corpus under a unigram LM, in which subwords

are generated independently.

These algorithms work well in certain contexts,

but are not universally applicable. Klein and Tsar-

faty (2020) show that they are sub-optimal for mor-

phologically rich languages. Zhu et al. (2019b)

show that the best method varies across languages

and tasks, and existing segmenters require exten-

sive tuning. They also ﬁnd that subword segmenta-

tion is particularly beneﬁcial for low-resource lan-

guages, but on average a simple character n-gram

method outperforms BPE (Zhu et al.,2019a).

Recently it has become popular to construct

shared multilingual vocabularies, but this leads to

over-segmented words in low-resource languages

(Wang et al.,2021b;Ács,2019). Some have argued

that these problems can be overcome by avoiding

segmentation altogether (Clark et al.,2021) or by

more sophisticated hyperparameter tuning (Salesky

et al.,2020). But the limitations arise partly from

the fact that the segmentation algorithms them-

selves are separated from model training. To over-

come this we turn to a different paradigm, where

we can cast subword segmentation as something

for the model to learn.

2.2 Segmental Sequence Models

The main idea behind segmental sequence mod-

elling is to let the model learn segmentation itself.

This involves treating sequence segmentation as a

Figure 1: The SSLM computing the probability for the subword segment “ph” in the isiXhosa sentence “Uya phi?”

A character-level LSTM encodes the unsegmented text history “Uya ”, while a mixture model (equation 6) that

interpolates between a separate character-level LSTM decoder and a lexicon model generates the segment “ph”.

This is repeated for all possible subwords in a sequence to compute the forward scores of equation 3.

latent variable to be marginalised over. The moti-

vation behind this is that the model would be able

to “discover” the optimal segments for sequence

prediction. These segments might correspond to

natural underlying sequence units, such as words

in text or phonemes in speech.

Variants of this idea have been used in a few neu-

ral sequence models. Kong et al. (2016) propose

a bidirectional RNN that learns segment embed-

dings for handwriting recognition and POS tagging.

Wang et al. (2017) propose SWAN (Sleep-WAke

Networks), a segmental RNN for text segmenta-

tion and speech recognition. Both of these models

use dynamic programming to efﬁciently compute

marginal likelihood during training (by summing

over all possible segmentations) and to ﬁnd the

most likely segmentation of a sequence.

Sun and Deng (2018) coined the term “seg-

mental language model” (SLM) in applying this

approach to Chinese language modelling for un-

supervised word segmentation. Kawakami et al.

(2019) extended their approach by equipping the

model with a lexical memory and introducing seg-

ment length regularisation. Segmental models for

word discovery have also been proposed as masked

LMs (Downey et al.,2021) and bi-directional LMs

(Wang et al.,2021a). Inspired by these works, we

adapt segmental sequence modelling for the joint

task of language modelling and subword discovery.

3 Subword Segmental Language Model

Our SSLM combines autoregressive language mod-

elling and subword segmentation in a single model

that can be trained end-to-end. The architecture

is shown in Figure 1. It represents a radical di-

vergence from segmenters like BPE and ULM,

which view subword segmentation as context-

independent. The SSLM views subword segmenta-

tion and language modelling jointly, so it can learn

subwords that optimise conditional LM generation.

3.1 Generative Model

The SSLM generates a sequence of space-separated

words

w=w1,w2,...,wn

, corresponding to

an underlying sequence of characters

, and gen-

erates each word

as a sequence of subwords

si=si1, si2, . . . , si|si|

. The probability of a text

sequence

is computed through the marginal dis-

tribution over all possible word segmentations as

p(w) = X

s:π(s)=w

p(s),(1)

where

π(s)

is the unsegmented text underlying the

sequence of segmented words

s=s1,s2,...,sn

Using the chain rule, we deﬁne the probability of a

sequence of segmented words as

p(s) =

|w|

i=1

|si|

j=1

p(sij |s≤i,<j),(2)

where

s≤i,<j

is the subword sequence preceding

the

jth

subword of the

ith

word (this includes all

subwords in the preceding words and the subwords

preceding sij in the current word).

We treat white spaces and punctuation as as-

sumed segments that are equivalent to 1-character

words. In this way we implicitly model the end of a

word. When the model predicts a non-alphabetical

character (e.g. space) that is equivalent to a word

boundary. Segments cannot cross word boundaries,

so the only segmentation learned by the model is

how to segment words into subwords.

3.2 Dynamic Programming Algorithm

Conditioning the probabilities of a segment

p(sij |s≤i,<j)

on all possible segmentation histories

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SubwordSegmentalLanguageModellingforNguniLanguagesFrancoisMeyerandJanBuysDepartmentofComputerScienceUniversityofCapeTownMYRFRA008@myuct.ac.za,jbuys@cs.uct.ac.zaAbstractSubwordshavebecomethestandardunitsoftextinNLP,enablingefcientopen-vocabularymodels.Withalgorithmslikebyte-pairencoding(BPE),subword...

展开>> 收起<<

Subword Segmental Language Modelling for Nguni Languages Francois Meyer and Jan Buys Department of Computer Science.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Subword Segmental Language Modelling for Nguni Languages Francois Meyer and Jan Buys Department of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: