Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation Arturo OncevayKervy Dante Rivas Rojas

2025-04-29 0 0 374.73KB 10 页 10玖币

侵权投诉

Revisiting Syllables in Language Modelling and their

Application on Low-Resource Machine Translation

Arturo Oncevay,χ Kervy Dante Rivas Rojasρ,α

Liz Karen Chavez SanchezχRoberto Zariquieyρ,χ

School of Informatics, University of Edinburgh, Scotland

ρPontiﬁcia Universidad Católica del Perú (αIA-PUCP |χChana Field Station), Peru

a.oncevay@ed.ac.uk,rzariquiey@pucp.edu.pe

Abstract

Language modelling and machine translation

tasks mostly use subword or character in-

puts, but syllables are seldom used. Sylla-

bles provide shorter sequences than charac-

ters, require less-specialised extracting rules

than morphemes, and their segmentation is

not impacted by the corpus size. In this

study, we ﬁrst explore the potential of sylla-

bles for open-vocabulary language modelling

in 21 languages. We use rule-based syllab-

iﬁcation methods for six languages and ad-

dress the rest with hyphenation, which works

as a syllabiﬁcation proxy. With a comparable

perplexity, we show that syllables outperform

characters and other subwords. Moreover,

we study the importance of syllables on neu-

ral machine translation for a non-related and

low-resource language-pair (Spanish–Shipibo-

Konibo). In pairwise and multilingual sys-

tems, syllables outperform unsupervised sub-

words, and further morphological segmenta-

tion methods, when translating into a highly

synthetic language with a transparent orthog-

raphy (Shipibo-Konibo). Finally, we perform

some human evaluation, and discuss limita-

tions and opportunities.

1 Introduction

In language modelling (LM), we learn distributions

over sequences of words, subwords or characters,

and the last two can allow an open-vocabulary gen-

eration (Sutskever et al.,2011). We rely on sub-

word segmentation as a widespread approach to

generate rare subword units (Sennrich et al.,2016).

However, the lack of a representative corpus, in

terms of the word vocabulary, can constrain the

unsupervised segmentation (e.g. with scarce mono-

lingual texts (Joshi et al.,2020)). As an alternative,

we could use character-level modelling, since it

also has access to subword information (Kim et al.,

2016), but we face long-term dependency issues

and require longer training time to converge. Sim-

ilar issues are extended to other generation tasks,

such as machine translation (MT).

In this context, we focus on syllables, which are

speech units: “A syl-la-ble con-tains a sin-gle vow-

el u-nit”. syllables can be deﬁned as a group of

segments that is pronounced as a single articulatory

movement. Syllables are fundamental phonologi-

cal units since they participate in important word

prosodic patterns, such as stress assignment. In

this sense, syllables are more linguistically relevant

units than characters, and behave as a mapping

function to reduce the length of the sequence with

a larger “alphabet” or syllabary. Their extraction

can be rule-based and corpus-independent, but data-

driven methods or hyphenation using dictionaries

can approximate them as well.

We assess whether syllables are useful for en-

coding and/or decoding a diverse set of languages

on two generation tasks. First, for LM, we study

21 languages, to cover different levels of ortho-

graphic depth, which is the degree of grapheme-

phoneme correspondence (Borgwaldt et al.,2005)

and a factor that can increase complexity to syl-

labiﬁcation (Marjou,2021).

Whereas for MT, we

focus on the distant and low-resource language-pair

of Spanish–Shipibo-Konibo. We choose Shipibo-

Konibo

because it is an endangered language with

scarce textual corpora, which limits unsupervised

segmentation methods, and has a transparent or-

thography, which could be beneﬁcial to syllabiﬁca-

tion. Also, we consider multilingual MT systems,

as they outperformed pairwise systems for the cho-

sen language pair (Mager et al.,2021).

2 Related work

The closest LM study to ours is from Mikolov et al.

(2012) for subword-grained prediction in English,

where they used syllables as a proxy to split words

E.g., English has a deep orthography (weak correspon-

dence), whereas Finnish is transparent (Ziegler et al.,2010).

2See Appendix Afor more details about the language.

arXiv:2210.02509v1 [cs.CL] 5 Oct 2022

with low frequency, reduce the vocabulary and com-

press the model size. Besides, syllable-aware LM

was addressed by Assylbekov et al. (2017) for En-

glish, German, French, Czech, Spanish and Rus-

sian, and by Yu et al. (2017) for Korean. How-

ever, in both cases, the syllables were composed

with convolutional ﬁlters into word-level represen-

tations for closed-vocabulary generation. Besides,

for subword-aware open-vocabulary LM, Blevins

and Zettlemoyer (2019) incorporated morphologi-

cal supervision with a multi-task objective.

For syllable-based MT, there are mostly stud-

ies for related paired languages, such as Indic

languages (in statistical MT without subword-

based baselines: Kunchukuttan and Bhattacharyya

(2016)), Tibetan–Chinese (Lai et al.,2018), and

Myanmar–Rakhine (Myint Oo et al.,2019). In-

stead, Spanish–Shipibo-Konibo is a non-related

language-pair. The only distant pair was English–

Myanmar (ShweSin et al.,2019), but they did not

compare it with unsupervised subwords. Neither

of these studies analysed multilingual settings.

3 Open-vocabulary language modelling

with a comparable perplexity

Open-vocabulary output

We generate the same

input unit (e.g. characters, syllables or other sub-

words) as an open-vocabulary LM task, where

there is no prediction of an “unknown” or out-

of-vocabulary word-level token (Sutskever et al.,

2011). We thereby differ from previous works, and

refrain from composing the syllable representations

into words to evaluate only word-level perplexity.

Character-level perplexity

For a fair compari-

son across all granularities, we evaluate all results

with character-level perplexity:

pplc= exp (LLM (s)·|sseg|+ 1

|sc|+ 1 )(1)

where

LLM(s)

is the cross entropy of a string

computed by the neural LM, and |sseg|and |sc|re-

fer to the length of

in the chosen segmentation and

character-level units, respectively (Mielke,2019).

The extra unit considers the end of the sequence.

3.1 Experimental setup

Languages and datasets

Corpora are listed

in Table 4in Appendix B. We ﬁrst choose

WikiText-2-raw (enw;Merity et al.,2016)

, which

contains around two million word-level tokens ex-

tracted from Wikipedia articles in English. Further-

more, we employ 20 Universal Dependencies (UD;

Nivre et al.,2020) treebanks, similarly to Blevins

and Zettlemoyer (2019).

Finally, we include the

Shipibo-Konibo (shp) side of the parallel corpora

provided by the AmericasNLP shared task on MT

(Mager et al.,2021), which is also used in §4.

Syllable segmentation (SYL)

For splitting syl-

lables in different languages, we used rule-based

syllabiﬁcation tools for English, Spanish, Rus-

sian, Finnish, Turkish and Shipibo-Konibo, and

dictionary-based hyphenation tools for all lan-

guages except the ones mentioned above. We list

the tools in Appendix C.

Segmentation baselines

Besides characters

(CHAR) and the annotated morphemes in the

UD treebanks (MORPH), we consider Polyglot

(POLY)

, which includes models for unsupervised

morpheme segmentation trained with Morfessor

(Virpioja et al.,2013). Moreover, we employ

an unsupervised subword segmentation baseline

of Byte Pair Encoding (BPE; Sennrich et al.,

2016)

with different vocabulary sizes from 2,500

to 10,000 tokens, with 2,500 steps. We also ﬁx

the parameter to the syllabary size. Appendix C

includes details about the segmentation format.

Model and training

Following other open-

vocabulary LM studies (Mielke and Eisner,2019;

Mielke et al.,2019), we use a low-compute ver-

sion of an LSTM neural network, named Average

SGD Weight-Dropped (Merity et al.,2018). See

the hyperparameter details in Appendix E.

3.2 Results and discussion

Table 1shows the

pplc

values for the different lev-

els of segmentation we considered in the study,

where we did not tune the neural LM model for

a speciﬁc segmentation. We observe that sylla-

bles always result in better perplexities than other

granularities, even for deep orthography languages

such as English or French. The results obtained

by the BPE baselines are relatively poor as well,

The languages are chosen given the availability of an

open-source syllabiﬁcation or hyphenation tool. We prefer to

use the UD treebanks, instead of other well-known datasets

for language modelling (e.g. Multilingual Wikipedia Corpus

(Kawakami et al.,2017)), because they provide morphological

annotations, which are fundamental for this study.

4polyglot-nlp.com

5We use: https://github.com/huggingface/tokenizers

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RevisitingSyllablesinLanguageModellingandtheirApplicationonLow-ResourceMachineTranslationArturoOncevay;KervyDanteRivasRojas;LizKarenChavezSanchezRobertoZariquiey;SchoolofInformatics,UniversityofEdinburgh,ScotlandPonticiaUniversidadCatólicadelPerú(IA-PUCPjChanaFieldStation),Perua.oncevay@...

展开>> 收起<<

Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation Arturo OncevayKervy Dante Rivas Rojas.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation Arturo OncevayKervy Dante Rivas Rojas

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: