Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation Arturo OncevayKervy Dante Rivas Rojas

2025-04-29 0 0 374.73KB 10 页 10玖币
侵权投诉
Revisiting Syllables in Language Modelling and their
Application on Low-Resource Machine Translation
Arturo Oncevay,χ Kervy Dante Rivas Rojasρ,α
Liz Karen Chavez SanchezχRoberto Zariquieyρ,χ
School of Informatics, University of Edinburgh, Scotland
ρPontificia Universidad Católica del Perú (αIA-PUCP |χChana Field Station), Peru
a.oncevay@ed.ac.uk,rzariquiey@pucp.edu.pe
Abstract
Language modelling and machine translation
tasks mostly use subword or character in-
puts, but syllables are seldom used. Sylla-
bles provide shorter sequences than charac-
ters, require less-specialised extracting rules
than morphemes, and their segmentation is
not impacted by the corpus size. In this
study, we first explore the potential of sylla-
bles for open-vocabulary language modelling
in 21 languages. We use rule-based syllab-
ification methods for six languages and ad-
dress the rest with hyphenation, which works
as a syllabification proxy. With a comparable
perplexity, we show that syllables outperform
characters and other subwords. Moreover,
we study the importance of syllables on neu-
ral machine translation for a non-related and
low-resource language-pair (Spanish–Shipibo-
Konibo). In pairwise and multilingual sys-
tems, syllables outperform unsupervised sub-
words, and further morphological segmenta-
tion methods, when translating into a highly
synthetic language with a transparent orthog-
raphy (Shipibo-Konibo). Finally, we perform
some human evaluation, and discuss limita-
tions and opportunities.
1 Introduction
In language modelling (LM), we learn distributions
over sequences of words, subwords or characters,
and the last two can allow an open-vocabulary gen-
eration (Sutskever et al.,2011). We rely on sub-
word segmentation as a widespread approach to
generate rare subword units (Sennrich et al.,2016).
However, the lack of a representative corpus, in
terms of the word vocabulary, can constrain the
unsupervised segmentation (e.g. with scarce mono-
lingual texts (Joshi et al.,2020)). As an alternative,
we could use character-level modelling, since it
also has access to subword information (Kim et al.,
2016), but we face long-term dependency issues
and require longer training time to converge. Sim-
ilar issues are extended to other generation tasks,
such as machine translation (MT).
In this context, we focus on syllables, which are
speech units: A syl-la-ble con-tains a sin-gle vow-
el u-nit”. syllables can be defined as a group of
segments that is pronounced as a single articulatory
movement. Syllables are fundamental phonologi-
cal units since they participate in important word
prosodic patterns, such as stress assignment. In
this sense, syllables are more linguistically relevant
units than characters, and behave as a mapping
function to reduce the length of the sequence with
a larger “alphabet” or syllabary. Their extraction
can be rule-based and corpus-independent, but data-
driven methods or hyphenation using dictionaries
can approximate them as well.
We assess whether syllables are useful for en-
coding and/or decoding a diverse set of languages
on two generation tasks. First, for LM, we study
21 languages, to cover different levels of ortho-
graphic depth, which is the degree of grapheme-
phoneme correspondence (Borgwaldt et al.,2005)
and a factor that can increase complexity to syl-
labification (Marjou,2021).
1
Whereas for MT, we
focus on the distant and low-resource language-pair
of Spanish–Shipibo-Konibo. We choose Shipibo-
Konibo
2
because it is an endangered language with
scarce textual corpora, which limits unsupervised
segmentation methods, and has a transparent or-
thography, which could be beneficial to syllabifica-
tion. Also, we consider multilingual MT systems,
as they outperformed pairwise systems for the cho-
sen language pair (Mager et al.,2021).
2 Related work
The closest LM study to ours is from Mikolov et al.
(2012) for subword-grained prediction in English,
where they used syllables as a proxy to split words
1
E.g., English has a deep orthography (weak correspon-
dence), whereas Finnish is transparent (Ziegler et al.,2010).
2See Appendix Afor more details about the language.
arXiv:2210.02509v1 [cs.CL] 5 Oct 2022
with low frequency, reduce the vocabulary and com-
press the model size. Besides, syllable-aware LM
was addressed by Assylbekov et al. (2017) for En-
glish, German, French, Czech, Spanish and Rus-
sian, and by Yu et al. (2017) for Korean. How-
ever, in both cases, the syllables were composed
with convolutional filters into word-level represen-
tations for closed-vocabulary generation. Besides,
for subword-aware open-vocabulary LM, Blevins
and Zettlemoyer (2019) incorporated morphologi-
cal supervision with a multi-task objective.
For syllable-based MT, there are mostly stud-
ies for related paired languages, such as Indic
languages (in statistical MT without subword-
based baselines: Kunchukuttan and Bhattacharyya
(2016)), Tibetan–Chinese (Lai et al.,2018), and
Myanmar–Rakhine (Myint Oo et al.,2019). In-
stead, Spanish–Shipibo-Konibo is a non-related
language-pair. The only distant pair was English–
Myanmar (ShweSin et al.,2019), but they did not
compare it with unsupervised subwords. Neither
of these studies analysed multilingual settings.
3 Open-vocabulary language modelling
with a comparable perplexity
Open-vocabulary output
We generate the same
input unit (e.g. characters, syllables or other sub-
words) as an open-vocabulary LM task, where
there is no prediction of an “unknown” or out-
of-vocabulary word-level token (Sutskever et al.,
2011). We thereby differ from previous works, and
refrain from composing the syllable representations
into words to evaluate only word-level perplexity.
Character-level perplexity
For a fair compari-
son across all granularities, we evaluate all results
with character-level perplexity:
pplc= exp (LLM (s)·|sseg|+ 1
|sc|+ 1 )(1)
where
LLM(s)
is the cross entropy of a string
s
computed by the neural LM, and |sseg|and |sc|re-
fer to the length of
s
in the chosen segmentation and
character-level units, respectively (Mielke,2019).
The extra unit considers the end of the sequence.
3.1 Experimental setup
Languages and datasets
Corpora are listed
in Table 4in Appendix B. We first choose
WikiText-2-raw (enw;Merity et al.,2016)
, which
contains around two million word-level tokens ex-
tracted from Wikipedia articles in English. Further-
more, we employ 20 Universal Dependencies (UD;
Nivre et al.,2020) treebanks, similarly to Blevins
and Zettlemoyer (2019).
3
Finally, we include the
Shipibo-Konibo (shp) side of the parallel corpora
provided by the AmericasNLP shared task on MT
(Mager et al.,2021), which is also used in §4.
Syllable segmentation (SYL)
For splitting syl-
lables in different languages, we used rule-based
syllabification tools for English, Spanish, Rus-
sian, Finnish, Turkish and Shipibo-Konibo, and
dictionary-based hyphenation tools for all lan-
guages except the ones mentioned above. We list
the tools in Appendix C.
Segmentation baselines
Besides characters
(CHAR) and the annotated morphemes in the
UD treebanks (MORPH), we consider Polyglot
(POLY)
4
, which includes models for unsupervised
morpheme segmentation trained with Morfessor
(Virpioja et al.,2013). Moreover, we employ
an unsupervised subword segmentation baseline
of Byte Pair Encoding (BPE; Sennrich et al.,
2016)
5
with different vocabulary sizes from 2,500
to 10,000 tokens, with 2,500 steps. We also fix
the parameter to the syllabary size. Appendix C
includes details about the segmentation format.
Model and training
Following other open-
vocabulary LM studies (Mielke and Eisner,2019;
Mielke et al.,2019), we use a low-compute ver-
sion of an LSTM neural network, named Average
SGD Weight-Dropped (Merity et al.,2018). See
the hyperparameter details in Appendix E.
3.2 Results and discussion
Table 1shows the
pplc
values for the different lev-
els of segmentation we considered in the study,
where we did not tune the neural LM model for
a specific segmentation. We observe that sylla-
bles always result in better perplexities than other
granularities, even for deep orthography languages
such as English or French. The results obtained
by the BPE baselines are relatively poor as well,
3
The languages are chosen given the availability of an
open-source syllabification or hyphenation tool. We prefer to
use the UD treebanks, instead of other well-known datasets
for language modelling (e.g. Multilingual Wikipedia Corpus
(Kawakami et al.,2017)), because they provide morphological
annotations, which are fundamental for this study.
4polyglot-nlp.com
5We use: https://github.com/huggingface/tokenizers
摘要:

RevisitingSyllablesinLanguageModellingandtheirApplicationonLow-ResourceMachineTranslationArturoOncevay;KervyDanteRivasRojas; LizKarenChavezSanchezRobertoZariquiey;SchoolofInformatics,UniversityofEdinburgh,ScotlandPonticiaUniversidadCatólicadelPerú( IA-PUCPjChanaFieldStation),Perua.oncevay@...

展开>> 收起<<
Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation Arturo OncevayKervy Dante Rivas Rojas.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:374.73KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注