
with low frequency, reduce the vocabulary and com-
press the model size. Besides, syllable-aware LM
was addressed by Assylbekov et al. (2017) for En-
glish, German, French, Czech, Spanish and Rus-
sian, and by Yu et al. (2017) for Korean. How-
ever, in both cases, the syllables were composed
with convolutional filters into word-level represen-
tations for closed-vocabulary generation. Besides,
for subword-aware open-vocabulary LM, Blevins
and Zettlemoyer (2019) incorporated morphologi-
cal supervision with a multi-task objective.
For syllable-based MT, there are mostly stud-
ies for related paired languages, such as Indic
languages (in statistical MT without subword-
based baselines: Kunchukuttan and Bhattacharyya
(2016)), Tibetan–Chinese (Lai et al.,2018), and
Myanmar–Rakhine (Myint Oo et al.,2019). In-
stead, Spanish–Shipibo-Konibo is a non-related
language-pair. The only distant pair was English–
Myanmar (ShweSin et al.,2019), but they did not
compare it with unsupervised subwords. Neither
of these studies analysed multilingual settings.
3 Open-vocabulary language modelling
with a comparable perplexity
Open-vocabulary output
We generate the same
input unit (e.g. characters, syllables or other sub-
words) as an open-vocabulary LM task, where
there is no prediction of an “unknown” or out-
of-vocabulary word-level token (Sutskever et al.,
2011). We thereby differ from previous works, and
refrain from composing the syllable representations
into words to evaluate only word-level perplexity.
Character-level perplexity
For a fair compari-
son across all granularities, we evaluate all results
with character-level perplexity:
pplc= exp (LLM (s)·|sseg|+ 1
|sc|+ 1 )(1)
where
LLM(s)
is the cross entropy of a string
s
computed by the neural LM, and |sseg|and |sc|re-
fer to the length of
s
in the chosen segmentation and
character-level units, respectively (Mielke,2019).
The extra unit considers the end of the sequence.
3.1 Experimental setup
Languages and datasets
Corpora are listed
in Table 4in Appendix B. We first choose
WikiText-2-raw (enw;Merity et al.,2016)
, which
contains around two million word-level tokens ex-
tracted from Wikipedia articles in English. Further-
more, we employ 20 Universal Dependencies (UD;
Nivre et al.,2020) treebanks, similarly to Blevins
and Zettlemoyer (2019).
3
Finally, we include the
Shipibo-Konibo (shp) side of the parallel corpora
provided by the AmericasNLP shared task on MT
(Mager et al.,2021), which is also used in §4.
Syllable segmentation (SYL)
For splitting syl-
lables in different languages, we used rule-based
syllabification tools for English, Spanish, Rus-
sian, Finnish, Turkish and Shipibo-Konibo, and
dictionary-based hyphenation tools for all lan-
guages except the ones mentioned above. We list
the tools in Appendix C.
Segmentation baselines
Besides characters
(CHAR) and the annotated morphemes in the
UD treebanks (MORPH), we consider Polyglot
(POLY)
4
, which includes models for unsupervised
morpheme segmentation trained with Morfessor
(Virpioja et al.,2013). Moreover, we employ
an unsupervised subword segmentation baseline
of Byte Pair Encoding (BPE; Sennrich et al.,
2016)
5
with different vocabulary sizes from 2,500
to 10,000 tokens, with 2,500 steps. We also fix
the parameter to the syllabary size. Appendix C
includes details about the segmentation format.
Model and training
Following other open-
vocabulary LM studies (Mielke and Eisner,2019;
Mielke et al.,2019), we use a low-compute ver-
sion of an LSTM neural network, named Average
SGD Weight-Dropped (Merity et al.,2018). See
the hyperparameter details in Appendix E.
3.2 Results and discussion
Table 1shows the
pplc
values for the different lev-
els of segmentation we considered in the study,
where we did not tune the neural LM model for
a specific segmentation. We observe that sylla-
bles always result in better perplexities than other
granularities, even for deep orthography languages
such as English or French. The results obtained
by the BPE baselines are relatively poor as well,
3
The languages are chosen given the availability of an
open-source syllabification or hyphenation tool. We prefer to
use the UD treebanks, instead of other well-known datasets
for language modelling (e.g. Multilingual Wikipedia Corpus
(Kawakami et al.,2017)), because they provide morphological
annotations, which are fundamental for this study.
4polyglot-nlp.com
5We use: https://github.com/huggingface/tokenizers