University of Cape Towns WMT22 System Multilingual Machine Translation for Southern African Languages Khalid N. Elmadani Francois Meyer Jan Buys

2025-05-06 0 0 258.13KB 10 页 10玖币
侵权投诉
University of Cape Town’s WMT22 System: Multilingual Machine
Translation for Southern African Languages
Khalid N. Elmadani Francois Meyer Jan Buys
Department of Computer Science
University of Cape Town
{ahmkha009,myrfra008}@myuct.ac.za, jbuys@cs.uct.ac.za
Abstract
The paper describes the University of Cape
Town’s submission to the constrained track of
the WMT22 Shared Task: Large-Scale Ma-
chine Translation Evaluation for African Lan-
guages. Our system is a single multilingual
translation model that translates between En-
glish and 8 South / South East African Lan-
guages, as well as between specific pairs of
the African languages. We used several tech-
niques suited for low-resource machine trans-
lation (MT), including overlap BPE, back-
translation, synthetic training data generation,
and adding more translation directions during
training. Our results show the value of these
techniques, especially for directions where
very little or no bilingual training data is avail-
able.1
1 Introduction
Southern African languages are underrepresented
in NLP research, in part because most of them are
low-resource languages: It is not always possible
to find high-quality datasets that are large enough
to train effective deep learning models (Kreutzer
et al.,2021). The WMT22 Shared Task on Large-
Scale Machine Translation Evaluation for African
Languages (Adelani et al.,2022) presented an op-
portunity to apply one of the most promising recent
developments in NLP — multilingual neural ma-
chine translation — to Southern African languages.
For many languages, the parallel corpora released
for the shared task are the largest publicly available
datasets yet. For some translation directions (e.g.
between Southern African languages), no parallel
corpora were previously available.
In this paper we present our submission to the
shared task. Our system is a Transformer-based
encoder-decoder (Vaswani et al.,2017) that trans-
lates between English and 8 South / South East
1
Our model is available at https://github.com/Khalid-
Nabigh/UCT-s-WMT22-shared-task.
African languages (Afrikaans, Northern Sotho,
Shona, Swati, Tswana, Xhosa, Xitsonga, Zulu) and
in 8 additional directions (Xhosa to Zulu, Zulu to
Shona, Shona to Afrikaans, Afrikaans to Swati,
Swati to Tswana, Tswana to Xitsonga, Xitsonga
to Northern Sotho, Northern Sotho to Xhosa). We
trained a single model with shared encoder and de-
coder parameters and a shared subword vocabulary.
We applied several methods aimed at improving
translation performance in a low-resource setting.
We experimented with BPE (Sennrich et al.,2016b)
and overlap BPE (Patil et al.,2022), the latter of
which increases the representation of low-resource
language tokens in the shared subword vocabulary.
We used initial multilingual and bilingual models to
generate back-translated sentences (Sennrich et al.,
2016a) for subsequent training.
First, we trained a model to translate between En-
glish and the 8 Southern African languages. Then
we added the 8 additional translation directions and
continued training. For some of these additional
directions no parallel corpora were available, so we
generated synthetic training data with our existing
model. By downsampling some of the parallel cor-
pora to ensure a balanced dataset, we were able to
train our model effectively in the new directions,
while retaining performance in the old directions.
We describe the development of our model and
report translation performance at each training
stage. Our final results compare favourably to
existing works with overlapping translation direc-
tions. While there is considerable disparity in per-
formance across languages, our model nonetheless
achieves results that indicate some degree of effec-
tive MT across all directions (most BLEU scores
are above 10 and most chrF++ scores are above 40).
We also discuss our findings regarding techniques
for low-resource MT. We found overlap BPE and
back-translation to improve performance for most
translation directions. Furthermore, our results con-
firm the value of multilingual models, which proves
arXiv:2210.11757v1 [cs.CL] 21 Oct 2022
critical for the lowest-resource languages.
2 Background
2.1 Multilingual Neural Machine Translation
(MNMT)
Multilingual models help low-resource languages
(LRLs) by leveraging the massive amount of
training data available in high-resource languages
(HRLs) (Aharoni et al.,2019;Zhang et al.,2020).
In the context of Neural Machine Translation, a
multilingual model can translate between more
than two languages. Current research in MNMT
can be divided into two main areas: training
language-specific parameters (Kim et al.,2019;
Philip et al.,2020) and training a single massive
model that shares all parameters among all lan-
guages (Fan et al.,2020;NLLB Team et al.,2022).
Our work lies in the second category, as we are
building a single multilingual translation system by
exploring back-translation and different vocabulary
generation approaches.
2.2 Back-Translation
Given parallel sentences in two languages
A
and
B
(
Ab
,
Ba
), with goal of training a model that
translates sentences from
A
to
B
(
AB
). Back-
translation works as follows: First, one trains a
(
BA
) model using the available (
Ab
,
Ba
)
data. Then the
Ba
sentences are passed to the
model to regenerate
Ab
. This model’s output (
A0
b
)
is then considered as additional synthetic parallel
data (
A0
b
,
Ba
). The final step of back-translation is
training an (
AB
) translation model using (
A0
b
,
Ba
) as parallel data. The motivation behind back-
translation is that the noise added to the
A0
b
sen-
tences from regeneration increases the model’s ro-
bustness (Edunov et al.,2018). The same approach
can be extended to multilingual models (Liao et al.,
2021).
2.3 Overlap-based BPE (OBPE)
Byte Pair Encoding (BPE) is a vocabulary creation
method that relies on
n
-gram frequency (Sennrich
et al.,2016b). The starting point is a character-
based vocabulary. At each step, the BPE algorithm
identifies the two adjacent tokens with the highest
frequency, joins them together as a single token,
and adds the new token to the vocabulary. The
dataset is then restructured based on the expanded
vocabulary. In the case of multilingual training, a
single BPE vocabulary can handle all languages
Language Pairs WMT22_african
eng-sna 8.7M
eng-xho 8.6M
eng-tsn 5.9M
eng-zul 3.8M
eng-nso 3M
eng-afr 1.6M
eng-tso 630K
eng-ssw 165K
xho-zul 1M
zul-sna 1.1M
sna-afr 1.6M*
afr-ssw 165K*
ssw-tsn 85K
tsn-tso 285K
tso-nso 212K
nso-xho 200K
Table 1: Number of available parallel sentences for all
language pairs. * indicates that no data is available for
these pairs and the number represents the amount of
synthetic data we generated.
Language Family LHRL LLRL
Germanic English(eng) Afrikaans(afr)
Nguni Xhosa(xho) Zulu(zul), Swati(ssw)
Sotho-Tswana Tswana(tsn) Sepedi(nso)
Bantu Shona(sna) Xitsonga(tso)
Table 2: The languages included in our translation sys-
tem, grouped by language family and whether they are
used as LHRL or LLRL for the OBPE algorithm.
by running the BPE algorithm on the union of the
data from all the languages. However, when con-
structing a multilingual vocabulary, BPE will prefer
frequent word types, most of which are from HRLs,
leaving a smaller proportion of the vocabulary for
words from LRLs.
Overlap-based BPE (OBPE) is a modification to
the BPE vocabulary creation algorithm which en-
hances overlap across related languages (Patil et al.,
2022). OBPE takes into account the frequency of
tokens as well as their existence among different
languages. Given a list of HRLs (
LHRL
) and LRLs
(
LLRL
), OBPE tries to balance cross-lingual shar-
ing (tokens shared between HRLs and LRLs) and
individual languages’ representation. The optimal
OBPE vocabulary for a set of languages from differ-
ent families is produced by considering the highest
resource language from each family as
LHRL
and
the rest of the languages as LLRL.
摘要:

UniversityofCapeTown'sWMT22System:MultilingualMachineTranslationforSouthernAfricanLanguagesKhalidN.ElmadaniFrancoisMeyerJanBuysDepartmentofComputerScienceUniversityofCapeTown{ahmkha009,myrfra008}@myuct.ac.za,jbuys@cs.uct.ac.zaAbstractThepaperdescribestheUniversityofCapeTown'ssubmissiontotheconstrain...

展开>> 收起<<
University of Cape Towns WMT22 System Multilingual Machine Translation for Southern African Languages Khalid N. Elmadani Francois Meyer Jan Buys.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:10 页 大小:258.13KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注