The VolcTrans System for WMT22 Multilingual Machine Translation Task Xian Qian1 Kai Hu1 Jiaqiang Wang1 Yifeng Liu2

2025-05-06 0 0 327.2KB 8 页 10玖币
侵权投诉
The VolcTrans System for WMT22 Multilingual Machine Translation
Task
Xian Qian1, Kai Hu1, Jiaqiang Wang1, Yifeng Liu2
Xingyuan Pan3, Jun Cao1, Mingxuan Wang1
1ByteDance AI Lab, 2Tsinghua University, 3Wuhan University
{qian.xian, hukai.joseph,wangjiaqiang.sonian,
caojun.sh, wangmingxuan.89}@bytedance.com
liuyifen20@mails.tsinghua.edu.cn, panxingyuan209@gmail.com
Abstract
This report describes our VolcTrans system for
the WMT22 shared task on large-scale multi-
lingual machine translation. We participated
in the unconstrained track which allows the
use of external resources. Our system is a
transformer-based multilingual model trained
on data from multiple sources including the
public training set from the data track, NLLB
data provided by Meta AI, self-collected par-
allel corpora, and pseudo bitext from back-
translation. A series of heuristic rules clean
both bilingual and monolingual texts. On
the official test set, our system achieves 17.3
BLEU, 21.9spBLEU, and 41.9chrF2++ on
average over all language pairs. The aver-
age inference speed is 11.5sentences per sec-
ond using a single Nvidia Tesla V100 GPU.
Our code and trained models are available at
https://github.com/xian8/wmt22
1 Introduction
Multilingual Machine Translation attracts much
attention in recent years due to its advantages in
sharing cross-lingual knowledge for low-resource
languages. It also dramatically reduces training
and serving costs. Training a multilingual model is
much faster and simpler than training many bilin-
gual ones. Serving multiple low-traffic languages
using one model could drastically improve GPU
utilization.
The WMT22 shared task on large-scale multi-
lingual machine translation includes 24 African
languages (Adelani et al.,2022b). Inspired by pre-
vious research works, we train a deep transformer
model to translate all languages since large models
have been demonstrated effective for multilingual
translation (Fan et al.,2021;Kong et al.,2021;
Zhang et al.,2020). We participated in the un-
constrained track that allows the use of external
data. Besides the official dataset for the constrained
track, and the NLLB corpus provided by MetaAI
(NLLB Team et al.,2022), we also collect parallel
and monolingual texts from public websites and
sources. These raw data are cleaned by a series
of commonly used heuristic rules, and a minimum
description length (MDL) based approach to re-
move samples with repeat patterns. Monolingual
texts are used for back translation. For some very
low-resource languages such as Wolof, iterative
back-translation is adopted for higher accuracy.
We compare different training strategies to bal-
ance efficiency and quality, such as streaming data
shuffling, and dynamic vocabulary for new lan-
guages. Furthermore, we used the open-sourced
LightSeq toolkit
1
to accelerate training and infer-
ence.
On the official test set, our system achieves
17.3
BLEU,
21.9
spBLEU, and
41.9
chrF2++ on aver-
age over all language pairs. Averaged inference
speed is
11.5
sentences per second using a single
Nvidia Tesla V100 GPU.
2 Data
2.1 Data Collection
Our training data are mainly from four sources:
the official set for constrained track, NLLB data
provided by Meta AI, self-collected corpora, and
pseudo training set from back translation.
For each source, we collect both parallel sen-
tence pairs and monolingual sentences. A parallel
sentence pair is collected if one side is in African
language and the other is in English or French. We
did not collect African-African sentence pairs as
we use English as the pivot language for African-
to-African translation. Instead, they are added to
the monolingual set. More specifically, we split ev-
ery sentence pair into two sentences and add them
to the monolingual set accordingly. For example,
the source side of a fuv-fon sentence pair is added
to the fuv set. This greatly enriches the monolin-
gual dataset, especially for the very low-resource
1https://github.com/bytedance/lightseq
arXiv:2210.11599v1 [cs.CL] 20 Oct 2022
languages.
We merge multiple corpora from the same source
into one and use bloom filter
2
(Bloom,1970) for
fast deduplication. To reduce false positive errors
which over delete distinct samples, we set the error
rate
1e7
and capacity of
4B
samples which costs
100Ghost memory.
The official set includes the data from data track
participants, OPUS collections, and the NLLB par-
allel corpora mined from Common Crawl (com)
and other sources. All domains in OPUS collec-
tions are involved, such as Mozilla-I10n, which
could introduce many noises such as programming
languages, and needs extra rules to clean.
NLLB data provided by Meta AI has three sub-
sets: primary bitext including a seed set that is care-
fully annotated for representative languages and
a public bitext set downloaded from open sources
and mined bitexts that are automatically discovered
by LASER3 encoder in a global mining pipeline,
back-translated data from a pretrained model. We
add the first two subsets in our training set.
Some public bitext data that are no longer avail-
able or require authorization such as JW300 (Agi´
c
and Vuli´
c,2019), Lorelei
3
and Chichewa News
4
are not included. We noticed that the NLLB team
released another version of mined data recently in
hugging-face
5
, which is different from the version
on the WMT22 website. We merge the new version
into the old one and remove duplicates.
We collected additional bitexts in two ways:
large-scale mining from general web pages, and
manually crawling from specific websites and
sources.
Large-scale mining focused on two scenarios,
parallel sentences appearing on a single web page
such as dictionary web pages that use multiple bilin-
gual sentences to exemplify the usage of a word,
and parallel web pages that describe the same con-
tent but are written in different languages. We ex-
tract these pages from the Common Crawl corpus.
Then we utilized Vecalign (Thompson and Koehn,
2019), an accurate and efficient sentence alignment
algorithm to mine parallel bilingual sentences. We
use LASER (Schwenk and Douze,2017) encoders
released by WMT to obtain multilingual sentence
embeddings and facilitate the alignment work. We
collected about 3 million sentence pairs namely
2https://pypi.org/project/bloom-filter
3https://catalog.ldc.upenn.edu/LDC2021T02
4https://zenodo.org/record/4315018#.YypJWezML0p
5https://huggingface.co/datasets/allenai/nllb
LAVA corpus and submitted them to the data track.
And another
150M
pairs for the unconstrained
track.
Specific websites and sources have fewer but
higher-quality sentence pairs. For example, the
bible website
6
labels the order of sentences across
languages so we can align them easily without sen-
tence segmentation. Since JW300 is not publicly
available, we crawled pages from Jehovah’s Wit-
nesses7to recover the dataset.
Monolingual texts have richer sources such as
VOA news in Amharic
8
and OSCAR (Abadji et al.,
2022), which improve English/French
African
translation using back-translation. Monolingual
texts from parallel data are also collected as de-
scribed above. For African
English/French
translation, we clean Wikipedia pages in En-
glish/French to get monolingual texts. For lan-
guages that gain significantly from back-translation
such as Wolof, we run another round of back-
translation to generate high-quality pseudo data.
2.2 Data Cleaning
We used the following rules to clean parallel
datasets, except the NLLB mined bitext.
Filter out parentheses and texts in between if
the numbers of parentheses in two sentences
are different.
Filter out sentence pairs if numbers mismatch
or one sentence ends with punctuation : ! ? ...
and the other mismatches.
Filter out sentences shorter than 30 characters,
sentences having URLs or emails, or words
longer than 100 characters.
De-duplication: remove sentence pairs shar-
ing the same source or target but having dif-
ferent translations.
Sentences having programming languages are
removed. We manually create a set of key-
words to detect programming languages, such
as if ( ,== and .getAttribute .
Language identification using the NLLB lan-
guage identification model trained by fastText
(Joulin et al.,2017)
6https://www.bible.com/languages
7https://www.jw.org
8https://amharic.voanews.com/
摘要:

TheVolcTransSystemforWMT22MultilingualMachineTranslationTaskXianQian1,KaiHu1,JiaqiangWang1,YifengLiu2XingyuanPan3,JunCao1,MingxuanWang11ByteDanceAILab,2TsinghuaUniversity,3WuhanUniversity{qian.xian,hukai.joseph,wangjiaqiang.sonian,caojun.sh,wangmingxuan.89}@bytedance.comliuyifen20@mails.tsinghua.edu...

展开>> 收起<<
The VolcTrans System for WMT22 Multilingual Machine Translation Task Xian Qian1 Kai Hu1 Jiaqiang Wang1 Yifeng Liu2.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:327.2KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注