
The VolcTrans System for WMT22 Multilingual Machine Translation
Task
Xian Qian1, Kai Hu1, Jiaqiang Wang1, Yifeng Liu2
Xingyuan Pan3, Jun Cao1, Mingxuan Wang1
1ByteDance AI Lab, 2Tsinghua University, 3Wuhan University
{qian.xian, hukai.joseph,wangjiaqiang.sonian,
caojun.sh, wangmingxuan.89}@bytedance.com
liuyifen20@mails.tsinghua.edu.cn, panxingyuan209@gmail.com
Abstract
This report describes our VolcTrans system for
the WMT22 shared task on large-scale multi-
lingual machine translation. We participated
in the unconstrained track which allows the
use of external resources. Our system is a
transformer-based multilingual model trained
on data from multiple sources including the
public training set from the data track, NLLB
data provided by Meta AI, self-collected par-
allel corpora, and pseudo bitext from back-
translation. A series of heuristic rules clean
both bilingual and monolingual texts. On
the official test set, our system achieves 17.3
BLEU, 21.9spBLEU, and 41.9chrF2++ on
average over all language pairs. The aver-
age inference speed is 11.5sentences per sec-
ond using a single Nvidia Tesla V100 GPU.
Our code and trained models are available at
https://github.com/xian8/wmt22
1 Introduction
Multilingual Machine Translation attracts much
attention in recent years due to its advantages in
sharing cross-lingual knowledge for low-resource
languages. It also dramatically reduces training
and serving costs. Training a multilingual model is
much faster and simpler than training many bilin-
gual ones. Serving multiple low-traffic languages
using one model could drastically improve GPU
utilization.
The WMT22 shared task on large-scale multi-
lingual machine translation includes 24 African
languages (Adelani et al.,2022b). Inspired by pre-
vious research works, we train a deep transformer
model to translate all languages since large models
have been demonstrated effective for multilingual
translation (Fan et al.,2021;Kong et al.,2021;
Zhang et al.,2020). We participated in the un-
constrained track that allows the use of external
data. Besides the official dataset for the constrained
track, and the NLLB corpus provided by MetaAI
(NLLB Team et al.,2022), we also collect parallel
and monolingual texts from public websites and
sources. These raw data are cleaned by a series
of commonly used heuristic rules, and a minimum
description length (MDL) based approach to re-
move samples with repeat patterns. Monolingual
texts are used for back translation. For some very
low-resource languages such as Wolof, iterative
back-translation is adopted for higher accuracy.
We compare different training strategies to bal-
ance efficiency and quality, such as streaming data
shuffling, and dynamic vocabulary for new lan-
guages. Furthermore, we used the open-sourced
LightSeq toolkit
1
to accelerate training and infer-
ence.
On the official test set, our system achieves
17.3
BLEU,
21.9
spBLEU, and
41.9
chrF2++ on aver-
age over all language pairs. Averaged inference
speed is
11.5
sentences per second using a single
Nvidia Tesla V100 GPU.
2 Data
2.1 Data Collection
Our training data are mainly from four sources:
the official set for constrained track, NLLB data
provided by Meta AI, self-collected corpora, and
pseudo training set from back translation.
For each source, we collect both parallel sen-
tence pairs and monolingual sentences. A parallel
sentence pair is collected if one side is in African
language and the other is in English or French. We
did not collect African-African sentence pairs as
we use English as the pivot language for African-
to-African translation. Instead, they are added to
the monolingual set. More specifically, we split ev-
ery sentence pair into two sentences and add them
to the monolingual set accordingly. For example,
the source side of a fuv-fon sentence pair is added
to the fuv set. This greatly enriches the monolin-
gual dataset, especially for the very low-resource
1https://github.com/bytedance/lightseq
arXiv:2210.11599v1 [cs.CL] 20 Oct 2022