The VolcTrans System for WMT22 Multilingual Machine Translation Task Xian Qian1 Kai Hu1 Jiaqiang Wang1 Yifeng Liu2

2025-05-06 1 0 327.2KB 8 页 10玖币

侵权投诉

The VolcTrans System for WMT22 Multilingual Machine Translation

Task

Xian Qian1, Kai Hu1, Jiaqiang Wang1, Yifeng Liu2

Xingyuan Pan3, Jun Cao1, Mingxuan Wang1

1ByteDance AI Lab, 2Tsinghua University, 3Wuhan University

{qian.xian, hukai.joseph,wangjiaqiang.sonian,

caojun.sh, wangmingxuan.89}@bytedance.com

liuyifen20@mails.tsinghua.edu.cn, panxingyuan209@gmail.com

Abstract

This report describes our VolcTrans system for

the WMT22 shared task on large-scale multi-

lingual machine translation. We participated

in the unconstrained track which allows the

use of external resources. Our system is a

transformer-based multilingual model trained

on data from multiple sources including the

public training set from the data track, NLLB

data provided by Meta AI, self-collected par-

allel corpora, and pseudo bitext from back-

translation. A series of heuristic rules clean

both bilingual and monolingual texts. On

the ofﬁcial test set, our system achieves 17.3

BLEU, 21.9spBLEU, and 41.9chrF2++ on

average over all language pairs. The aver-

age inference speed is 11.5sentences per sec-

ond using a single Nvidia Tesla V100 GPU.

Our code and trained models are available at

https://github.com/xian8/wmt22

1 Introduction

Multilingual Machine Translation attracts much

attention in recent years due to its advantages in

sharing cross-lingual knowledge for low-resource

languages. It also dramatically reduces training

and serving costs. Training a multilingual model is

much faster and simpler than training many bilin-

gual ones. Serving multiple low-trafﬁc languages

using one model could drastically improve GPU

utilization.

The WMT22 shared task on large-scale multi-

lingual machine translation includes 24 African

languages (Adelani et al.,2022b). Inspired by pre-

vious research works, we train a deep transformer

model to translate all languages since large models

have been demonstrated effective for multilingual

translation (Fan et al.,2021;Kong et al.,2021;

Zhang et al.,2020). We participated in the un-

constrained track that allows the use of external

data. Besides the ofﬁcial dataset for the constrained

track, and the NLLB corpus provided by MetaAI

(NLLB Team et al.,2022), we also collect parallel

and monolingual texts from public websites and

sources. These raw data are cleaned by a series

of commonly used heuristic rules, and a minimum

description length (MDL) based approach to re-

move samples with repeat patterns. Monolingual

texts are used for back translation. For some very

low-resource languages such as Wolof, iterative

back-translation is adopted for higher accuracy.

We compare different training strategies to bal-

ance efﬁciency and quality, such as streaming data

shufﬂing, and dynamic vocabulary for new lan-

guages. Furthermore, we used the open-sourced

LightSeq toolkit

to accelerate training and infer-

ence.

On the ofﬁcial test set, our system achieves

17.3

BLEU,

21.9

spBLEU, and

41.9

chrF2++ on aver-

age over all language pairs. Averaged inference

speed is

11.5

sentences per second using a single

Nvidia Tesla V100 GPU.

2 Data

2.1 Data Collection

Our training data are mainly from four sources:

the ofﬁcial set for constrained track, NLLB data

provided by Meta AI, self-collected corpora, and

pseudo training set from back translation.

For each source, we collect both parallel sen-

tence pairs and monolingual sentences. A parallel

sentence pair is collected if one side is in African

language and the other is in English or French. We

did not collect African-African sentence pairs as

we use English as the pivot language for African-

to-African translation. Instead, they are added to

the monolingual set. More speciﬁcally, we split ev-

ery sentence pair into two sentences and add them

to the monolingual set accordingly. For example,

the source side of a fuv-fon sentence pair is added

to the fuv set. This greatly enriches the monolin-

gual dataset, especially for the very low-resource

1https://github.com/bytedance/lightseq

arXiv:2210.11599v1 [cs.CL] 20 Oct 2022

languages.

We merge multiple corpora from the same source

into one and use bloom ﬁlter

(Bloom,1970) for

fast deduplication. To reduce false positive errors

which over delete distinct samples, we set the error

rate

1e−7

and capacity of

samples which costs

100Ghost memory.

The ofﬁcial set includes the data from data track

participants, OPUS collections, and the NLLB par-

allel corpora mined from Common Crawl (com)

and other sources. All domains in OPUS collec-

tions are involved, such as Mozilla-I10n, which

could introduce many noises such as programming

languages, and needs extra rules to clean.

NLLB data provided by Meta AI has three sub-

sets: primary bitext including a seed set that is care-

fully annotated for representative languages and

a public bitext set downloaded from open sources

and mined bitexts that are automatically discovered

by LASER3 encoder in a global mining pipeline,

back-translated data from a pretrained model. We

add the ﬁrst two subsets in our training set.

Some public bitext data that are no longer avail-

able or require authorization such as JW300 (Agi´

and Vuli´

c,2019), Lorelei

and Chichewa News

are not included. We noticed that the NLLB team

released another version of mined data recently in

hugging-face

, which is different from the version

on the WMT22 website. We merge the new version

into the old one and remove duplicates.

We collected additional bitexts in two ways:

large-scale mining from general web pages, and

manually crawling from speciﬁc websites and

sources.

Large-scale mining focused on two scenarios,

parallel sentences appearing on a single web page

such as dictionary web pages that use multiple bilin-

gual sentences to exemplify the usage of a word,

and parallel web pages that describe the same con-

tent but are written in different languages. We ex-

tract these pages from the Common Crawl corpus.

Then we utilized Vecalign (Thompson and Koehn,

2019), an accurate and efﬁcient sentence alignment

algorithm to mine parallel bilingual sentences. We

use LASER (Schwenk and Douze,2017) encoders

released by WMT to obtain multilingual sentence

embeddings and facilitate the alignment work. We

collected about 3 million sentence pairs namely

2https://pypi.org/project/bloom-ﬁlter

3https://catalog.ldc.upenn.edu/LDC2021T02

4https://zenodo.org/record/4315018#.YypJWezML0p

5https://huggingface.co/datasets/allenai/nllb

LAVA corpus and submitted them to the data track.

And another

150M

pairs for the unconstrained

track.

Speciﬁc websites and sources have fewer but

higher-quality sentence pairs. For example, the

bible website

labels the order of sentences across

languages so we can align them easily without sen-

tence segmentation. Since JW300 is not publicly

available, we crawled pages from Jehovah’s Wit-

nesses7to recover the dataset.

Monolingual texts have richer sources such as

VOA news in Amharic

and OSCAR (Abadji et al.,

2022), which improve English/French

→

African

translation using back-translation. Monolingual

texts from parallel data are also collected as de-

scribed above. For African

→

English/French

translation, we clean Wikipedia pages in En-

glish/French to get monolingual texts. For lan-

guages that gain signiﬁcantly from back-translation

such as Wolof, we run another round of back-

translation to generate high-quality pseudo data.

2.2 Data Cleaning

We used the following rules to clean parallel

datasets, except the NLLB mined bitext.

•

Filter out parentheses and texts in between if

the numbers of parentheses in two sentences

are different.

•

Filter out sentence pairs if numbers mismatch

or one sentence ends with punctuation : ! ? ...

and the other mismatches.

•

Filter out sentences shorter than 30 characters,

sentences having URLs or emails, or words

longer than 100 characters.

•

De-duplication: remove sentence pairs shar-

ing the same source or target but having dif-

ferent translations.

•

Sentences having programming languages are

removed. We manually create a set of key-

words to detect programming languages, such

as if ( ,== and .getAttribute .

•

Language identiﬁcation using the NLLB lan-

guage identiﬁcation model trained by fastText

(Joulin et al.,2017)

6https://www.bible.com/languages

7https://www.jw.org

8https://amharic.voanews.com/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheVolcTransSystemforWMT22MultilingualMachineTranslationTaskXianQian1,KaiHu1,JiaqiangWang1,YifengLiu2XingyuanPan3,JunCao1,MingxuanWang11ByteDanceAILab,2TsinghuaUniversity,3WuhanUniversity{qian.xian,hukai.joseph,wangjiaqiang.sonian,caojun.sh,wangmingxuan.89}@bytedance.comliuyifen20@mails.tsinghua.edu...

展开>> 收起<<

The VolcTrans System for WMT22 Multilingual Machine Translation Task Xian Qian1 Kai Hu1 Jiaqiang Wang1 Yifeng Liu2.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The VolcTrans System for WMT22 Multilingual Machine Translation Task Xian Qian1 Kai Hu1 Jiaqiang Wang1 Yifeng Liu2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: