MTet Multi-domain Translation for English and Vietnamese Chinh Ngo Trieu H. Trinh Long Phan Hieu Tran Tai Dang Hieu Nguyen Minh Nguyen andMinh-Thang Luong

2025-04-24 4 0 3.24MB 9 页 10玖币

侵权投诉

MTet: Multi-domain Translation for English and Vietnamese

Chinh Ngo∗, Trieu H. Trinh∗, Long Phan∗, Hieu Tran∗,

Tai Dang,Hieu Nguyen,Minh Nguyen and Minh-Thang Luong

VietAI Research

Abstract

We introduce MTet, the largest publicly avail-

able parallel corpus for English-Vietnamese

translation. MTet consists of 4.2M high-

quality training sentence pairs and a multi-

domain test set refined by the Vietnamese re-

search community. Combining with previous

works on English-Vietnamese translation, we

grow the existing parallel dataset to 6.2M sen-

tence pairs. We also release the first pre-

trained model EnViT5 for English and Viet-

namese languages. Combining both resources,

our model significantly outperforms previous

state-of-the-art results by up to 2 points in

translation BLEU score, while being 1.6 times

smaller.

1 Introduction

Machine Translation is an impactful subdomain

of Natural Language Processing that directly ben-

efits the world’s interconnected regions and na-

tions, especially so for fast-developing economies

such as Vietnam (Baum,2020). Neural machine

translation, however, is hindered for many pairs

of languages due to their scarce availability. The

literature tackling this problem consists mainly

of regularization and data augmentation meth-

ods (Provilkov et al.,2019;Nguyen and Salazar,

2019a;Clark et al.,2018). Recently a more data-

centric view with more successful results arises:

directly growing the small existing datasets (Fan

et al.,2020;Ngo and Trinh,2021;Cruz and Cheng,

2021) and better pretraining methodologies to ex-

tract value from large corpora (Liu et al.,2020;

Lample and Conneau,2019;Song et al.,2019).

In this work, we introduce EnViT5, the first pre-

trained Transformer-based encoder-decoder model

for English-Vietnamese, and MTet -

ulti-domain

ranslation for

nglish-Vie

namese, the largest

high-quality multi-domain corpus for English-

Vietnamese translation of size 4.2M. Notably, MTet

*The first four authors contributed equally to this work

also focuses on highly technical, impactful yet

mostly neglected domains due to their expensive-

to-obtain nature such as law and biomedical bitexts.

We also introduce a test set of four distinctively

different domains, refined and cross-checked by

human experts through a data crowdsourcing plat-

form. Our final model, pretrained on EnViT5 and

finetuned on MTet + phoMT (Doan et al.,2021a)

outperforms previous results by a significant mar-

gin of up to 2 points in BLEU score. Finally, we

perform experiments to confirm that with the same

amount of training data, a multi-domain training

set results in a better test performance as shown

in Section 6, further supporting the multi-domain

nature of MTet.

2 Related Works

In recent years, research works focusing on im-

proving Machine Translation Systems for Low-

Resource Languages have received a lot of attention

from both academia and the industry (Chen et al.,

2019;Shen et al.,2019;Gu et al.,2018;Nasir and

Mchechesi,2022). Prior works include collect-

ing more parallel translation data (Thu et al.,2016;

nón et al.,2020;Sánchez-Cartagena et al.), train-

ing large multilingual models (Fan et al.,2020;Liu

et al.,2020), and utilizing data augmentation or

regularization techniques (He et al.,2019;Edunov

et al.,2018;Provilkov et al.,2019). Previous works

from ParaCrawl (Ba

nón et al.,2020) and BiCleaner

(Sánchez-Cartagena et al.) focused on mass crawl-

ing parallel translation data for many low-resource

language pairs. Yet, previous work (Doan et al.,

2021b) shows that crawling at scale still has lim-

itation and affect downstream translation perfor-

mance. We also compare our high-quality MTet

with other crawling at-scale datasets in Section 3.

Encouraging results have also been achieved in

low-resource English-Vietnamese translation. The

most popular and well-adopted translation dataset

for English-Vietnamese is IWSLT15 (Cettolo et al.,

arXiv:2210.05610v2 [cs.CL] 19 Oct 2022

2015b), which consists of 133K text pairs collected

from TED talk transcripts. Some studies (Provilkov

et al.,2020;Xu et al.,2019;Nguyen and Salazar,

2019b) show decent improvements through differ-

ent regularization techniques. Recently, PhoMT

(Doan et al.,2021b) and VLSP2020 (Ha et al.,

2020) released larger parallel datasets of size 3M

and 4M text pairs, extracted from publicly available

resources for the English-Vietnamese translation.

mBART model trained on PhoMT sets the current

state-of-the-art results

3 MTet: a Machine Translation dataset

in English and Vietnamese

In this section, we describe in details our MTet -

ultidomain

ranslation for

nglish-vie

namese

dataset. We curated a total of 4.2M training ex-

amples

. Based on the curation methodology, we

divide this data into four types.

Combining existing sources

This includes

sources from the Open Parallel corPUS (Tiede-

mann,2012), spanning across different domains

such as educational videos (Abdelali et al.,2014),

software user interface (GNOME, KDE4, Ubuntu),

COVID-related news articles (ELRC), religious

texts (Christodouloupoulos and Steedman,2015),

subtitles (Tatoeba), Wikipedia (Wołk and Marasek,

2014), TED Talks (Reimers and Gurevych,2020).

Together with the original IWSLT’15 (Cettolo

et al.,2015a) training set, the total dataset reaches

1.2M training examples. We train a base Trans-

former on this data, denoted

bTA

, to aid the collec-

tion of other data sources described below.

Scoring and filtering

Another large source from

OPUS is OpenSubtitles (Lison and Tiedemann,

2016) and CCAlign-envi (El-Kishky et al.,2020)

of sizes 3.5M and 9.3M respectively. For OpenSub-

titles, manual inspection showed inaccurate trans-

lations similar to the previous observations in Doan

et al. (2021b). Including CCAlign-envi as-is will

significantly reduce the model test performance in

test set (Appendix C). For this reason, we make use

bTA

to score each bitext by computing the loss

of all text pairs and select the best 700K training

examples using cross-validation on the tst2013 test

set

. CCAlign-envi, on the other hand, is entirely

Our work started and progress concurrently to PhoMT,

therefore a significant chunk of our data is overlapped. After

deduplication, 3M new training examples are contributed on

top of PhoMT existing training set.

2https://github.com/stefan-it/nmt-en-vi

discarded through the same process.

Dynamic Programming style alignment

An-

other large source of parallel data but trickier to

extract comes from weakly-aligned books and ar-

ticles (Ladhak et al.,2020). This includes many

mismatches at sentence and paragraph levels due

to versioning, translator formatting, extra head-

ers and page footers information. We propose a

dynamic-programming style alignment algorithm

detailed in Algorithm 1, a simplified version of

BleuAlign (Sennrich and Volk,2011), to filter and

align sentences between each pair of documents,

maximizing the total BLEU score after alignment.

In total, we collected 900K training examples from

300 bilingual books and news articles.

Manual crawl and clean

For this source, we fo-

cus on more technical and high-impact domains,

this include law documents and biomedical scien-

tific articles. We manually crawl and clean across

20 different websites of public biomedical journals

and law document libraries, treating them individ-

ually due to their significantly different format-

ting. We also manually crawl and clean some other

available websites that are more straightforward to

process, as detailed in Appendix D. Overall, this

source contributed another 1.2M training examples.

Data crowdsourcing for MTet multi-domain

test set

We utilize dataset.vn to distribute 4K test

examples held out from the collected data to 13 hu-

man experts to further refine its content. These do-

mains include biomedical, religion, law, and news.

Overall, we collected 4.2M training examples

across all sources. After combining MTet with

PhoMT and IWSLT’15, we grew the existing train-

ing set from 3M to 6M training examples. Com-

pared to the existing data sources, this dataset is

both larger and much more diverse, with the inclu-

sion of technical, impactful, yet so far mostly ne-

glected domains such as law and biomedical data.

4 EnViT5

4.1 Model

EnViT5 is a Text-to-Text Transfer Transformer

model follows the encoder-decoder architecture

proposed by (Vaswani et al.,2017) and the T5

framework proposed by (Raffel et al.,2019). The

original works of T5 proposed five different con-

figurations in model size: small, base, large, 3B,

and 11B. For the practical purpose of the study, we

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MTet:Multi-domainTranslationforEnglishandVietnameseChinhNgo,TrieuH.Trinh,LongPhan,HieuTran,TaiDang,HieuNguyen,MinhNguyenandMinh-ThangLuongVietAIResearchAbstractWeintroduceMTet,thelargestpubliclyavail-ableparallelcorpusforEnglish-Vietnamesetranslation.MTetconsistsof4.2Mhigh-qualitytrainingsentenc...

展开>> 收起<<

MTet Multi-domain Translation for English and Vietnamese Chinh Ngo Trieu H. Trinh Long Phan Hieu Tran Tai Dang Hieu Nguyen Minh Nguyen andMinh-Thang Luong.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MTet Multi-domain Translation for English and Vietnamese Chinh Ngo Trieu H. Trinh Long Phan Hieu Tran Tai Dang Hieu Nguyen Minh Nguyen andMinh-Thang Luong

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: