
2015b), which consists of 133K text pairs collected
from TED talk transcripts. Some studies (Provilkov
et al.,2020;Xu et al.,2019;Nguyen and Salazar,
2019b) show decent improvements through differ-
ent regularization techniques. Recently, PhoMT
(Doan et al.,2021b) and VLSP2020 (Ha et al.,
2020) released larger parallel datasets of size 3M
and 4M text pairs, extracted from publicly available
resources for the English-Vietnamese translation.
mBART model trained on PhoMT sets the current
state-of-the-art results
3 MTet: a Machine Translation dataset
in English and Vietnamese
In this section, we describe in details our MTet -
M
ultidomain
T
ranslation for
E
nglish-vie
T
namese
dataset. We curated a total of 4.2M training ex-
amples
1
. Based on the curation methodology, we
divide this data into four types.
Combining existing sources
This includes
sources from the Open Parallel corPUS (Tiede-
mann,2012), spanning across different domains
such as educational videos (Abdelali et al.,2014),
software user interface (GNOME, KDE4, Ubuntu),
COVID-related news articles (ELRC), religious
texts (Christodouloupoulos and Steedman,2015),
subtitles (Tatoeba), Wikipedia (Wołk and Marasek,
2014), TED Talks (Reimers and Gurevych,2020).
Together with the original IWSLT’15 (Cettolo
et al.,2015a) training set, the total dataset reaches
1.2M training examples. We train a base Trans-
former on this data, denoted
bTA
, to aid the collec-
tion of other data sources described below.
Scoring and filtering
Another large source from
OPUS is OpenSubtitles (Lison and Tiedemann,
2016) and CCAlign-envi (El-Kishky et al.,2020)
of sizes 3.5M and 9.3M respectively. For OpenSub-
titles, manual inspection showed inaccurate trans-
lations similar to the previous observations in Doan
et al. (2021b). Including CCAlign-envi as-is will
significantly reduce the model test performance in
test set (Appendix C). For this reason, we make use
of
bTA
to score each bitext by computing the loss
of all text pairs and select the best 700K training
examples using cross-validation on the tst2013 test
set
2
. CCAlign-envi, on the other hand, is entirely
1
Our work started and progress concurrently to PhoMT,
therefore a significant chunk of our data is overlapped. After
deduplication, 3M new training examples are contributed on
top of PhoMT existing training set.
2https://github.com/stefan-it/nmt-en-vi
discarded through the same process.
Dynamic Programming style alignment
An-
other large source of parallel data but trickier to
extract comes from weakly-aligned books and ar-
ticles (Ladhak et al.,2020). This includes many
mismatches at sentence and paragraph levels due
to versioning, translator formatting, extra head-
ers and page footers information. We propose a
dynamic-programming style alignment algorithm
detailed in Algorithm 1, a simplified version of
BleuAlign (Sennrich and Volk,2011), to filter and
align sentences between each pair of documents,
maximizing the total BLEU score after alignment.
In total, we collected 900K training examples from
300 bilingual books and news articles.
Manual crawl and clean
For this source, we fo-
cus on more technical and high-impact domains,
this include law documents and biomedical scien-
tific articles. We manually crawl and clean across
20 different websites of public biomedical journals
and law document libraries, treating them individ-
ually due to their significantly different format-
ting. We also manually crawl and clean some other
available websites that are more straightforward to
process, as detailed in Appendix D. Overall, this
source contributed another 1.2M training examples.
Data crowdsourcing for MTet multi-domain
test set
We utilize dataset.vn to distribute 4K test
examples held out from the collected data to 13 hu-
man experts to further refine its content. These do-
mains include biomedical, religion, law, and news.
Overall, we collected 4.2M training examples
across all sources. After combining MTet with
PhoMT and IWSLT’15, we grew the existing train-
ing set from 3M to 6M training examples. Com-
pared to the existing data sources, this dataset is
both larger and much more diverse, with the inclu-
sion of technical, impactful, yet so far mostly ne-
glected domains such as law and biomedical data.
4 EnViT5
4.1 Model
EnViT5 is a Text-to-Text Transfer Transformer
model follows the encoder-decoder architecture
proposed by (Vaswani et al.,2017) and the T5
framework proposed by (Raffel et al.,2019). The
original works of T5 proposed five different con-
figurations in model size: small, base, large, 3B,
and 11B. For the practical purpose of the study, we