
to employ MT systems to contribute high-quality
Vietnamese datasets that emerged from the En-
glish language. To the best of our knowledge, this
is the first work utilizing state-of-the-art machine
translation to translate both self-supervised and su-
pervised learning biomedical data for pretrained
models in a low-resource language setting.
2.1 Pubmed and English Biomedical NLP
Studies
The Pubmed
3
provides access to the MEDLINE
database
4
which contains titles, abstracts, and
metadata from medical literature since the 1970s.
The dataset consists of more than 34 million
biomedical abstracts from the literature collected
from sources such as life science publications, med-
ical journals, and published online e-books. This
dataset is maintained and updated yearly to include
more up-to-date biomedical documents.
Pubmed Abstract has been the main dataset for
almost any state-of-the-art biomedical domain-
specific pretrained models (Lee et al.,2019;Yuan
et al.,2022;Tinn et al.,2021;Yasunaga et al.,
2022;Alrowili and Shanker,2021;Phan et al.,
2021a). In addition, many well-known Biomedical
NLP/NLU benchmark datasets are created based
on the unlabeled Pubmed corpus (Do
˘
gan et al.,
2014;Nye et al.,2018;Herrero-Zazo et al.,2013;
Jin et al.,2019). Recently, to help accelerate re-
search in biomedical NLP, Gu et al. (2020) releases
BLURB (
B
iomedical
L
anguage
U
nderstanding &
R
easoning
B
enchmark), which consists of multiple
pretrained biomedical NLP models and benchmark
tasks. It is important to note that all of the top 10
models on the BLURB Leaderboard
5
are pretrained
on the Pubmed Abstract dataset.
2.2 English-Vietnamese Translation
Due to its limitation of high-quality parallel data
available, English-Vietnamese translation is classi-
fied as a low-resource translation language (Liu
et al.,2020). One of the first notable parallel
datasets and En-Vi neural machine translation is
ISWLT’15 (Luong and Manning,2015) with 133K
sentence pairs. A few years later, PhoMT (Doan
et al.,2021) and VLSP2020 (Ha et al.,2020) re-
leased larger parallel datasets, extracted from pub-
licly available resources for the English-Vietnamese
translation.
3https://pubmed.ncbi.nlm.nih.gov
4https://www.nlm.nih.gov/bsd/pmresources.html
5https://microsoft.github.io/BLURB/leaderboard.html
Recently, VietAI
6
curated the largest 4.2M high-
quality training pairs from various domains and
achieved state-of-the-art on English-Vietnamese
translation (Ngo et al.,2022). The work also
focuses on En-Vi translation performance across
multiple domains, including biomedical. As a re-
sult, the project’s NMT outperforms existing En-Vi
translation models (Doan et al.,2021;Fan et al.,
2020) by more than 2% in the BLEU score.
3 Improvements on Biomedical
English-Vietnamese Translation
through Self-training
To generate a large-scale synthetic translated Viet-
namese biomedical corpus, we first look into im-
proving the existing English-Vietnamese transla-
tion system in the biomedical translation domain.
Previous work from Ngo et al. (2022) has shown
that En-Vi biomedical bitexts are very rare, even for
large-scale bitext mining. Therefore, we look into
self-training to leverage the available monolingual
English biomedical data.
Self-training approach has been experimented
with in He et al. (2019) and utilized to improve
translation on low-resource MT systems (Chen
et al.,2019a). The advantage of this method is
that the source side of the monolingual corpus can
be domain-specific data for translation. However,
the shortcoming is that the generated targets can
be low-quality and affect the machine translation
performance. Therefore, we start with the English-
Vietnamese machine translation model from Ngo
et al. (2022), denoted
bTA
, which achieves state-
of-the-art results on both En-Vi biomedical and
general translation domains.
We use
bTA
to translate and generate a syn-
thetic parallel biomedical dataset with 1M pairs
of English-Vietnamese biomedical abstracts from
the Pubmed Corpus. The new 1M En-Vi biomed-
ical pairs are then concatenated with the current
high-quality En-Vi translation dataset from MTet
(Ngo et al.,2022) and PhoMT (Doan et al.,2021),
increasing from 6.2M to 7.2M En-Vi sentence pairs
total. To verify the effectiveness of our new self-
training data, we re-finetune the
bTA
model on this
7.2M bitexts corpus. We report the model perfor-
mance on the medical test set from MTet and the
general test set from PhoMT in Table 1(the trans-
lation performances on other domains like News,
Religion, and Law are reported in Appendix Afor
6https://vietai.org