Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation Long Phan1 Tai Dang13 Hieu Tran1 Trieu H. Trinh14

2025-04-24 0 0 419.7KB 12 页 10玖币
侵权投诉
Enriching Biomedical Knowledge for Low-resource Language
Through Large-Scale Translation
Long Phan1, Tai Dang1,3, Hieu Tran1, Trieu H. Trinh1,4,
Vy Phan3,Lam D. Chau2and Minh-Thang Luong1
1VietAI Research
2Case Western Reserve University
3University of Massachusetts-Amherst
4New York University
Abstract
Biomedical data and benchmarks are highly
valuable yet very limited in low-resource lan-
guages other than English, such as Viet-
namese. In this paper, we use a state-of-the-
art translation model in English-Vietnamese
to translate and produce both pretrained and
supervised data in the biomedical domains.
Thanks to such large-scale translation, we in-
troduce ViPubmedT5, a pretrained Encoder-
Decoder Transformer model trained on 20 mil-
lion translated abstracts from the high-quality
public PubMed corpus. ViPubMedT5 demon-
strates state-of-the-art results on two different
biomedical benchmarks in summarization and
acronym disambiguation. Further, we release
ViMedNLI - a new NLP task in Vietnamese
translated from MedNLI using the recently
public En-vi translation model and carefully
refined by human experts, with evaluations of
existing methods against ViPubmedT5.
1 Introduction
In recent years, pretrained language models (LMs)
have played an important and novel role in devel-
oping many Natural Language Processing (NLP)
systems. Utilizing large pretrained models like
BERT (Devlin et al.,2018), XLNET (Yang et al.,
2019), ALBERT (Lan et al.,2019), RoBERTa (Liu
et al.,2019), GPT-3 (Brown et al.,2020) BART
(Lewis et al.,2019), and T5 (Raffel et al.,2019) has
become an effective trend in natural language pro-
cessing. All these large models follow the Trans-
former architecture proposed by (Vaswani et al.,
2017) with the attention mechanism. The architec-
ture has been proven to be very suitable for finetun-
ing downstream tasks leveraging transfer learning
with their large pretrained checkpoints. Before the
emergence of large Transformer LMs, traditional
wording embedding gave each word a fixed global
*The first four authors contributed equally to this work
representation. Large pretrained models can de-
rive word vector representation from a trained large
corpus. This will give the pretrained model a bet-
ter knowledge of the generalized representation of
a trained language/domain and significantly im-
prove performance on downstream finetune tasks.
The success of pretrained models on a generative
domain (BERT, RoBERTa, BART, T5, etc.) has
created a path in creating more specific-domain
language models such as CodeBERT (Feng et al.,
2020) and CoTexT (Phan et al.,2021b) for coding
languages, TaBERT (Yin et al.,2020) for tabular
data, BioBERT (Lee et al.,2019) and Pubmed-
BERT (Tinn et al.,2021) for biomedical languages.
Biomedical literature is getting more popular
and widely accessible to the scientific community
through large databases such as Pubmed
1
, PMC
2
,
and MIMIC-IV (Johnson et al.,2021). This also
leads to many studies, corpora, or projects released
to further advance the Biomedical Natural Lan-
guage Processing field (Lee et al.,2019;Tinn et al.,
2021;Phan et al.,2021a;Yuan et al.,2022). These
biomedical domain models leverage transfer learn-
ing from pretrained models (Devlin et al.,2018;
Clark et al.,2020;Raffel et al.,2019;Lewis et al.,
2019) to achieve state-of-the-art results on multiple
Biomedical NLP tasks like Named Entity Recogni-
tion (NER), Relation Extraction (RE), or document
classification.
However, few studies have been on leveraging
large pretrained models for biomedical NLP in low-
resource languages. The main reason is the lack of
large biomedical pretraining data and benchmark
datasets. Furthermore, collecting biomedical data
in low-resource languages can be very expensive
due to scientific limitations and inaccessibility.
We attempt to overcome the issue of lacking
biomedical text data in low-resource languages by
using state-of-the-art translation works. We start
1https://pubmed.ncbi.nlm.nih.gov
2https://www.ncbi.nlm.nih.gov/pmc
arXiv:2210.05598v3 [cs.CL] 29 Jan 2023
MTet
Pretraining Finetuning
Training process
Synthetic pretraining data
20M English abstracts
Downstream tasks
Summarization
FAQSum
Acronym Disambiguation
acrDrAid
Natural Language Inference
ViMedNLI
Chào bác sĩ, bác sĩ cho mình hỏi là thai
nhi tuần thứ 31 nặng 1,3kg thì có bị làm
sao không ạ?
Thai nhi tuần 31 nặng 1,3 kg liệu có
đáng lo?
1. ngoại tâm thu nhĩ sl ít
2. số lượng
True
1. Cô ấy bắt đầu dùng Dilantin.
2. Bệnh nhân đang bị đau.
entailment
ViPubmed
20M translated
0 1.5M
steps
1.0M
20GB ViPubmed 160GB CC100
ViT5 ViPubmedT5
Vietnamese abstracts
Biomedical General
Figure 1: Overview of the pretraining and finetuning of ViPubmedT5
with the Vietnamese language and keep everything
reproducible for other low-resource languages in
future work.
We introduce ViPubmedT5, a pretrained
encoder-decoder transformer model trained on
synthetic Vietnamese biomedical text translated
with state-of-the-art English-Vietnamese trans-
lation work. Meanwhile, we also introduced
ViMedNLI, a medical natural language inference
task (NLI), translated from the English MedNLI
(Romanov and Shivade,2018) with human refin-
ing.
We thoroughly benchmark the performance of
our ViPubmedT5 model when pretrained with syn-
thetic translated biomedical data with ViMedNLI
and other public Vietnamese Biomedical NLP tasks
(Minh et al.,2022). The results show that our
model outperforms both general domain (Nguyen
and Nguyen,2020;Phan et al.,2022) and health-
specific domain Vietnamese (Minh et al.,2022)
pretrained models on biomedical tasks.
In this work, we offer the following contribu-
tions:
A state-of-the-art English-Vietnamese Trans-
lation model (with self-training) on medical
and general domains.
A first Encoder-Decoder Transformer model
ViPubmedT5 pretrained on large-scale syn-
thetic translated biomedical data.
A Vietnamese medical natural language infer-
ence dataset (ViMedNLI) that translated from
MedNLI (Romanov and Shivade,2018) and
refined with biomedical expertise human.
We publicize our model checkpoints, datasets,
and source code for future studies on other
low-resource languages.
2 Related Works
The development of parallel text corpora for trans-
lation and use for training MT systems has been a
rapidly growing field of research. In recent years,
low-resource languages have gained more atten-
tion from the industry, and academia (Chen et al.,
2019b;Shen et al.,2021;Gu et al.,2018;Nasir
and Mchechesi,2022). Previous works include
gathering more training data or training large mul-
tilingual models (Thu et al.,2016;Fan et al.,2021).
Low-Language MT enhances billions of people’s
daily life in numerous fields. Nonetheless, there
are specific domains crucial yet limited such as
biomedical and healthcare, in which MT systems
have not been able to contribute adequately.
Previous works using MT systems for biomedi-
cal tasks includes (Neves et al.,2016;Névéol et al.,
2018). Additionally, several biomedical parallel
(Deléger et al.,2009) have been utilized just for
terminology translation only. Pioneer attempts to
train MT systems using a corpus of MEDLINE
titles (Wu et al.,2011), and the use of publica-
tion titles and abstracts for both ES-EN and FR-EN
language pairs (Jimeno-Yepes et al.,2012). How-
ever, none of these works targets low-resource lan-
guages. A recent effort to train Vietnamese ML
systems for biomedical and healthcare is Minh et al.
(2022). These, however, do not utilize the capa-
bility of MT systems, instead relying on manual
crawling. Therefore, this motivation has led us
to employ MT systems to contribute high-quality
Vietnamese datasets that emerged from the En-
glish language. To the best of our knowledge, this
is the first work utilizing state-of-the-art machine
translation to translate both self-supervised and su-
pervised learning biomedical data for pretrained
models in a low-resource language setting.
2.1 Pubmed and English Biomedical NLP
Studies
The Pubmed
3
provides access to the MEDLINE
database
4
which contains titles, abstracts, and
metadata from medical literature since the 1970s.
The dataset consists of more than 34 million
biomedical abstracts from the literature collected
from sources such as life science publications, med-
ical journals, and published online e-books. This
dataset is maintained and updated yearly to include
more up-to-date biomedical documents.
Pubmed Abstract has been the main dataset for
almost any state-of-the-art biomedical domain-
specific pretrained models (Lee et al.,2019;Yuan
et al.,2022;Tinn et al.,2021;Yasunaga et al.,
2022;Alrowili and Shanker,2021;Phan et al.,
2021a). In addition, many well-known Biomedical
NLP/NLU benchmark datasets are created based
on the unlabeled Pubmed corpus (Do
˘
gan et al.,
2014;Nye et al.,2018;Herrero-Zazo et al.,2013;
Jin et al.,2019). Recently, to help accelerate re-
search in biomedical NLP, Gu et al. (2020) releases
BLURB (
B
iomedical
L
anguage
U
nderstanding &
R
easoning
B
enchmark), which consists of multiple
pretrained biomedical NLP models and benchmark
tasks. It is important to note that all of the top 10
models on the BLURB Leaderboard
5
are pretrained
on the Pubmed Abstract dataset.
2.2 English-Vietnamese Translation
Due to its limitation of high-quality parallel data
available, English-Vietnamese translation is classi-
fied as a low-resource translation language (Liu
et al.,2020). One of the first notable parallel
datasets and En-Vi neural machine translation is
ISWLT’15 (Luong and Manning,2015) with 133K
sentence pairs. A few years later, PhoMT (Doan
et al.,2021) and VLSP2020 (Ha et al.,2020) re-
leased larger parallel datasets, extracted from pub-
licly available resources for the English-Vietnamese
translation.
3https://pubmed.ncbi.nlm.nih.gov
4https://www.nlm.nih.gov/bsd/pmresources.html
5https://microsoft.github.io/BLURB/leaderboard.html
Recently, VietAI
6
curated the largest 4.2M high-
quality training pairs from various domains and
achieved state-of-the-art on English-Vietnamese
translation (Ngo et al.,2022). The work also
focuses on En-Vi translation performance across
multiple domains, including biomedical. As a re-
sult, the projects NMT outperforms existing En-Vi
translation models (Doan et al.,2021;Fan et al.,
2020) by more than 2% in the BLEU score.
3 Improvements on Biomedical
English-Vietnamese Translation
through Self-training
To generate a large-scale synthetic translated Viet-
namese biomedical corpus, we first look into im-
proving the existing English-Vietnamese transla-
tion system in the biomedical translation domain.
Previous work from Ngo et al. (2022) has shown
that En-Vi biomedical bitexts are very rare, even for
large-scale bitext mining. Therefore, we look into
self-training to leverage the available monolingual
English biomedical data.
Self-training approach has been experimented
with in He et al. (2019) and utilized to improve
translation on low-resource MT systems (Chen
et al.,2019a). The advantage of this method is
that the source side of the monolingual corpus can
be domain-specific data for translation. However,
the shortcoming is that the generated targets can
be low-quality and affect the machine translation
performance. Therefore, we start with the English-
Vietnamese machine translation model from Ngo
et al. (2022), denoted
bTA
, which achieves state-
of-the-art results on both En-Vi biomedical and
general translation domains.
We use
bTA
to translate and generate a syn-
thetic parallel biomedical dataset with 1M pairs
of English-Vietnamese biomedical abstracts from
the Pubmed Corpus. The new 1M En-Vi biomed-
ical pairs are then concatenated with the current
high-quality En-Vi translation dataset from MTet
(Ngo et al.,2022) and PhoMT (Doan et al.,2021),
increasing from 6.2M to 7.2M En-Vi sentence pairs
total. To verify the effectiveness of our new self-
training data, we re-finetune the
bTA
model on this
7.2M bitexts corpus. We report the model perfor-
mance on the medical test set from MTet and the
general test set from PhoMT in Table 1(the trans-
lation performances on other domains like News,
Religion, and Law are reported in Appendix Afor
6https://vietai.org
摘要:

EnrichingBiomedicalKnowledgeforLow-resourceLanguageThroughLarge-ScaleTranslationLongPhan1,TaiDang1,3,HieuTran1,TrieuH.Trinh1,4,VyPhan3,LamD.Chau2andMinh-ThangLuong11VietAIResearch2CaseWesternReserveUniversity3UniversityofMassachusetts-Amherst4NewYorkUniversityAbstractBiomedicaldataandbenchmarksa...

展开>> 收起<<
Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation Long Phan1 Tai Dang13 Hieu Tran1 Trieu H. Trinh14.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:419.7KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注