Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation Long Phan1 Tai Dang13 Hieu Tran1 Trieu H. Trinh14

2025-04-24 0 0 419.7KB 12 页 10玖币

侵权投诉

Enriching Biomedical Knowledge for Low-resource Language

Through Large-Scale Translation

Long Phan1∗, Tai Dang1,3∗, Hieu Tran1∗, Trieu H. Trinh1,4∗,

Vy Phan3,Lam D. Chau2and Minh-Thang Luong1

1VietAI Research

2Case Western Reserve University

3University of Massachusetts-Amherst

4New York University

Abstract

Biomedical data and benchmarks are highly

valuable yet very limited in low-resource lan-

guages other than English, such as Viet-

namese. In this paper, we use a state-of-the-

art translation model in English-Vietnamese

to translate and produce both pretrained and

supervised data in the biomedical domains.

Thanks to such large-scale translation, we in-

troduce ViPubmedT5, a pretrained Encoder-

Decoder Transformer model trained on 20 mil-

lion translated abstracts from the high-quality

public PubMed corpus. ViPubMedT5 demon-

strates state-of-the-art results on two different

biomedical benchmarks in summarization and

acronym disambiguation. Further, we release

ViMedNLI - a new NLP task in Vietnamese

translated from MedNLI using the recently

public En-vi translation model and carefully

refined by human experts, with evaluations of

existing methods against ViPubmedT5.

1 Introduction

In recent years, pretrained language models (LMs)

have played an important and novel role in devel-

oping many Natural Language Processing (NLP)

systems. Utilizing large pretrained models like

BERT (Devlin et al.,2018), XLNET (Yang et al.,

2019), ALBERT (Lan et al.,2019), RoBERTa (Liu

et al.,2019), GPT-3 (Brown et al.,2020) BART

(Lewis et al.,2019), and T5 (Raffel et al.,2019) has

become an effective trend in natural language pro-

cessing. All these large models follow the Trans-

former architecture proposed by (Vaswani et al.,

2017) with the attention mechanism. The architec-

ture has been proven to be very suitable for finetun-

ing downstream tasks leveraging transfer learning

with their large pretrained checkpoints. Before the

emergence of large Transformer LMs, traditional

wording embedding gave each word a fixed global

*The first four authors contributed equally to this work

representation. Large pretrained models can de-

rive word vector representation from a trained large

corpus. This will give the pretrained model a bet-

ter knowledge of the generalized representation of

a trained language/domain and significantly im-

prove performance on downstream finetune tasks.

The success of pretrained models on a generative

domain (BERT, RoBERTa, BART, T5, etc.) has

created a path in creating more specific-domain

language models such as CodeBERT (Feng et al.,

2020) and CoTexT (Phan et al.,2021b) for coding

languages, TaBERT (Yin et al.,2020) for tabular

data, BioBERT (Lee et al.,2019) and Pubmed-

BERT (Tinn et al.,2021) for biomedical languages.

Biomedical literature is getting more popular

and widely accessible to the scientific community

through large databases such as Pubmed

, PMC

and MIMIC-IV (Johnson et al.,2021). This also

leads to many studies, corpora, or projects released

to further advance the Biomedical Natural Lan-

guage Processing field (Lee et al.,2019;Tinn et al.,

2021;Phan et al.,2021a;Yuan et al.,2022). These

biomedical domain models leverage transfer learn-

ing from pretrained models (Devlin et al.,2018;

Clark et al.,2020;Raffel et al.,2019;Lewis et al.,

2019) to achieve state-of-the-art results on multiple

Biomedical NLP tasks like Named Entity Recogni-

tion (NER), Relation Extraction (RE), or document

classification.

However, few studies have been on leveraging

large pretrained models for biomedical NLP in low-

resource languages. The main reason is the lack of

large biomedical pretraining data and benchmark

datasets. Furthermore, collecting biomedical data

in low-resource languages can be very expensive

due to scientific limitations and inaccessibility.

We attempt to overcome the issue of lacking

biomedical text data in low-resource languages by

using state-of-the-art translation works. We start

1https://pubmed.ncbi.nlm.nih.gov

2https://www.ncbi.nlm.nih.gov/pmc

arXiv:2210.05598v3 [cs.CL] 29 Jan 2023

MTet

Pretraining Finetuning

Training process

Synthetic pretraining data

20M English abstracts

Downstream tasks

Summarization

FAQSum

Acronym Disambiguation

acrDrAid

Natural Language Inference

ViMedNLI

Chào bác sĩ, bác sĩ cho mình hỏi là thai

nhi tuần thứ 31 nặng 1,3kg thì có bị làm

sao không ạ?

Thai nhi tuần 31 nặng 1,3 kg liệu có

đáng lo?

1. ngoại tâm thu nhĩ sl ít

2. số lượng

True

1. Cô ấy bắt đầu dùng Dilantin.

2. Bệnh nhân đang bị đau.

entailment

ViPubmed

20M translated

0 1.5M

steps

1.0M

20GB ViPubmed 160GB CC100

ViT5 ViPubmedT5

Vietnamese abstracts

Biomedical General

Figure 1: Overview of the pretraining and finetuning of ViPubmedT5

with the Vietnamese language and keep everything

reproducible for other low-resource languages in

future work.

We introduce ViPubmedT5, a pretrained

encoder-decoder transformer model trained on

synthetic Vietnamese biomedical text translated

with state-of-the-art English-Vietnamese trans-

lation work. Meanwhile, we also introduced

ViMedNLI, a medical natural language inference

task (NLI), translated from the English MedNLI

(Romanov and Shivade,2018) with human refin-

ing.

We thoroughly benchmark the performance of

our ViPubmedT5 model when pretrained with syn-

thetic translated biomedical data with ViMedNLI

and other public Vietnamese Biomedical NLP tasks

(Minh et al.,2022). The results show that our

model outperforms both general domain (Nguyen

and Nguyen,2020;Phan et al.,2022) and health-

specific domain Vietnamese (Minh et al.,2022)

pretrained models on biomedical tasks.

In this work, we offer the following contribu-

tions:

•

A state-of-the-art English-Vietnamese Trans-

lation model (with self-training) on medical

and general domains.

•

A first Encoder-Decoder Transformer model

ViPubmedT5 pretrained on large-scale syn-

thetic translated biomedical data.

•

A Vietnamese medical natural language infer-

ence dataset (ViMedNLI) that translated from

MedNLI (Romanov and Shivade,2018) and

refined with biomedical expertise human.

•

We publicize our model checkpoints, datasets,

and source code for future studies on other

low-resource languages.

2 Related Works

The development of parallel text corpora for trans-

lation and use for training MT systems has been a

rapidly growing field of research. In recent years,

low-resource languages have gained more atten-

tion from the industry, and academia (Chen et al.,

2019b;Shen et al.,2021;Gu et al.,2018;Nasir

and Mchechesi,2022). Previous works include

gathering more training data or training large mul-

tilingual models (Thu et al.,2016;Fan et al.,2021).

Low-Language MT enhances billions of people’s

daily life in numerous fields. Nonetheless, there

are specific domains crucial yet limited such as

biomedical and healthcare, in which MT systems

have not been able to contribute adequately.

Previous works using MT systems for biomedi-

cal tasks includes (Neves et al.,2016;Névéol et al.,

2018). Additionally, several biomedical parallel

(Deléger et al.,2009) have been utilized just for

terminology translation only. Pioneer attempts to

train MT systems using a corpus of MEDLINE

titles (Wu et al.,2011), and the use of publica-

tion titles and abstracts for both ES-EN and FR-EN

language pairs (Jimeno-Yepes et al.,2012). How-

ever, none of these works targets low-resource lan-

guages. A recent effort to train Vietnamese ML

systems for biomedical and healthcare is Minh et al.

(2022). These, however, do not utilize the capa-

bility of MT systems, instead relying on manual

crawling. Therefore, this motivation has led us

to employ MT systems to contribute high-quality

Vietnamese datasets that emerged from the En-

glish language. To the best of our knowledge, this

is the first work utilizing state-of-the-art machine

translation to translate both self-supervised and su-

pervised learning biomedical data for pretrained

models in a low-resource language setting.

2.1 Pubmed and English Biomedical NLP

Studies

The Pubmed

provides access to the MEDLINE

database

which contains titles, abstracts, and

metadata from medical literature since the 1970s.

The dataset consists of more than 34 million

biomedical abstracts from the literature collected

from sources such as life science publications, med-

ical journals, and published online e-books. This

dataset is maintained and updated yearly to include

more up-to-date biomedical documents.

Pubmed Abstract has been the main dataset for

almost any state-of-the-art biomedical domain-

specific pretrained models (Lee et al.,2019;Yuan

et al.,2022;Tinn et al.,2021;Yasunaga et al.,

2022;Alrowili and Shanker,2021;Phan et al.,

2021a). In addition, many well-known Biomedical

NLP/NLU benchmark datasets are created based

on the unlabeled Pubmed corpus (Do

gan et al.,

2014;Nye et al.,2018;Herrero-Zazo et al.,2013;

Jin et al.,2019). Recently, to help accelerate re-

search in biomedical NLP, Gu et al. (2020) releases

BLURB (

iomedical

anguage

nderstanding &

easoning

enchmark), which consists of multiple

pretrained biomedical NLP models and benchmark

tasks. It is important to note that all of the top 10

models on the BLURB Leaderboard

are pretrained

on the Pubmed Abstract dataset.

2.2 English-Vietnamese Translation

Due to its limitation of high-quality parallel data

available, English-Vietnamese translation is classi-

fied as a low-resource translation language (Liu

et al.,2020). One of the first notable parallel

datasets and En-Vi neural machine translation is

ISWLT’15 (Luong and Manning,2015) with 133K

sentence pairs. A few years later, PhoMT (Doan

et al.,2021) and VLSP2020 (Ha et al.,2020) re-

leased larger parallel datasets, extracted from pub-

licly available resources for the English-Vietnamese

translation.

3https://pubmed.ncbi.nlm.nih.gov

4https://www.nlm.nih.gov/bsd/pmresources.html

5https://microsoft.github.io/BLURB/leaderboard.html

Recently, VietAI

curated the largest 4.2M high-

quality training pairs from various domains and

achieved state-of-the-art on English-Vietnamese

translation (Ngo et al.,2022). The work also

focuses on En-Vi translation performance across

multiple domains, including biomedical. As a re-

sult, the project’s NMT outperforms existing En-Vi

translation models (Doan et al.,2021;Fan et al.,

2020) by more than 2% in the BLEU score.

3 Improvements on Biomedical

English-Vietnamese Translation

through Self-training

To generate a large-scale synthetic translated Viet-

namese biomedical corpus, we first look into im-

proving the existing English-Vietnamese transla-

tion system in the biomedical translation domain.

Previous work from Ngo et al. (2022) has shown

that En-Vi biomedical bitexts are very rare, even for

large-scale bitext mining. Therefore, we look into

self-training to leverage the available monolingual

English biomedical data.

Self-training approach has been experimented

with in He et al. (2019) and utilized to improve

translation on low-resource MT systems (Chen

et al.,2019a). The advantage of this method is

that the source side of the monolingual corpus can

be domain-specific data for translation. However,

the shortcoming is that the generated targets can

be low-quality and affect the machine translation

performance. Therefore, we start with the English-

Vietnamese machine translation model from Ngo

et al. (2022), denoted

bTA

, which achieves state-

of-the-art results on both En-Vi biomedical and

general translation domains.

We use

bTA

to translate and generate a syn-

thetic parallel biomedical dataset with 1M pairs

of English-Vietnamese biomedical abstracts from

the Pubmed Corpus. The new 1M En-Vi biomed-

ical pairs are then concatenated with the current

high-quality En-Vi translation dataset from MTet

(Ngo et al.,2022) and PhoMT (Doan et al.,2021),

increasing from 6.2M to 7.2M En-Vi sentence pairs

total. To verify the effectiveness of our new self-

training data, we re-finetune the

bTA

model on this

7.2M bitexts corpus. We report the model perfor-

mance on the medical test set from MTet and the

general test set from PhoMT in Table 1(the trans-

lation performances on other domains like News,

Religion, and Law are reported in Appendix Afor

6https://vietai.org

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EnrichingBiomedicalKnowledgeforLow-resourceLanguageThroughLarge-ScaleTranslationLongPhan1,TaiDang1,3,HieuTran1,TrieuH.Trinh1,4,VyPhan3,LamD.Chau2andMinh-ThangLuong11VietAIResearch2CaseWesternReserveUniversity3UniversityofMassachusetts-Amherst4NewYorkUniversityAbstractBiomedicaldataandbenchmarksa...

展开>> 收起<<

Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation Long Phan1 Tai Dang13 Hieu Tran1 Trieu H. Trinh14.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation Long Phan1 Tai Dang13 Hieu Tran1 Trieu H. Trinh14

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: