Investigating Massive Multilingual Pre-Trained Machine Translation Models for Clinical Domain via Transfer Learning Lifeng Han1 Gleb Erofeev2 Irina Sorokina2 Serge Gladkoff2andGoran Nenadic1

2025-04-27 1 0 1.12MB 10 页 10玖币

侵权投诉

Investigating Massive Multilingual Pre-Trained Machine Translation Models

for Clinical Domain via Transfer Learning

Lifeng Han1, Gleb Erofeev2, Irina Sorokina2, Serge Gladkoﬀ2, and Goran Nenadic1

1The University of Manchester, UK

2Logrus Global, Translation & Localization

lifeng.han, g.nenadic@manchester.ac.uk

gleberof, irina.sorokina, serge.gladkoﬀ@logrusglobal.com

Abstract

Massively multilingual pre-trained language

models (MMPLMs) are developed in recent

years demonstrating superpowers and the

pre-knowledge they acquire for downstream

tasks. This work investigates whether

MMPLMs can be applied to clinical domain

machine translation (MT) towards entirely

unseen languages via transfer learning. We

carry out an experimental investigation

using Meta-AI’s MMPLMs “wmt21-dense-

24-wide-en-X and X-en (WMT21fb)” which

were pre-trained on 7 language pairs and 14

translation directions including English to

Czech, German, Hausa, Icelandic, Japanese,

Russian, and Chinese, and the opposite

direction. We ﬁne-tune these MMPLMs

towards English-Spanish language pair

which did not exist at all in their original

pre-trained corpora both implicitly and

explicitly. We prepare carefully aligned

clinical domain data for this ﬁne-tuning,

which is diﬀerent from their original mixed

domain knowledge. Our experimental result

shows that the ﬁne-tuning is very successful

using just 250k well-aligned in-domain EN-

ES segments for three sub-task translation

testings: clinical cases, clinical terms, and

ontology concepts. It achieves very close

evaluation scores to another MMPLM

NLLB from Meta-AI, which included

Spanish as a high-resource setting in

the pre-training. To the best of our

knowledge, this is the ﬁrst work on using

MMPLMs towards clinical domain transfer-

learning NMT successfully for totally

unseen languages during pre-training.

1 Introduction

Multilingual neural machine translation

(MNMT) has its root from the beginning of

NMT era (Dong et al.,2015;Firat et al.,2016)

but only made its ﬁrst milestone when Google’s

end-to-end MNMT arrived (Johnson et al.,

2017) where the artiﬁcial token was introduced

for the ﬁrst time for translation task at the

beginning of the input source sentence to

indicate the speciﬁed target language, e.g.

“2en” as translating into English. This model

used a shared word-piece vocabulary and

enabled multilingual NMT through a single

encoder-decoder model training. Google’s

MNMT also demonstrated the possibility of

“zero-shot” translation as long as the languages

to be translated from or to have been seen

during the training stage, even though not

explicitly. However, as the authors mentioned,

Google’s MNMT only allows translating

between languages that have been seen

individually as “source and target languages

during some point, not for entirely new ones”

in their many-to-many model, which was

tested using the WMT14 and WMT15 data

(Johnson et al.,2017). This set an obstacle to

translating freshly new languages that do not

exist in their pre-training stage. Then using the

later developed NMT structure Transformer

and BERT (Devlin et al.,2019;Vaswani et al.,

2017), Facebook AI extended the coverage of

multilingual translation into 50, 100, and 200+

languages via mBERT-50 (Tang et al.,2020),

M2M-100 (Fan et al.,2021), and NLLB (NLLB

Team et al.,2022) models. However, these

models never address the issue of translating

entirely new languages that do not exist in

their pre-training stage, which sets an obstacle

for MT applications in serving an even broader

community.

In this work, we move one step forward

towards domain-speciﬁc transfer-learning

(Zoph et al.,2016) for NMT via ﬁne-tuning

an entirely new language pair that does not

exist in the deployed multilingual pre-trained

language models (MPLMs). The MPLMs

we used are from Facebook AI (Meta-AI)’ s

submission to the WMT21 news translation

arXiv:2210.06068v2 [cs.CL] 4 Jun 2023

task, i.e. “wmt21-dense-24-wide-en-X” and

“wmt21-dense-24-wide-X-en” which were pre-

trained for 7 languages Hausa (ha), Icelandic

(is), Japanese (ja), Czech (cs), Russian (ru),

Chinese (zh), German (de) to English (en),

and backward (Tran et al.,2021). We use a

well-prepared 250k pairs of English-Spanish

(en-es) clinical domain corpus and demonstrate

that not only it is possible to achieve successful

transfer-learning on this explicit new language

pair, i.e. the Spanish language is totally

unseen among the languages in the MPLM,

but also the domain knowledge transfer from

general and mixed domain to the clinical

domain is very successful. In comparison to

the massively MPLM (MMPLM) NLLB which

covers Spanish as a high-resource language at

its pre-training stage, our transfer-learning

model achieves very close evaluation scores

in most sub-tasks (clinical cases and clinical

terms translation) and even wins NLLB in

ontology concept translation task by the metric

COMET (Rei et al.,2020) using ClinSpEn2022

testing data at WMT22. This is a follow-up

work reporting further ﬁndings based on our

previous shared task participation (Han et al.,

2022).

2 Related Work

Regarding the early usage of special tokens

in NMT, Sennrich et al. (2016) designed the

token T from Latin Tu and V from Latin

Vos for familiar and polite indicators attached

to the source sentences towards English-

to-German NMT. Yamagishi et al. (2016)

designed tokens <all-active>, <all-passive>,

<reference> and <predict> to control of voice

of Japanese-to-English NMT; either they are

active, passive, reference aware or prediction

guided. Subsequently, Google’s MNMT system

designed target language indicators, e.g. <2en>

and <2jp> controlling the translation towards

English and Japanese respectively (Johnson

et al.,2017). Google’s MNMT also designed

mixed target language translation control, e.g.

(1-

)<2ko> +

<2jp> tells a mixed language

translation into Korean and Japanese with a

weighting mechanism. We take one step further

to use an existing language controller token

from a MPLM as a pseudo code to ﬁne-tune

an external language translation model, which

was entirely not seen during the pre-training

stage.

Regarding transfer-learning applications for

downstream NLP tasks other than MT, Muller

et al. (2021) applied transfer learning from

MPLMs towards unseen languages of diﬀerent

typologies on dependency parsing (DEP),

named entity recognition (NER), and part-

of-speech (POS) tagging. Ahuja et al. (2022)

carried out zero-shot transfer learning for

natural language inference (NLI) tasks such

as question answering.

In this paper, we ask this research question

(RQ): Can Massive Multilingual Pre-Trained

Language Models Create a Knowledge Space

Transferring to Entirely New Language (Pairs)

and New (clinical) Domains for Machine

Translation Task via Fine-Tuning?

3 Model Settings

To investigate into our RQ, we take Meta-

AI’s MNMT submission to WMT21 shared

task on news translation, i.e. the MMPLM

“wmt21-dense-24-wide-en-X” and “wmt21-dense-

24-wide-X-en” as our test-base, and we name

them as WMT21fb models (Tran et al.,

2021)

. They are conditional generation models

from the same structure of massive M2M-

100 (Fan et al.,2021) having a total number

of 4.7 billion parameters which demand high

computational cost for ﬁne-tuning. WMT21fb

models were trained on mixed domain data

using “all available resources” they had, for

instances, from historical WMT challenges,

large-scale data mining, and their in-domain

back-translation. Then these models were ﬁne-

tuned in news domain for 7 languages including

Hausa, Icelandic, Japanese, Czech, Russian,

Chinese, German from and to English.

The challenging language we choose is

Spanish, which did not appear in the training

stage of WMT21fb models. The ﬁne-tuning

corpus we use is extracted from MeSpEn

(Villegas et al.,2018) clinical domain data,

of which we managed to extract 250k pairs

of English-Spanish segments after data

cleaning. They are from IBECS-descriptions,

IBECS-titles, MedlinePlus-health_topics-

titles, MedlinePlus-health_topics-descriptions,

1https://github.com/facebookresearch/fairseq/

tree/main/examples/wmt21

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InvestigatingMassiveMultilingualPre-TrainedMachineTranslationModelsforClinicalDomainviaTransferLearningLifengHan1,GlebErofeev2,IrinaSorokina2,SergeGladkoff2,andGoranNenadic11TheUniversityofManchester,UK2LogrusGlobal,Translation&Localizationlifeng.han,g.nenadic@manchester.ac.ukgleberof,irina.sorokina...

展开>> 收起<<

Investigating Massive Multilingual Pre-Trained Machine Translation Models for Clinical Domain via Transfer Learning Lifeng Han1 Gleb Erofeev2 Irina Sorokina2 Serge Gladkoff2andGoran Nenadic1.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Investigating Massive Multilingual Pre-Trained Machine Translation Models for Clinical Domain via Transfer Learning Lifeng Han1 Gleb Erofeev2 Irina Sorokina2 Serge Gladkoff2andGoran Nenadic1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: