Investigating Massive Multilingual Pre-Trained Machine Translation Models for Clinical Domain via Transfer Learning Lifeng Han1 Gleb Erofeev2 Irina Sorokina2 Serge Gladkoff2andGoran Nenadic1

2025-04-27 0 0 1.12MB 10 页 10玖币
侵权投诉
Investigating Massive Multilingual Pre-Trained Machine Translation Models
for Clinical Domain via Transfer Learning
Lifeng Han1, Gleb Erofeev2, Irina Sorokina2, Serge Gladkoff2, and Goran Nenadic1
1The University of Manchester, UK
2Logrus Global, Translation & Localization
lifeng.han, g.nenadic@manchester.ac.uk
gleberof, irina.sorokina, serge.gladkoff@logrusglobal.com
Abstract
Massively multilingual pre-trained language
models (MMPLMs) are developed in recent
years demonstrating superpowers and the
pre-knowledge they acquire for downstream
tasks. This work investigates whether
MMPLMs can be applied to clinical domain
machine translation (MT) towards entirely
unseen languages via transfer learning. We
carry out an experimental investigation
using Meta-AI’s MMPLMs “wmt21-dense-
24-wide-en-X and X-en (WMT21fb)” which
were pre-trained on 7 language pairs and 14
translation directions including English to
Czech, German, Hausa, Icelandic, Japanese,
Russian, and Chinese, and the opposite
direction. We fine-tune these MMPLMs
towards English-Spanish language pair
which did not exist at all in their original
pre-trained corpora both implicitly and
explicitly. We prepare carefully aligned
clinical domain data for this fine-tuning,
which is different from their original mixed
domain knowledge. Our experimental result
shows that the fine-tuning is very successful
using just 250k well-aligned in-domain EN-
ES segments for three sub-task translation
testings: clinical cases, clinical terms, and
ontology concepts. It achieves very close
evaluation scores to another MMPLM
NLLB from Meta-AI, which included
Spanish as a high-resource setting in
the pre-training. To the best of our
knowledge, this is the first work on using
MMPLMs towards clinical domain transfer-
learning NMT successfully for totally
unseen languages during pre-training.
1 Introduction
Multilingual neural machine translation
(MNMT) has its root from the beginning of
NMT era (Dong et al.,2015;Firat et al.,2016)
but only made its first milestone when Google’s
end-to-end MNMT arrived (Johnson et al.,
2017) where the artificial token was introduced
for the first time for translation task at the
beginning of the input source sentence to
indicate the specified target language, e.g.
“2en” as translating into English. This model
used a shared word-piece vocabulary and
enabled multilingual NMT through a single
encoder-decoder model training. Google’s
MNMT also demonstrated the possibility of
“zero-shot” translation as long as the languages
to be translated from or to have been seen
during the training stage, even though not
explicitly. However, as the authors mentioned,
Google’s MNMT only allows translating
between languages that have been seen
individually as “source and target languages
during some point, not for entirely new ones”
in their many-to-many model, which was
tested using the WMT14 and WMT15 data
(Johnson et al.,2017). This set an obstacle to
translating freshly new languages that do not
exist in their pre-training stage. Then using the
later developed NMT structure Transformer
and BERT (Devlin et al.,2019;Vaswani et al.,
2017), Facebook AI extended the coverage of
multilingual translation into 50, 100, and 200+
languages via mBERT-50 (Tang et al.,2020),
M2M-100 (Fan et al.,2021), and NLLB (NLLB
Team et al.,2022) models. However, these
models never address the issue of translating
entirely new languages that do not exist in
their pre-training stage, which sets an obstacle
for MT applications in serving an even broader
community.
In this work, we move one step forward
towards domain-specific transfer-learning
(Zoph et al.,2016) for NMT via fine-tuning
an entirely new language pair that does not
exist in the deployed multilingual pre-trained
language models (MPLMs). The MPLMs
we used are from Facebook AI (Meta-AI)’ s
submission to the WMT21 news translation
arXiv:2210.06068v2 [cs.CL] 4 Jun 2023
task, i.e. “wmt21-dense-24-wide-en-X” and
“wmt21-dense-24-wide-X-en” which were pre-
trained for 7 languages Hausa (ha), Icelandic
(is), Japanese (ja), Czech (cs), Russian (ru),
Chinese (zh), German (de) to English (en),
and backward (Tran et al.,2021). We use a
well-prepared 250k pairs of English-Spanish
(en-es) clinical domain corpus and demonstrate
that not only it is possible to achieve successful
transfer-learning on this explicit new language
pair, i.e. the Spanish language is totally
unseen among the languages in the MPLM,
but also the domain knowledge transfer from
general and mixed domain to the clinical
domain is very successful. In comparison to
the massively MPLM (MMPLM) NLLB which
covers Spanish as a high-resource language at
its pre-training stage, our transfer-learning
model achieves very close evaluation scores
in most sub-tasks (clinical cases and clinical
terms translation) and even wins NLLB in
ontology concept translation task by the metric
COMET (Rei et al.,2020) using ClinSpEn2022
testing data at WMT22. This is a follow-up
work reporting further findings based on our
previous shared task participation (Han et al.,
2022).
2 Related Work
Regarding the early usage of special tokens
in NMT, Sennrich et al. (2016) designed the
token T from Latin Tu and V from Latin
Vos for familiar and polite indicators attached
to the source sentences towards English-
to-German NMT. Yamagishi et al. (2016)
designed tokens <all-active>, <all-passive>,
<reference> and <predict> to control of voice
of Japanese-to-English NMT; either they are
active, passive, reference aware or prediction
guided. Subsequently, Google’s MNMT system
designed target language indicators, e.g. <2en>
and <2jp> controlling the translation towards
English and Japanese respectively (Johnson
et al.,2017). Google’s MNMT also designed
mixed target language translation control, e.g.
(1-
α
)<2ko> +
α
<2jp> tells a mixed language
translation into Korean and Japanese with a
weighting mechanism. We take one step further
to use an existing language controller token
from a MPLM as a pseudo code to fine-tune
an external language translation model, which
was entirely not seen during the pre-training
stage.
Regarding transfer-learning applications for
downstream NLP tasks other than MT, Muller
et al. (2021) applied transfer learning from
MPLMs towards unseen languages of different
typologies on dependency parsing (DEP),
named entity recognition (NER), and part-
of-speech (POS) tagging. Ahuja et al. (2022)
carried out zero-shot transfer learning for
natural language inference (NLI) tasks such
as question answering.
In this paper, we ask this research question
(RQ): Can Massive Multilingual Pre-Trained
Language Models Create a Knowledge Space
Transferring to Entirely New Language (Pairs)
and New (clinical) Domains for Machine
Translation Task via Fine-Tuning?
3 Model Settings
To investigate into our RQ, we take Meta-
AI’s MNMT submission to WMT21 shared
task on news translation, i.e. the MMPLM
“wmt21-dense-24-wide-en-X” and “wmt21-dense-
24-wide-X-en” as our test-base, and we name
them as WMT21fb models (Tran et al.,
2021)
1
. They are conditional generation models
from the same structure of massive M2M-
100 (Fan et al.,2021) having a total number
of 4.7 billion parameters which demand high
computational cost for fine-tuning. WMT21fb
models were trained on mixed domain data
using “all available resources” they had, for
instances, from historical WMT challenges,
large-scale data mining, and their in-domain
back-translation. Then these models were fine-
tuned in news domain for 7 languages including
Hausa, Icelandic, Japanese, Czech, Russian,
Chinese, German from and to English.
The challenging language we choose is
Spanish, which did not appear in the training
stage of WMT21fb models. The fine-tuning
corpus we use is extracted from MeSpEn
(Villegas et al.,2018) clinical domain data,
of which we managed to extract 250k pairs
of English-Spanish segments after data
cleaning. They are from IBECS-descriptions,
IBECS-titles, MedlinePlus-health_topics-
titles, MedlinePlus-health_topics-descriptions,
1https://github.com/facebookresearch/fairseq/
tree/main/examples/wmt21
摘要:

InvestigatingMassiveMultilingualPre-TrainedMachineTranslationModelsforClinicalDomainviaTransferLearningLifengHan1,GlebErofeev2,IrinaSorokina2,SergeGladkoff2,andGoranNenadic11TheUniversityofManchester,UK2LogrusGlobal,Translation&Localizationlifeng.han,g.nenadic@manchester.ac.ukgleberof,irina.sorokina...

展开>> 收起<<
Investigating Massive Multilingual Pre-Trained Machine Translation Models for Clinical Domain via Transfer Learning Lifeng Han1 Gleb Erofeev2 Irina Sorokina2 Serge Gladkoff2andGoran Nenadic1.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.12MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注