Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition Samuel Belkadi Lifeng Han Yuping Wu and Goran Nenadic

2025-05-06 0 0 3.16MB 14 页 10玖币
侵权投诉
Exploring the Value of Pre-trained Language Models
for Clinical Named Entity Recognition
Samuel Belkadi, Lifeng Han, Yuping Wu, and Goran Nenadic
Department of Computer Science, The University of Manchester, UK
*co-first authors
samuel.belkadi@student.manchester.ac.uk
lifeng.han, yuping.wu, g.nenadic@manchester.ac.uk
Abstract
The practice of fine-tuning Pre-trained Lan-
guage Models (PLMs) from general or domain-
specific data to a specific task with limited re-
sources, has gained popularity within the field
of natural language processing (NLP). In this
work, we re-visit this assumption and carry
out an investigation in clinical NLP, specifi-
cally Named Entity Recognition on drugs and
their related attributes. We compare Trans-
former models that are trained from scratch to
fine-tuned BERT-based LLMs namely BERT,
BioBERT, and ClinicalBERT. Furthermore, we
examine the impact of an additional CRF
layer on such models to encourage contex-
tual learning. We use n2c2-2018 shared
task data for model development and eval-
uations. The experimental outcomes show
that 1) CRF layers improved all language
models; 2) referring to BIO-strict span level
evaluation using macro-average F1 score, al-
though the fine-tuned LLMs achieved 0.83+
scores, the TransformerCRF model trained
from scratch achieved 0.78+, demonstrating
comparable performances with much lower
cost - e.g. with 39.80% less training pa-
rameters; 3) referring to BIO-strict span-level
evaluation using weighted-average F1 score,
ClinicalBERT-CRF, BERT-CRF, and Trans-
formerCRF exhibited lower score differences,
with 97.59%/97.44%/96.84% respectively. 4)
applying efficient training by down-sampling
for better data distribution further reduced the
training cost and need for data, while main-
taining similar scores - i.e. around 0.02 points
lower compared to using the full dataset.
Our models will be hosted at
https://github.com/HECTA-UoM/
TransformerCRF.
1 Introduction
Fine-tuning Pre-trained Language Models (PLMs)
has demonstrated state-of-the-art abilities in solv-
ing natural language processing tasks, including
text mining (Zhang et al.,2021), named entity
recognition (Dernoncourt et al.,2017), reading
comprehension (Sun et al.,2020), machine trans-
lation (Vaswani et al.,2017;Devlin et al.,2019),
and summarisation (Gokhan et al.,2021;Wu et al.,
2022). Domain applications of PLMs have spanned
a much wider variety including financial, legal, and
biomedical texts, in addition to traditional news
and social media domains. For instance, experi-
mental work on BioBERT (Lee et al.,2019) and
BioMedBERT (Chakraborty et al.,2020) show-
cased high evaluation scores by exploiting BERT’s
(Devlin et al.,2019) structure to train on biomedi-
cal data. Additionally, fine-tuned SciFive, BioGPT,
and BART models produced reasonable experimen-
tal outputs on biomedical abstract simplification
tasks (Li et al.,2023).
However, ongoing investigations try to under-
stand the extent to which fine-tuning PLMs in-
creases performances against language models
trained from scratch on domain-specific tasks (Gu
et al.,2021). Researchers often assume that fine-
tuning becomes indeed helpful when dealing with
tasks that have limited available data, and where
PLMs can leverage additional knowledge acquired
from extensive out-of-domain or domain-related
data. Therefore, an important question arises:
Given a domain-specific task, how limited should
the available data be for mixed-domain pre-training
to be considered beneficial? Surprisingly, no previ-
ous studies have provided statistics in this regard.
In this paper, our focus is on clinical domain text
mining, and our objective is to examine the afore-
mentioned hypothesis. Specifically, we aim to de-
termine whether PLMs outperform models trained
from scratch when given access to limited data in
a constrained setting, and to what extent this im-
provement occurs.
In comparison to other domains, clinical text
mining (CTM) is still considered a relatively new
task for PLM applications, as CTM is well known
arXiv:2210.12770v4 [cs.CL] 30 Oct 2023
for data-scarce issues due to a small amount of
human-annotated corpora and privacy concerns.
In this work, we fine-tune PLMs from the gen-
eral domain BERT, biomedical domain BioBERT,
and clinical domain ClinicalBERT, examining how
well they perform on clinical information extrac-
tion task, namely drugs and drug-related attributes
using n2c2-2018 shared task data via adaptation
and fine-tuning. We then compare their results with
ones of a lightweight Transformer model trained
from scratch, and further investigate the impact of
an additional CRF layer on the deployed models.
Section 2gives more details on related works,
Section 3introduces the methodologies for
our investigation, Section 4describes our data-
preprocessing and experimental setups, Section 5
presents the evaluation results and ablation studies,
Section 6further discusses data-constrained train-
ing looking back at n2c2-2018 shared tasks; finally,
Section 7concludes this paper and opens ideas for
future works. Readers can refer to the Appendix for
more details on experimental analysis and relevant
findings.
2 Related Work
The integration of pre-trained language models into
applications within the biomedical and clinical do-
mains has emerged as a prominent trend in recent
years. A significant contribution to this field is
BioBERT (Lee et al.,2019), which was among the
first to explore the advantages of training a BERT-
based model from domain-specific knowledge, i.e.
using biomedical data. BioBERT demonstrated
that training BERT using PubMed abstracts and
PubMed Central (PMC) full-text articles resulted
in superior performances on Named Entity Recog-
nition (NER) and Relation Extraction (RE) tasks,
within the biomedical domain.
However, since BioBERT was pre-trained
on general-domain data such as Wikipedia
or BooksCorpus (Zhu et al.,2015) and then
continuously-trained on biomedical data, PubMed-
BERT (Gu et al.,2021) further examined the ad-
vantages of training a model from scratch solely on
biomedical data, employing the same PubMed data
as BioBERT to avoid influences of mixed domains.
This choice was motivated by the observation that
word distributions from different domains are rep-
resented differently in their respective vocabularies.
Furthermore, PubMedBERT created a new bench-
mark dataset named BLURB covering more tasks
than BioBERT and including the terms: disease,
drug, gene, organ, and cell.
PubMedBERT and BioBERT both focused on
biomedical knowledge, leaving other closely re-
lated domains such as the clinical one for fu-
ture exploration. Subsequently, (Alsentzer et al.,
2019) demonstrated that ClinicalBERT, trained us-
ing generic clinical text and discharge summaries,
exhibited superior performances on medical lan-
guage inference (i2b2-2010 and 2012) and de-
identification tasks (i2b2-2006 and 2014). Simi-
larly, (Huang et al.,2019) found that ClinicalBERT
trained on clinical notes achieved improved pre-
dictive performance for hospital readmission after
fine-tuning on this specific task.
In our work on the clinical domain, we use
the n2c2-2018 shared task corpus which provides
electric health records (EHR) as semi-structured
letters (their heading specifying drug names, pa-
tient names, doses, relations, etc., and the body
describing the diagnoses and treatment as free text).
We aim to examine how fine-tuned PLMs perform
against domain-specific transformers trained from
scratch, at biomedical and clinical text mining.
Regarding the usage of Transformer models for
text mining, (Wu et al.,2021) implemented the
Transformer structure with an adaptation layer for
information and communication technology (ICT)
entity extraction. (Al-Qurishi and Souissi,2021)
proposed to add a CRF layer on top of the BERT
model to carry out Arabic NER on mixed do-
main data, such as news and magazines. (Yan
et al.,2019) demonstrated that the Encoder-only
Transformer could improve previous results on tra-
ditional NER tasks in comparison to BiLSTMs.
Other related works include (Zhang and Wang,
2019;Gan et al.,2021;Zheng et al.,2021;Wang
and Su,2022) which applied Transformer and CRF
for spoken language understanding, Chinese NER,
power-meter NER, and forest disease text.
3
Methodology and Experimental Designs
Figure 1displays the design of our investigation,
which includes the pre-trained LLMs BERT (De-
vlin et al.,2019), BioBERT (Lee et al.,2019),
and ClinicalBERT (Alsentzer et al.,2019), in ad-
dition to an Encoder-only Transformer (Vaswani
et al.,2017) implementing the “distilbert-base-
cased” structure and trained from scratch.
The first step is to adapt these models to Named
Entity Recognition by adding an Adaptation (or
Figure 1: Model Designs upon Investigations
Classification) layer, resulting in the following
models: BERT-Apt, BioBERT-Apt, ClinicalBERT-
Apt, and Transformer-Apt. This adaptation layer
predicts probability distribution over all labels for
each token independently.
Then, we compare the results of the above mod-
els with the same ones but implementing an addi-
tional Conditional Random Field (CRF) (Lafferty
et al.,2001) layer, obtaining BERT-CRF, BioBERT-
CRF, ClinicalBERT-CRF, and Transformer-CRF
models. Now, instead of independently predicting
labels in a sequence, the CRF layer takes the neigh-
bouring tokens with their corresponding labels to
predict the label of the token under study.
4 Data Pre-processing and Experimental
Setups
In this section, we introduce the n2c2-2018 cor-
pus we utilise for model training and evaluations,
as well as model optimisation strategies, efficient
training, and evaluation metrics.
4.1 Corpus and Model Setting
Regarding the dataset, we utilise the standard n2c2-
2018 shared task data from Track-2 (Henry et al.,
2020): Adverse Drug Events and Medication Ex-
traction in Electric Health Records (EHRs)
1
. We
note that The World Health Organisation (WHO)
2
defines ADE as “an injury resulting from med-
ical intervention related to a drug”, while the Pa-
tient Safety Network (PSNet) defines it as “harm
experienced by a patient as a result of exposure
to a medication”
3
. The aim of this task is to
investigate whether “NLP systems can automat-
ically discover drug-to-adverse event relations in
clinical narratives”. Three sub-tasks under this
track include Concepts, Relations, and End-to-
End. Among these, the first task is to identify
drug names, dosage, duration, and other entities;
the second task is to identify relations of drugs
with adverse drug events (ADEs) and other enti-
ties given gold standard entities; finally, the third
task is identical to the second one, but involves
entities that have been predicted by systems. In
total, this track provides 505 annotated files on dis-
charge summaries from the Medical Information
Mart for Intensive Care III (MIMIC-III) clinical
care database (Johnson et al.,2016), for which
annotation was carried out by four physician as-
1https://portal.dbmi.hms.harvard.edu/
projects/n2c2-2018-t2/
2https://www.who.int/
3https://psnet.ahrq.gov/primer/
medication-errors-and-adverse-drug-events
摘要:

ExploringtheValueofPre-trainedLanguageModelsforClinicalNamedEntityRecognitionSamuelBelkadi∗,LifengHan∗,YupingWu,andGoranNenadicDepartmentofComputerScience,TheUniversityofManchester,UK*co-firstauthorssamuel.belkadi@student.manchester.ac.uklifeng.han,yuping.wu,g.nenadic@manchester.ac.ukAbstractTheprac...

展开>> 收起<<
Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition Samuel Belkadi Lifeng Han Yuping Wu and Goran Nenadic.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:14 页 大小:3.16MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注