MedJEx A Medical Jargon Extraction Model with Wikis Hyperlink Span and Contextualized Masked Language Model Score Sunjae Kwon1 Zonghai Yao1 Harmon S. Jordan2

2025-04-30 0 0 763.04KB 20 页 10玖币
侵权投诉
MedJEx: A Medical Jargon Extraction Model with Wiki’s Hyperlink
Span and Contextualized Masked Language Model Score
Sunjae Kwon1, Zonghai Yao1, Harmon S. Jordan2,
David A. Levy3, Brian Corner4, Hong Yu1,3,4,5
1UMass Amherst, 2Health Research Consultant,
3UMass Lowell, 4UMass Medical School, 5U.S. Department of Veterans Affairs
sunjaekwon@umass.edu, zonghaiyao@umass.edu, harmon.s.jordan@gmail.com,
david_levy@uml.edu, brian.corner@umassmed.edu, hong_yu@uml.edu
Abstract
This paper proposes a new natural language
processing (NLP) application for identifying
medical jargon terms potentially difficult for
patients to comprehend from electronic health
record (EHR) notes. We first present a novel
and publicly available dataset with expert-
annotated medical jargon terms from 18K+
EHR note sentences (MedJ). Then, we
introduce a novel medical jargon extraction
(MedJEx) model which has been shown to
outperform existing state-of-the-art NLP mod-
els. First, MedJEx improved the overall per-
formance when it was trained on an auxil-
iary Wikipedia hyperlink span dataset, where
hyperlink spans provide additional Wikipedia
articles to explain the spans (or terms), and
then fine-tuned on the annotated MedJ data.
Secondly, we found that a contextualized
masked language model score was beneficial
for detecting domain-specific unfamiliar jar-
gon terms. Moreover, our results show that
training on the auxiliary Wikipedia hyper-
link span datasets improved six out of eight
biomedical named entity recognition bench-
mark datasets. Both MedJ and MedJEx are
publicly available 1.
1 Introduction
Allowing patients to access their electronic health
records (EHRs) represents a new and personal-
ized communication channel that has the poten-
tial to improve patient involvement in care and
assist communication between physicians, patients,
and other healthcare providers (Baldry et al.,1986;
Schillinger et al.,2009). However, studies showed
that patients do not understand medical jargon in
their EHR notes (Chen et al.,2018).
To improve patients’ EHR note comprehension,
it is important to identify medical jargon terms
that are difficult for patients to understand. Un-
like the traditional concept identification or named
1
Code and the pre-trained models are available at
https:
//github.com/MozziTasteBitter/MedJEx.
entity recognition (NER) tasks, where the tasks
mainly center on semantic salient entities, detect-
ing such medical jargon terms takes into consider-
ation the perspective of user comprehension. Tra-
ditional NER approaches such as using compre-
hensive clinical terminological resources (e.g., the
Unified Medical Language System (UMLS) (Bo-
denreider,2004)) would identify terms such as “wa-
ter” and “fat”, which are not considered difficult for
patients to comprehend. Meanwhile, using term fre-
quency (TF) as the proxy for medical jargon term
identification will miss outliers such as "shock,"
which is a term frequently used in the open do-
main with its common sense: “a sudden upsetting
or surprising event or experience." However, EHR
notes incorporate its uncommon sense: “a medi-
cal condition caused by severe injury, pain, loss of
blood, or fear that slows down the flow of blood.
(Shock,2022). Thus, “shock” should be identified
as a jargon term from EHR notes since it would be
difficult for patients to comprehend, even though
its TF is high. In this study, we propose a natural
language processing (NLP) system that can iden-
tify such outlier jargon from EHR notes through a
novel method for homonym resolution.
We first expert-annotated de-identified EHR note
sentences for medical jargon terms judged to be
difficult to comprehend. This resulted in the Med-
ical Jargon Extraction for Improving EHR Text
Comprehension (MedJ) dataset, which comprises
18,178 sentences and 95,393 medical jargon terms.
We then present a neural network-based medical
jargon extraction (MedJEx) model to identify the
jargon terms.
To ameliorate the limited training-size issue, we
propose a novel transfer learning-based framework
(Tan et al.,2018) utilizing auxiliary Wikipedia
(Wiki) hyperlink span datasets (WikiHyperlink),
where the span terms link to different Wiki articles
(Mihalcea and Csomai,2007). Although medical
jargon extraction and WikiHyperlink recognition
arXiv:2210.05875v1 [cs.CL] 12 Oct 2022
seem to be two different applications, they share
similarities. The role of hyperlinks is to help a
reader to understand an Wiki article. Thus, "dif-
ficult to understand" concepts in the Wiki article
may be more likely to have hyperlinks. Therefore,
we hypothesize that large-scale hyperlink span in-
formation from Wiki can be advantageous for our
models of medical jargon extraction. Our results
show that models trained on WikiHyperlink span
datasets indeed substantially improved the perfor-
mance of MedJEx. Moreover, we also found that
such auxiliary learning improved six out of the
eight benchmark datasets of biomedical NER tasks.
To detect outlier homonymous terms such as
“shock”, we deployed an approach inspired by
masking probing (Petroni et al.,2019), a method
for evaluating linguistic knowledge of large-scale
pre-trained language models (PLMs). Meister et al.
(2022) suggests PLMs are beneficial for predicting
the reading time, with longer reading time indicates
difficult for indicating difficulty in understanding.
In our work, we propose a contextualized masked
language model (MLM) score feature to tackle the
homonym challenge. Note that models will recog-
nize the sense of a word or phrase using contextual
information. Since PLMs calculate the probability
of masked words in consideration of context, we
hypothesize that PLMs trained in the open-domain
corpus would predict poorly masked medical jar-
gon if senses are distributed differently between
the open domain and clinical domain corpora.
We conducted experiments on four state-of-the-
art PLMs, namely BERT (Devlin et al.,2019),
RoBERTa (Liu et al.,2019), BioClinicalBERT
(Alsentzer et al.,2019b) and BioBERT (Lee et al.,
2020). Experimental results show that when both
of the methods are combined, the medical jargon
extraction performance is improved by 2.44%p in
BERT, 2.42%p in RoBERTa, 1.56%p in BioClini-
calBERT, and 1.19%p in BioBERT.
Our contributions are as follows:
We
propose a novel NLP task
for identifying
medical jargon terms potentially difficult for
patients to comprehend from EHR notes.
We will release
MedJ
, an expert-curated
18K+ sentence dataset for the MedJEx task.
We introduce
MedJEx
, a medical jargon ex-
traction model. Herein, MedJEx was first
trained with the auxiliary WikiHyperlink span
dataset before being fine-tuned on the MedJ
dataset. It uses MLM score feature for
homonym resolution.
The experimental results show that training
on the Wiki’s hyperlink span datasets consis-
tently improved the performance of not only
MedJ but also six out of eight BioNER bench-
marks. In addition, our qualitative analyses
show that the MLM score can complement
the TF score for detecting the outlier jargon
terms.
2 Related Work
In principle, MedJEx is related to text simplifica-
tion (Kandula et al.,2010). None of the previ-
ous work (Abrahamsson et al.,2014;Qenam et al.,
2017;Nassar et al.,2019) identified terms that im-
portant for comprehension.
On the other hand, MedJEx is relevant to
BioNER, a task for identifying biomedical named
entities such as
disease
,
drug
, and
symptom
from medical text. There are several benchmark
corpora, including i2b2 2010 (Patrick and Li,
2010), ShARe/CLEF 2013 (Zuccon et al.,2013),
and MADE (Jagannatha et al.,2019), all of which
were developed solely based on clinical importance.
In contrast,
MedJ
is patient-centered, taking into
consideration of patients’ comprehension. Identi-
fying BioNER from medical documents has been
an active area of research. Earlier work such as
the MetaMap (Aronson,2001), used linguistic pat-
terns, either manually constructed or learned semi-
automatically, to map free text to external knowl-
edge resources such as UMLS (Lindberg et al.,
1993). The benchmark corpora have promoted su-
pervised machine learning approaches including
conditional random fields and deep learning ap-
proaches (Jagannatha et al.,2019).
Key phrase extraction in the medical domain is
another related task. It identifies important phrases
or clauses that represent topics (Hulth,2003). In
previous studies, key phrases were extracted using
features such as TF, word stickiness, and word cen-
trality (Saputra et al.,2018). Chen and Yu (2017)
proposed an unsupervised learning based method
to elicit important medical terms from EHR notes
using MetaMap (Demner-Fushman et al.,2017)
and various weighting features such as TextRank
(Mihalcea and Tarau,2004) and term familiarity
score (Zeng-Treitler et al.,2007). In another work,
Chen et al. (2017) proposed an adaptive distant su-
UMLS
QuickUMLS
Weighted
Score Feature
Binary
Feature
Wiki_trained LM
Tokeni ze r
CRF Layer
MLPMLP
MLP Binary Feature
Extraction
Biomedical
Concepts
Ter m
Weighting
Initialize with
trained weights
WikiHyperlink Training Auxiliary Feature ExtractionTarget Mode l
Input
Hidden
Weighted emission
Emission
Final emission
WordFreq
… exacerbated by his shock liver …
‘exacerbated by
'shock’
‘liver’
Pretraiend LM
… 'ex', '##ace', '##rb', '##ated’, 'by', 'his’, [MASK], 'liver
… 'ex', '##ace', '##rb', '##ated’, 'by', 'his', 'shock', 'liver’
Masking
MLM Score : 0.6018
TF Score : 0.5608
MLM Score
TF Score
WikiHyperlink
Pretraiend LM
Wikipedia
Articles
Data
Processing
Source Label
MedJ
1. Other common side effects include
fatigue, headache, myalgia (muscle
pain), and arthralgia (joint pain)
2. as shown by an increase in serum
creatinine
3. The discomfort may occasionally
feel like heartburn
4. Shock is the state of
insufficient blood flow to the tissues
Examples of
WikiHyperlink
Figure 1: This figure demonstrates the overall architecture of MedJEx. There are three components in MedJEx: 1)
WikiHyperlink training, 2) auxiliary feature extraction and 3) target model. First, in WikiHyperlink training, we
extract hyperlink spans from Wikipedia articles. The examples shows that hyperlink spans (blue colored) represent
medical jargons, and ignore easier medical terms such as “fatigue” and “headache”. Then, the pretrained language
model (LM) is trained with WikiHyperlink. In auxiliary feature extraction, we can see that MLM score of medical
jargon “shock” shows relatively high TF and MLM scores, indicating that the MLM score can help detect the
medical jargon. Finally, the weight parameters of Wiki-trained LM in the target model are initialized with trained
parameters of pretrained LM of WikiHyperlink training. Then, the model is finetuned with MedJ.
pervision based medical term extraction approach
that utilizes consumer health vocabulary (Zeng and
Tse,2006) and a heuristic rule to distantly label
medical jargon training datasets. A key phrase
extraction method using a large-scale pretrained
model is being actively studied (Soundarajan et al.,
2021).
Unlike the previous BioNER or key phrase iden-
tification applications, identifying medical jargon
terms is important for patients’ comprehension of
their EHR notes and represents a novel NLP ap-
plication. However, not all medical entities are
unfamiliar to patients. The brute force approach
of capturing every medical entity, the approaches
of existing BioNER and key phrase identification
applications, may bring about confusion to patients.
On the other hand, undetected medical jargon terms
will reduce patients’ EHR note comprehension. In
this paper, we propose MedJEx, a novel application
that identifies medical jargon terms important for
patients’ comprehension. Once jargon terms are
identified, interventions such as linking the jargon
terms to lay definitions can help improve compre-
hension.
3 Dataset Construction
This work has two different datasets: 1) MedJ for
medical jargon extraction and 2) Wiki’s hyperlink
span (WikiHyperlink) dataset for transfer learning.
3.1 MedJ
3.1.1 Data Collection
The source of the dataset is a collection of publicly
available deidentified EHR notes from hospitals
affiliated with the University of Pittsburg Medical
Center. Herein, 18,178 sentences were randomly
sampled and domain-experts then annotated the
sentences for medical jargon 2.
3.1.2 Data Annotation
Domain-experts read each sentence and identified
as medical jargon terms that would be considered
difficult to comprehend for anyone no greater than
a 7th grade education
3
. Overall, 96,479 medical
jargon terms have been annotated by complying
with the following annotation guideline.
Annotation Guideline
The dataset was anno-
tated for medical jargon by six domain experts from
medicine, nursing, biostatistics, biochemistry, and
biomedical literature curation
4
. Herein, the anno-
2Using these data requires a license agreement.
3
The rule of thumb is that if a candidate term has a lay def-
inition comprehensible to a 4-7th grader as judged by Flesch-
Kincaid Grade Level (Solnyshkina et al.,2017), the candidate
term is included as a jargon term.
4
The annotator agreement scores can be found in Ap-
pendix A.1.
tators applied the following rules for identifying
what was jargon:
Rule 1.
Medical terms that would
not be recog-
nized by about 4 to 7th graders
, or that
have a
different meaning in the medical context than
in the lay context (homonym)
were defined. For
example:
accommodate: When the eye changes focus
from far to near.
antagonize: A drug or substance that stops the
action or effect of another substance.
resident: A doctor who has finished medical
school and is receiving more training.
formed: Stool that is solid.
Rule 2.
Terms that are not strictly medical, but are
frequently used in medicine
were defined. For
example:
“aberrant”, “acute”, “ammonia”, “tender”, “in-
tact”, “negative”, “evidence”
Rule 3.
When jargon words are
commonly used
together, or together they mean something dis-
tinct or are difficult to quickly understand from
the individual parts
, they were defined. For ex-
ample:
vascular surgery: Medical specialty that per-
forms surgery on blood vessels.
airway protection: Inserting a tube into the
windpipe to keep it wide open and prevent
vomit or other material from getting into the
lungs.
posterior capsule: The thin layer of tissue be-
hind the lens of the eye. It can become cloudy
and blur vision.
right heart: The side of the heart that pumps
blood from the body into the lungs.
intracerebral hemorrhage: A stroke.
Rule 4.
Terms whose
definitions are widely
known
(e.g., by a 3rd grader) do NOT need to
be defined. For example:
“muscle”, “heart”, “pain”, “rib”, “hospital”
Rule 4.1
When in doubt, define the term. For
example:
“colon”, “immune system”
3.1.3 Data Cleaning
First, we cleaned up overlapped (tumor suppres-
sor gene,gene deletion) or nested (vitamin D, 25-
hydroxy vitamin D) jargon. We chose the longest
jargon terms among nested or overlapped jargon
terms. For example, we chose "tumor suppressor
gene" as a jargon term, not its nested term "tumor."
In all, MedJ contains a total of 95,393 context-
dependent jargon terms which we used as the gold
standard for training and evaluation of the MedJEx
model. The 95,393 jargon terms represent a total
of 12,383 unique jargon terms.
3.2 WikiHyperlink
From a Wiki dump data
5
, we first cleaned and
elicited text by using Wikiextractor (Attardi,2015).
Then, we extracted hyperlink spans with the Beauti-
fulSoup (Richardson,2007) module. Wiki articles
were split into sentences with the Natural Language
Toolkit (Bird et al.,2009), then the sentences were
split into tokens with the PLM tokenizer. Overall,
WikiHyperlink contains more than 114M sentences,
13B words, and 99M hyperlink spans. Finally, the
source data consists of the sequence input of the
token and hyperlink labels represented in the stan-
dard BIOES format (Yang et al.,2018).
4 MedJEx Model
Figure 1is an overview of MedJEx. First, we
trained PLMs with WikiHyperlink (Wiki-trained).
Then, the Wiki-trained model was transferred to
the target model that we propose by initializing
the target model with the weight parameters of the
Wiki-trained model. Finally, we fine-tuned the tar-
get model with our expert-annotated dataset. Note
that, since the pretrain corpora of PLMs used in
this work include the Wiki corpus, we noticed that
the performance change should derive from the
added labels (hyperlink spans). Herein, we ex-
tracted UMLS concepts and used them as auxiliary
features.
4.1 Wiki’s Hyperlink Span Prediction for
Transfer Learning Framework
Although MedJ is a high-quality and a large scale
expert-labeled dataset, deep learning models could
improve performance with additional data. How-
ever, annotation is very expensive. Transfer learn-
ing is one of the effective ways to mitigate the
5https://dumps.wikimedia.org/enwiki/
20211001/
摘要:

MedJEx:AMedicalJargonExtractionModelwithWiki'sHyperlinkSpanandContextualizedMaskedLanguageModelScoreSunjaeKwon1,ZonghaiYao1,HarmonS.Jordan2,DavidA.Levy3,BrianCorner4,HongYu1;3;4;51UMassAmherst,2HealthResearchConsultant,3UMassLowell,4UMassMedicalSchool,5U.S.DepartmentofVeteransAffairssunjaekwon@umass...

展开>> 收起<<
MedJEx A Medical Jargon Extraction Model with Wikis Hyperlink Span and Contextualized Masked Language Model Score Sunjae Kwon1 Zonghai Yao1 Harmon S. Jordan2.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:763.04KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注