MedJEx A Medical Jargon Extraction Model with Wikis Hyperlink Span and Contextualized Masked Language Model Score Sunjae Kwon1 Zonghai Yao1 Harmon S. Jordan2

2025-04-30 0 0 763.04KB 20 页 10玖币

侵权投诉

MedJEx: A Medical Jargon Extraction Model with Wiki’s Hyperlink

Span and Contextualized Masked Language Model Score

Sunjae Kwon1, Zonghai Yao1, Harmon S. Jordan2,

David A. Levy3, Brian Corner4, Hong Yu1,3,4,5

1UMass Amherst, 2Health Research Consultant,

3UMass Lowell, 4UMass Medical School, 5U.S. Department of Veterans Affairs

sunjaekwon@umass.edu, zonghaiyao@umass.edu, harmon.s.jordan@gmail.com,

david_levy@uml.edu, brian.corner@umassmed.edu, hong_yu@uml.edu

Abstract

This paper proposes a new natural language

processing (NLP) application for identifying

medical jargon terms potentially difﬁcult for

patients to comprehend from electronic health

record (EHR) notes. We ﬁrst present a novel

and publicly available dataset with expert-

annotated medical jargon terms from 18K+

EHR note sentences (MedJ). Then, we

introduce a novel medical jargon extraction

(MedJEx) model which has been shown to

outperform existing state-of-the-art NLP mod-

els. First, MedJEx improved the overall per-

formance when it was trained on an auxil-

iary Wikipedia hyperlink span dataset, where

hyperlink spans provide additional Wikipedia

articles to explain the spans (or terms), and

then ﬁne-tuned on the annotated MedJ data.

Secondly, we found that a contextualized

masked language model score was beneﬁcial

for detecting domain-speciﬁc unfamiliar jar-

gon terms. Moreover, our results show that

training on the auxiliary Wikipedia hyper-

link span datasets improved six out of eight

biomedical named entity recognition bench-

mark datasets. Both MedJ and MedJEx are

publicly available 1.

1 Introduction

Allowing patients to access their electronic health

records (EHRs) represents a new and personal-

ized communication channel that has the poten-

tial to improve patient involvement in care and

assist communication between physicians, patients,

and other healthcare providers (Baldry et al.,1986;

Schillinger et al.,2009). However, studies showed

that patients do not understand medical jargon in

their EHR notes (Chen et al.,2018).

To improve patients’ EHR note comprehension,

it is important to identify medical jargon terms

that are difﬁcult for patients to understand. Un-

like the traditional concept identiﬁcation or named

Code and the pre-trained models are available at

https:

//github.com/MozziTasteBitter/MedJEx.

entity recognition (NER) tasks, where the tasks

mainly center on semantic salient entities, detect-

ing such medical jargon terms takes into consider-

ation the perspective of user comprehension. Tra-

ditional NER approaches such as using compre-

hensive clinical terminological resources (e.g., the

Uniﬁed Medical Language System (UMLS) (Bo-

denreider,2004)) would identify terms such as “wa-

ter” and “fat”, which are not considered difﬁcult for

patients to comprehend. Meanwhile, using term fre-

quency (TF) as the proxy for medical jargon term

identiﬁcation will miss outliers such as "shock,"

which is a term frequently used in the open do-

main with its common sense: “a sudden upsetting

or surprising event or experience." However, EHR

notes incorporate its uncommon sense: “a medi-

cal condition caused by severe injury, pain, loss of

blood, or fear that slows down the ﬂow of blood.”

(Shock,2022). Thus, “shock” should be identiﬁed

as a jargon term from EHR notes since it would be

difﬁcult for patients to comprehend, even though

its TF is high. In this study, we propose a natural

language processing (NLP) system that can iden-

tify such outlier jargon from EHR notes through a

novel method for homonym resolution.

We ﬁrst expert-annotated de-identiﬁed EHR note

sentences for medical jargon terms judged to be

difﬁcult to comprehend. This resulted in the Med-

ical Jargon Extraction for Improving EHR Text

Comprehension (MedJ) dataset, which comprises

18,178 sentences and 95,393 medical jargon terms.

We then present a neural network-based medical

jargon extraction (MedJEx) model to identify the

jargon terms.

To ameliorate the limited training-size issue, we

propose a novel transfer learning-based framework

(Tan et al.,2018) utilizing auxiliary Wikipedia

(Wiki) hyperlink span datasets (WikiHyperlink),

where the span terms link to different Wiki articles

(Mihalcea and Csomai,2007). Although medical

jargon extraction and WikiHyperlink recognition

arXiv:2210.05875v1 [cs.CL] 12 Oct 2022

seem to be two different applications, they share

similarities. The role of hyperlinks is to help a

reader to understand an Wiki article. Thus, "dif-

ﬁcult to understand" concepts in the Wiki article

may be more likely to have hyperlinks. Therefore,

we hypothesize that large-scale hyperlink span in-

formation from Wiki can be advantageous for our

models of medical jargon extraction. Our results

show that models trained on WikiHyperlink span

datasets indeed substantially improved the perfor-

mance of MedJEx. Moreover, we also found that

such auxiliary learning improved six out of the

eight benchmark datasets of biomedical NER tasks.

To detect outlier homonymous terms such as

“shock”, we deployed an approach inspired by

masking probing (Petroni et al.,2019), a method

for evaluating linguistic knowledge of large-scale

pre-trained language models (PLMs). Meister et al.

(2022) suggests PLMs are beneﬁcial for predicting

the reading time, with longer reading time indicates

difﬁcult for indicating difﬁculty in understanding.

In our work, we propose a contextualized masked

language model (MLM) score feature to tackle the

homonym challenge. Note that models will recog-

nize the sense of a word or phrase using contextual

information. Since PLMs calculate the probability

of masked words in consideration of context, we

hypothesize that PLMs trained in the open-domain

corpus would predict poorly masked medical jar-

gon if senses are distributed differently between

the open domain and clinical domain corpora.

We conducted experiments on four state-of-the-

art PLMs, namely BERT (Devlin et al.,2019),

RoBERTa (Liu et al.,2019), BioClinicalBERT

(Alsentzer et al.,2019b) and BioBERT (Lee et al.,

2020). Experimental results show that when both

of the methods are combined, the medical jargon

extraction performance is improved by 2.44%p in

BERT, 2.42%p in RoBERTa, 1.56%p in BioClini-

calBERT, and 1.19%p in BioBERT.

Our contributions are as follows:

•

propose a novel NLP task

for identifying

medical jargon terms potentially difﬁcult for

patients to comprehend from EHR notes.

•

We will release

MedJ

, an expert-curated

18K+ sentence dataset for the MedJEx task.

•

We introduce

MedJEx

, a medical jargon ex-

traction model. Herein, MedJEx was ﬁrst

trained with the auxiliary WikiHyperlink span

dataset before being ﬁne-tuned on the MedJ

dataset. It uses MLM score feature for

homonym resolution.

•

The experimental results show that training

on the Wiki’s hyperlink span datasets consis-

tently improved the performance of not only

MedJ but also six out of eight BioNER bench-

marks. In addition, our qualitative analyses

show that the MLM score can complement

the TF score for detecting the outlier jargon

terms.

2 Related Work

In principle, MedJEx is related to text simpliﬁca-

tion (Kandula et al.,2010). None of the previ-

ous work (Abrahamsson et al.,2014;Qenam et al.,

2017;Nassar et al.,2019) identiﬁed terms that im-

portant for comprehension.

On the other hand, MedJEx is relevant to

BioNER, a task for identifying biomedical named

entities such as

disease

drug

, and

symptom

from medical text. There are several benchmark

corpora, including i2b2 2010 (Patrick and Li,

2010), ShARe/CLEF 2013 (Zuccon et al.,2013),

and MADE (Jagannatha et al.,2019), all of which

were developed solely based on clinical importance.

In contrast,

MedJ

is patient-centered, taking into

consideration of patients’ comprehension. Identi-

fying BioNER from medical documents has been

an active area of research. Earlier work such as

the MetaMap (Aronson,2001), used linguistic pat-

terns, either manually constructed or learned semi-

automatically, to map free text to external knowl-

edge resources such as UMLS (Lindberg et al.,

1993). The benchmark corpora have promoted su-

pervised machine learning approaches including

conditional random ﬁelds and deep learning ap-

proaches (Jagannatha et al.,2019).

Key phrase extraction in the medical domain is

another related task. It identiﬁes important phrases

or clauses that represent topics (Hulth,2003). In

previous studies, key phrases were extracted using

features such as TF, word stickiness, and word cen-

trality (Saputra et al.,2018). Chen and Yu (2017)

proposed an unsupervised learning based method

to elicit important medical terms from EHR notes

using MetaMap (Demner-Fushman et al.,2017)

and various weighting features such as TextRank

(Mihalcea and Tarau,2004) and term familiarity

score (Zeng-Treitler et al.,2007). In another work,

Chen et al. (2017) proposed an adaptive distant su-

UMLS

QuickUMLS

Weighted

Score Feature

Binary

Feature

Wiki_trained LM

Tokeni ze r

CRF Layer

MLPMLP

MLP Binary Feature

Extraction

Biomedical

Concepts

Ter m

Weighting

Initialize with

trained weights

WikiHyperlink Training Auxiliary Feature ExtractionTarget Mode l

Input

Hidden

Weighted emission

Emission

Final emission

WordFreq

… exacerbated by his shock liver …

…

‘exacerbated by ’

'shock’

‘liver’

Pretraiend LM

… 'ex', '##ace', '##rb', '##ated’, 'by', 'his’, [MASK], 'liver

… 'ex', '##ace', '##rb', '##ated’, 'by', 'his', 'shock', 'liver’

Masking

MLM Score : 0.6018

TF Score : 0.5608

MLM Score

TF Score

WikiHyperlink

Pretraiend LM

Wikipedia

Articles

Data

Processing

Source Label

MedJ

1. Other common side effects include

fatigue, headache, myalgia (muscle

pain), and arthralgia (joint pain)

2. as shown by an increase in serum

creatinine

3. The discomfort may occasionally

feel like heartburn

4. Shock is the state of

insufficient blood flow to the tissues

…

Examples of

WikiHyperlink

Figure 1: This ﬁgure demonstrates the overall architecture of MedJEx. There are three components in MedJEx: 1)

WikiHyperlink training, 2) auxiliary feature extraction and 3) target model. First, in WikiHyperlink training, we

extract hyperlink spans from Wikipedia articles. The examples shows that hyperlink spans (blue colored) represent

medical jargons, and ignore easier medical terms such as “fatigue” and “headache”. Then, the pretrained language

model (LM) is trained with WikiHyperlink. In auxiliary feature extraction, we can see that MLM score of medical

jargon “shock” shows relatively high TF and MLM scores, indicating that the MLM score can help detect the

medical jargon. Finally, the weight parameters of Wiki-trained LM in the target model are initialized with trained

parameters of pretrained LM of WikiHyperlink training. Then, the model is ﬁnetuned with MedJ.

pervision based medical term extraction approach

that utilizes consumer health vocabulary (Zeng and

Tse,2006) and a heuristic rule to distantly label

medical jargon training datasets. A key phrase

extraction method using a large-scale pretrained

model is being actively studied (Soundarajan et al.,

2021).

Unlike the previous BioNER or key phrase iden-

tiﬁcation applications, identifying medical jargon

terms is important for patients’ comprehension of

their EHR notes and represents a novel NLP ap-

plication. However, not all medical entities are

unfamiliar to patients. The brute force approach

of capturing every medical entity, the approaches

of existing BioNER and key phrase identiﬁcation

applications, may bring about confusion to patients.

On the other hand, undetected medical jargon terms

will reduce patients’ EHR note comprehension. In

this paper, we propose MedJEx, a novel application

that identiﬁes medical jargon terms important for

patients’ comprehension. Once jargon terms are

identiﬁed, interventions such as linking the jargon

terms to lay deﬁnitions can help improve compre-

hension.

3 Dataset Construction

This work has two different datasets: 1) MedJ for

medical jargon extraction and 2) Wiki’s hyperlink

span (WikiHyperlink) dataset for transfer learning.

3.1 MedJ

3.1.1 Data Collection

The source of the dataset is a collection of publicly

available deidentiﬁed EHR notes from hospitals

afﬁliated with the University of Pittsburg Medical

Center. Herein, 18,178 sentences were randomly

sampled and domain-experts then annotated the

sentences for medical jargon 2.

3.1.2 Data Annotation

Domain-experts read each sentence and identiﬁed

as medical jargon terms that would be considered

difﬁcult to comprehend for anyone no greater than

a 7th grade education

. Overall, 96,479 medical

jargon terms have been annotated by complying

with the following annotation guideline.

Annotation Guideline

The dataset was anno-

tated for medical jargon by six domain experts from

medicine, nursing, biostatistics, biochemistry, and

biomedical literature curation

. Herein, the anno-

2Using these data requires a license agreement.

The rule of thumb is that if a candidate term has a lay def-

inition comprehensible to a 4-7th grader as judged by Flesch-

Kincaid Grade Level (Solnyshkina et al.,2017), the candidate

term is included as a jargon term.

The annotator agreement scores can be found in Ap-

pendix A.1.

tators applied the following rules for identifying

what was jargon:

Rule 1.

Medical terms that would

not be recog-

nized by about 4 to 7th graders

, or that

have a

different meaning in the medical context than

in the lay context (homonym)

were deﬁned. For

example:

•

accommodate: When the eye changes focus

from far to near.

•

antagonize: A drug or substance that stops the

action or effect of another substance.

•

resident: A doctor who has ﬁnished medical

school and is receiving more training.

• formed: Stool that is solid.

Rule 2.

Terms that are not strictly medical, but are

frequently used in medicine

were deﬁned. For

example:

•

“aberrant”, “acute”, “ammonia”, “tender”, “in-

tact”, “negative”, “evidence”

Rule 3.

When jargon words are

commonly used

together, or together they mean something dis-

tinct or are difﬁcult to quickly understand from

the individual parts

, they were deﬁned. For ex-

ample:

•

vascular surgery: Medical specialty that per-

forms surgery on blood vessels.

•

airway protection: Inserting a tube into the

windpipe to keep it wide open and prevent

vomit or other material from getting into the

lungs.

•

posterior capsule: The thin layer of tissue be-

hind the lens of the eye. It can become cloudy

and blur vision.

•

right heart: The side of the heart that pumps

blood from the body into the lungs.

• intracerebral hemorrhage: A stroke.

Rule 4.

Terms whose

deﬁnitions are widely

known

(e.g., by a 3rd grader) do NOT need to

be deﬁned. For example:

• “muscle”, “heart”, “pain”, “rib”, “hospital”

Rule 4.1

When in doubt, deﬁne the term. For

example:

• “colon”, “immune system”

3.1.3 Data Cleaning

First, we cleaned up overlapped (tumor suppres-

sor gene,gene deletion) or nested (vitamin D, 25-

hydroxy vitamin D) jargon. We chose the longest

jargon terms among nested or overlapped jargon

terms. For example, we chose "tumor suppressor

gene" as a jargon term, not its nested term "tumor."

In all, MedJ contains a total of 95,393 context-

dependent jargon terms which we used as the gold

standard for training and evaluation of the MedJEx

model. The 95,393 jargon terms represent a total

of 12,383 unique jargon terms.

3.2 WikiHyperlink

From a Wiki dump data

, we ﬁrst cleaned and

elicited text by using Wikiextractor (Attardi,2015).

Then, we extracted hyperlink spans with the Beauti-

fulSoup (Richardson,2007) module. Wiki articles

were split into sentences with the Natural Language

Toolkit (Bird et al.,2009), then the sentences were

split into tokens with the PLM tokenizer. Overall,

WikiHyperlink contains more than 114M sentences,

13B words, and 99M hyperlink spans. Finally, the

source data consists of the sequence input of the

token and hyperlink labels represented in the stan-

dard BIOES format (Yang et al.,2018).

4 MedJEx Model

Figure 1is an overview of MedJEx. First, we

trained PLMs with WikiHyperlink (Wiki-trained).

Then, the Wiki-trained model was transferred to

the target model that we propose by initializing

the target model with the weight parameters of the

Wiki-trained model. Finally, we ﬁne-tuned the tar-

get model with our expert-annotated dataset. Note

that, since the pretrain corpora of PLMs used in

this work include the Wiki corpus, we noticed that

the performance change should derive from the

added labels (hyperlink spans). Herein, we ex-

tracted UMLS concepts and used them as auxiliary

features.

4.1 Wiki’s Hyperlink Span Prediction for

Transfer Learning Framework

Although MedJ is a high-quality and a large scale

expert-labeled dataset, deep learning models could

improve performance with additional data. How-

ever, annotation is very expensive. Transfer learn-

ing is one of the effective ways to mitigate the

5https://dumps.wikimedia.org/enwiki/

20211001/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MedJEx:AMedicalJargonExtractionModelwithWiki'sHyperlinkSpanandContextualizedMaskedLanguageModelScoreSunjaeKwon1,ZonghaiYao1,HarmonS.Jordan2,DavidA.Levy3,BrianCorner4,HongYu1;3;4;51UMassAmherst,2HealthResearchConsultant,3UMassLowell,4UMassMedicalSchool,5U.S.DepartmentofVeteransAffairssunjaekwon@umass...

展开>> 收起<<

MedJEx A Medical Jargon Extraction Model with Wikis Hyperlink Span and Contextualized Masked Language Model Score Sunjae Kwon1 Zonghai Yao1 Harmon S. Jordan2.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MedJEx A Medical Jargon Extraction Model with Wikis Hyperlink Span and Contextualized Masked Language Model Score Sunjae Kwon1 Zonghai Yao1 Harmon S. Jordan2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: