Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding Zhichao Yang1 Shufan Wang1 Bhanu Pratap Singh Rawat1 Avijit Mitra1 Hong Yu12

2025-05-03 0 0 2.43MB 15 页 10玖币

侵权投诉

Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot

ICD Coding

Zhichao Yang1, Shufan Wang1, Bhanu Pratap Singh Rawat1, Avijit Mitra1, Hong Yu1,2

1College of Information and Computer Sciences, University of Massachusetts Amherst

2Department of Computer Science, University of Massachusetts Lowell

{zhichaoyang,shufanwang,brawat,avijitmitra}@umass.edu hong_yu@uml.edu

Abstract

Automatic International Classiﬁcation of Dis-

eases (ICD) coding aims to assign multiple

ICD codes to a medical note with average

length of 3,000+ tokens. This task is chal-

lenging due to a high-dimensional space of

multi-label assignment (tens of thousands of

ICD codes) and the long-tail challenge: only

a few codes (common diseases) are frequently

assigned while most codes (rare diseases) are

infrequently assigned. This study addresses

the long-tail challenge by adapting a prompt-

based ﬁne-tuning technique with label seman-

tics, which has been shown to be effective un-

der few-shot setting. To further enhance the

performance in medical domain, we propose a

knowledge-enhanced longformer by injecting

three domain-speciﬁc knowledge: hierarchy,

synonym, and abbreviation with additional

pretraining using contrastive learning. Experi-

ments on MIMIC-III-full, a benchmark dataset

of code assignment, show that our proposed

method outperforms previous state-of-the-art

method in 14.5% in marco F1 (from 10.3 to

11.8, P<0.001). To further test our model on

few-shot setting, we created a new rare dis-

eases coding dataset, MIMIC-III-rare50, on

which our model improves marco F1 from 17.1

to 30.4 and micro F1 from 17.2 to 32.6 com-

pared to previous method.

1 Introduction

Multi-label learning has many real-word applica-

tions in natural language processing (NLP), in-

cluding but not limited to academic paper label-

ing (Chen et al.,2020), news framing (Akyürek

et al.,2020), waste crises response (Yang et al.,

2020), amazon product labeling (McAuley et al.,

2015;Dahiya et al.,2021), and medical coding

(Atutxa et al.,2019). In contrast to multi-class

classiﬁcation, an instance in multi-label learning

is frequently linked with more than one class la-

bels, making the task more challenging due to the

combination of potential class labels.

In real-world tasks, there are often insufﬁcient

training data for rare class labels. Taking automatic

international classiﬁcation of diseases (ICD) cod-

ing as example, given discharge summaries notes

as input, the task is to assign multiple ICD disease

and procedure label codes associated with each

note. The assigned codes need to be accurate and

complete for the billing purposes. As an exam-

ple, the MIMIC-III dataset (Johnson et al.,2016)

contains 8,692 unique ICD-9 codes, among which

4,115 (47.3%) codes occur less than 6 times and

203 (2.3%) occur zero times. Clinical practice re-

quires a high accuracy, hence, it is not acceptable

for a multi-label classiﬁer to fail a disease diagno-

sis (or code assignment) because it is rare, since

such a diagnosis may be of the most clinical im-

portance for the patient. Therefore, the classiﬁer

is required to perform with high precision even for

infrequent codes. This translates to data sparsity

due to availability of few training examples.

To mitigate the data sparsity problem, additional

structured knowledge could be applied. ICD codes

are organized with an ontological/hierarchical

structure where a text description is associated

to each code. For instance, ICD 250 (Diabetes

mellitus), shown in Figure 1, is the parent of sev-

eral child codes including 250.0 (Diabetes mellitus

without mention of complication), 250.1 (Diabetes

with ketoacidosis), and 250.2 (Diabetes with hy-

perosmolarity). Such child ICD codes are more

semantically different from each other than their

parent code 250.

Synonyms including acronyms and abbrevia-

tions are common in medical notes. For instance,

the description of code 250.00 is disease "type II

diabetes mellitus". However, this code can be de-

scribed in different text forms such as "insulin-

resistant diabetes", "non-insulin dependent dia-

betes", "DM2", and "T2DM". Therefore, one naive

way to assign ICD codes is to identify matching

between candidate code descriptions and their syn-

arXiv:2210.03304v2 [cs.CL] 13 Oct 2022

Figure 1: An illustration of self-alignment pretraining from medical knowledge UMLS, including the usage of (a)

Hierarchy, (b) Synonym, (c) Abbreviation. Pink region is the dynamic margin ranges from π/2to πwhere we

wish to pull negatives apart with a dynamic distance.

onyms in medical notes. In this work, we separate

synonyms from both acronyms and abbreviations

due to its importance in medical domain (Yu et al.,

2002). While synonymous relations could be im-

plicitly learned from pretrained language model

(LM) (Michalopoulos et al.,2022;Li et al.,2022),

previous researches show that language models are

only limited biomedical (Sung et al.,2021) or clin-

ical knowledge bases (Yao et al.,2022) due to the

data sparsity challenge in the medical domain. An

explicit way of adding such medical knowledge

into language model should be explored.

In this paper, we present a simple but effective

Knowledge Enhanced PrompT (KEPT) framework.

We implement and evaluate KEPT using a LM

based on Longformer because clinical notes are

typically more than 500 tokens. Speciﬁcally, we

ﬁrst

pretrainmimic

a Longformer LM on MIMIC-

III dataset. Then, we further

pretrainumls

on struc-

tured medical knowledge UMLS (Uniﬁed Medical

Language System) using self-alignment learning

with contrastive loss to inject medical knowledge

into pretrained LM. For the downstream ICD-code

assignment ﬁne-tuning, we add a sequence of ICD

code descriptions (label semantics) as prompts in

addition to each clinical note as KEPT LM input.

This allows early fusion of code descriptions and

the input note. Experiments on full disease cod-

ing (MIMIC-III-full) and common disease coding

(MIMIC-III-50) show that our KEPTLongformer

outperforms previous SOTA MSMN (Yuan et al.,

2022). In order to test its few-shot ability, we create

a new few-shot rare diseases coding dataset named

MIMIC-III-rare50, and results show signiﬁcant im-

provements compared between MSMN and our

method. To facilitate future research, we publicly

release the code and trained models1.

2 Related Work

2.1 Prompt-based Fine-tuning

Prompt-based ﬁne-tuning has been shown to be ef-

fective in few-shot tasks (Le Scao and Rush,2021;

Gao et al.,2021), even when the language model

is relatively small (Schick and Schütze,2021) be-

cause they introduce no new parameter during few

shot ﬁne-tuning. Additional tuning techniques such

as to tune bias-term or language model head have

shown to be efﬁcient on memory and training time

(Ben Zaken et al.,2022;Logan IV et al.,2022).

However, most previous works focus injecting

knowledge into prompt on single-label multi-class

classiﬁcation task (Hu et al.,2022;Wang et al.,

2022a;Ye et al.,2022). To the best of our knowl-

edge, this is the ﬁrst work that applies prompting

to multi-label classiﬁcation task.

2.2 Entity Representation Pretraining

Many recent researches use synonyms to conduct

biomedical entity representation learning (Sung

et al.,2020;Liu et al.,2021;Lai et al.,2021;An-

gell et al.,2021;Zhang et al.,2021;Kong et al.,

1https://github.com/whaleloops/KEPT

2021;Seneviratne et al.,2022). Our work is most

similar to Liu et al. (2021), which uses additional

pretraining scheme that self-aligns the representa-

tion space of biomedical entities from pretrained

medical LM. They collect self-supervised synonym

examples from the biomedical ontology UMLS,

and use multi-similarity contrastive loss to keep

the representation of similar entities closer to each

other, before ﬁne-tuning them to the downstream

speciﬁc task. However, their work differs from ours

in (1) their testing being limited to only medical

entity linking tasks and (2) not using hierarchical

information, which has been shown to be useful in

KRISSBERT (Zhang et al.,2021). In contrast to

KRISSBERT, our contrastive learning selects neg-

ative samples from siblings (1-hop nodes) instead

of random nodes in the graph. Our method follows

InfoMin proposition that selected samples should

contain as much task-relevant information while

discarding as much irrelevant information in the

input as possible (Tian et al.,2020).

2.3 ICD Coding

ICD coding uses NLP models to predict expert

labeled ICD codes given discharge summaries as

input. Currently, the most straightforward method

is to take the best language model for encoding

notes, and later use the label attention mechanism

to attend labeled ICD codes to input notes for pre-

diction (Mullenbach et al.,2018). In comparison,

we apply attention between codes and notes way

before within the encoder with the help of prompt.

The label representations in attention played an

important role in many previous works. Li and

Yu (2020) and Vu et al. (2020) ﬁrst randomly ini-

tialize the label representations. Chen and Ren

(2019); Dong et al. (2021); Zhou et al. (2021) ini-

tialize the label representation with code descrip-

tion from shallow representation using Word2Vec

(Mikolov et al.,2013). Yuan et al. (2022) further

add description synonyms semantic information. In

comparison, we use deep contextual representation

from Longformer pretrained on both MIMIC and

UMLS with contrastive loss. Similar pretrained

language models have shown to be effective in pre-

vious works (Wu et al.,2020;Huang et al.,2022;

DeYoung et al.,2022;Michalopoulos et al.,2022).

As stated previously, the high dimensions of

available label codes, such as 14,000 diagnosis

codes and 3,900 procedure codes in ICD-9 and

80,000 in industry coding (Ziletti et al.,2022),

makes ICD coding challenging. Another challenge

is the long-tail distribution, in which few codes

are frequently used but most codes may only be

used a few times due to the rareness of diseases

(Shi et al.,2017;Xie et al.,2019). Mottaghi et al.

(2020) use active learning with extra human label-

ing to solve this issue. Other recent works focus on

using additional medical domain-speciﬁc knowl-

edge to better understand the few training instances

(Cao et al.,2020;Song et al.,2020;Lu et al.,2020;

Falis et al.,2022;Wang et al.,2022b). Wu et al.

(2017) perform entity linking to identify medical

phrase in document note. Xie et al. (2019) map

label codes as entities in medical hierarchy graph.

Compared to a baseline which uses a shallow con-

volutional neural network to learn n-gram features

from notes, they add complex hierarchy structure

between codes by allowing the loss to propagate

through graph convolutional neural network. In

contrast with the previous systems which adopt

complex pipelines and different tools, our method

applies a much simpler training procedure by incor-

porating knowledge into language model without

requiring any knowledge pre or post-processing

(i.e. MedSpacy, Gensim, NLTK) during the ﬁne-

tuning. Additionally, previous methods use knowl-

edge graph as an input source, however, we train

our language model to include knowledge graph as

a target with contrastive loss.

3 Methods

ICD coding:

ICD coding is a multi-label multi-

class classiﬁcation task. Speciﬁcally, considering

thousands of words from an input medical note

the task is to assign a binary label

yi∈ {0,1}

for

each ICD code in the label space

, where 1 means

that note is positive for an ICD disease or procedure

and

i∈

range

[1, Nc]

. In this study, we deﬁne and

evaluate the number of candidate codes

as 50,

although

could be higher or lower depending

on speciﬁc applications. Each candidate code has

a short code description phrase

in free text. For

instance, code 250.1 has description diabetes with

ketoacidosis.Code descriptions

is the set of all

Ncnumber of ci.

3.1 Encoding Text with Longformer

To solve this task, we ﬁrst need to encode free text

into hidden representation with a pretrained clinical

longformer. Speciﬁcally, we convert free text

to a

sequence of tokens

, the vocab embedding then

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

KnowledgeInjectedPromptBasedFine-tuningforMulti-labelFew-shotICDCodingZhichaoYang1,ShufanWang1,BhanuPratapSinghRawat1,AvijitMitra1,HongYu1;21CollegeofInformationandComputerSciences,UniversityofMassachusettsAmherst2DepartmentofComputerScience,UniversityofMassachusettsLowell{zhichaoyang,shufanwang,bra...

展开>> 收起<<

Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding Zhichao Yang1 Shufan Wang1 Bhanu Pratap Singh Rawat1 Avijit Mitra1 Hong Yu12.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding Zhichao Yang1 Shufan Wang1 Bhanu Pratap Singh Rawat1 Avijit Mitra1 Hong Yu12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: