2021;Seneviratne et al.,2022). Our work is most
similar to Liu et al. (2021), which uses additional
pretraining scheme that self-aligns the representa-
tion space of biomedical entities from pretrained
medical LM. They collect self-supervised synonym
examples from the biomedical ontology UMLS,
and use multi-similarity contrastive loss to keep
the representation of similar entities closer to each
other, before fine-tuning them to the downstream
specific task. However, their work differs from ours
in (1) their testing being limited to only medical
entity linking tasks and (2) not using hierarchical
information, which has been shown to be useful in
KRISSBERT (Zhang et al.,2021). In contrast to
KRISSBERT, our contrastive learning selects neg-
ative samples from siblings (1-hop nodes) instead
of random nodes in the graph. Our method follows
InfoMin proposition that selected samples should
contain as much task-relevant information while
discarding as much irrelevant information in the
input as possible (Tian et al.,2020).
2.3 ICD Coding
ICD coding uses NLP models to predict expert
labeled ICD codes given discharge summaries as
input. Currently, the most straightforward method
is to take the best language model for encoding
notes, and later use the label attention mechanism
to attend labeled ICD codes to input notes for pre-
diction (Mullenbach et al.,2018). In comparison,
we apply attention between codes and notes way
before within the encoder with the help of prompt.
The label representations in attention played an
important role in many previous works. Li and
Yu (2020) and Vu et al. (2020) first randomly ini-
tialize the label representations. Chen and Ren
(2019); Dong et al. (2021); Zhou et al. (2021) ini-
tialize the label representation with code descrip-
tion from shallow representation using Word2Vec
(Mikolov et al.,2013). Yuan et al. (2022) further
add description synonyms semantic information. In
comparison, we use deep contextual representation
from Longformer pretrained on both MIMIC and
UMLS with contrastive loss. Similar pretrained
language models have shown to be effective in pre-
vious works (Wu et al.,2020;Huang et al.,2022;
DeYoung et al.,2022;Michalopoulos et al.,2022).
As stated previously, the high dimensions of
available label codes, such as 14,000 diagnosis
codes and 3,900 procedure codes in ICD-9 and
80,000 in industry coding (Ziletti et al.,2022),
makes ICD coding challenging. Another challenge
is the long-tail distribution, in which few codes
are frequently used but most codes may only be
used a few times due to the rareness of diseases
(Shi et al.,2017;Xie et al.,2019). Mottaghi et al.
(2020) use active learning with extra human label-
ing to solve this issue. Other recent works focus on
using additional medical domain-specific knowl-
edge to better understand the few training instances
(Cao et al.,2020;Song et al.,2020;Lu et al.,2020;
Falis et al.,2022;Wang et al.,2022b). Wu et al.
(2017) perform entity linking to identify medical
phrase in document note. Xie et al. (2019) map
label codes as entities in medical hierarchy graph.
Compared to a baseline which uses a shallow con-
volutional neural network to learn n-gram features
from notes, they add complex hierarchy structure
between codes by allowing the loss to propagate
through graph convolutional neural network. In
contrast with the previous systems which adopt
complex pipelines and different tools, our method
applies a much simpler training procedure by incor-
porating knowledge into language model without
requiring any knowledge pre or post-processing
(i.e. MedSpacy, Gensim, NLTK) during the fine-
tuning. Additionally, previous methods use knowl-
edge graph as an input source, however, we train
our language model to include knowledge graph as
a target with contrastive loss.
3 Methods
ICD coding:
ICD coding is a multi-label multi-
class classification task. Specifically, considering
thousands of words from an input medical note
t
,
the task is to assign a binary label
yi∈ {0,1}
for
each ICD code in the label space
Y
, where 1 means
that note is positive for an ICD disease or procedure
and
i∈
range
[1, Nc]
. In this study, we define and
evaluate the number of candidate codes
Nc
as 50,
although
Nc
could be higher or lower depending
on specific applications. Each candidate code has
a short code description phrase
ci
in free text. For
instance, code 250.1 has description diabetes with
ketoacidosis.Code descriptions
c
is the set of all
Ncnumber of ci.
3.1 Encoding Text with Longformer
To solve this task, we first need to encode free text
into hidden representation with a pretrained clinical
longformer. Specifically, we convert free text
a
to a
sequence of tokens
xa
, the vocab embedding then