Woodstock ’18, June 03–05, 2018, Woodstock, NY Yong He, Cheng Wang, Shun Zhang, Nan Li, Zhaorong Li, Zhenyu Zeng
BERT[
5
] has been widely used in text classication and typically
multiple texts are concatenated into one long text for modeling
purposes. However, there are three limitations of directly applying
the BERT model for multi-type texts. First, the input of BERT is
limited to 512 tokens, and combining all texts could easily exceed
this limit and lead to information loss. Second, the concatenation
of multiple types of text may not be feasible as some texts are
syntactically dierent or contextually irrelevant, which makes the
direct self-attention module of BERT unnecessary. Lastly, BERT
was pre-trained over open-domain corpora and it does not play
well with EHR data in the medical eld.
To overcome the above limitations for multi-type text classica-
tion, we develop a new model KG-MTT-BERT (Knowledge Graph
Enhanced Multi-Type Text BERT) to extend the BERT model for
multi-type text and integrate medical knowledge graph for do-
main specicity. Our model rst applies the same BERT-Encoder
to process each text, i.e. the encoder of each text of one patient
shares the same parameters. The two levels of encoding outputs
from BERT-Encoder with dierent granularities, one is at text-level
and the other is at token-level, are compared in this paper. Multi-
ple encodings from input texts are concatenated together as the
representation matrix and dierent types of pooling layers are in-
vestigated for better information summation. In addition, we use
the knowledge graph to enhance the model’s ability to handle the
medical domain knowledge. Finally, a classication layer is utilized
to predict DRG categories.
We have conducted experiments on three DRG datasets from
Tertiary hospitals in China (See Appendix A.2 for the denition of
hospital classication). Our model can outperform baselines and
other state-of-the-art models in all datasets. The rest of the paper
is organized as follows. We review related works and techniques
in the Related Works Section and introduce the formulation and
architecture of our model in the Method Section. Then we report
the performances of our model in three datasets and investigate
hyperparameter eects and multiple ablation studies to further the
understanding of the model. Lastly, we conclude our paper and
point out our future directions.
2 RELATED WORKS
Text classication is an important Natural Language Processing
(NLP) task directly related to many real-world applications. The
goal of text classication is to automatically classify the text into
predened labels. The unstructured nature of free text requires the
transformation of text into a numerical representation for modeling.
Over the past decade, text classication has changed from shallow
to deep learning methods, according to the model structures [
20
].
Shallow learning models focus on the feature extraction and classi-
er selection, while deep learning models can perform automatic
high-level feature engineering and then t with the classier to-
gether.
Shallow learning model rst converts text into vectors using text
representation methods like Bag-of-words (BOW), N-gram, term
frequency-inverse document frequency (TF-IDF) [
22
], Word2vec
[
25
], and GloVe [
27
], then trains the shadow model to classify, such
as Naive Bayes [
24
], Support Vector Machine [
12
], Random Forest
[
3
], XGBoost [
4
], and LightGBM [
15
]. In practice, the classier
is trained and routinely selected from the zoo of shallow models.
Therefore, feature extraction and engineering is the critical step for
the performance of text classication.
Various deep learning models have been proposed in recent years
for text classication, which builds on basic deep learning tech-
niques like CNN, RNN, and the attention mechanism. TextCNN [
16
]
applies the convolutional neural network for sentence classication
tasks. The RNNs, such as long short-term memory (LSTM), are
broadly used to capture long-range dependence. Zhang et al. intro-
duce bidirectional long short-term memory networks (BLSTM) to
model the sequential information about all words before and after
it [
35
]. Zhou et al. integrate attention with BLSTM as Att-BLSTM to
capture the most important semantic information in a text [
36
]. Re-
current Convolutional Neural Network (RCNN) combines recurrent
structure to capture the contextual information and max-pooling
layer to judge the key message from features [
18
]. Therefore, RCNN
leverages the advantages of both RNN and CNN. Johnson et al. de-
velop a deep pyramid CNN (DPCNN) model to increase the depth of
CNN by alternating a convolution block and a down-sampling layer
over and over [
13
]. Yang et al. design a hierarchical attention net-
work (HAN) for document classication by aggregating important
words to a sentence and then aggregating importance sentences to
document [
34
]. The appearance of BERT, which uses the masked
language model to pre-train deep bidirectional representation, is a
signicant turning point in developing text classication models
and other NLP technologies. The memory and computation cost
of self-attention grows quadratically with text length and prevents
applications for long sequences. Longformer [
1
] and Reformer [
17
]
are designed to address this limitation.
Domain knowledge serves a principal role in many industries. In
order to integrate domain knowledge, it is necessary to embed the
entities and relationships of the knowledge graph. Many knowledge
graph embedding methods have been proposed, such as TransE [
2
],
TransH [
30
], TransR [
21
], TransD [
10
], KG2E [
7
], and TranSparse
[
11
]. Some existing works show that the knowledge graph can
improve the ability of BERT, such as K-BERT [
23
] and E-BERT [
26
].
Data mining of EHR [
32
] and the biomedical data has been an
increasingly popular research area, such as medical code assign-
ment, medical text classication [
29
], and medical event sequential
learning. Several models have been developed for medical code
assignment, which is a prerequisite of the DRG grouper. Multi-
modal Fusion Architecture SeArch (MUFASA) model investigates
multimodal fusion strategies to predict the diagnosis code from
comprehensive EHR data [
31
]. Yan et al. design a knowledge-driven
generation model for short procedure mention normalization [
33
].
Limited research has been performed on direct DRG prediction.
Gartner et al. apply the shallow learning approach to predict the
DRG from the free text with complicated feature engineering, fea-
ture selection, and 9 classication techniques [
6
]. AMANet [
8
] treats
the DRG classication as a dual-view sequential learning task based
on standardized diagnosis and procedure sequences. BioBERT [
19
]
is a domain-specic language representation model pre-trained on
large-scale biomedical corpora, which is based on BERT. Med-BERT
[
28
] adapts the BERT framework for pre-training contextualized
embedding models on the structured diagnosis ICD code from EHR.
ClinicalBERT [
9
] uses clinical text to train the BERT framework for