
training on large-scale medical corpora, and then fine-tune
language models on the target domain so that the language
models can improve their performance on medical dialogues
[
1
,
2
]; and (2) Inject medical knowledge into language models
with additional training targets to improve language under-
standing [
6
,
7
,
8
,
9
,
10
]. The second paradigm has received
increasing attention due to the effectiveness of knowledge
injection. However, existing works all employ traditional lan-
guage models (e.g. LSTMs) as the main network architecture
with limited capacity. To our knowledge, there is currently no
framework that employs and empowers large pre-trained mod-
els to incorporate medical terminology knowledge to improve
medical dialogue generation.
In order to improve the medical terminology understand-
ing of language models, we propose a novel framework that
fills in the medical knowledge gap of conventional encoders
with additional terminology-aware training. It trains the neu-
ral networks to learn the feature distribution by enforcing the
neural encoder to understand the medical terminologies from
the input. Due to the lack of available large-scale medical
dialogue corpora with annotated terminology, we develop an
automatic terminology annotation framework, as well as a
corresponding dataset for the purpose of further study (see
our repository). The experimental results show superior per-
formance on a range of metrics compared to SOTA language
models, demonstrating the effectiveness of our proposed frame-
work and the importance of medical terminology in medical
dialogue generation.
Our contributions can be summarised as follows: (I) A
novel pre-trained language model-based framework is pro-
posed for terminology-aware medical dialogue generation; (II)
An automatic terminology annotation framework and large-
scale medical dialogue corpus with annotated terminology is
provided; and (III) Extensive experiments demonstrate our
framework achieves a substantial improvement over SOTA
language models, owing to the better semantic understanding
contributed by the enhancement of terminological knowledge.
2. METHODOLOGY
Our proposed framework is illustrated in Figure 2, which
includes an automatic terminology annotation component and
a terminology-aware training component.
2.1. Task Definition
We formulate our task as follows: the given inputs are in
the form of a text sequence
X={x1, x2, ..., xn}
, which
consists of a patient’s question alongside the collection of prior
dialogue turns between the doctor and the patient. The goal of
the task is to generate a response
Y={y1, y2, ..., ym}
(as a
doctor) by modeling the conditional probability distribution
P(Y|X).
Datasets Train Val Test
# Dialogues 221,600 12,351 12,354
# Words 36,700,836 2,041,364 2,052,748
# Terms (words) 10,240,061 569,155 572,286
Avg. # Words in Input Text 104.15 103.84 103.84
Avg. # Utterances in Input Text 1.19 1.20 1.19
Avg. # Terms in Input Text 22.04 21.98 21.98
Avg. # Words in Output Text 105.85 104.31 105.92
Avg. # Utterances in Output Text 1.55 1.23 1.56
Avg. # Terms in Output Text 24.05 22.24 24.10
Table 1
. Data statistics of the Medical English Dialogue Cor-
pus with annotated medical terminologies (abbr. Terms).
2.2. Terminological Knowledge Enhancement
Terminology Representations.
We employ an automatic an-
notation framework to annotate terminology-related tokens
contained in dialogues (of both
X
and
Y
) via inserting a spe-
cial token [TERM]. Take Xas an example:
Xγ=Flatten(Identify(X)) (1)
xi=xT, xi, xiis term
xi,otherwise (2)
where
xT
denotes
[TERM]
and
Xγ
denotes the input sequence
with additional terminology annotations. Identify denotes a
distant supervision function for identifying the terminology
tokens contained in plain text. It extracts all tokens of the
input data that match the terminology word list
*
, and then
derives terminology phrases by assuming that adjacent tokens
hold some form of relationship and constitute a terminological
phrase. This approach largely eases the annotation process
and avoids noise introduced by errors of normalisation. The
annotated input is then flattened to be a text sequence for
encoding. E.g., “there is infection on hand” will be processed
as “there is [TERM]infection on hand”.
Data Preparation.
In order to acquire large-scale medical
dialogues with annotated terminology, we conduct a series of
preprocessing stages including outlier filtering and segment
truncation on a real medical dialogue dump from MedDia-
log [
7
]. We then leverage the external terminology word-list
for our terminology identification task, which is mainly based
on distant supervision (details in the repository). Finally, we
obtain a corpus containing nearly 250,000 doctor-patient con-
versation pairs with 3,600M tokens. We set up our experiment
based on this terminology-enhanced dataset, whose statistics
are shown in Table 1. The train/val/test are split according to
the ratio of 0.9/0.05/0.05. In addition, the dataset is fully shuf-
fled to guarantee the data distribution learned by the language
models can be utilised in validation and testing.
2.3. Response Generation
We employ BART [
11
] (encoder-decoder structure) as the
base model. The features of each input sequence are encoded
*https://github.com/glutanimate/
wordlist-medicalterms-en