TERMINOLOGY-A WARE MEDICAL DIALOGUE GENERATION Chen Tang12 Hongbo Zhang2 Tyler Loakman2 Chenghua Lin2y Frank Guerin1 1Department of Computer Science The University of Surrey UK

2025-05-02 0 0 471.87KB 6 页 10玖币
侵权投诉
TERMINOLOGY-AWARE MEDICAL DIALOGUE GENERATION
Chen Tang1,2*, Hongbo Zhang2*, Tyler Loakman2, Chenghua Lin2, Frank Guerin1
1Department of Computer Science, The University of Surrey, UK
2Department of Computer Science, The University of Sheffield, UK
{chen.tang,f.guerin}@surrey.ac.uk
{hzhang183,tcloakman1,c.lin}@sheffield.ac.uk
ABSTRACT
Medical dialogue generation aims to generate responses ac-
cording to a history of dialogue turns between doctors and pa-
tients. Unlike open-domain dialogue generation, this requires
background knowledge specific to the medical domain. Exist-
ing generative frameworks for medical dialogue generation fall
short of incorporating domain-specific knowledge, especially
with regard to medical terminology. In this paper, we propose
a novel framework to improve medical dialogue generation by
considering features centered on domain-specific terminology.
We leverage an attention mechanism to incorporate termino-
logically centred features, and fill in the semantic gap between
medical background knowledge and common utterances by en-
forcing language models to learn terminology representations
with an auxiliary terminology recognition task. Experimen-
tal results demonstrate the effectiveness of our approach, in
which our proposed framework outperforms SOTA language
models. Additionally, we provide a new dataset with medi-
cal terminology annotations to support research on medical
dialogue generation. Our dataset and code are available at
https://github.com/tangg555/meddialog.
Index Terms
Dialogue Generation, Language Model,
Terminology, Knowledge Enhancement, Artificial Intelligence
1. INTRODUCTION
The goal of telemedicine is to provide patients with digital
access to medical information, particularly as a first port-of-
call where access to a medical professional may be limited.
In order to better handle the growing demand for quick and
easy access to healthcare services [
1
,
2
,
3
], there is increas-
ing research on Medical Dialogue Generation, which aims to
assist telemedicine by automatically generating informative
responses given a history of dialogues between the patient
and doctor as input [
4
]. Such consultation dialogues typi-
cally contain a great amount of domain-specific knowledge,
such as that relating to diseases, symptoms, and treatments.
*Equal contribution.
Corresponding author.
Fig. 1
. A medical dialogue example from the provided dataset.
The phrases in red denote terminology-related expressions that
have been automatically annotated by our framework, which
include Symptoms (sticks out), Examines (physical exam),
Diseased parts (spine), and others.
Therefore, without background knowledge of relevant medical
terminologies, conventional language models tend to struggle
to understand the semantics of medical dialogues and generate
medically relevant responses [5].
From an application perspective, it is crucial for telemedicine
dialogue systems to understand the medical issues of a pa-
tient, and provide informative suggestions as a “doctor”. As
illustrated in Figure 1, the dialogues between doctors and
patients usually consist of both medically relevant and irrele-
vant expressions, e.g. My brother said it reminded him ...”.
The medically irrelevant expressions, to some extent, can be
considered as noise, as they are prevalent in the dialogue and
may misguide a language model in understanding the medical
context, consequently biasing the model towards generating
responses that are unrelated to the medical issue of interest.
Therefore, it is important to inform the language model of the
most important aspects of the context, which is the medical
terminology (e.g. “spine sticks out”, “not bending”, “not
super skinny” in Figure 1), so that the model can learn the
distribution of important medical features.
To tackle medical domain-specific generation, prior works
mainly explore two technical paradigms: (1) Implement pre-
arXiv:2210.15551v2 [cs.CL] 15 Mar 2023
training on large-scale medical corpora, and then fine-tune
language models on the target domain so that the language
models can improve their performance on medical dialogues
[
1
,
2
]; and (2) Inject medical knowledge into language models
with additional training targets to improve language under-
standing [
6
,
7
,
8
,
9
,
10
]. The second paradigm has received
increasing attention due to the effectiveness of knowledge
injection. However, existing works all employ traditional lan-
guage models (e.g. LSTMs) as the main network architecture
with limited capacity. To our knowledge, there is currently no
framework that employs and empowers large pre-trained mod-
els to incorporate medical terminology knowledge to improve
medical dialogue generation.
In order to improve the medical terminology understand-
ing of language models, we propose a novel framework that
fills in the medical knowledge gap of conventional encoders
with additional terminology-aware training. It trains the neu-
ral networks to learn the feature distribution by enforcing the
neural encoder to understand the medical terminologies from
the input. Due to the lack of available large-scale medical
dialogue corpora with annotated terminology, we develop an
automatic terminology annotation framework, as well as a
corresponding dataset for the purpose of further study (see
our repository). The experimental results show superior per-
formance on a range of metrics compared to SOTA language
models, demonstrating the effectiveness of our proposed frame-
work and the importance of medical terminology in medical
dialogue generation.
Our contributions can be summarised as follows: (I) A
novel pre-trained language model-based framework is pro-
posed for terminology-aware medical dialogue generation; (II)
An automatic terminology annotation framework and large-
scale medical dialogue corpus with annotated terminology is
provided; and (III) Extensive experiments demonstrate our
framework achieves a substantial improvement over SOTA
language models, owing to the better semantic understanding
contributed by the enhancement of terminological knowledge.
2. METHODOLOGY
Our proposed framework is illustrated in Figure 2, which
includes an automatic terminology annotation component and
a terminology-aware training component.
2.1. Task Definition
We formulate our task as follows: the given inputs are in
the form of a text sequence
X={x1, x2, ..., xn}
, which
consists of a patient’s question alongside the collection of prior
dialogue turns between the doctor and the patient. The goal of
the task is to generate a response
Y={y1, y2, ..., ym}
(as a
doctor) by modeling the conditional probability distribution
P(Y|X).
Datasets Train Val Test
# Dialogues 221,600 12,351 12,354
# Words 36,700,836 2,041,364 2,052,748
# Terms (words) 10,240,061 569,155 572,286
Avg. # Words in Input Text 104.15 103.84 103.84
Avg. # Utterances in Input Text 1.19 1.20 1.19
Avg. # Terms in Input Text 22.04 21.98 21.98
Avg. # Words in Output Text 105.85 104.31 105.92
Avg. # Utterances in Output Text 1.55 1.23 1.56
Avg. # Terms in Output Text 24.05 22.24 24.10
Table 1
. Data statistics of the Medical English Dialogue Cor-
pus with annotated medical terminologies (abbr. Terms).
2.2. Terminological Knowledge Enhancement
Terminology Representations.
We employ an automatic an-
notation framework to annotate terminology-related tokens
contained in dialogues (of both
X
and
Y
) via inserting a spe-
cial token [TERM]. Take Xas an example:
Xγ=Flatten(Identify(X)) (1)
xi=xT, xi, xiis term
xi,otherwise (2)
where
xT
denotes
[TERM]
and
Xγ
denotes the input sequence
with additional terminology annotations. Identify denotes a
distant supervision function for identifying the terminology
tokens contained in plain text. It extracts all tokens of the
input data that match the terminology word list
*
, and then
derives terminology phrases by assuming that adjacent tokens
hold some form of relationship and constitute a terminological
phrase. This approach largely eases the annotation process
and avoids noise introduced by errors of normalisation. The
annotated input is then flattened to be a text sequence for
encoding. E.g., “there is infection on hand” will be processed
as “there is [TERM]infection on hand”.
Data Preparation.
In order to acquire large-scale medical
dialogues with annotated terminology, we conduct a series of
preprocessing stages including outlier filtering and segment
truncation on a real medical dialogue dump from MedDia-
log [
7
]. We then leverage the external terminology word-list
for our terminology identification task, which is mainly based
on distant supervision (details in the repository). Finally, we
obtain a corpus containing nearly 250,000 doctor-patient con-
versation pairs with 3,600M tokens. We set up our experiment
based on this terminology-enhanced dataset, whose statistics
are shown in Table 1. The train/val/test are split according to
the ratio of 0.9/0.05/0.05. In addition, the dataset is fully shuf-
fled to guarantee the data distribution learned by the language
models can be utilised in validation and testing.
2.3. Response Generation
We employ BART [
11
] (encoder-decoder structure) as the
base model. The features of each input sequence are encoded
*https://github.com/glutanimate/
wordlist-medicalterms-en
摘要:

TERMINOLOGY-AWAREMEDICALDIALOGUEGENERATIONChenTang1,2*,HongboZhang2*,TylerLoakman2,ChenghuaLin2y,FrankGuerin11DepartmentofComputerScience,TheUniversityofSurrey,UK2DepartmentofComputerScience,TheUniversityofShefeld,UK{chen.tang,f.guerin}@surrey.ac.uk{hzhang183,tcloakman1,c.lin}@sheffield.ac.ukABSTRA...

展开>> 收起<<
TERMINOLOGY-A WARE MEDICAL DIALOGUE GENERATION Chen Tang12 Hongbo Zhang2 Tyler Loakman2 Chenghua Lin2y Frank Guerin1 1Department of Computer Science The University of Surrey UK.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:471.87KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注