CHARD Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models Steven Y. Feng1 Vivek Khetan2 Bogdan Sacaleanu2 Anatole Gershman3 Eduard Hovy3_2

2025-04-30 0 0 451.43KB 16 页 10玖币
侵权投诉
CHARD:Clinical Health-Aware Reasoning Across Dimensions for Text
Generation Models
Steven Y. Feng1
, Vivek Khetan2, Bogdan Sacaleanu2, Anatole Gershman3, Eduard Hovy3
1Stanford University, 2Accenture Labs, SF, 3Carnegie Mellon University
syfeng@stanford.edu
{vivek.a.khetan,bogdan.e.sacaleanu}@accenture.com
{anatoleg,hovy}@cs.cmu.edu
Abstract
We motivate and introduce CHARD:Clinical
Health-Aware Reasoning across Dimensions,
to investigate the capability of text genera-
tion models to act as implicit clinical knowl-
edge bases and generate free-flow textual ex-
planations about various health-related condi-
tions across several dimensions. We collect
and present an associated dataset, CHARDat,
consisting of explanations about 52 health con-
ditions across three clinical dimensions. We
conduct extensive experiments using BART
and T5 along with data augmentation, and per-
form automatic, human, and qualitative analy-
ses. We show that while our models can per-
form decently, CHARD is very challenging
with strong potential for further exploration.
1 Introduction
Pretrained language models (PLM) have seen in-
creasing popularity for NLP tasks and applications,
including text generation. Researchers have be-
come interested in the extent to which PLMs can:
1) act as knowledge bases, 2) reason like humans.
Rather than using external databases, exposure
to large amounts of data during training combined
with their large number of parameters, has given
PLMs the ability to store knowledge that can be
extracted through effective probing strategies such
as text infilling (Donahue et al.,2020), prompt-
ing (Liu et al.,2021), and QA (Jiang et al.,2021).
PLMs imitate a more high-level information store,
allowing for greater abstractness, flexibility, and
generalizability. They are also able to better exploit
contextual information than simple retrieval.
Studies have also shown that as PLMs scale
up, they have have emergent abilities (Wei et al.,
2022a), including reasoning. There has been in-
creasing attention on their commonsense reasoning
through works like COMET (Bosselut et al.,2019).
However, studies show that even large PLMs strug-
Work done while at CMU.
Template Full Text with Explanation
A person with Costochondri-
tis has a/an exercise risk fac-
tor because/since/as {expla-
nation}
A person with Costochondritis has an
exercise risk factor because costochon-
dritis can be aggravated by any activity
that places stress on your chest area.
A person with gout has a/an
lose weight prevention be-
cause/since/as {explanation}
A person with gout has a lose weight
prevention because losing weight can
lower uric acid levels in your body and
significantly reduce the chance of gout
attacks.
A person with rheumatoid
has a/an therapy treatment be-
cause/since/as {explanation}
A person with rheumatoid has a therapy
treatment because physiotherapy helps
rheumatoid patients with pain control,
reducing inflammation and joint stiff-
ness and to return to the normal activ-
ities of daily living or sports.
Table 1:
Examples of
CHARD
templates with explanations
(from
CHARDat
). The human was asked to write the entire
output text (not just the explanation) by infilling the template.
gle with commonsense tasks that humans can rea-
son through very easily (Talmor et al.,2020). There
are works that investigate more complicated reason-
ing tasks, e.g. arithmetic and symbolic reasoning
(Wei et al.,2022b). PLMs inherently have some
extent of reasoning capability, and many more com-
plex reasoning tasks are easier to carry out over
abstract PLM embedding space.
In this paper, we are interested in the intersection
of these areas. Can PLMs act as knowledge bases
and also reliably reason using their own knowl-
edge? We investigate whether PLMs can learn and
reason through health-related knowledge. Work on
generation-based reasoning for health has been lim-
ited, with most prior work exploring retrieval-based
methods. Generation-based reasoning is more diffi-
cult, as such a specialized domain contains esoteric
information not prevalent in the PLM’s training
data, and involves a higher degree of specialized
reasoning to handle domain-specific problems.
Healthcare is an important domain that deals
with human lives. It is a large application area
for machine learning and NLP. The need for au-
tomation in healthcare rises, as countless studies
show that healthcare workers are overworked and
burned out, especially recently due to the COVID-
19 pandemic (Portoghese et al.,2014;Brophy et al.,
Code: https://github.com/styfeng/CHARD
arXiv:2210.04191v2 [cs.CL] 13 Feb 2023
2021;Couarraze et al.,2021). Further, healthcare
resources will continue to be strained as the baby
boomer generation ages (Canizares et al.,2016).
To this end, we propose
CHARD
:Clinical
Health-Aware Reasoning across Dimensions (§2.1).
This task is designed to explore the capability of
text generation models to act as implicit clinical
knowledge bases and generate textual explanations
about health-related conditions across several di-
mensions. The ultimate goal of
CHARD
is to
eventually have a model that is knowledgeable and
insightful across numerous clinical dimensions and
reasoning pathways. For now, we focus on three
relevant clinical dimensions using a template in-
filling approach, and collect an associated dataset,
CHARDat
, which includes information for 52
health conditions across these dimensions (§2.2).
We perform extensive experiments on
CHAR-
Dat
using two SOTA seq2seq models: BART
(Lewis et al.,2020) and T5 (Raffel et al.,2020)
3.1), with data augmentation using backtransla-
tion (Sennrich et al.,2016) (§3.2,4.2). We bench-
mark our models through automatic, human, and
qualitative analyses (§5). We show that our models
show strong potential, but have room to improve,
and that
CHARD
is highly challenging with room
for additional exploration. Lastly, we discuss sev-
eral potential directions for improvement (§6).
2 Task and Dataset
2.1 The CHARD Task
Our task,
CHARD
:Clinical Health-Aware
Reasoning across Dimensions, investigates the ca-
pability of text generation models to produce clin-
ical explanations about various health conditions
across several clinical dimensions (
dim
). Essen-
tially, we assess how a PLM can be used as and
reason through an implicit clinical knowledge base.
We focus on three
dim
:
risk factors
(RF),
treatment
(TREAT), and
prevention
(PREV), as
they are important and relevant in the context of
health. A risk factor refers to something that in-
creases the chance of developing a condition. For
cancer, some examples are age, family history, and
smoking. Treatment refers to something that helps
treat or cure a condition. For migraines, some ex-
amples are medication, stress management, and
meditation. Prevention refers to strategies to stop
or lower the chance of getting a condition. For dia-
betes, some examples are a healthy diet and regular
exercise.
As an initial approach to
CHARD
, we use a
template infilling formulation, where given an in-
put template that lays out the structure of the de-
sired explanation, the model’s goal is to generate
a complete explanation of how the particular
dim
attribute relates to the given condition. In particu-
lar, the templates end with an {explanation} span
that the models fill in by explaining the appropriate
relationship. Some examples are in Table 1.
2.2 CHARDat Dataset
Collection Process:
We collect a dataset for our
task called
CHARDat
(where DAT is short for
data). We collect data across the three
dim
for 52
health conditions, listed in Appendix A. This is a
manually curated list of health conditions which
range from common conditions such as migraine
and acne to rare conditions such as Lyme dis-
ease and Paget–Schroetter. The conditions were
also selected by volume of online activity (e.g.
number of active subreddit users), treatable vs.
chronic conditions, and whether a condition can
be self-diagnosed or not. This allows us to assess
CHARD across a variety of conditions.
For each
dim
, we manually collect an exhaus-
tive list of
dim
-related attributes (e.g. risk factors)
for each condition. By attribute, we refer to a par-
ticular example of that
dim
(e.g. "obesity"). This
was accomplished by searching through reliable
and reputable medically-reviewed sources such as
MayoClinic, CDC, WebMD, and Healthline.
We collect the final text (with explanations) us-
ing Amazon Mechanical Turk (AMT). We ask ap-
proved AMT workers (with strong qualifications
and approval ratings on healthcare-related tasks) to
write factually accurate, informative, and relatively
concise passages given a particular condition and
dim
attribute template (per HIT), while encourag-
ing them to consult the aforementioned health re-
sources. Three separate annotation studies (one per
dim
) with strict quality control were conducted
to collect an annotation per example.
1
Annota-
tions were regularly verified by authors, and a large
subset of
CHARDat
was manually examined for
medical accuracy. More details are in Appendix B.
Some examples from CHARDat are in Table 1.
Splits and Statistics:
We split
CHARDat
by
dim
into train, val, and test splits of
1
Explanations for
CHARD
are typically quite standard-
ized, and additional annotations were repetitive. Differences
are mainly in language, so we instead opt for paraphrasing
data augmentation techniques such as backtranslation (§3.2).
Dataset Stats Train Val
Test
(seen/unseen)
# conditions = 52 44 39 41 (37/4)
rf = 52 44 26 26 (22/4)
treat = 52 43 21 20 (16/4)
prev = 44 35 11 21 (17/4)
# sentences = 937 655 141 141 (70/71)
rf = 457 319 69 69 (32/37)
treat = 297 207 45 45 (20/25)
prev = 183 129 27 27 (18/9)
Avg length = 36.2 37.7 36.1
35 (35.9/34.2)
Table 2: CHARDat
statistics. Differing #s by
dim
are
because there are more risk factors for most conditions, and
some do not have prevention strategies. Length is in words.
70%/15%/15%, and combine the individual splits
per
dim
to form the final splits called
CHAR-
Dattr
,
CHARDatval
, and
CHARDattest
, re-
spectively. The individual
dim
splits are called
dimtr
,
dimval
, and
dimtest
, where
dim
is a
short-form of the particular dimension:
rf
,
treat
, or
prev
. The individual dimension sub-
sets of CHARDat are called CHARDatDIM .
For each
dim
s test split, we ensure that ap-
proximately half consist of examples from con-
ditions entirely unseen during training for that
dim
, called
dimtestunseen
. This is to assess
whether the model can generalize to unseen condi-
tions. The other half contains examples from con-
ditions seen during training called
dimtestseen
,
but the specific condition and
dim
attribute com-
bination was unseen. The combined halves
(across
dim
) are called
CHARDattestunseen
and
CHARDattestseen
. We do the same for the
val split to ensure consistency for model selection
purposes. CHARDat statistics are in Table 2.
3 Methodology
3.1 Models
BART and T5:
We experiment using two pre-
trained seq2seq models: BART and T5 (both base
and large versions). These are suitable for our task
formulation (template infilling). T5 (Raffel et al.,
2020) has strong multitask pretraining. BART
(Lewis et al.,2020) is trained to reconstruct original
text from noised text (as a denoising autoencoder).
We use their HuggingFace codebases.
Retrieval Baseline (RETR):
We use a retrieval-
based approach as a baseline. We manually query
Google using {condition +
dim
+
dim
attribute},
e.g. {asthma + risk factor + smoking}, and extract
either the featured snippet at the top of the results
page, or the text below the first link if there is no
featured snippet. If the featured snippet is a list
or table, we manually concatenate the items into a
Figure 1:
An example of the Google search results for the
query {asthma + risk factor + smoking} highlighting: a) the
featured snippet, b) the text below the first link.
single piece of text. An example is in Figure 1.
The extracted text approximates an explanation,
which we then concatenate to the first part of the
associated template to form the final text, e.g. A
person with asthma has a/an smoking risk factor
because/since/as
{retrieved explanation}
.RETR
leverages Google’s strong search and summariza-
tion capabilities, serving as a useful baseline. Fur-
ther, Google Search is an evolving baseline that
continually challenges our CHARD models.2
3.2 Data Augmentation (DA)
Since
CHARDat
is relatively small, which is
mainly a function of our task and domain, i.e. there
are a limited number of non-obscure medical con-
ditions and associated
dim
attributes, we hypothe-
size that data augmentation (DA) techniques (Feng
et al.,2021a,2020) may be useful.
As noted by Feng et al. (2021a), text genera-
tion and specialized domains (such as healthcare)
both present several challenges for DA. In our case,
many explanations contain clinical or health jar-
gon which makes techniques that leverage lexical
databases such as WordNet, e.g. synonym replace-
ment (Feng et al.,2020), challenging or impossible.
We decide to use backtranslation (BT) (Sennrich
et al.,2016) to augment examples in
CHARDattr
,
a popular and easy DA technique which translates a
sentence into another language and back to the orig-
inal language.
3
This usually results in a slightly
altered version (paraphrase) of the original text.
BT is effective here as healthcare-related terms are
preserved relatively well, and the resulting para-
phrased explanation remains relatively intact.
We use UDA (Xie et al.,2020) for BT, which
translates sentences from English to French, then
back to English. UDA is a DA method that uses
unsupervised data through consistency training on
2We will release our current baseline data.
3This is sometimes referred to as round-trip translation.
Tmp Text
0 A person with acne has an avoid irritants prevention because using
oily or irritating personal care products clog your pores causing acne.
0.5 if you use oily or irritant personal care products, you block pores and
cause acne.
0.7 using oily or irritating personal care products, you block acne pores.
0.9 use oily and irritating disinfectant products freezing your pores to
cause the Acne restructurs.
0 A person with MultipleSclerosis has a stress management prevention
because stress is more likely to exacerbate the symptoms of MS and
bring about a flare or relapse.
0.5 stress is more likely to exacerbate MS symptoms and lead to an out-
break or relapse
0.7 stress is more likely to exacerbate symptoms of MS and trigger a flare
or relapse.
0.9 severe mourning problems occurred at Vancouver Hospital (Prince
Edward Island), British Columbia. (...)
Table 3:
Examples of original (tmp=0) and BT text. The
explanation portion (which is backtranslated) is italicized.
Backtranslation Temperature (tmp)
10
20
30
40
50
60
70
80
0.4 0.5 0.6 0.7 0.8 0.9
ROUGE-1 ROUGE-2 ROUGE-L BERTScore
ROUGE and BERTScore vs. Backtranslation Temperature
Figure 2:
Graph showing how avg. ROUGE and BERTScore
of BT vs. original text vary by BT tmp on CHARDattr .
(x, DA(x))
pairs. An advantage of UDAs BT is
that we can control for the degree of variation using
atemperature (tmp) parameter, where higher val-
ues (e.g. 0.9) result in more varied paraphrases. We
only backtranslate the explanation portion of ex-
amples (concatenating them back to the preceding
part) as we wish to keep the preceding part intact.
From the examples in Table 3, we can see that
higher tmp typically results in more varied text,
albeit with issues with content preservation and
fluency. For the second example, the tmp=0.9 BT
is completely unrelated to the original text. This is
not entirely undesirable, as some noise may make
our trained models more robust. From Figure 2, we
see that the average ROUGE and BERTScore of
backtranslated
CHARDattr
text compared to the
original text decrease as tmp increases, as expected.
3.3 Evaluation Metrics
We use several standard text generation evaluation
metrics including reference-based token and se-
mantic comparison metrics used in works like Lin
et al. (2020) such as ROUGE (Lin and Hovy,2003),
CIDEr (Vedantam et al.,2015), and SPICE (Ander-
son et al.,2016). SPICE translates text to semantic
scene graphs and calculates an F-score over graph
tuples. CIDEr captures sentence similarity, gram-
maticality, saliency, importance, and accuracy.4
We also use average word length (Len),
BERTScore (Zhang et al.,2019), and Perplex-
ity (PPL). BERTScore serves as a more seman-
tic similarity measure by assessing BERT (Devlin
et al.,2019) embeddings similarity between indi-
vidual tokens. We multiply by 100 when reporting
BERTScore. PPL approximately measures fluency,
where lower values represent higher fluency. We
use GPT-2 (Radford et al.,2019) for PPL. Higher
is better for all metrics other than PPL and Len.
4 Experimental Setup
4.1 Model Finetuning and Generation
For the standard (non-augmented)
CHARD
mod-
els, we train and evaluate four versions of each on
CHARDat
,
CHARDatRF
,
CHARDatT REAT
,
and
CHARDatP REV
, respectively. The first of
these is a combined model that learns to handle all
three
dim
at once depending on the
dim
given at
inference, while the latter three are models trained
on each individual
dim
. We predict that while the
latter three may perform better on their particular
dim
, the first model is more effective overall as it
accomplishes our goal of having a single PLM that
can store knowledge and reason through several
dim. It is thus more adaptable and generalizable.
For training the
CHARD
models, we keep most
hyperparameters static, other than learning rate
(LR) which is tuned per individual model. For each
model, we select the epoch that corresponds to
highest ROUGE-2 on
CHARDatval
, and decode
using beam search. See Appendix Cfor more.
4.2 Data Augmentation Experiments
We try several backtranslation DA experiments.
2x DA with Different Tmp:
Our first set of ex-
periments involves 2x DA (backtranslating each
CHARDattr
explanation once, to 2x the original
training data) using different BT tmp which we call
BT-set: {0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. We predict
that the optimal tmp lies in the 0.6-0.7 range, as the
text is modified to a reasonable degree.
Different DA Amounts (2x-10x):
We also try
further DA amounts: 3x, 4x, 5x, 7x, and 10x
the original amount of training data. We explore
4
Matching metrics are sufficient as
CHARD
explanations
are standardized (space for explanations is low) since our
inputs present a particular condition and
dim
attribute combo.
摘要:

CHARD:ClinicalHealth-AwareReasoningAcrossDimensionsforTextGenerationModelsStevenY.Feng1,VivekKhetan2,BogdanSacaleanu2,AnatoleGershman3,EduardHovy31StanfordUniversity,2AccentureLabs,SF,3CarnegieMellonUniversitysyfeng@stanford.edu{vivek.a.khetan,bogdan.e.sacaleanu}@accenture.com{anatoleg,hovy}@cs.cmu...

展开>> 收起<<
CHARD Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models Steven Y. Feng1 Vivek Khetan2 Bogdan Sacaleanu2 Anatole Gershman3 Eduard Hovy3_2.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:451.43KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注