CHARD Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models Steven Y. Feng1 Vivek Khetan2 Bogdan Sacaleanu2 Anatole Gershman3 Eduard Hovy3_2

2025-04-30 0 0 451.43KB 16 页 10玖币

侵权投诉

CHARD:Clinical Health-Aware Reasoning Across Dimensions for Text

Generation Models

Steven Y. Feng1∗

, Vivek Khetan2, Bogdan Sacaleanu2, Anatole Gershman3, Eduard Hovy3

1Stanford University, 2Accenture Labs, SF, 3Carnegie Mellon University

syfeng@stanford.edu

{vivek.a.khetan,bogdan.e.sacaleanu}@accenture.com

{anatoleg,hovy}@cs.cmu.edu

Abstract

We motivate and introduce CHARD:Clinical

Health-Aware Reasoning across Dimensions,

to investigate the capability of text genera-

tion models to act as implicit clinical knowl-

edge bases and generate free-ﬂow textual ex-

planations about various health-related condi-

tions across several dimensions. We collect

and present an associated dataset, CHARDat,

consisting of explanations about 52 health con-

ditions across three clinical dimensions. We

conduct extensive experiments using BART

and T5 along with data augmentation, and per-

form automatic, human, and qualitative analy-

ses. We show that while our models can per-

form decently, CHARD is very challenging

with strong potential for further exploration.

1 Introduction

Pretrained language models (PLM) have seen in-

creasing popularity for NLP tasks and applications,

including text generation. Researchers have be-

come interested in the extent to which PLMs can:

1) act as knowledge bases, 2) reason like humans.

Rather than using external databases, exposure

to large amounts of data during training combined

with their large number of parameters, has given

PLMs the ability to store knowledge that can be

extracted through effective probing strategies such

as text inﬁlling (Donahue et al.,2020), prompt-

ing (Liu et al.,2021), and QA (Jiang et al.,2021).

PLMs imitate a more high-level information store,

allowing for greater abstractness, ﬂexibility, and

generalizability. They are also able to better exploit

contextual information than simple retrieval.

Studies have also shown that as PLMs scale

up, they have have emergent abilities (Wei et al.,

2022a), including reasoning. There has been in-

creasing attention on their commonsense reasoning

through works like COMET (Bosselut et al.,2019).

However, studies show that even large PLMs strug-

∗Work done while at CMU.

Template Full Text with Explanation

A person with Costochondri-

tis has a/an exercise risk fac-

tor because/since/as {expla-

nation}

A person with Costochondritis has an

exercise risk factor because costochon-

dritis can be aggravated by any activity

that places stress on your chest area.

A person with gout has a/an

lose weight prevention be-

cause/since/as {explanation}

A person with gout has a lose weight

prevention because losing weight can

lower uric acid levels in your body and

signiﬁcantly reduce the chance of gout

attacks.

A person with rheumatoid

has a/an therapy treatment be-

cause/since/as {explanation}

A person with rheumatoid has a therapy

treatment because physiotherapy helps

rheumatoid patients with pain control,

reducing inﬂammation and joint stiff-

ness and to return to the normal activ-

ities of daily living or sports.

Table 1:

Examples of

CHARD

templates with explanations

(from

CHARDat

). The human was asked to write the entire

output text (not just the explanation) by inﬁlling the template.

gle with commonsense tasks that humans can rea-

son through very easily (Talmor et al.,2020). There

are works that investigate more complicated reason-

ing tasks, e.g. arithmetic and symbolic reasoning

(Wei et al.,2022b). PLMs inherently have some

extent of reasoning capability, and many more com-

plex reasoning tasks are easier to carry out over

abstract PLM embedding space.

In this paper, we are interested in the intersection

of these areas. Can PLMs act as knowledge bases

and also reliably reason using their own knowl-

edge? We investigate whether PLMs can learn and

reason through health-related knowledge. Work on

generation-based reasoning for health has been lim-

ited, with most prior work exploring retrieval-based

methods. Generation-based reasoning is more difﬁ-

cult, as such a specialized domain contains esoteric

information not prevalent in the PLM’s training

data, and involves a higher degree of specialized

reasoning to handle domain-speciﬁc problems.

Healthcare is an important domain that deals

with human lives. It is a large application area

for machine learning and NLP. The need for au-

tomation in healthcare rises, as countless studies

show that healthcare workers are overworked and

burned out, especially recently due to the COVID-

19 pandemic (Portoghese et al.,2014;Brophy et al.,

Code: https://github.com/styfeng/CHARD

arXiv:2210.04191v2 [cs.CL] 13 Feb 2023

2021;Couarraze et al.,2021). Further, healthcare

resources will continue to be strained as the baby

boomer generation ages (Canizares et al.,2016).

To this end, we propose

CHARD

:Clinical

Health-Aware Reasoning across Dimensions (§2.1).

This task is designed to explore the capability of

text generation models to act as implicit clinical

knowledge bases and generate textual explanations

about health-related conditions across several di-

mensions. The ultimate goal of

CHARD

is to

eventually have a model that is knowledgeable and

insightful across numerous clinical dimensions and

reasoning pathways. For now, we focus on three

relevant clinical dimensions using a template in-

ﬁlling approach, and collect an associated dataset,

CHARDat

, which includes information for 52

health conditions across these dimensions (§2.2).

We perform extensive experiments on

CHAR-

Dat

using two SOTA seq2seq models: BART

(Lewis et al.,2020) and T5 (Raffel et al.,2020)

(§3.1), with data augmentation using backtransla-

tion (Sennrich et al.,2016) (§3.2,4.2). We bench-

mark our models through automatic, human, and

qualitative analyses (§5). We show that our models

show strong potential, but have room to improve,

and that

CHARD

is highly challenging with room

for additional exploration. Lastly, we discuss sev-

eral potential directions for improvement (§6).

2 Task and Dataset

2.1 The CHARD Task

Our task,

CHARD

:Clinical Health-Aware

Reasoning across Dimensions, investigates the ca-

pability of text generation models to produce clin-

ical explanations about various health conditions

across several clinical dimensions (

dim

). Essen-

tially, we assess how a PLM can be used as and

reason through an implicit clinical knowledge base.

We focus on three

dim

risk factors

(RF),

treatment

(TREAT), and

prevention

(PREV), as

they are important and relevant in the context of

health. A risk factor refers to something that in-

creases the chance of developing a condition. For

cancer, some examples are age, family history, and

smoking. Treatment refers to something that helps

treat or cure a condition. For migraines, some ex-

amples are medication, stress management, and

meditation. Prevention refers to strategies to stop

or lower the chance of getting a condition. For dia-

betes, some examples are a healthy diet and regular

exercise.

As an initial approach to

CHARD

, we use a

template inﬁlling formulation, where given an in-

put template that lays out the structure of the de-

sired explanation, the model’s goal is to generate

a complete explanation of how the particular

dim

attribute relates to the given condition. In particu-

lar, the templates end with an {explanation} span

that the models ﬁll in by explaining the appropriate

relationship. Some examples are in Table 1.

2.2 CHARDat Dataset

Collection Process:

We collect a dataset for our

task called

CHARDat

(where DAT is short for

data). We collect data across the three

dim

for 52

health conditions, listed in Appendix A. This is a

manually curated list of health conditions which

range from common conditions such as migraine

and acne to rare conditions such as Lyme dis-

ease and Paget–Schroetter. The conditions were

also selected by volume of online activity (e.g.

number of active subreddit users), treatable vs.

chronic conditions, and whether a condition can

be self-diagnosed or not. This allows us to assess

CHARD across a variety of conditions.

For each

dim

, we manually collect an exhaus-

tive list of

dim

-related attributes (e.g. risk factors)

for each condition. By attribute, we refer to a par-

ticular example of that

dim

(e.g. "obesity"). This

was accomplished by searching through reliable

and reputable medically-reviewed sources such as

MayoClinic, CDC, WebMD, and Healthline.

We collect the ﬁnal text (with explanations) us-

ing Amazon Mechanical Turk (AMT). We ask ap-

proved AMT workers (with strong qualiﬁcations

and approval ratings on healthcare-related tasks) to

write factually accurate, informative, and relatively

concise passages given a particular condition and

dim

attribute template (per HIT), while encourag-

ing them to consult the aforementioned health re-

sources. Three separate annotation studies (one per

dim

) with strict quality control were conducted

to collect an annotation per example.

Annota-

tions were regularly veriﬁed by authors, and a large

subset of

CHARDat

was manually examined for

medical accuracy. More details are in Appendix B.

Some examples from CHARDat are in Table 1.

Splits and Statistics:

We split

CHARDat

dim

into train, val, and test splits of

≈

Explanations for

CHARD

are typically quite standard-

ized, and additional annotations were repetitive. Differences

are mainly in language, so we instead opt for paraphrasing

data augmentation techniques such as backtranslation (§3.2).

Dataset Stats Train Val

Test

(seen/unseen)

# conditions = 52 44 39 41 (37/4)

rf = 52 44 26 26 (22/4)

treat = 52 43 21 20 (16/4)

prev = 44 35 11 21 (17/4)

# sentences = 937 655 141 141 (70/71)

rf = 457 319 69 69 (32/37)

treat = 297 207 45 45 (20/25)

prev = 183 129 27 27 (18/9)

Avg length = 36.2 37.7 36.1

35 (35.9/34.2)

Table 2: CHARDat

statistics. Differing #s by

dim

are

because there are more risk factors for most conditions, and

some do not have prevention strategies. Length is in words.

70%/15%/15%, and combine the individual splits

per

dim

to form the ﬁnal splits called

CHAR-

Dattr

CHARDatval

, and

CHARDattest

, re-

spectively. The individual

dim

splits are called

dimtr

dimval

, and

dimtest

, where

dim

is a

short-form of the particular dimension:

treat

, or

. The individual dimension sub-

sets of CHARDat are called CHARDatDIM .

For each

dim

’s test split, we ensure that ap-

proximately half consist of examples from con-

ditions entirely unseen during training for that

dim

, called

dimtest−unseen

. This is to assess

whether the model can generalize to unseen condi-

tions. The other half contains examples from con-

ditions seen during training called

dimtest−seen

but the speciﬁc condition and

dim

attribute com-

bination was unseen. The combined halves

(across

dim

) are called

CHARDattest−unseen

and

CHARDattest−seen

. We do the same for the

val split to ensure consistency for model selection

purposes. CHARDat statistics are in Table 2.

3 Methodology

3.1 Models

BART and T5:

We experiment using two pre-

trained seq2seq models: BART and T5 (both base

and large versions). These are suitable for our task

formulation (template inﬁlling). T5 (Raffel et al.,

2020) has strong multitask pretraining. BART

(Lewis et al.,2020) is trained to reconstruct original

text from noised text (as a denoising autoencoder).

We use their HuggingFace codebases.

Retrieval Baseline (RETR):

We use a retrieval-

based approach as a baseline. We manually query

Google using {condition +

dim

attribute},

e.g. {asthma + risk factor + smoking}, and extract

either the featured snippet at the top of the results

page, or the text below the ﬁrst link if there is no

featured snippet. If the featured snippet is a list

or table, we manually concatenate the items into a

Figure 1:

An example of the Google search results for the

query {asthma + risk factor + smoking} highlighting: a) the

featured snippet, b) the text below the ﬁrst link.

single piece of text. An example is in Figure 1.

The extracted text approximates an explanation,

which we then concatenate to the ﬁrst part of the

associated template to form the ﬁnal text, e.g. A

person with asthma has a/an smoking risk factor

because/since/as

{retrieved explanation}

.RETR

leverages Google’s strong search and summariza-

tion capabilities, serving as a useful baseline. Fur-

ther, Google Search is an evolving baseline that

continually challenges our CHARD models.2

3.2 Data Augmentation (DA)

Since

CHARDat

is relatively small, which is

mainly a function of our task and domain, i.e. there

are a limited number of non-obscure medical con-

ditions and associated

dim

attributes, we hypothe-

size that data augmentation (DA) techniques (Feng

et al.,2021a,2020) may be useful.

As noted by Feng et al. (2021a), text genera-

tion and specialized domains (such as healthcare)

both present several challenges for DA. In our case,

many explanations contain clinical or health jar-

gon which makes techniques that leverage lexical

databases such as WordNet, e.g. synonym replace-

ment (Feng et al.,2020), challenging or impossible.

We decide to use backtranslation (BT) (Sennrich

et al.,2016) to augment examples in

CHARDattr

a popular and easy DA technique which translates a

sentence into another language and back to the orig-

inal language.

This usually results in a slightly

altered version (paraphrase) of the original text.

BT is effective here as healthcare-related terms are

preserved relatively well, and the resulting para-

phrased explanation remains relatively intact.

We use UDA (Xie et al.,2020) for BT, which

translates sentences from English to French, then

back to English. UDA is a DA method that uses

unsupervised data through consistency training on

2We will release our current baseline data.

3This is sometimes referred to as round-trip translation.

Tmp Text

0 A person with acne has an avoid irritants prevention because using

oily or irritating personal care products clog your pores causing acne.

0.5 if you use oily or irritant personal care products, you block pores and

cause acne.

0.7 using oily or irritating personal care products, you block acne pores.

0.9 use oily and irritating disinfectant products freezing your pores to

cause the Acne restructurs.

0 A person with MultipleSclerosis has a stress management prevention

because stress is more likely to exacerbate the symptoms of MS and

bring about a ﬂare or relapse.

0.5 stress is more likely to exacerbate MS symptoms and lead to an out-

break or relapse

0.7 stress is more likely to exacerbate symptoms of MS and trigger a ﬂare

or relapse.

0.9 severe mourning problems occurred at Vancouver Hospital (Prince

Edward Island), British Columbia. (...)

Table 3:

Examples of original (tmp=0) and BT text. The

explanation portion (which is backtranslated) is italicized.

Backtranslation Temperature (tmp)

0.4 0.5 0.6 0.7 0.8 0.9

ROUGE-1 ROUGE-2 ROUGE-L BERTScore

ROUGE and BERTScore vs. Backtranslation Temperature

Figure 2:

Graph showing how avg. ROUGE and BERTScore

of BT vs. original text vary by BT tmp on CHARDattr .

(x, DA(x))

pairs. An advantage of UDA’s BT is

that we can control for the degree of variation using

atemperature (tmp) parameter, where higher val-

ues (e.g. 0.9) result in more varied paraphrases. We

only backtranslate the explanation portion of ex-

amples (concatenating them back to the preceding

part) as we wish to keep the preceding part intact.

From the examples in Table 3, we can see that

higher tmp typically results in more varied text,

albeit with issues with content preservation and

ﬂuency. For the second example, the tmp=0.9 BT

is completely unrelated to the original text. This is

not entirely undesirable, as some noise may make

our trained models more robust. From Figure 2, we

see that the average ROUGE and BERTScore of

backtranslated

CHARDattr

text compared to the

original text decrease as tmp increases, as expected.

3.3 Evaluation Metrics

We use several standard text generation evaluation

metrics including reference-based token and se-

mantic comparison metrics used in works like Lin

et al. (2020) such as ROUGE (Lin and Hovy,2003),

CIDEr (Vedantam et al.,2015), and SPICE (Ander-

son et al.,2016). SPICE translates text to semantic

scene graphs and calculates an F-score over graph

tuples. CIDEr captures sentence similarity, gram-

maticality, saliency, importance, and accuracy.4

We also use average word length (Len),

BERTScore (Zhang et al.,2019), and Perplex-

ity (PPL). BERTScore serves as a more seman-

tic similarity measure by assessing BERT (Devlin

et al.,2019) embeddings similarity between indi-

vidual tokens. We multiply by 100 when reporting

BERTScore. PPL approximately measures ﬂuency,

where lower values represent higher ﬂuency. We

use GPT-2 (Radford et al.,2019) for PPL. Higher

is better for all metrics other than PPL and Len.

4 Experimental Setup

4.1 Model Finetuning and Generation

For the standard (non-augmented)

CHARD

mod-

els, we train and evaluate four versions of each on

CHARDat

CHARDatRF

CHARDatT REAT

and

CHARDatP REV

, respectively. The ﬁrst of

these is a combined model that learns to handle all

three

dim

at once depending on the

dim

given at

inference, while the latter three are models trained

on each individual

dim

. We predict that while the

latter three may perform better on their particular

dim

, the ﬁrst model is more effective overall as it

accomplishes our goal of having a single PLM that

can store knowledge and reason through several

dim. It is thus more adaptable and generalizable.

For training the

CHARD

models, we keep most

hyperparameters static, other than learning rate

(LR) which is tuned per individual model. For each

model, we select the epoch that corresponds to

highest ROUGE-2 on

CHARDatval

, and decode

using beam search. See Appendix Cfor more.

4.2 Data Augmentation Experiments

We try several backtranslation DA experiments.

2x DA with Different Tmp:

Our ﬁrst set of ex-

periments involves 2x DA (backtranslating each

CHARDattr

explanation once, to 2x the original

training data) using different BT tmp which we call

BT-set: {0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. We predict

that the optimal tmp lies in the 0.6-0.7 range, as the

text is modiﬁed to a reasonable degree.

Different DA Amounts (2x-10x):

We also try

further DA amounts: 3x, 4x, 5x, 7x, and 10x

the original amount of training data. We explore

Matching metrics are sufﬁcient as

CHARD

explanations

are standardized (space for explanations is low) since our

inputs present a particular condition and

dim

attribute combo.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CHARD:ClinicalHealth-AwareReasoningAcrossDimensionsforTextGenerationModelsStevenY.Feng1,VivekKhetan2,BogdanSacaleanu2,AnatoleGershman3,EduardHovy31StanfordUniversity,2AccentureLabs,SF,3CarnegieMellonUniversitysyfeng@stanford.edu{vivek.a.khetan,bogdan.e.sacaleanu}@accenture.com{anatoleg,hovy}@cs.cmu...

展开>> 收起<<

CHARD Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models Steven Y. Feng1 Vivek Khetan2 Bogdan Sacaleanu2 Anatole Gershman3 Eduard Hovy3_2.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CHARD Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models Steven Y. Feng1 Vivek Khetan2 Bogdan Sacaleanu2 Anatole Gershman3 Eduard Hovy3_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: