REV Information-Theoretic Evaluation of Free-Text Rationales Hanjie ChenFaeze BrahmanXiang RenYangfeng Ji Yejin ChoiSwabha Swayamdipta

2025-04-29 0 0 1.82MB 22 页 10玖币

侵权投诉

REV: Information-Theoretic Evaluation of Free-Text Rationales

Hanjie Chen♡∗Faeze Brahman♠♢ Xiang Ren♠♣ Yangfeng Ji♡

Yejin Choi♠♢ Swabha Swayamdipta♣

♡Department of Computer Science, University of Virginia

♠Allen Institute for AI ♣University of Southern California

♢Paul G. Allen School of Computer Science & Engineering, University of Washington

{hc9mx,yangfeng}@virginia.edu {faezeb,xiangr,yejinc}@allenai.org swabhas@usc.edu

Abstract

Generating free-text rationales is a promising

step towards explainable NLP, yet evaluating

such rationales remains a challenge. Existing

metrics have mostly focused on measuring the

association between the rationale and a given

label. We argue that an ideal metric should fo-

cus on the new information uniquely provided

in the rationale that is otherwise not provided

in the input or the label. We investigate this re-

search problem from an information-theoretic

perspective using conditional

-information

(Hewitt et al.,2021). More concretely, we pro-

pose a metric called REV (

ationale

valuation

with conditional

-information), to quantify

the amount of new, label-relevant information

in a rationale beyond the information already

available in the input or the label. Experiments

across four benchmarks with reasoning tasks,

including chain-of-thought, demonstrate the ef-

fectiveness of REV in evaluating rationale-label

pairs, compared to existing metrics. We fur-

ther demonstrate REV is consistent with hu-

man judgments on rationale evaluations and

provides more sensitive measurements of new

information in free-text rationales. When used

alongside traditional performance metrics, REV

provides deeper insights into models’ reasoning

and prediction processes.1

1 Introduction

Model explanations have been indispensable for

trust and interpretability in natural language pro-

cessing (NLP) (Ribeiro et al.,2016,2020;Lipton,

2018;Chen et al.,2020,2021a). Free-text ratio-

nales, which explain a model prediction in natural

language, have been especially appealing due to

their ﬂexibility in eliciting the reasoning process be-

hind the model’s decision making (Camburu et al.,

∗Work done during an internship at AI2.

Our code is publicly available at

https://github.com/

HanjieChen/REV

2018;Narang et al.,2020;Rajani et al.,2019;Ku-

mar and Talukdar,2020;Brahman et al.,2021),

making them closer to human explanations. How-

ever, existing metrics for free-text rationale eval-

uation remain narrowly focused on the extent to

which a rationale can help a (proxy) model predict

the label it explains (i.e., accuracy based) (Hase

et al.,2020;Wiegreffe et al.,2021). These metrics

offer little understanding of the new information

contained in the rationale, as added to the original

input, that could explain why the label is selected—

the very purpose a rationale is designed to serve.

For instance, the two rationales

r∗

and

ˆr1,a

in Fig.

1would be considered equally valuable under ex-

isting metrics, even though they supply different

amount of novel and relevant information.

In this paper, we overcome this shortcoming by

introducing an automatic evaluation for free-text ra-

tionales along two dimensions: (1) whether the ra-

tionale supports (i.e., is predictive of) the intended

label, and (2) how much new information does it

provide to justify the label, beyond what is con-

tained in the input. For example, rationale

ˆr1,b

Fig. 1violates (1) because it is not predictive of

the label, “

enjoy nature

”. Rationale

ˆr1,a

does

support the label but contains no new information

that justiﬁes it, beyond what is stated in the input

; thus, it violates (2). Rationale

r∗

is satisﬁed

along both dimensions: it supports the label and

does so by providing new and relevant information,

beyond what is in the input. Our proposed eval-

uation is designed to penalize both

ˆr1,a

and

ˆr1,b

while rewarding rationales like r∗

We introduce REV

, which adapts an

information-theoretic framework from Xu

et al. (2020) for evaluating free-text rationales

along the two dimensions mentioned above. Specif-

ically, REV is based on conditional

-information

For

ationale

valuation with conditional

-information.

arXiv:2210.04982v5 [cs.CL] 2 Jun 2023

Figure 1: Our evaluation framework for different free-text rationales (

r∗

is a human-written rationale,

ˆr1,a

and

ˆr1,b

are two generated rationales for the true label

. Our metric, REV, based on CVI (Hewitt et al.,2021) is able

to distinguish all three rationales by measuring how much new and label-relevant information each adds over a

vacuous rationale,

; performance-based evaluations can only distinguish between

ˆr1,a

and

ˆr1,b

. For an (arguably)

incorrect label,

, REV still gives a positive score highlighting that

ˆr2

is able to provide new information for why it

supports

. Prediction accuracy can be augmented with REV to provide a fuller interpretability of model decisions.

(Hewitt et al.,2021), which quantiﬁes the degree of

information contained in a representation beyond

another (baseline) representation, accessible to a

model family

. As our baseline representation,

we consider any vacuous rationale which simply

(and declaratively) combines an input with a

given label, without providing any new infor-

mation relevant to answering why the label was

chosen. REV adapts conditional

-information

to evaluate rationales, where we compare two

representations—one from an evaluation model

trained to produce the label given the input and the

rationale, and the other from another evaluation

model for the same task but considering only the

input (disguised as a vacuous rationale). Other

metrics do not take into consideration vacuous

rationales, and are hence unable to measure new

and label-relevant information in rationales.

In our experiments, we present evaluations with

REV for rationales under two reasoning tasks, com-

monsense question-answering (CQA; Talmor et al.,

2019) and natural language inference (NLI; Bow-

man et al.,2015), across four benchmarks. Several

quantitative evaluations demonstrate the capabili-

ties of REV in providing evaluations along new di-

mensions for free-text rationales, while also being

more consistent with human judgements compared

to existing metrics. We also provide comparisons

to demonstrate the sensitivity of REV to various

degrees of input perturbations. Additionally, evalu-

ation with REV offers insights into why rationales

obtained through chain-of-thought prompting (Wei

et al.,2022) do not necessarily improve prediction

performance.

2 REV: Information-Theoretic

Evaluation of Rationales

We introduce a new metric, REV,

ationale

valuation with conditional

-information, for

evaluation of free-text rationales on the proposed

dimensions (§2.2), based on the framework of con-

ditional V-information (§2.1).

We consider the setting where we have input

X∈X

, label

Y∈Y

, and free-text rationale

R∈R

generated for label

. A common strat-

egy to evaluate rationale

is through an evaluator

function

f∶Z→Y

, which maps a variable

to a label distribution. Here,

can be deﬁned

based on the evaluation framework; e.g.,

can be

a concatenation of

and

, or contains only

These metrics evaluate the utility of

based on

how much

helps

predict

. The evaluator

is typically trained on a set of input, label and ra-

tionale triples

Dtrain ={(xj, yj, rj)}

, and applied

Dtest ={(xi, yi, ri)}

for evaluation. The utility

is formulated as the difference between the

performance of the evaluator on predicting

with

R, and without it, i.e.

Perf[f(Y∣X, R)] −Perf[f(Y∣X)],(1)

where a larger performance gap indicates a bet-

ter rationale. Existing metrics (Hase et al.,2020;

Wiegreffe et al.,2021) compute the performance

gap based on prediction accuracies.

However, accuracy-based evaluation can only

indicate whether or not a rationale is predictive of

a label, but cannot quantify how much new infor-

mation the rationale provides to justify the label.

Figure 1illustrates this issue via an example. Here,

accuracy-based evaluation can distinguish between

ˆr1,a

and

ˆr1,b

since

ˆr1,a

supports

and

ˆr1,b

does

not. However, it is unable to distinguish between

r∗

and

ˆr1,a

(since both are predictive of

), de-

spite the fact that

ˆr1,a

does not provide any unique

and relevant information to answer why the label

should be

. In practice, vacuous rationales such

ˆr1,a

are commonly seen in model generations

(Sun et al.,2022;Wiegreffe and Marasovic,2021).

This calls for an evaluation metric which is able to

identify and penalize such vacuous rationales.

2.1 An Information-Theoretic Perspective on

Rationale Evaluation

The key quantity of interest for our evaluation of

rationale

is the amount of new information ex-

pressed in

(e.g., background knowledge, reason-

ing process) that can justify a label

. The mutual

information between

and

I(Y;R)

, can be

helpful for evaluating this quantity. However, we

are not interested in the information that is already

captured in the input

. A vacuous rationale, such

ˆr1,a

in Fig. 1—which simply combines the input

and the label,

declaratively—captures all the

information in

and

without specifying any

new information to help understand why

has

been chosen for

. We denote such rationales as

. Thus, we argue that a good evaluation metric

must be able to measure the amount of new and

label-relevant information contained in a rationale

beyond what is contained in any vacuous rationale,

, that leads to the prediction of

. Then the new

information in

beyond what is available in

can

be grounded with conditional mutual information

(Shannon,1948) as follows,

I(Y;R∣B)=I(Y;R, B)−I(Y;B),(2)

where the difference of two information quantities

demonstrates the performance gap in Equation 1.

Directly computing mutual information, how-

ever, is challenging because true distributions of

random variables are usually unknown, and we do

not have unbounded computation. A recently intro-

duced information-theoretic framework called

information circumvents this by restricting the com-

putation to certain predictive model families,

(Xu

et al.,2020). Given a model family

that maps two

random variables

and

-information deﬁnes

the usable information that can be extracted from

by models in

to predict

, i.e.

IV(R→Y)

generalizes to the set of all possible functions,

then

-information is mutual information (Shan-

non,1948). In practice, it is feasible to estimate

the usable information from

about

by select-

ing any neural model without frozen parameters as

Our approach to evaluate rationales builds on

a modiﬁcation of this framework for conditional

information by Hewitt et al. (2021), as described

below.

Conditional

-information Following condi-

tional mutual information in information theory

(Cover and Thomas,2006),

-information has been

extended to conditional

-information (CVI; He-

witt et al.,2021). CVI quantiﬁes the

-usable in-

formation in

about

conditioned on a variable

B, i.e.

IV(R→Y∣B)=HV(Y∣B)−HV(Y∣R, B).

Here

is any vacuous rationale that leads to the

prediction of

. In this work, we consider

sim-

ply as the declarative combination of

and

HV(⋅∣⋅)

is the conditional

-entropy (Xu et al.,

2020;Hewitt et al.,2021;Ethayarajh et al.,2022),

deﬁned as

HV(Y∣B)=inf

f∈V

E[−log f[b](y)] (3)

HV(Y∣R, B)=inf

f∈V

E[−log f[r, b](y)],(4)

where

f[b]

and

f[r, b]

produce a probability dis-

tribution over the labels given

and

[r, b]

as inputs

respectively.

Further, given

g′, g ∈V

which opti-

mize Equations 3and 4respectively, we consider

pointwise CVI for individual triples (r, y, b):

−log g′[b](y)+log g[r, b](y).(5)

2.2

Computing REV for Rationale Evaluation

Building on the framework of CVI, we propose

a new metric REV, for

ationale

valuation with

conditional

-information. We compute REV over

a given test set,

Dtest ={(xi, yi, ri)}

, by estimating

CVI over the set with evaluation models,

g, g′∈V

For a test example

(x, y, r)

, the REV score denoted

REV(x, y, r)

is computed based on Equation 5,

where bis constructed by combining xand y. ,

REV(x, y, r)=−log g′[b](y)+log g[r, b](y).

Please see Xu et al. (2020) for a detailed discussion of

properties such as optional ignorance that a predictive family

Vmust follow.

4[r, b]

is the concatenation of

and

. Please see Appendix

Afor further details on CVI.

The REV score for the entire test corpus

Dtest

, is

given by the average pointwise REV score:

REVD=1

∣Dtest∣

∑

i=1

REV(xi, yi, ri).(6)

Algorithm 1 Computing REV Scores

Input: evaluation models

and

g′

, test set

Dtest ={(xi, yi, ri)}

2: Initialize an empty list S

3: for (xi, yi, ri)∈Dtest do

4: Construct the baseline rationale bi

5: REV(xi, yi, ri)

=−log g′[bi](yi)+log g[ri, bi](yi)

6: S.add(REV(xi, yi, ri))

7: end for

8: REVD=mean(S)

9: Output:S, REVD

Algorithm 1shows the process of computing

both pointwise and aggregate REV scores. The

higher the REV score, the more additional (new

and relevant) information the rationale

contains

to explain the label beyond the baseline rationale

REV(xi, yi, ri)

can take positive, negative, or

zero values. When

REV(xi, yi, ri)>0

, the ra-

tionale supplies additional new information for

supporting the label (e.g.,

r∗

in Fig. 1); when

REV(xi, yi, ri)=0

, the rationale provides no ad-

ditional information beyond the baseline (e.g.,

ˆr1,a

in Fig. 1); and when

REV(xi, yi, ri)<0

, the

rationale does not support the label (e.g.,

ˆr1,b

Fig. 1). REV can assign a positive score to a ra-

tionale for an incorrect prediction as long as the

rationale supports it and provides additional infor-

mation beyond a vacuous baseline rationale (e.g.,

ˆr2

in Fig. 1). Thus, REV cannot be seen as a re-

placement for prediction accuracy, but rather as an

orthogonal metric to interpret the usefulness of a

generated rationale for the model decision.

3 Experimental Setup

We outline our experimental setup by describing

the reasoning tasks and datasets (§3.1), followed

by the task and evaluation models (§3.2), and the

baseline metrics for comparison (§3.3). Additional

details on the setup are provided in Appendix B.

3.1 Datasets

We explore two reasoning tasks, namely Common-

senseQA (CQA) and Natural Language Inference

(NLI) across four datasets, all containing human-

annotated free-text rationales. For CQA task, we

use ECQA (Aggarwal et al.,2021), CoS-E (v1.11;

Rajani et al.,2019) and QuaRTz (Tafjord et al.,

2019). For both ECQA and CoS-E, each com-

monsense question is paired with ﬁve candidate

choices and the task is to select an answer from the

candidates. ECQA contains higher quality human-

written rationales compared to CoS-E (Aggarwal

et al.,2021;Sun et al.,2022). QuaRTz is for open-

domain reasoning about textual qualitative relation-

ships, and the task is to select an answer from two

options to the question based on the textual qual-

itative knowledge (rationale). For the NLI task,

we use the e-SNLI (Camburu et al.,2018) dataset

containing explanations for SNLI (Bowman et al.,

2015), where the task is given a premise to predict

if a hypothesis entails, contradicts or is neutral to it.

More details on the datasets are in Appendix B.1.

3.2 Task and Evaluation Models

Task models We choose T5 Large (Raffel et al.,

2020) as the task model (ﬁnetuned on ground-

truth labels and rationales) to produce generated

rationale-label pairs under three settings:

•

∗

→

R: Given an input text and the ground-

truth label, generate a rationale.

•

→

YR: Given an input text, generate a label

followed by a rationale. Since T5 decodes

tokens sequentially, each R is generated con-

ditioned on the predicted Y.

•

→

RY: Given an input text, generate a ratio-

nale followed by a label. Here, we compute a

likelihood for each candidate Y conditioned

on R, and then select the most probable can-

didate. This operation can improve the model

prediction accuracy, while weakening the con-

sistency and relevance between the generated

rationales and predicted labels.

After training, we collect three types of rationale-

label pairs by applying the three task models on

the test set of each dataset. In addition to these

three settings, we also evaluate ground-truth labels

paired with crowd-sourced rationales (Y∗;R∗).

Constructing a Baseline with Vacuous Ratio-

nales Given an input

and a label

(ground-

truth or model-generated), we construct a baseline

rationale

by declaratively combining

and

into

a sentence. For the CQA task, we adopt a T5-3B

Task Input Label Vacuous Baseline Rationale

CQA Where can personal mushrooms be kept

fresh?

refrigerator Personal mushrooms can be kept fresh in

the refrigerator.

NLI Premise: A dog running in the surf.

Hypothesise: A dog is at the beach.

entailment A dog running in the surf indicates a dog is

at the beach.

Table 1: Examples of constructed vacuous baseline rationales for CQA and NLI tasks. For NLI, the vacuous baseline

rationale was obtained after paraphrasing.

model ﬁne-tuned on a set of (question,answer,

declarative sentence) tuples (Demszky et al.,2018)

following Chen et al. (2021b).

For the NLI task,

we ﬁrst use a template to convert (premise,hypoth-

esis,label) tuple into a baseline rationale: “premise

implies

contradicts

is not related to

hypothesis”. Then we paraphrase these templated,

vacuous NLI rationales using a pre-trained model

in order to prevent the evaluators from learning the

template patterns. Table 1 shows some examples

of constructed vacuous baseline rationales.

Training Evaluation Models,

and

g′

We train

two evaluation models,

and

g′

, which take

[r, b]

and

as inputs, respectively (see Equation 5 in §2).

Both evaluators are based on ﬁne-tuning T5 Large

(Raffel et al.,2020) models. We use the training set

Dtrain ={(x, y∗, r∗)}

, where

{y∗}

and

{r∗}

are

gold labels and human-annotated rationales, respec-

tively. We construct baseline rationales

{b∗}

based

{(x, y∗)}

. The objective is to maximize the log-

likelihood of

y∗

given

[r∗, b∗]

b∗

. After train-

ing, the evaluation models are applied to evaluate

a rationale-label pair

(y, r)

w.r.t. an input

. The

rationale-label pair

(y, r)

can be model-generated

and the label may not be ground-truth (e.g.,

Fig. 1), while REV is able to provide an assessment

on the rationale along the two dimensions (§1). We

refer readers to the Appendix B.3 for results of us-

ing T5 Base, BART Large (Lewis et al.,2020), and

GPT-2 Large (Radford et al.,2019) as evaluation

model architectures.

3.3 Other Metrics for Rationale Evaluation

We compare with two existing automatic metrics

for free-text rationale evaluation: LAS (Hase et al.,

2020) and RQ (Wiegreffe et al.,2021). Analo-

gous to our evaluation models, both approaches

use proxy models; we use the same architecture

5https://github.com/jifan-chen/

QA-Verification-Via-NLI

6https://huggingface.co/humarin/chatgpt_

paraphraser_on_T5_base

(T5 Large) across metrics in our reported results.

Leakage-Adjusted Simulatability (LAS) Hase

et al. (2020) evaluate the quality of free-text ra-

tionales via a proxy model, trained with the task

model outputs as labels and original input texts

combined with rationales as input sequences. The

metric computes the difference between its pre-

diction accuracy on the predicted label when the

rationale is included into the input vs. when it is

not,

[ˆy∣x, ˆr]−[ˆy∣x]

, averaged over exam-

ples grouped based on whether they leak labels or

not. The ﬁnal LAS score is given by the macro

average across groups.

Rationale Quality (RQ) Wiegreffe et al. (2021)

propose a variant of the simulatability in Hase et al.

(2020). The main difference is that gold labels are

used to train the model proxy and evaluate rationale

quality. Speciﬁcally, the quality of a rationale

ˆr

measured as

[y∗∣x, ˆr]−[y∗∣x]

, where

y∗

is the gold label. RQ is the average score over all

test examples without considering label leakage.

4 Evaluating REV

We ﬁrst compare REV with existing metrics (§4.1)

and human judgments (§4.2) on the ECQA dataset,

as well as show REV on other CQA and NLI bench-

marks. We then test the sensitivity of different met-

rics to input perturbations (§4.3). Next, we apply

REV to generations via few-shot prompting (4.4).

Additional experiments are listed in Appendix C.

4.1 Comparison Between Evaluation Metrics

We compare REV with LAS and RQ, in evaluat-

ing different rationale-label pairs on the ECQA

dataset. In addition to XY

∗

→

R, X

→

YR, X

→

RY,

and (Y

∗

), we also explore the evaluation on the

vacuous baseline rationales (Y

∗

;B) that are con-

structed with ground-truth labels. LAS, RQ and

REV are not directly comparable due to different

comparison scales and criteria (e.g., log-probability

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

REV:Information-TheoreticEvaluationofFree-TextRationalesHanjieChen♡∗FaezeBrahman♠♢XiangRen♠♣YangfengJi♡YejinChoi♠♢SwabhaSwayamdipta♣♡DepartmentofComputerScience,UniversityofVirginia♠AllenInstituteforAI♣UniversityofSouthernCalifornia♢PaulG.AllenSchoolofComputerScience&Engineering,UniversityofWashingt...

展开>> 收起<<

REV Information-Theoretic Evaluation of Free-Text Rationales Hanjie ChenFaeze BrahmanXiang RenYangfeng Ji Yejin ChoiSwabha Swayamdipta.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

REV Information-Theoretic Evaluation of Free-Text Rationales Hanjie ChenFaeze BrahmanXiang RenYangfeng Ji Yejin ChoiSwabha Swayamdipta

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: