REV Information-Theoretic Evaluation of Free-Text Rationales Hanjie ChenFaeze BrahmanXiang RenYangfeng Ji Yejin ChoiSwabha Swayamdipta

2025-04-29 0 0 1.82MB 22 页 10玖币
侵权投诉
REV: Information-Theoretic Evaluation of Free-Text Rationales
Hanjie ChenFaeze Brahman♠♢ Xiang Ren♠♣ Yangfeng Ji
Yejin Choi♠♢ Swabha Swayamdipta
Department of Computer Science, University of Virginia
Allen Institute for AI University of Southern California
Paul G. Allen School of Computer Science & Engineering, University of Washington
{hc9mx,yangfeng}@virginia.edu {faezeb,xiangr,yejinc}@allenai.org swabhas@usc.edu
Abstract
Generating free-text rationales is a promising
step towards explainable NLP, yet evaluating
such rationales remains a challenge. Existing
metrics have mostly focused on measuring the
association between the rationale and a given
label. We argue that an ideal metric should fo-
cus on the new information uniquely provided
in the rationale that is otherwise not provided
in the input or the label. We investigate this re-
search problem from an information-theoretic
perspective using conditional
V
-information
(Hewitt et al.,2021). More concretely, we pro-
pose a metric called REV (
R
ationale
E
valuation
with conditional
V
-information), to quantify
the amount of new, label-relevant information
in a rationale beyond the information already
available in the input or the label. Experiments
across four benchmarks with reasoning tasks,
including chain-of-thought, demonstrate the ef-
fectiveness of REV in evaluating rationale-label
pairs, compared to existing metrics. We fur-
ther demonstrate REV is consistent with hu-
man judgments on rationale evaluations and
provides more sensitive measurements of new
information in free-text rationales. When used
alongside traditional performance metrics, REV
provides deeper insights into models’ reasoning
and prediction processes.1
1 Introduction
Model explanations have been indispensable for
trust and interpretability in natural language pro-
cessing (NLP) (Ribeiro et al.,2016,2020;Lipton,
2018;Chen et al.,2020,2021a). Free-text ratio-
nales, which explain a model prediction in natural
language, have been especially appealing due to
their flexibility in eliciting the reasoning process be-
hind the model’s decision making (Camburu et al.,
Work done during an internship at AI2.
1
Our code is publicly available at
https://github.com/
HanjieChen/REV
2018;Narang et al.,2020;Rajani et al.,2019;Ku-
mar and Talukdar,2020;Brahman et al.,2021),
making them closer to human explanations. How-
ever, existing metrics for free-text rationale eval-
uation remain narrowly focused on the extent to
which a rationale can help a (proxy) model predict
the label it explains (i.e., accuracy based) (Hase
et al.,2020;Wiegreffe et al.,2021). These metrics
offer little understanding of the new information
contained in the rationale, as added to the original
input, that could explain why the label is selected
the very purpose a rationale is designed to serve.
For instance, the two rationales
r
1
and
ˆr1,a
in Fig.
1would be considered equally valuable under ex-
isting metrics, even though they supply different
amount of novel and relevant information.
In this paper, we overcome this shortcoming by
introducing an automatic evaluation for free-text ra-
tionales along two dimensions: (1) whether the ra-
tionale supports (i.e., is predictive of) the intended
label, and (2) how much new information does it
provide to justify the label, beyond what is con-
tained in the input. For example, rationale
ˆr1,b
in
Fig. 1violates (1) because it is not predictive of
the label, “
enjoy nature
”. Rationale
ˆr1,a
does
support the label but contains no new information
that justifies it, beyond what is stated in the input
x
; thus, it violates (2). Rationale
r
1
is satisfied
along both dimensions: it supports the label and
does so by providing new and relevant information,
beyond what is in the input. Our proposed eval-
uation is designed to penalize both
ˆr1,a
and
ˆr1,b
,
while rewarding rationales like r
1.
We introduce REV
2
, which adapts an
information-theoretic framework from Xu
et al. (2020) for evaluating free-text rationales
along the two dimensions mentioned above. Specif-
ically, REV is based on conditional
V
-information
2
For
R
ationale
E
valuation with conditional
V
-information.
arXiv:2210.04982v5 [cs.CL] 2 Jun 2023
Figure 1: Our evaluation framework for different free-text rationales (
r
).
r
1
is a human-written rationale,
ˆr1,a
and
ˆr1,b
are two generated rationales for the true label
y1
. Our metric, REV, based on CVI (Hewitt et al.,2021) is able
to distinguish all three rationales by measuring how much new and label-relevant information each adds over a
vacuous rationale,
b
; performance-based evaluations can only distinguish between
ˆr1,a
and
ˆr1,b
. For an (arguably)
incorrect label,
y2
, REV still gives a positive score highlighting that
ˆr2
is able to provide new information for why it
supports
y2
. Prediction accuracy can be augmented with REV to provide a fuller interpretability of model decisions.
(Hewitt et al.,2021), which quantifies the degree of
information contained in a representation beyond
another (baseline) representation, accessible to a
model family
V
. As our baseline representation,
we consider any vacuous rationale which simply
(and declaratively) combines an input with a
given label, without providing any new infor-
mation relevant to answering why the label was
chosen. REV adapts conditional
V
-information
to evaluate rationales, where we compare two
representations—one from an evaluation model
trained to produce the label given the input and the
rationale, and the other from another evaluation
model for the same task but considering only the
input (disguised as a vacuous rationale). Other
metrics do not take into consideration vacuous
rationales, and are hence unable to measure new
and label-relevant information in rationales.
In our experiments, we present evaluations with
REV for rationales under two reasoning tasks, com-
monsense question-answering (CQA; Talmor et al.,
2019) and natural language inference (NLI; Bow-
man et al.,2015), across four benchmarks. Several
quantitative evaluations demonstrate the capabili-
ties of REV in providing evaluations along new di-
mensions for free-text rationales, while also being
more consistent with human judgements compared
to existing metrics. We also provide comparisons
to demonstrate the sensitivity of REV to various
degrees of input perturbations. Additionally, evalu-
ation with REV offers insights into why rationales
obtained through chain-of-thought prompting (Wei
et al.,2022) do not necessarily improve prediction
performance.
2 REV: Information-Theoretic
Evaluation of Rationales
We introduce a new metric, REV,
R
ationale
E
valuation with conditional
V
-information, for
evaluation of free-text rationales on the proposed
dimensions (§2.2), based on the framework of con-
ditional V-information (§2.1).
We consider the setting where we have input
XX
, label
YY
, and free-text rationale
RR
generated for label
Y
. A common strat-
egy to evaluate rationale
R
is through an evaluator
function
fZY
, which maps a variable
Z
to a label distribution. Here,
Z
can be defined
based on the evaluation framework; e.g.,
Z
can be
a concatenation of
X
and
R
, or contains only
X
.
These metrics evaluate the utility of
R
based on
how much
R
helps
f
predict
Y
. The evaluator
f
is typically trained on a set of input, label and ra-
tionale triples
Dtrain ={(xj, yj, rj)}
, and applied
to
Dtest ={(xi, yi, ri)}
for evaluation. The utility
of
R
is formulated as the difference between the
performance of the evaluator on predicting
Y
with
R, and without it, i.e.
Perf[f(YX, R)] Perf[f(YX)],(1)
where a larger performance gap indicates a bet-
ter rationale. Existing metrics (Hase et al.,2020;
Wiegreffe et al.,2021) compute the performance
gap based on prediction accuracies.
However, accuracy-based evaluation can only
indicate whether or not a rationale is predictive of
a label, but cannot quantify how much new infor-
mation the rationale provides to justify the label.
Figure 1illustrates this issue via an example. Here,
accuracy-based evaluation can distinguish between
ˆr1,a
and
ˆr1,b
since
ˆr1,a
supports
y1
and
ˆr1,b
does
not. However, it is unable to distinguish between
r
1
and
ˆr1,a
(since both are predictive of
y1
), de-
spite the fact that
ˆr1,a
does not provide any unique
and relevant information to answer why the label
should be
y1
. In practice, vacuous rationales such
as
ˆr1,a
are commonly seen in model generations
(Sun et al.,2022;Wiegreffe and Marasovic,2021).
This calls for an evaluation metric which is able to
identify and penalize such vacuous rationales.
2.1 An Information-Theoretic Perspective on
Rationale Evaluation
The key quantity of interest for our evaluation of
rationale
R
is the amount of new information ex-
pressed in
R
(e.g., background knowledge, reason-
ing process) that can justify a label
Y
. The mutual
information between
R
and
Y
,
I(Y;R)
, can be
helpful for evaluating this quantity. However, we
are not interested in the information that is already
captured in the input
X
. A vacuous rationale, such
as
ˆr1,a
in Fig. 1—which simply combines the input
X
and the label,
Y
declaratively—captures all the
information in
X
and
Y
without specifying any
new information to help understand why
Y
has
been chosen for
X
. We denote such rationales as
B
. Thus, we argue that a good evaluation metric
must be able to measure the amount of new and
label-relevant information contained in a rationale
beyond what is contained in any vacuous rationale,
B
, that leads to the prediction of
Y
. Then the new
information in
R
beyond what is available in
B
can
be grounded with conditional mutual information
(Shannon,1948) as follows,
I(Y;RB)=I(Y;R, B)I(Y;B),(2)
where the difference of two information quantities
demonstrates the performance gap in Equation 1.
Directly computing mutual information, how-
ever, is challenging because true distributions of
random variables are usually unknown, and we do
not have unbounded computation. A recently intro-
duced information-theoretic framework called
V
-
information circumvents this by restricting the com-
putation to certain predictive model families,
V
(Xu
et al.,2020). Given a model family
V
that maps two
random variables
R
and
Y
,
V
-information defines
the usable information that can be extracted from
R
by models in
V
to predict
Y
, i.e.
IV(RY)
.
If
V
generalizes to the set of all possible functions,
then
V
-information is mutual information (Shan-
non,1948). In practice, it is feasible to estimate
the usable information from
R
about
Y
by select-
ing any neural model without frozen parameters as
V
.
3
Our approach to evaluate rationales builds on
a modification of this framework for conditional
information by Hewitt et al. (2021), as described
below.
Conditional
V
-information Following condi-
tional mutual information in information theory
(Cover and Thomas,2006),
V
-information has been
extended to conditional
V
-information (CVI; He-
witt et al.,2021). CVI quantifies the
V
-usable in-
formation in
R
about
Y
conditioned on a variable
B, i.e.
IV(RYB)=HV(YB)HV(YR, B).
Here
B
is any vacuous rationale that leads to the
prediction of
Y
. In this work, we consider
B
sim-
ply as the declarative combination of
X
and
Y
.
HV()
is the conditional
V
-entropy (Xu et al.,
2020;Hewitt et al.,2021;Ethayarajh et al.,2022),
defined as
HV(YB)=inf
fV
E[log f[b](y)] (3)
HV(YR, B)=inf
fV
E[log f[r, b](y)],(4)
where
f[b]
and
f[r, b]
produce a probability dis-
tribution over the labels given
b
and
[r, b]
as inputs
respectively.
4
Further, given
g, g V
which opti-
mize Equations 3and 4respectively, we consider
pointwise CVI for individual triples (r, y, b):
log g[b](y)+log g[r, b](y).(5)
2.2
Computing REV for Rationale Evaluation
Building on the framework of CVI, we propose
a new metric REV, for
R
ationale
E
valuation with
conditional
V
-information. We compute REV over
a given test set,
Dtest ={(xi, yi, ri)}
, by estimating
CVI over the set with evaluation models,
g, gV
.
For a test example
(x, y, r)
, the REV score denoted
as
REV(x, y, r)
is computed based on Equation 5,
where bis constructed by combining xand y. ,
REV(x, y, r)=log g[b](y)+log g[r, b](y).
3
Please see Xu et al. (2020) for a detailed discussion of
properties such as optional ignorance that a predictive family
Vmust follow.
4[r, b]
is the concatenation of
r
and
b
. Please see Appendix
Afor further details on CVI.
The REV score for the entire test corpus
Dtest
, is
given by the average pointwise REV score:
REVD=1
Dtest
Dtest
i=1
REV(xi, yi, ri).(6)
Algorithm 1 Computing REV Scores
1:
Input: evaluation models
g
and
g
, test set
Dtest ={(xi, yi, ri)}
2: Initialize an empty list S
3: for (xi, yi, ri)Dtest do
4: Construct the baseline rationale bi
5: REV(xi, yi, ri)
=log g[bi](yi)+log g[ri, bi](yi)
6: S.add(REV(xi, yi, ri))
7: end for
8: REVD=mean(S)
9: Output:S, REVD
Algorithm 1shows the process of computing
both pointwise and aggregate REV scores. The
higher the REV score, the more additional (new
and relevant) information the rationale
r
contains
to explain the label beyond the baseline rationale
b
.
REV(xi, yi, ri)
can take positive, negative, or
zero values. When
REV(xi, yi, ri)>0
, the ra-
tionale supplies additional new information for
supporting the label (e.g.,
r
1
in Fig. 1); when
REV(xi, yi, ri)=0
, the rationale provides no ad-
ditional information beyond the baseline (e.g.,
ˆr1,a
in Fig. 1); and when
REV(xi, yi, ri)<0
, the
rationale does not support the label (e.g.,
ˆr1,b
in
Fig. 1). REV can assign a positive score to a ra-
tionale for an incorrect prediction as long as the
rationale supports it and provides additional infor-
mation beyond a vacuous baseline rationale (e.g.,
ˆr2
in Fig. 1). Thus, REV cannot be seen as a re-
placement for prediction accuracy, but rather as an
orthogonal metric to interpret the usefulness of a
generated rationale for the model decision.
3 Experimental Setup
We outline our experimental setup by describing
the reasoning tasks and datasets (§3.1), followed
by the task and evaluation models (§3.2), and the
baseline metrics for comparison (§3.3). Additional
details on the setup are provided in Appendix B.
3.1 Datasets
We explore two reasoning tasks, namely Common-
senseQA (CQA) and Natural Language Inference
(NLI) across four datasets, all containing human-
annotated free-text rationales. For CQA task, we
use ECQA (Aggarwal et al.,2021), CoS-E (v1.11;
Rajani et al.,2019) and QuaRTz (Tafjord et al.,
2019). For both ECQA and CoS-E, each com-
monsense question is paired with five candidate
choices and the task is to select an answer from the
candidates. ECQA contains higher quality human-
written rationales compared to CoS-E (Aggarwal
et al.,2021;Sun et al.,2022). QuaRTz is for open-
domain reasoning about textual qualitative relation-
ships, and the task is to select an answer from two
options to the question based on the textual qual-
itative knowledge (rationale). For the NLI task,
we use the e-SNLI (Camburu et al.,2018) dataset
containing explanations for SNLI (Bowman et al.,
2015), where the task is given a premise to predict
if a hypothesis entails, contradicts or is neutral to it.
More details on the datasets are in Appendix B.1.
3.2 Task and Evaluation Models
Task models We choose T5 Large (Raffel et al.,
2020) as the task model (finetuned on ground-
truth labels and rationales) to produce generated
rationale-label pairs under three settings:
XY
R: Given an input text and the ground-
truth label, generate a rationale.
X
YR: Given an input text, generate a label
followed by a rationale. Since T5 decodes
tokens sequentially, each R is generated con-
ditioned on the predicted Y.
X
RY: Given an input text, generate a ratio-
nale followed by a label. Here, we compute a
likelihood for each candidate Y conditioned
on R, and then select the most probable can-
didate. This operation can improve the model
prediction accuracy, while weakening the con-
sistency and relevance between the generated
rationales and predicted labels.
After training, we collect three types of rationale-
label pairs by applying the three task models on
the test set of each dataset. In addition to these
three settings, we also evaluate ground-truth labels
paired with crowd-sourced rationales (Y;R).
Constructing a Baseline with Vacuous Ratio-
nales Given an input
x
and a label
y
(ground-
truth or model-generated), we construct a baseline
rationale
b
by declaratively combining
x
and
y
into
a sentence. For the CQA task, we adopt a T5-3B
Task Input Label Vacuous Baseline Rationale
CQA Where can personal mushrooms be kept
fresh?
refrigerator Personal mushrooms can be kept fresh in
the refrigerator.
NLI Premise: A dog running in the surf.
Hypothesise: A dog is at the beach.
entailment A dog running in the surf indicates a dog is
at the beach.
Table 1: Examples of constructed vacuous baseline rationales for CQA and NLI tasks. For NLI, the vacuous baseline
rationale was obtained after paraphrasing.
model fine-tuned on a set of (question,answer,
declarative sentence) tuples (Demszky et al.,2018)
following Chen et al. (2021b).
5
For the NLI task,
we first use a template to convert (premise,hypoth-
esis,label) tuple into a baseline rationale: premise
implies
/
contradicts
/
is not related to
hypothesis”. Then we paraphrase these templated,
vacuous NLI rationales using a pre-trained model
6
in order to prevent the evaluators from learning the
template patterns. Table 1 shows some examples
of constructed vacuous baseline rationales.
Training Evaluation Models,
g
and
g
We train
two evaluation models,
g
and
g
, which take
[r, b]
and
b
as inputs, respectively (see Equation 5 in §2).
Both evaluators are based on fine-tuning T5 Large
(Raffel et al.,2020) models. We use the training set
Dtrain ={(x, y, r)}
, where
{y}
and
{r}
are
gold labels and human-annotated rationales, respec-
tively. We construct baseline rationales
{b}
based
on
{(x, y)}
. The objective is to maximize the log-
likelihood of
y
given
[r, b]
or
b
. After train-
ing, the evaluation models are applied to evaluate
a rationale-label pair
(y, r)
w.r.t. an input
x
. The
rationale-label pair
(y, r)
can be model-generated
and the label may not be ground-truth (e.g.,
y2
in
Fig. 1), while REV is able to provide an assessment
on the rationale along the two dimensions (§1). We
refer readers to the Appendix B.3 for results of us-
ing T5 Base, BART Large (Lewis et al.,2020), and
GPT-2 Large (Radford et al.,2019) as evaluation
model architectures.
3.3 Other Metrics for Rationale Evaluation
We compare with two existing automatic metrics
for free-text rationale evaluation: LAS (Hase et al.,
2020) and RQ (Wiegreffe et al.,2021). Analo-
gous to our evaluation models, both approaches
use proxy models; we use the same architecture
5https://github.com/jifan-chen/
QA-Verification-Via-NLI
6https://huggingface.co/humarin/chatgpt_
paraphraser_on_T5_base
(T5 Large) across metrics in our reported results.
Leakage-Adjusted Simulatability (LAS) Hase
et al. (2020) evaluate the quality of free-text ra-
tionales via a proxy model, trained with the task
model outputs as labels and original input texts
combined with rationales as input sequences. The
metric computes the difference between its pre-
diction accuracy on the predicted label when the
rationale is included into the input vs. when it is
not,
[ˆyx, ˆr][ˆyx]
, averaged over exam-
ples grouped based on whether they leak labels or
not. The final LAS score is given by the macro
average across groups.
Rationale Quality (RQ) Wiegreffe et al. (2021)
propose a variant of the simulatability in Hase et al.
(2020). The main difference is that gold labels are
used to train the model proxy and evaluate rationale
quality. Specifically, the quality of a rationale
ˆr
is
measured as
[yx, ˆr][yx]
, where
y
is the gold label. RQ is the average score over all
test examples without considering label leakage.
4 Evaluating REV
We first compare REV with existing metrics (§4.1)
and human judgments (§4.2) on the ECQA dataset,
as well as show REV on other CQA and NLI bench-
marks. We then test the sensitivity of different met-
rics to input perturbations (§4.3). Next, we apply
REV to generations via few-shot prompting (4.4).
Additional experiments are listed in Appendix C.
4.1 Comparison Between Evaluation Metrics
We compare REV with LAS and RQ, in evaluat-
ing different rationale-label pairs on the ECQA
dataset. In addition to XY
R, X
YR, X
RY,
and (Y
;R
), we also explore the evaluation on the
vacuous baseline rationales (Y
;B) that are con-
structed with ground-truth labels. LAS, RQ and
REV are not directly comparable due to different
comparison scales and criteria (e.g., log-probability
摘要:

REV:Information-TheoreticEvaluationofFree-TextRationalesHanjieChen♡∗FaezeBrahman♠♢XiangRen♠♣YangfengJi♡YejinChoi♠♢SwabhaSwayamdipta♣♡DepartmentofComputerScience,UniversityofVirginia♠AllenInstituteforAI♣UniversityofSouthernCalifornia♢PaulG.AllenSchoolofComputerScience&Engineering,UniversityofWashingt...

展开>> 收起<<
REV Information-Theoretic Evaluation of Free-Text Rationales Hanjie ChenFaeze BrahmanXiang RenYangfeng Ji Yejin ChoiSwabha Swayamdipta.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.82MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注