Does Self-Rationalization Improve Robustness to Spurious Correlations Alexis RossyMatthew E. PeterszAna Marasovi cx yMassachusetts Institute of Technology Cambridge MA USA

2025-05-03 0 0 361.68KB 14 页 10玖币
侵权投诉
Does Self-Rationalization Improve Robustness to Spurious Correlations?
Alexis Ross†∗ Matthew E. PetersAna Marasovi´
c§∗
Massachusetts Institute of Technology, Cambridge, MA, USA
Allen Institute for AI, Seattle, WA, USA
§University of Utah, Salt Lake City, UT, USA
alexisro@mit.edu matthewp@allenai.org ana.marasovic@utah.edu
Abstract
Rationalization is fundamental to human rea-
soning and learning. NLP models trained
to produce rationales along with predictions,
called self-rationalization models, have been
investigated for their interpretability and util-
ity to end-users. However, the extent to which
training with human-written rationales facili-
tates learning remains an under-explored ques-
tion. We ask whether training models to self-
rationalize can aid in their learning to solve
tasks for the right reasons. Specifically, we
evaluate how training self-rationalization mod-
els with free-text rationales affects robustness
to spurious correlations in fine-tuned encoder-
decoder and decoder-only models of six dif-
ferent sizes. We evaluate robustness to spu-
rious correlations by measuring performance
on 1) manually annotated challenge datasets
and 2) subsets of original test sets where re-
liance on spurious correlations would fail to
produce correct answers. We find that while
self-rationalization can improve robustness to
spurious correlations in low-resource settings,
it tends to hurt robustness in higher-resource
settings. Furthermore, these effects depend
on model family and size, as well as on ratio-
nale content. Together, our results suggest that
explainability can come at the cost of robust-
ness; thus, appropriate care should be taken
when training self-rationalizing models with
the goal of creating more trustworthy models.
1 Introduction
Rationalization—the process of explaining the rea-
soning used to come to a particular decision—plays
a pivotal role in human inference and learning
(Lombrozo,2016). For these reasons, there has
been a growing interest in producing NLP mod-
*
Work undertaken while Alexis Ross and Ana Marasovi´
c
were at the Allen Institute for AI.
Our code is publicly available at
https://github.com/
allenai/rationale_robustness
els that can output rationales
1
for their predic-
tions. Models that output such rationales have mul-
tiple benefits: First, they are more interpretable
and easier to interact with for end-users than non-
rationalizing models (Alvarez-Melis and Jaakkola,
2018). Second, such intermediate rationalization
can offer learning benefits, such as achieving com-
parable performance with less data and improving
out-of-distribution generalization (Nye et al.,2021;
Wei et al.,2022;Zelikman et al.,2022).
However, the question of whether training mod-
els to rationalize can help them learn how to solve
tasks for the right reasons remains open. In par-
ticular, rationales encode information about the
underlying reasoning humans use to reach answers,
which raises the question: Does incorporating such
rationales into training allow models to rely on
human-aligned reasoning rather than spurious fea-
ture interactions? If so, training with rationales
could offer a pathway towards creating more ro-
bust, trustworthy, or cognitively plausible models.
In this work, we explore this question by empir-
ically investigating whether training models with
human-written rationales can help make them more
robust to spurious correlations in data. We ana-
lyze a class of models called self-rationalization
models—which jointly output free-text rationales
along with predictions—and focus specifically on
the fine-tuning setting, in which prior work has
found reliance on spurious correlations to emerge
(Utama et al.,2021).
We evaluate six models of varying architectures
and sizes across two tasks, natural language infer-
ence and commonsense question answering. Our
main results are as follows:
1.
While the effects of training with rationales
are model- and task-specific, when it improves
robustness to spurious correlations, it tends
1
Prior work has used the terms “explanation” and ”ratio-
nale” interchangeably. In this work, we use the word ”ratio-
nale” for consistency with ”self-rationalization” models.
arXiv:2210.13575v1 [cs.CL] 24 Oct 2022
to be in lower-resource settings. In higher-
resource settings, training with rationales can
hurt robustness (§4.1).
2.
Within model families, larger models benefit
more in robustness from rationales (§4.2).
3.
The effects of self-rationalization on robust-
ness are not fully explained by its effects on
in-domain task performance (§4.3).
4.
The content of rationales used during training
influences both task performance and robust-
ness to spurious correlations (§4.4).
Our results suggest that straightforward self-
rationalization training does not always facilitate
learning to solve a task for the right reasons. In-
stead, the effects of self-rationalization on robust-
ness to spurious correlations depend on a multitude
of factors. Thus, appropriate care should be taken
when training models to self-rationalize for the goal
of creating trustworthy models.
2 Related Work
Learning to rationalize
Two classes of ap-
proaches to producing models that can rational-
ize their predictions include self-rationalization
models,
2
which are fully differentiable and out-
put free-text rationales along with task predic-
tions, and pipeline models, which consist of two
components—one that produces rationales, and a
second that makes predictions from those rationales
(Wiegreffe et al.,2021).
3
Such methods are typi-
cally evaluated by the faithfulness and plausibility
of their rationales, where faithfulness represents
the extent to which a model actually relied on the
rationale in making its prediction, and plausibility
indicates human judgment of how well the ratio-
nale explains the output (DeYoung et al.,2020).
In contrast to these works which aim to im-
prove model interpretability through new methods
for rationalizing models, we ask to what extent
existing methods affect model robustness to spu-
rious correlations. We conduct our analysis on
self-rationalization models, which have been found
to achieve better task performance and produce
higher-quality rationales than do pipeline models
(Wiegreffe et al.,2021;Camburu et al.,2018).
2
Such approaches have also been referred to as explain-
then-predict (Camburu et al.,2018) and rationalize-then-
predict (Chen et al.,2022) models.
3
See Wiegreffe et al. (2021) for a detailed discussion of
pipeline and self-rationalization approaches to rationalization.
Learning from rationales
Recent work has ex-
plored the utility of rationales for improving end-
task performance in in-context learning (Wei et al.,
2022;Lampinen et al.,2022;Ye and Durrett,2022)
as well as in fine-tuning (Zaidan et al.,2007;Han-
cock et al.,2018;Camburu et al.,2018;Narang
et al.,2020;Hase and Bansal,2021;Nye et al.,
2021;Zhao and Vydiswaran,2021). Previous work
has shown that training with both human-annotated
rationales (Rajani et al.,2019) and rationales gen-
erated by language models (Paranjape et al.,2021)
can increase in-domain task performance, partic-
ularly in low-resource settings (Bhat et al.,2021;
Pruthi et al.,2022;Zelikman et al.,2022). Unlike
these prior works, which study how training with
rationales affects in-domain, end-task performance,
we focus specifically on evaluating impact on ro-
bustness to spurious correlations.
Improving robustness with rationales
Most
closely related are recent works that study how
training with rationales affects model robustness.
Stacey et al. (2022) propose a method of supervis-
ing attention weights with extractive rationales and
show that this method leads to both in-distribution
and out-of-distribution improvements for natural
language inference. Schuster et al. (2021) find that
training with contrastive extractive rationales im-
proves robustness as measured by performance on
adversarial evaluation sets. Concurrent work by
Chen et al. (2022) investigates to what extent train-
ing models to extract rationales through pipelines
improves their robustness to adversarial attacks.
In contrast to all three of these works, we focus
on freeform rationales instead of extractive ratio-
nales and explore the impact of amount of training
data on robustness. In contrast to Schuster et al.
(2021) and Chen et al. (2022), we analyze self-
rationalization models instead of pipeline models
and measure robustness to spurious correlations,
rather than robustness to adversarial attacks. While
Stacey et al. (2022) evaluate robustness to spurious
correlations for natural language inference with
some of the same test sets, they work with masked
language models and evaluate the effect of super-
vising model attention with rationales; in contrast,
we work with encoder-decoder and decoder-only
models of varying sizes and evaluate the effect of
outputting rationales along with predictions. In ad-
dition, their analysis is limited to natural language
inference, for which evaluation datasets targeting
robustness exist; in contrast, we also experiment
with commonsense question answering through
new methods for evaluating robustness. In §4.1,
we discuss the variance in results across different
tasks and highlight the importance of cross-task
evaluation.
3 Experiments
3.1 Experimental Set-Up
Models
We experiment with encoder-decoder
and decoder-only models of varying sizes ranging
from 140 to 774 million parameters, as shown in
Figures 1and 2. Our encoder-decoder models build
on pretrained
T5
(Raffel et al.,2020) and
BART
models (Lewis et al.,2020), and our decoder-only
models build on pretrained
GPT2
(Radford et al.,
2019) models. Our T5 models build specifically on
the versions trained for an additional 100K steps on
the language modeling objective after pretraining
(Lester et al.,2021), as we aim to measure how
the amount of training data impacts results, and the
default T5 models have already been fine-tuned on
the full SNLI training dataset.4
Tasks
We evaluate self-rationalization models
on two tasks—
natural language inference
(NLI),
and
commonsense question answering
(CQA)—
for which human-annotated rationales already ex-
ist. For NLI, we train task models on SNLI (Bow-
man et al.,2015) and obtain rationales from ESNLI
(Camburu et al.,2018). For CQA, we train task
models on CQA (Talmor et al.,2019) and obtain
rationales from ECQA (Aggarwal et al.,2021). Ex-
amples of inputs and outputs for both tasks are
shown in Table 2. For CQA, unless otherwise spec-
ified, we train on the “positive” freeform rationales
in ECQA, which explain why the gold answer is
the correct answer for a given question. In §4.4,
we explore the impact of training with the different
forms of rationales shown in Table 2.
Rationales
For each task, we compare a baseline
model trained solely to predict task labels with
models trained to also self-rationalize. All self-
rationalization models are trained to generate a
rationale following the task label, as previous work
has found that outputting rationales conditioned on
labels leads to better performance than outputting
4
For example, when experimenting with T5-BASE,
we work specifically with
t5-base-lm-adapt
available
in
huggingface
at
https://huggingface.co/google/
t5-base-lm-adapt.
labels conditioned on rationales in the fine-tuning
setting (Schuff et al.,2021).
Data
We experiment with different numbers of
training examples
n
, as we seek to understood how
training data size influences the impact of self-
rationalization training on robustness to spurious
correlations. We experiment with
n
{1K, 2.5K,
5K, 10K, 50K, 100K} for NLI and
n
{1K, 5K,
7598} for CQA.
5
For each training data amount
n
,
we create validation data for checkpointing mod-
els by randomly sampling
n/2
instances from the
original task-only validation dataset, such that we
perform model selection based on task performance
across baseline and self-rationalization models. For
self-rationalization models, we create training data
by concatenating original task-only training input-
output pairs with their rationale-extended counter-
parts, such that we have
2n
training inputs obtained
from noriginal instances.6
Training
For each amount of training data
n
, we
report the average difference between task-only
and self-rationalization models across multiple ran-
dom seeds (5 for NLI and 10 for CQA).
7
For one
random seed in each evaluation setting (where a set-
ting is determined by the task, model family, model
size, whether rationales are used, and amount of
training data), we tune the learning rate from pos-
sible values
[1e5,3e5,5e5]
and use the best-
performing learning rate for other random seeds in
the same setting. We train with fixed batch size 64
and linear learning rate scheduler using Adafactor
until accuracy on the validation data stops decreas-
ing, or for a maximum of 50 epochs. We use pa-
tience values of
10
for
n <
10K,
5
for
n >=
10K,
and 3for n >=50K.
Evaluation
We decode predictions using greedy
decoding and evaluate accuracy using exact match
with gold labels. We evaluate robustness to spuri-
ous correlations by measuring performance on 1)
manually annotated challenge datasets and 2) sub-
sets of original test sets where reliance on spurious
5
The total size of original training datasets are 549,339 for
SNLI and 7,598 for CQA.
6
In initial experiments, we find that this leads to bet-
ter performance/robustness measures than only using the
n
input-outputs for self-rationalization; we hypothesize that
without including the original task-only inputs as well, self-
rationalization models may be overfitting to the rationale gen-
eration part of the training objective.
7
We experiment with more seeds for CQA because we have
fewer metrics/evaluation datasets to measure robustness for
CQA, and so it is harder to disentangle real effects from noise.
摘要:

DoesSelf-RationalizationImproveRobustnesstoSpuriousCorrelations?AlexisRossyMatthewE.PeterszAnaMarasovi´cxyMassachusettsInstituteofTechnology,Cambridge,MA,USAzAllenInstituteforAI,Seattle,WA,USAxUniversityofUtah,SaltLakeCity,UT,USAalexisro@mit.edumatthewp@allenai.organa.marasovic@utah.eduAbstractRat...

展开>> 收起<<
Does Self-Rationalization Improve Robustness to Spurious Correlations Alexis RossyMatthew E. PeterszAna Marasovi cx yMassachusetts Institute of Technology Cambridge MA USA.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:361.68KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注