Does Self-Rationalization Improve Robustness to Spurious Correlations Alexis RossyMatthew E. PeterszAna Marasovi cx yMassachusetts Institute of Technology Cambridge MA USA

2025-05-03 0 0 361.68KB 14 页 10玖币

侵权投诉

Does Self-Rationalization Improve Robustness to Spurious Correlations?

Alexis Ross†∗ Matthew E. Peters‡Ana Marasovi´

c§∗

†Massachusetts Institute of Technology, Cambridge, MA, USA

‡Allen Institute for AI, Seattle, WA, USA

§University of Utah, Salt Lake City, UT, USA

alexisro@mit.edu matthewp@allenai.org ana.marasovic@utah.edu

Abstract

Rationalization is fundamental to human rea-

soning and learning. NLP models trained

to produce rationales along with predictions,

called self-rationalization models, have been

investigated for their interpretability and util-

ity to end-users. However, the extent to which

training with human-written rationales facili-

tates learning remains an under-explored ques-

tion. We ask whether training models to self-

rationalize can aid in their learning to solve

tasks for the right reasons. Speciﬁcally, we

evaluate how training self-rationalization mod-

els with free-text rationales affects robustness

to spurious correlations in ﬁne-tuned encoder-

decoder and decoder-only models of six dif-

ferent sizes. We evaluate robustness to spu-

rious correlations by measuring performance

on 1) manually annotated challenge datasets

and 2) subsets of original test sets where re-

liance on spurious correlations would fail to

produce correct answers. We ﬁnd that while

self-rationalization can improve robustness to

spurious correlations in low-resource settings,

it tends to hurt robustness in higher-resource

settings. Furthermore, these effects depend

on model family and size, as well as on ratio-

nale content. Together, our results suggest that

explainability can come at the cost of robust-

ness; thus, appropriate care should be taken

when training self-rationalizing models with

the goal of creating more trustworthy models.

1 Introduction

Rationalization—the process of explaining the rea-

soning used to come to a particular decision—plays

a pivotal role in human inference and learning

(Lombrozo,2016). For these reasons, there has

been a growing interest in producing NLP mod-

Work undertaken while Alexis Ross and Ana Marasovi´

were at the Allen Institute for AI.

Our code is publicly available at

https://github.com/

allenai/rationale_robustness

els that can output rationales

for their predic-

tions. Models that output such rationales have mul-

tiple beneﬁts: First, they are more interpretable

and easier to interact with for end-users than non-

rationalizing models (Alvarez-Melis and Jaakkola,

2018). Second, such intermediate rationalization

can offer learning beneﬁts, such as achieving com-

parable performance with less data and improving

out-of-distribution generalization (Nye et al.,2021;

Wei et al.,2022;Zelikman et al.,2022).

However, the question of whether training mod-

els to rationalize can help them learn how to solve

tasks for the right reasons remains open. In par-

ticular, rationales encode information about the

underlying reasoning humans use to reach answers,

which raises the question: Does incorporating such

rationales into training allow models to rely on

human-aligned reasoning rather than spurious fea-

ture interactions? If so, training with rationales

could offer a pathway towards creating more ro-

bust, trustworthy, or cognitively plausible models.

In this work, we explore this question by empir-

ically investigating whether training models with

human-written rationales can help make them more

robust to spurious correlations in data. We ana-

lyze a class of models called self-rationalization

models—which jointly output free-text rationales

along with predictions—and focus speciﬁcally on

the ﬁne-tuning setting, in which prior work has

found reliance on spurious correlations to emerge

(Utama et al.,2021).

We evaluate six models of varying architectures

and sizes across two tasks, natural language infer-

ence and commonsense question answering. Our

main results are as follows:

While the effects of training with rationales

are model- and task-speciﬁc, when it improves

robustness to spurious correlations, it tends

Prior work has used the terms “explanation” and ”ratio-

nale” interchangeably. In this work, we use the word ”ratio-

nale” for consistency with ”self-rationalization” models.

arXiv:2210.13575v1 [cs.CL] 24 Oct 2022

to be in lower-resource settings. In higher-

resource settings, training with rationales can

hurt robustness (§4.1).

Within model families, larger models beneﬁt

more in robustness from rationales (§4.2).

The effects of self-rationalization on robust-

ness are not fully explained by its effects on

in-domain task performance (§4.3).

The content of rationales used during training

inﬂuences both task performance and robust-

ness to spurious correlations (§4.4).

Our results suggest that straightforward self-

rationalization training does not always facilitate

learning to solve a task for the right reasons. In-

stead, the effects of self-rationalization on robust-

ness to spurious correlations depend on a multitude

of factors. Thus, appropriate care should be taken

when training models to self-rationalize for the goal

of creating trustworthy models.

2 Related Work

Learning to rationalize

Two classes of ap-

proaches to producing models that can rational-

ize their predictions include self-rationalization

models,

which are fully differentiable and out-

put free-text rationales along with task predic-

tions, and pipeline models, which consist of two

components—one that produces rationales, and a

second that makes predictions from those rationales

(Wiegreffe et al.,2021).

Such methods are typi-

cally evaluated by the faithfulness and plausibility

of their rationales, where faithfulness represents

the extent to which a model actually relied on the

rationale in making its prediction, and plausibility

indicates human judgment of how well the ratio-

nale explains the output (DeYoung et al.,2020).

In contrast to these works which aim to im-

prove model interpretability through new methods

for rationalizing models, we ask to what extent

existing methods affect model robustness to spu-

rious correlations. We conduct our analysis on

self-rationalization models, which have been found

to achieve better task performance and produce

higher-quality rationales than do pipeline models

(Wiegreffe et al.,2021;Camburu et al.,2018).

Such approaches have also been referred to as explain-

then-predict (Camburu et al.,2018) and rationalize-then-

predict (Chen et al.,2022) models.

See Wiegreffe et al. (2021) for a detailed discussion of

pipeline and self-rationalization approaches to rationalization.

Learning from rationales

Recent work has ex-

plored the utility of rationales for improving end-

task performance in in-context learning (Wei et al.,

2022;Lampinen et al.,2022;Ye and Durrett,2022)

as well as in ﬁne-tuning (Zaidan et al.,2007;Han-

cock et al.,2018;Camburu et al.,2018;Narang

et al.,2020;Hase and Bansal,2021;Nye et al.,

2021;Zhao and Vydiswaran,2021). Previous work

has shown that training with both human-annotated

rationales (Rajani et al.,2019) and rationales gen-

erated by language models (Paranjape et al.,2021)

can increase in-domain task performance, partic-

ularly in low-resource settings (Bhat et al.,2021;

Pruthi et al.,2022;Zelikman et al.,2022). Unlike

these prior works, which study how training with

rationales affects in-domain, end-task performance,

we focus speciﬁcally on evaluating impact on ro-

bustness to spurious correlations.

Improving robustness with rationales

Most

closely related are recent works that study how

training with rationales affects model robustness.

Stacey et al. (2022) propose a method of supervis-

ing attention weights with extractive rationales and

show that this method leads to both in-distribution

and out-of-distribution improvements for natural

language inference. Schuster et al. (2021) ﬁnd that

training with contrastive extractive rationales im-

proves robustness as measured by performance on

adversarial evaluation sets. Concurrent work by

Chen et al. (2022) investigates to what extent train-

ing models to extract rationales through pipelines

improves their robustness to adversarial attacks.

In contrast to all three of these works, we focus

on freeform rationales instead of extractive ratio-

nales and explore the impact of amount of training

data on robustness. In contrast to Schuster et al.

(2021) and Chen et al. (2022), we analyze self-

rationalization models instead of pipeline models

and measure robustness to spurious correlations,

rather than robustness to adversarial attacks. While

Stacey et al. (2022) evaluate robustness to spurious

correlations for natural language inference with

some of the same test sets, they work with masked

language models and evaluate the effect of super-

vising model attention with rationales; in contrast,

we work with encoder-decoder and decoder-only

models of varying sizes and evaluate the effect of

outputting rationales along with predictions. In ad-

dition, their analysis is limited to natural language

inference, for which evaluation datasets targeting

robustness exist; in contrast, we also experiment

with commonsense question answering through

new methods for evaluating robustness. In §4.1,

we discuss the variance in results across different

tasks and highlight the importance of cross-task

evaluation.

3 Experiments

3.1 Experimental Set-Up

Models

We experiment with encoder-decoder

and decoder-only models of varying sizes ranging

from 140 to 774 million parameters, as shown in

Figures 1and 2. Our encoder-decoder models build

on pretrained

(Raffel et al.,2020) and

BART

models (Lewis et al.,2020), and our decoder-only

models build on pretrained

GPT2

(Radford et al.,

2019) models. Our T5 models build speciﬁcally on

the versions trained for an additional 100K steps on

the language modeling objective after pretraining

(Lester et al.,2021), as we aim to measure how

the amount of training data impacts results, and the

default T5 models have already been ﬁne-tuned on

the full SNLI training dataset.4

Tasks

We evaluate self-rationalization models

on two tasks—

natural language inference

(NLI),

and

commonsense question answering

(CQA)—

for which human-annotated rationales already ex-

ist. For NLI, we train task models on SNLI (Bow-

man et al.,2015) and obtain rationales from ESNLI

(Camburu et al.,2018). For CQA, we train task

models on CQA (Talmor et al.,2019) and obtain

rationales from ECQA (Aggarwal et al.,2021). Ex-

amples of inputs and outputs for both tasks are

shown in Table 2. For CQA, unless otherwise spec-

iﬁed, we train on the “positive” freeform rationales

in ECQA, which explain why the gold answer is

the correct answer for a given question. In §4.4,

we explore the impact of training with the different

forms of rationales shown in Table 2.

Rationales

For each task, we compare a baseline

model trained solely to predict task labels with

models trained to also self-rationalize. All self-

rationalization models are trained to generate a

rationale following the task label, as previous work

has found that outputting rationales conditioned on

labels leads to better performance than outputting

For example, when experimenting with T5-BASE,

we work speciﬁcally with

t5-base-lm-adapt

available

huggingface

https://huggingface.co/google/

t5-base-lm-adapt.

labels conditioned on rationales in the ﬁne-tuning

setting (Schuff et al.,2021).

Data

We experiment with different numbers of

training examples

, as we seek to understood how

training data size inﬂuences the impact of self-

rationalization training on robustness to spurious

correlations. We experiment with

n∈

{1K, 2.5K,

5K, 10K, 50K, 100K} for NLI and

n∈

{1K, 5K,

7598} for CQA.

For each training data amount

we create validation data for checkpointing mod-

els by randomly sampling

n/2

instances from the

original task-only validation dataset, such that we

perform model selection based on task performance

across baseline and self-rationalization models. For

self-rationalization models, we create training data

by concatenating original task-only training input-

output pairs with their rationale-extended counter-

parts, such that we have

training inputs obtained

from noriginal instances.6

Training

For each amount of training data

, we

report the average difference between task-only

and self-rationalization models across multiple ran-

dom seeds (5 for NLI and 10 for CQA).

For one

random seed in each evaluation setting (where a set-

ting is determined by the task, model family, model

size, whether rationales are used, and amount of

training data), we tune the learning rate from pos-

sible values

[1e−5,3e−5,5e−5]

and use the best-

performing learning rate for other random seeds in

the same setting. We train with ﬁxed batch size 64

and linear learning rate scheduler using Adafactor

until accuracy on the validation data stops decreas-

ing, or for a maximum of 50 epochs. We use pa-

tience values of

for

n <

10K,

for

n >=

10K,

and 3for n >=50K.

Evaluation

We decode predictions using greedy

decoding and evaluate accuracy using exact match

with gold labels. We evaluate robustness to spuri-

ous correlations by measuring performance on 1)

manually annotated challenge datasets and 2) sub-

sets of original test sets where reliance on spurious

The total size of original training datasets are 549,339 for

SNLI and 7,598 for CQA.

In initial experiments, we ﬁnd that this leads to bet-

ter performance/robustness measures than only using the

input-outputs for self-rationalization; we hypothesize that

without including the original task-only inputs as well, self-

rationalization models may be overﬁtting to the rationale gen-

eration part of the training objective.

We experiment with more seeds for CQA because we have

fewer metrics/evaluation datasets to measure robustness for

CQA, and so it is harder to disentangle real effects from noise.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DoesSelf-RationalizationImproveRobustnesstoSpuriousCorrelations?AlexisRossyMatthewE.PeterszAnaMarasovi´cxyMassachusettsInstituteofTechnology,Cambridge,MA,USAzAllenInstituteforAI,Seattle,WA,USAxUniversityofUtah,SaltLakeCity,UT,USAalexisro@mit.edumatthewp@allenai.organa.marasovic@utah.eduAbstractRat...

展开>> 收起<<

Does Self-Rationalization Improve Robustness to Spurious Correlations Alexis RossyMatthew E. PeterszAna Marasovi cx yMassachusetts Institute of Technology Cambridge MA USA.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Does Self-Rationalization Improve Robustness to Spurious Correlations Alexis RossyMatthew E. PeterszAna Marasovi cx yMassachusetts Institute of Technology Cambridge MA USA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: