
to be in lower-resource settings. In higher-
resource settings, training with rationales can
hurt robustness (§4.1).
2.
Within model families, larger models benefit
more in robustness from rationales (§4.2).
3.
The effects of self-rationalization on robust-
ness are not fully explained by its effects on
in-domain task performance (§4.3).
4.
The content of rationales used during training
influences both task performance and robust-
ness to spurious correlations (§4.4).
Our results suggest that straightforward self-
rationalization training does not always facilitate
learning to solve a task for the right reasons. In-
stead, the effects of self-rationalization on robust-
ness to spurious correlations depend on a multitude
of factors. Thus, appropriate care should be taken
when training models to self-rationalize for the goal
of creating trustworthy models.
2 Related Work
Learning to rationalize
Two classes of ap-
proaches to producing models that can rational-
ize their predictions include self-rationalization
models,
2
which are fully differentiable and out-
put free-text rationales along with task predic-
tions, and pipeline models, which consist of two
components—one that produces rationales, and a
second that makes predictions from those rationales
(Wiegreffe et al.,2021).
3
Such methods are typi-
cally evaluated by the faithfulness and plausibility
of their rationales, where faithfulness represents
the extent to which a model actually relied on the
rationale in making its prediction, and plausibility
indicates human judgment of how well the ratio-
nale explains the output (DeYoung et al.,2020).
In contrast to these works which aim to im-
prove model interpretability through new methods
for rationalizing models, we ask to what extent
existing methods affect model robustness to spu-
rious correlations. We conduct our analysis on
self-rationalization models, which have been found
to achieve better task performance and produce
higher-quality rationales than do pipeline models
(Wiegreffe et al.,2021;Camburu et al.,2018).
2
Such approaches have also been referred to as explain-
then-predict (Camburu et al.,2018) and rationalize-then-
predict (Chen et al.,2022) models.
3
See Wiegreffe et al. (2021) for a detailed discussion of
pipeline and self-rationalization approaches to rationalization.
Learning from rationales
Recent work has ex-
plored the utility of rationales for improving end-
task performance in in-context learning (Wei et al.,
2022;Lampinen et al.,2022;Ye and Durrett,2022)
as well as in fine-tuning (Zaidan et al.,2007;Han-
cock et al.,2018;Camburu et al.,2018;Narang
et al.,2020;Hase and Bansal,2021;Nye et al.,
2021;Zhao and Vydiswaran,2021). Previous work
has shown that training with both human-annotated
rationales (Rajani et al.,2019) and rationales gen-
erated by language models (Paranjape et al.,2021)
can increase in-domain task performance, partic-
ularly in low-resource settings (Bhat et al.,2021;
Pruthi et al.,2022;Zelikman et al.,2022). Unlike
these prior works, which study how training with
rationales affects in-domain, end-task performance,
we focus specifically on evaluating impact on ro-
bustness to spurious correlations.
Improving robustness with rationales
Most
closely related are recent works that study how
training with rationales affects model robustness.
Stacey et al. (2022) propose a method of supervis-
ing attention weights with extractive rationales and
show that this method leads to both in-distribution
and out-of-distribution improvements for natural
language inference. Schuster et al. (2021) find that
training with contrastive extractive rationales im-
proves robustness as measured by performance on
adversarial evaluation sets. Concurrent work by
Chen et al. (2022) investigates to what extent train-
ing models to extract rationales through pipelines
improves their robustness to adversarial attacks.
In contrast to all three of these works, we focus
on freeform rationales instead of extractive ratio-
nales and explore the impact of amount of training
data on robustness. In contrast to Schuster et al.
(2021) and Chen et al. (2022), we analyze self-
rationalization models instead of pipeline models
and measure robustness to spurious correlations,
rather than robustness to adversarial attacks. While
Stacey et al. (2022) evaluate robustness to spurious
correlations for natural language inference with
some of the same test sets, they work with masked
language models and evaluate the effect of super-
vising model attention with rationales; in contrast,
we work with encoder-decoder and decoder-only
models of varying sizes and evaluate the effect of
outputting rationales along with predictions. In ad-
dition, their analysis is limited to natural language
inference, for which evaluation datasets targeting
robustness exist; in contrast, we also experiment