
strengthen a models’ ability to generalize, can they
also help us understand it?
Figure 1illustrates the role this understanding
can play. We have three trained models and are
trying to rank them for suitability on a new do-
main. The small labeled dataset is a useful (albeit
noisy) indicator of success. However, by checking
model attributions on our few OOD samples, we
can more deeply understand model behavior and
analyze if they use certain pathological heuristics.
Unlike past work (Adebayo et al.,2022), we seek
to automate this process as much as possible, pro-
vided the unwanted behaviors are characterizable
by describable heuristics. We use scalar factors,
which are simple functions of model attributions,
to estimate proximity to these heuristics, similar
to characterizing behavior in past work (Ye et al.,
2021). We then evaluate whether these factors al-
low us to correctly rank the models’ performance
on OOD data.
Both on synthetic (Warstadt et al.,2020), and
real datasets (McCoy et al.,2019;Zhang et al.,
2019), we find that, between models with similar
architectures but different training processes, both
our accuracy baseline and attribution-based factors
are good at distinguishing relative model perfor-
mance on OOD data. However, on models with
different base architectures, we discovering inter-
esting patterns, where factors can very strongly
distinguish between different types of models, but
cannot always map these differences to correct pre-
dictions of OOD performance. In practice, we find
probe set accuracy to be a quick and reliable tool
for understanding OOD performance, whereas fac-
tors are capable of more fine-grained distinctions
in certain situations.
Our Contributions:
(1) We benchmark, in sev-
eral settings, methods for predicting and under-
standing relative OOD performance with few-shot
OOD samples. (2) We establish a ranking-based
evaluation framework for systems in our problem
setting. (3) We analyze patterns in how accuracy
on a few-shot set and factors derived from token
attributions distinguish models.
2 Motivating Example
To expand on Figure 1, Figure 2shows an in-depth
motivating example of our process. We show three
feature attributions from three different models on
an example from the HANS dataset (McCoy et al.,
2019). These models have (unknown) varied OOD
The manager knew the athlete mentioned the actor
The manager knew the athlete mentioned the actor
The manager knew the athlete mentioned the actor
Hypothesis: The manager knew the athlete
M1
M2
M3
M1 > M2 > M3
Agreement with
Pathological Heuristic:
M1 < M2 < M3
OOD Performance:
(subseq attributions) = 0.31
(subseq attributions) = 0.253
(subseq attributions) = -0.04
Figure 2: Explanations generated on the same sample
for HANS subsequence data models M1, M2, M3 (have
ascending OOD performance). The factor (shaded un-
derlines) from knowledge of the OOD allows us to in
this example predict the model ranking.
performance but similar performance on the in-
domain MNLI (Williams et al.,2018) data. Our
task is then to correctly rank these models’ perfor-
mance on the HANS dataset in a few-shot manner.
We can consider ranking these models via simple
metrics like accuracy on the small few-shot dataset,
where higher-scoring models are higher-ranked.
However, such estimates can be high variance on
small datasets. In Figure 2, only M3 predicts non-
entailment correctly, and we cannot distinguish the
OOD performance of M1 and M2 without addi-
tional information.
Thus, we turn to explanations to gain more in-
sight into the models’ underlying behavior. With
faithful attributions, we should be able to determine
if the model is following simple inaccurate rules
called heuristics (McCoy et al.,2019). Figure 2
shows the heuristic where a model predicts that the
sentence
A
entails
B
if
B
is a subsequence of
A
.
Crucially, we can use model attributions to assess
model use of this heuristic :we can sum the attribu-
tion mass the model places on subsequence tokens.
We use the term factors to refer to such functions
over model attributions.
The use of factors potentially allows for the au-
tomation of detection of spurious signals or short-
cut learning (Geirhos et al.,2020). While prior
work has shown that spurious correlations are hard
for a human user to detect from explanations (Ade-
bayo et al.,2022), well-designed factors could auto-
matically analyze model behavior across a number
of tasks and detect such failures.