Assessing Out-of-Domain Language Model Performance from Few Examples Prasann Singhal Jarad Forristal Xi Ye and Greg Durrett

2025-04-24 0 0 819.41KB 13 页 10玖币
侵权投诉
Assessing Out-of-Domain Language Model Performance from Few
Examples
Prasann Singhal, Jarad Forristal, Xi Ye, and Greg Durrett
Department of Computer Science
The University of Texas at Austin
{prasanns, jarad, xiye, gdurrett}@cs.utexas.edu
Abstract
While pretrained language models have ex-
hibited impressive generalization capabilities,
they still behave unpredictably under certain
domain shifts. In particular, a model may
learn a reasoning process on in-domain train-
ing data that does not hold for out-of-domain
test data. We address the task of predicting out-
of-domain (OOD) performance in a few-shot
fashion: given a few target-domain examples
and a set of models with similar training per-
formance, can we understand how these mod-
els will perform on OOD test data? We bench-
mark the performance on this task when look-
ing at model accuracy on the few-shot exam-
ples, then investigate how to incorporate anal-
ysis of the models’ behavior using feature at-
tributions to better tackle this problem. Specif-
ically, we explore a set of “factors” designed
to reveal model agreement with certain patho-
logical heuristics that may indicate worse gen-
eralization capabilities. On textual entailment,
paraphrase recognition, and a synthetic classi-
fication task, we show that attribution-based
factors can help rank relative model OOD per-
formance. However, accuracy on a few-shot
test set is a surprisingly strong baseline, par-
ticularly when the system designer does not
have in-depth prior knowledge about the do-
main shift.
1 Introduction
The question of whether models have learned the
right behavior on a training set is crucial for gener-
alization. Deep models have a propensity to learn
shallow reasoning shortcuts (Geirhos et al.,2020)
like single-word correlations (Gardner et al.,2021)
or predictions based on partial inputs (Poliak et al.,
2018), particularly for problems like natural lan-
guage inference (Gururangan et al.,2018;McCoy
et al.,2019) and question answering (Jia and Liang,
2017;Chen and Durrett,2019). Unless we use eval-
*Equal contribution
Training
M1
M2
M3
Suite of trained models
(different pre-training, data
augmenta;on, inocula;on, …)
Which one do I use?
OOD SeEng
label 10 exs
How do models perform
on the small sample?
This work: Can post-hoc explana5ons reveal generaliza5on failures?
train
labeled data
unlabeled data
system
developer
Figure 1: Our setting: a system developer is trying
to evaluate a collection of trained models on a small
amount of hand-labeled data to assess which one may
work best in this new domain. Can baselines / attribu-
tions help?
uation sets tailored to these spurious signals, accu-
rately understanding if a model is learning them
remains a hard problem (Bastings et al.,2021;Kim
et al.,2021;Hupkes et al.,2022).
This paper addresses the problem of predicting
whether a model will work well in a target domain
given only a few examples from that domain. This
setting is realistic: a system designer can typically
hand-label a few examples to serve as a test set.
Computing accuracy on this small set and using
that as a proxy for full-test set performance is a
simple baseline for our task, but has high variance,
which may cause us to incorrectly rank two models
that achieve somewhat similar performance. We
hypothesize that we can do better if we can inter-
pret the model’s behavior beyond accuracy. With
the rise of techniques to analyze post-hoc feature
importance in machine-learned models (Lundberg
and Lee,2017;Ribeiro et al.,2016;Sundararajan
et al.,2017), we have seen not just better interpre-
tation of models, but improvements such as con-
straining them to avoid using certain features (Ross
et al.,2017) like those associated with biases (Liu
and Avci,2019;Kennedy et al.,2020), or trying to
more generally teach the right reasoning process
for a problem (Yao et al.,2021;Tang et al.,2021;
Pruthi et al.,2022). If post-hoc interpretation can
arXiv:2210.06725v1 [cs.CL] 13 Oct 2022
strengthen a models’ ability to generalize, can they
also help us understand it?
Figure 1illustrates the role this understanding
can play. We have three trained models and are
trying to rank them for suitability on a new do-
main. The small labeled dataset is a useful (albeit
noisy) indicator of success. However, by checking
model attributions on our few OOD samples, we
can more deeply understand model behavior and
analyze if they use certain pathological heuristics.
Unlike past work (Adebayo et al.,2022), we seek
to automate this process as much as possible, pro-
vided the unwanted behaviors are characterizable
by describable heuristics. We use scalar factors,
which are simple functions of model attributions,
to estimate proximity to these heuristics, similar
to characterizing behavior in past work (Ye et al.,
2021). We then evaluate whether these factors al-
low us to correctly rank the models’ performance
on OOD data.
Both on synthetic (Warstadt et al.,2020), and
real datasets (McCoy et al.,2019;Zhang et al.,
2019), we find that, between models with similar
architectures but different training processes, both
our accuracy baseline and attribution-based factors
are good at distinguishing relative model perfor-
mance on OOD data. However, on models with
different base architectures, we discovering inter-
esting patterns, where factors can very strongly
distinguish between different types of models, but
cannot always map these differences to correct pre-
dictions of OOD performance. In practice, we find
probe set accuracy to be a quick and reliable tool
for understanding OOD performance, whereas fac-
tors are capable of more fine-grained distinctions
in certain situations.
Our Contributions:
(1) We benchmark, in sev-
eral settings, methods for predicting and under-
standing relative OOD performance with few-shot
OOD samples. (2) We establish a ranking-based
evaluation framework for systems in our problem
setting. (3) We analyze patterns in how accuracy
on a few-shot set and factors derived from token
attributions distinguish models.
2 Motivating Example
To expand on Figure 1, Figure 2shows an in-depth
motivating example of our process. We show three
feature attributions from three different models on
an example from the HANS dataset (McCoy et al.,
2019). These models have (unknown) varied OOD
The manager knew the athlete mentioned the actor
The manager knew the athlete mentioned the actor
The manager knew the athlete mentioned the actor
Hypothesis: The manager knew the athlete
M1
M2
M3
M1 > M2 > M3
Agreement with
Pathological Heuristic:
M1 < M2 < M3
OOD Performance:
(subseq attributions) = 0.31
(subseq attributions) = 0.253
(subseq attributions) = -0.04
Predict Ranking
Figure 2: Explanations generated on the same sample
for HANS subsequence data models M1, M2, M3 (have
ascending OOD performance). The factor (shaded un-
derlines) from knowledge of the OOD allows us to in
this example predict the model ranking.
performance but similar performance on the in-
domain MNLI (Williams et al.,2018) data. Our
task is then to correctly rank these models’ perfor-
mance on the HANS dataset in a few-shot manner.
We can consider ranking these models via simple
metrics like accuracy on the small few-shot dataset,
where higher-scoring models are higher-ranked.
However, such estimates can be high variance on
small datasets. In Figure 2, only M3 predicts non-
entailment correctly, and we cannot distinguish the
OOD performance of M1 and M2 without addi-
tional information.
Thus, we turn to explanations to gain more in-
sight into the models’ underlying behavior. With
faithful attributions, we should be able to determine
if the model is following simple inaccurate rules
called heuristics (McCoy et al.,2019). Figure 2
shows the heuristic where a model predicts that the
sentence
A
entails
B
if
B
is a subsequence of
A
.
Crucially, we can use model attributions to assess
model use of this heuristic :we can sum the attribu-
tion mass the model places on subsequence tokens.
We use the term factors to refer to such functions
over model attributions.
The use of factors potentially allows for the au-
tomation of detection of spurious signals or short-
cut learning (Geirhos et al.,2020). While prior
work has shown that spurious correlations are hard
for a human user to detect from explanations (Ade-
bayo et al.,2022), well-designed factors could auto-
matically analyze model behavior across a number
of tasks and detect such failures.
3 Attributions to Predict Performance
In this section, we formalize the ideas presented
thus far. Token-level attribution methods (a subset
of post-hoc explanations) are methods which, given
an input sequence of tokens
xdef
=x1, x2, ..., xn
and a model prediction
ˆydef
=M(x)
for some task,
assign an explanation
φ(x,ˆy)def
=a1, . . . , an
where
ai
corresponds to an attribution or importance score
for a corresponding
xi
towards the final prediction.
For cases where the model, prediction, and inputs
are unambiguous, we abbreviate this simply
φi
φ(x)def
=φ(x, Mi(x)).
We assume that the model is trained on an in-
domain training dataset
DT
and will be evaluated
on some unknown OOD set
DO
. Given two models
M0
and
M1
, with a small amount of data
D(O,t)
DO
(
t= 10
examples or fewer in our settings), our
task is to predict which model will generalize better.
We break the process into 2 steps (see Figure 2):
1. Hypothesize a heuristic.
First we must iden-
tify an underlying heuristic
H
that reflects patho-
logical model behavior in the OOD dataset. For
example, the subsequence heuristic in Figure 2
corresponds to a heuristic which always predicts
entailed if the hypothesis is contained within the
premise. Let
h(Mi)
abstractly reflect how closely
the
i
th model’s behavior aligns with
H
. Let
s(Mi)
be the true OOD performance of model
Mi
.
If we then assume that
h(Mi)
faithfully models
some pathological heuristic
H
, we should have
that
h(M0)> h(M1)> . . . > h(Mm)
implies
s(M0)< s(M1)< . . . < s(Mm)
. In other words,
the more a model
Mi
agrees with a pathological
heuristic H, the worse it performs.
2. Measure alignment.
We now want to predict
the ranking of
s(Mi)
; however, with few labeled
examples there may be high variance in directly
evaluating these metrics. We instead use factors
f(x, φi)
which map tokens and their attributions
for model
Mi
to scalar scores that should corre-
late with the heuristic
H
. Factors can be designed
to align with known pathological heuristics, where
higher scores indicate strong model agreement with
the associated heuristic. We then estimate the rank-
ing of
s(Mi)
using the relative ranking of the cor-
responding h(Mi)approximated through factors.
Concretely, to measure the alignment, we first
compute for each input
xjD(O,t)
the predic-
tion
Mi(xj
) and the explanation
φ(xj)
for that
prediction. These
φ(xj)
are used to compute
the score
f(xj, φ(xj))
for model
M
. We take
the overall score of the model to be
F(i) =
1
tPt
j=1 f(xj, φ(xk, Mi(xk)))
, the mean over the
t
examples in
D(O,t)
. We then directly rank models
on the basis of the
F(i)
values: the higher the aver-
age factor value (the more it follows the heuristic),
the lower the relative ranking:
F(0) > F (1) =
s(M0)< s(M1)
. Therefore we can sort the mod-
els by these values and arrive at a predicted rank-
ing. We later also consider factors which to not
intuitively map to specific heuristics.
Baselines
We also consider three principle
explanation-agnostic baselines. A natural baseline
given
D(O,t)
is to simply use the accuracy (
ACC
)
on this dataset:
1
nPn
i=1 [yi=M(xi)
], however
this may be noisy on only a few examples, and
frequently leads to ties.1
We can also assess model confidence (
CONF
),
which looks at the softmax probability of the pre-
dicted label, as well as looking at
CONF-GT
which
only looks at the softmax probability of the ground-
truth label.
4 Experimental Setup
4.1 Models Compared
In this work, we compare various models across
different axes yielding different
DO
performance.
The first approach we use is
inoculation
(Liu
et al.,2019a), which involves fine-tuning models
on small amounts or batches of
DO
data alongside
in-domain data to increase model performance on
OOD data. The second approach we use is varying
the model architecture and pre-training (e.g., using
a stronger pre-trained Transformer model).
In Section 5, we use inoculation to create 5
RoBERTa-base (Liu et al.,2019b) models of vary-
ing
DO
performance for each of the three MSGS
sets. In Section 6where we consider the HANS and
PAWS datasets, we inoculate a variety of models.
For HANS, we inoculate 5 RoBERTa-large models.
We additionally examine DeBERTa-v3-base (He
et al.,2021b,a) and ELECTRA-base (Clark et al.,
2020) models fine-tuned on in-domain MNLI data.
For PAWS, we inoculate 4 RoBERTa-base mod-
els on the in-domain
DT
set. We also inoculate
ELECTRA-base and DEBERTA-base models. We
1
Most of the datasets we consider are constructed specifi-
cally to mislead models following the heuristic, so this base-
line directly measures agreement with a heuristic h.
摘要:

AssessingOut-of-DomainLanguageModelPerformancefromFewExamplesPrasannSinghal,JaradForristal,XiYe,andGregDurrettDepartmentofComputerScienceTheUniversityofTexasatAustin{prasanns,jarad,xiye,gdurrett}@cs.utexas.eduAbstractWhilepretrainedlanguagemodelshaveex-hibitedimpressivegeneralizationcapabilities,t...

展开>> 收起<<
Assessing Out-of-Domain Language Model Performance from Few Examples Prasann Singhal Jarad Forristal Xi Ye and Greg Durrett.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:819.41KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注