Assessing Out-of-Domain Language Model Performance from Few Examples Prasann Singhal Jarad Forristal Xi Ye and Greg Durrett

2025-04-24 1 0 819.41KB 13 页 10玖币

侵权投诉

Assessing Out-of-Domain Language Model Performance from Few

Examples

Prasann Singhal∗, Jarad Forristal∗, Xi Ye, and Greg Durrett

Department of Computer Science

The University of Texas at Austin

{prasanns, jarad, xiye, gdurrett}@cs.utexas.edu

Abstract

While pretrained language models have ex-

hibited impressive generalization capabilities,

they still behave unpredictably under certain

domain shifts. In particular, a model may

learn a reasoning process on in-domain train-

ing data that does not hold for out-of-domain

test data. We address the task of predicting out-

of-domain (OOD) performance in a few-shot

fashion: given a few target-domain examples

and a set of models with similar training per-

formance, can we understand how these mod-

els will perform on OOD test data? We bench-

mark the performance on this task when look-

ing at model accuracy on the few-shot exam-

ples, then investigate how to incorporate anal-

ysis of the models’ behavior using feature at-

tributions to better tackle this problem. Specif-

ically, we explore a set of “factors” designed

to reveal model agreement with certain patho-

logical heuristics that may indicate worse gen-

eralization capabilities. On textual entailment,

paraphrase recognition, and a synthetic classi-

ﬁcation task, we show that attribution-based

factors can help rank relative model OOD per-

formance. However, accuracy on a few-shot

test set is a surprisingly strong baseline, par-

ticularly when the system designer does not

have in-depth prior knowledge about the do-

main shift.

1 Introduction

The question of whether models have learned the

right behavior on a training set is crucial for gener-

alization. Deep models have a propensity to learn

shallow reasoning shortcuts (Geirhos et al.,2020)

like single-word correlations (Gardner et al.,2021)

or predictions based on partial inputs (Poliak et al.,

2018), particularly for problems like natural lan-

guage inference (Gururangan et al.,2018;McCoy

et al.,2019) and question answering (Jia and Liang,

2017;Chen and Durrett,2019). Unless we use eval-

*Equal contribution

Training

Suite of trained models

(diﬀerent pre-training, data

augmenta;on, inocula;on, …)

Which one do I use?

OOD SeEng

label 10 exs

How do models perform

on the small sample?

This work: Can post-hoc explana5ons reveal generaliza5on failures?

train

labeled data

unlabeled data

system

developer

Figure 1: Our setting: a system developer is trying

to evaluate a collection of trained models on a small

amount of hand-labeled data to assess which one may

work best in this new domain. Can baselines / attribu-

tions help?

uation sets tailored to these spurious signals, accu-

rately understanding if a model is learning them

remains a hard problem (Bastings et al.,2021;Kim

et al.,2021;Hupkes et al.,2022).

This paper addresses the problem of predicting

whether a model will work well in a target domain

given only a few examples from that domain. This

setting is realistic: a system designer can typically

hand-label a few examples to serve as a test set.

Computing accuracy on this small set and using

that as a proxy for full-test set performance is a

simple baseline for our task, but has high variance,

which may cause us to incorrectly rank two models

that achieve somewhat similar performance. We

hypothesize that we can do better if we can inter-

pret the model’s behavior beyond accuracy. With

the rise of techniques to analyze post-hoc feature

importance in machine-learned models (Lundberg

and Lee,2017;Ribeiro et al.,2016;Sundararajan

et al.,2017), we have seen not just better interpre-

tation of models, but improvements such as con-

straining them to avoid using certain features (Ross

et al.,2017) like those associated with biases (Liu

and Avci,2019;Kennedy et al.,2020), or trying to

more generally teach the right reasoning process

for a problem (Yao et al.,2021;Tang et al.,2021;

Pruthi et al.,2022). If post-hoc interpretation can

arXiv:2210.06725v1 [cs.CL] 13 Oct 2022

strengthen a models’ ability to generalize, can they

also help us understand it?

Figure 1illustrates the role this understanding

can play. We have three trained models and are

trying to rank them for suitability on a new do-

main. The small labeled dataset is a useful (albeit

noisy) indicator of success. However, by checking

model attributions on our few OOD samples, we

can more deeply understand model behavior and

analyze if they use certain pathological heuristics.

Unlike past work (Adebayo et al.,2022), we seek

to automate this process as much as possible, pro-

vided the unwanted behaviors are characterizable

by describable heuristics. We use scalar factors,

which are simple functions of model attributions,

to estimate proximity to these heuristics, similar

to characterizing behavior in past work (Ye et al.,

2021). We then evaluate whether these factors al-

low us to correctly rank the models’ performance

on OOD data.

Both on synthetic (Warstadt et al.,2020), and

real datasets (McCoy et al.,2019;Zhang et al.,

2019), we ﬁnd that, between models with similar

architectures but different training processes, both

our accuracy baseline and attribution-based factors

are good at distinguishing relative model perfor-

mance on OOD data. However, on models with

different base architectures, we discovering inter-

esting patterns, where factors can very strongly

distinguish between different types of models, but

cannot always map these differences to correct pre-

dictions of OOD performance. In practice, we ﬁnd

probe set accuracy to be a quick and reliable tool

for understanding OOD performance, whereas fac-

tors are capable of more ﬁne-grained distinctions

in certain situations.

Our Contributions:

(1) We benchmark, in sev-

eral settings, methods for predicting and under-

standing relative OOD performance with few-shot

OOD samples. (2) We establish a ranking-based

evaluation framework for systems in our problem

setting. (3) We analyze patterns in how accuracy

on a few-shot set and factors derived from token

attributions distinguish models.

2 Motivating Example

To expand on Figure 1, Figure 2shows an in-depth

motivating example of our process. We show three

feature attributions from three different models on

an example from the HANS dataset (McCoy et al.,

2019). These models have (unknown) varied OOD

The manager knew the athlete mentioned the actor

Hypothesis: The manager knew the athlete

M1 > M2 > M3

Agreement with

Pathological Heuristic:

M1 < M2 < M3

OOD Performance:

(subseq attributions) = 0.31

∑

(subseq attributions) = 0.253

∑

(subseq attributions) = -0.04

∑

Predict Ranking

Figure 2: Explanations generated on the same sample

for HANS subsequence data models M1, M2, M3 (have

ascending OOD performance). The factor (shaded un-

derlines) from knowledge of the OOD allows us to in

this example predict the model ranking.

performance but similar performance on the in-

domain MNLI (Williams et al.,2018) data. Our

task is then to correctly rank these models’ perfor-

mance on the HANS dataset in a few-shot manner.

We can consider ranking these models via simple

metrics like accuracy on the small few-shot dataset,

where higher-scoring models are higher-ranked.

However, such estimates can be high variance on

small datasets. In Figure 2, only M3 predicts non-

entailment correctly, and we cannot distinguish the

OOD performance of M1 and M2 without addi-

tional information.

Thus, we turn to explanations to gain more in-

sight into the models’ underlying behavior. With

faithful attributions, we should be able to determine

if the model is following simple inaccurate rules

called heuristics (McCoy et al.,2019). Figure 2

shows the heuristic where a model predicts that the

sentence

entails

is a subsequence of

Crucially, we can use model attributions to assess

model use of this heuristic :we can sum the attribu-

tion mass the model places on subsequence tokens.

We use the term factors to refer to such functions

over model attributions.

The use of factors potentially allows for the au-

tomation of detection of spurious signals or short-

cut learning (Geirhos et al.,2020). While prior

work has shown that spurious correlations are hard

for a human user to detect from explanations (Ade-

bayo et al.,2022), well-designed factors could auto-

matically analyze model behavior across a number

of tasks and detect such failures.

3 Attributions to Predict Performance

In this section, we formalize the ideas presented

thus far. Token-level attribution methods (a subset

of post-hoc explanations) are methods which, given

an input sequence of tokens

xdef

=x1, x2, ..., xn

and a model prediction

ˆydef

=M(x)

for some task,

assign an explanation

φ(x,ˆy)def

=a1, . . . , an

where

corresponds to an attribution or importance score

for a corresponding

towards the ﬁnal prediction.

For cases where the model, prediction, and inputs

are unambiguous, we abbreviate this simply

φi≡

φ(x)def

=φ(x, Mi(x)).

We assume that the model is trained on an in-

domain training dataset

and will be evaluated

on some unknown OOD set

. Given two models

and

, with a small amount of data

D(O,t)⊂

(

t= 10

examples or fewer in our settings), our

task is to predict which model will generalize better.

We break the process into 2 steps (see Figure 2):

1. Hypothesize a heuristic.

First we must iden-

tify an underlying heuristic

that reﬂects patho-

logical model behavior in the OOD dataset. For

example, the subsequence heuristic in Figure 2

corresponds to a heuristic which always predicts

entailed if the hypothesis is contained within the

premise. Let

h(Mi)

abstractly reﬂect how closely

the

th model’s behavior aligns with

. Let

s(Mi)

be the true OOD performance of model

If we then assume that

h(Mi)

faithfully models

some pathological heuristic

, we should have

that

h(M0)> h(M1)> . . . > h(Mm)

implies

s(M0)< s(M1)< . . . < s(Mm)

. In other words,

the more a model

agrees with a pathological

heuristic H, the worse it performs.

2. Measure alignment.

We now want to predict

the ranking of

s(Mi)

; however, with few labeled

examples there may be high variance in directly

evaluating these metrics. We instead use factors

f(x, φi)

which map tokens and their attributions

for model

to scalar scores that should corre-

late with the heuristic

. Factors can be designed

to align with known pathological heuristics, where

higher scores indicate strong model agreement with

the associated heuristic. We then estimate the rank-

ing of

s(Mi)

using the relative ranking of the cor-

responding h(Mi)approximated through factors.

Concretely, to measure the alignment, we ﬁrst

compute for each input

xj∈D(O,t)

the predic-

tion

Mi(xj

) and the explanation

φ(xj)

for that

prediction. These

φ(xj)

are used to compute

the score

f(xj, φ(xj))

for model

. We take

the overall score of the model to be

F(i) =

tPt

j=1 f(xj, φ(xk, Mi(xk)))

, the mean over the

examples in

D(O,t)

. We then directly rank models

on the basis of the

F(i)

values: the higher the aver-

age factor value (the more it follows the heuristic),

the lower the relative ranking:

F(0) > F (1) =⇒

s(M0)< s(M1)

. Therefore we can sort the mod-

els by these values and arrive at a predicted rank-

ing. We later also consider factors which to not

intuitively map to speciﬁc heuristics.

Baselines

We also consider three principle

explanation-agnostic baselines. A natural baseline

given

D(O,t)

is to simply use the accuracy (

ACC

)

on this dataset:

nPn

i=1 [yi=M(xi)

], however

this may be noisy on only a few examples, and

frequently leads to ties.1

We can also assess model conﬁdence (

CONF

which looks at the softmax probability of the pre-

dicted label, as well as looking at

CONF-GT

which

only looks at the softmax probability of the ground-

truth label.

4 Experimental Setup

4.1 Models Compared

In this work, we compare various models across

different axes yielding different

performance.

The ﬁrst approach we use is

inoculation

(Liu

et al.,2019a), which involves ﬁne-tuning models

on small amounts or batches of

data alongside

in-domain data to increase model performance on

OOD data. The second approach we use is varying

the model architecture and pre-training (e.g., using

a stronger pre-trained Transformer model).

In Section 5, we use inoculation to create 5

RoBERTa-base (Liu et al.,2019b) models of vary-

ing

performance for each of the three MSGS

sets. In Section 6where we consider the HANS and

PAWS datasets, we inoculate a variety of models.

For HANS, we inoculate 5 RoBERTa-large models.

We additionally examine DeBERTa-v3-base (He

et al.,2021b,a) and ELECTRA-base (Clark et al.,

2020) models ﬁne-tuned on in-domain MNLI data.

For PAWS, we inoculate 4 RoBERTa-base mod-

els on the in-domain

set. We also inoculate

ELECTRA-base and DEBERTA-base models. We

Most of the datasets we consider are constructed speciﬁ-

cally to mislead models following the heuristic, so this base-

line directly measures agreement with a heuristic h.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AssessingOut-of-DomainLanguageModelPerformancefromFewExamplesPrasannSinghal,JaradForristal,XiYe,andGregDurrettDepartmentofComputerScienceTheUniversityofTexasatAustin{prasanns,jarad,xiye,gdurrett}@cs.utexas.eduAbstractWhilepretrainedlanguagemodelshaveex-hibitedimpressivegeneralizationcapabilities,t...

展开>> 收起<<

Assessing Out-of-Domain Language Model Performance from Few Examples Prasann Singhal Jarad Forristal Xi Ye and Greg Durrett.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Assessing Out-of-Domain Language Model Performance from Few Examples Prasann Singhal Jarad Forristal Xi Ye and Greg Durrett

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: