Predicting Fine-Tuning Performance with Probing Zining Zhu12 Soroosh Shahtalebi2 Frank Rudzicz123 1University of Toronto2Vector Institute for Artiﬁcial Intelligence3Unity Health Toronto

2025-05-02 0 0 528.19KB 14 页 10玖币

侵权投诉

Predicting Fine-Tuning Performance with Probing

Zining Zhu1,2, Soroosh Shahtalebi2, Frank Rudzicz1,2,3

1University of Toronto 2Vector Institute for Artiﬁcial Intelligence 3Unity Health Toronto

zining@cs.toronto.edu, soroosh.shahtalebi@vectorinstitute.ai

frank@spoclab.com

Abstract

Large NLP models have recently shown im-

pressive performance in language understand-

ing tasks, typically evaluated by their ﬁne-

tuned performance. Alternatively, probing

has received increasing attention as being a

lightweight method for interpreting the intrin-

sic mechanisms of large NLP models. In prob-

ing, post-hoc classiﬁers are trained on “out-of-

domain” datasets that diagnose speciﬁc abil-

ities. While probing the language models

has led to insightful ﬁndings, they appear dis-

jointed from the development of models. This

paper explores the utility of probing deep NLP

models to extract a proxy signal widely used

in model development – the ﬁne-tuning perfor-

mance. We ﬁnd that it is possible to use the ac-

curacies of only three probing tests to predict

the ﬁne-tuning performance with errors 40%

-80% smaller than baselines. We further dis-

cuss possible avenues where probing can em-

power the development of deep NLP models.

1 Introduction

Large-scale neural models have recently demon-

strated state-of-the-art performance in a wide va-

riety of tasks, including sentiment detection, para-

phrase detection, linguistic acceptability, and en-

tailment detection (Devlin et al.,2019;Radford

et al.,2019;Peters et al.,2018). Developing sys-

tems for these tasks usually involves two stages: a

pre-training stage, where the large neural models

gain linguistic knowledge from weak supervision

signals in massive corpora, and a ﬁne-tuning stage,

where the models acquire task-speciﬁc knowledge

from labeled data. The ﬁne-tuning results are

widely used to benchmark the performances of

neural models and reﬁne the models’ development

procedures.

However, these ﬁne-tuning results are summary

statistics and do not paint the full picture of deep

neural models (Ethayarajh and Jurafsky,2020;Ben-

der and Koller,2020). As researchers are increas-

ingly concerned about interpreting the intrinsic

mechanisms of deep neural models, many data-

driven assessment methods have been developed.

These assessments usually follow the route of com-

piling a targeted dataset and running post-hoc anal-

yses. Until now, one of the most popular inter-

pretation methods is referred to as probing. To

probe a neural model, one uses a predictor to obtain

the labels from the representations that are embed-

ded using the neural model. Probing analyses on

deep neural models revealed some low-dimensional

syntactic structures (Hewitt and Manning,2019),

common-sense knowledge (Petroni et al.,2019)

and (to some extent) human-like abilities, including

being surprised upon witnessing linguistic irregu-

larity (Li et al.,2021) and reasoning about space

and time (Aroca-Ouellette et al.,2021).

From the viewpoint of data-driven assessments,

both ﬁne-tuning and probing can reveal the abilities

of deep neural networks, but they appear to steer

towards different directions:

In-domain vs. out-of-domain. Fine-tuning uses

in-domain data – we evaluate the models on the

same distributions as those in deployment. Probing,

however, uses out-domain data: instead of simu-

lating the deployment environment, the targeted

datasets focus on diagnosing speciﬁc abilities.

Inclusive vs. speciﬁc. In ﬁne-tuning, edge cases

should be included, so the unexpected behavior af-

ter deployment can be minimized (Ribeiro et al.,

2020) and the ﬁne-tuning results can be stable

(Zhang et al.,2021). On the contrary, the probing

datasets are more specialized, so smaller datasets

sufﬁce.1

High performances vs. faithful interpretations.

While ﬁne-tuning methods are mainly studied from

an algorithmic perspective to enhance the perfor-

Another viewpoint for the dataset requirement can be

derived from learning theory. Loosely speaking, optimizing

more parameters requires more data to reach stable results.

Fine-tuning involves more parameters than probing. Zhu et al.

(2022) provides a more quantitative discussion.

arXiv:2210.07352v1 [cs.CL] 13 Oct 2022

mance of language models, probing methods aim

at assessing the faithfulness of language models.

To fulﬁll the former objective, ﬁne-tuning is accom-

panied by efforts in pre-training, collecting more

data, building better representations, and explor-

ing novel model architectures (He et al.,2021;Sun

et al.,2021;Wang et al.,2021b;Jiang et al.,2020).

Conversely, the latter goal is pursued by borrowing

inspirations from a variety of other sources, includ-

ing psycholinguistic assessment protocols (Futrell

et al.,2019;Li et al.,2022a), information theory

(Voita and Titov,2020;Pimentel and Cotterell,

2021;Zhu and Rudzicz,2020), and causal anal-

ysis (Slobodkin et al.,2021;Elazar et al.,2021).

In short, probing assessments are more special-

ized (therefore more ﬂexible) and less computation-

ally expensive. In contrast, the performance scores

of ﬁne-tuning assessments are more relevant to the

design and training of deep neural models.

Can

probing be used in the development of deep neu-

ral models? This question involves two aspects:

•

Feasibility: Are probing results relevant in the

model development?

•

Operation: How to set up probing analyses to

get these useful results?

This paper attempts to answer both. For feasi-

bility, we show that a crucial feedback signal in

model development, the ﬁne-tuning performance,

can be predicted via probing results, indicating a

positive answer to the feasibility question.

For operation, we run extensive ablation studies

to simplify the probing conﬁgurations, leading to

some heuristics to set up probing analyses. We start

with a battery of probing tasks and evaluate the util-

ities both task-wise and layer-wise (§5.2 - §5.3).

We then reduce the number of probing conﬁgura-

tions, showing that as few as 3 conﬁgurations can

predict ﬁne-tuning results with RMSEs between

40%

and

80%

smaller than the control baseline

(§5.5). To further answer the operation question,

we run ablation studies on different probing conﬁg-

urations, including probing methods (§5.6) and the

number of data samples (§5.7). We also analyze

the uncertainty of the results (§5.8). Our analysis

shows the possibility of using probing in develop-

ing high-performance deep neural models.

All codes are open-sourced at

https://github.com/SPOClab-ca/

performance_prediction.

2 Related Work

Performance prediction

Xia et al. (2020) pro-

posed a framework that predicts task performance

using a collection of features, including the hy-

perparameters of the model and the percentage of

text overlap between the source and target datasets.

Srinivasan et al. (2021) extended this framework

into a multilingual setting. Ye et al. (2021) consid-

ered the reliability of performance – an idea similar

to that of Dodge et al. (2019). This paper differs

from the performance prediction literature in the set

of features – we use the probing results as features

– and more importantly, we aim at showing that the

probing results can improve the interpretability in

the development procedures of large models.

Out-of Domain generalization

The out-of-

domain generalization literature provides a vari-

ety of methods to improve the performance of out-

of-domain classiﬁcation. We defer to Wang et al.

(2021a) for a summary. Gulrajani and Lopez-Paz

(2020) ran empirical comparisons on many algo-

rithms, and some theoretical analyses bound the

performance of out-of-domain classiﬁcation (Li

et al.,2022b;Minsker and Mathieu,2019). In our

setting, the probing and the ﬁne-tuning datasets

can be considered different domains, but our anal-

ysis predicts the out-of-domain performance. A

similar setting was presented in Kornblith et al.

(2019), which studied the correlation between the

performance on ImageNet and the performance of

transfer learning on a variety of image domains.

Our setting focuses on text domains, and use spe-

cialized, small-sized probing datasets.

Probing, and the utility of LODNA

The prob-

ing literature reveals various abilities of deep neu-

ral models, as summarized by Rogers et al. (2020);

Manning et al. (2020); Belinkov (2021); Pavlick

(2022). There have been some discussions on the

utility of probing results. Baroni (2021) argued that

these linguistic-oriented deep neural network anal-

yses (LODNA) should treat deep neural models as

algorithmic linguistic theories; otherwise, LODNA

has limited relevance to theoretical linguists. Re-

cent literature in LODNA drew interesting ﬁndings

by comparing the mechanisms in which algorithms

and humans respond to external stimuli, including

the relative importance of sentences (Hollenstein

and Beinborn,2021). Probing results, when used

jointly with evidence from datasets, can also be

used to predict the inductive bias of neural models

(Lovering et al.,2021;Immer et al.,2021). As we

show, probing results can explain the variance in

and even predict the ﬁne-tuning performance of

neural NLP models.

Fine-tuning and probing

There have been mul-

tiple papers that explored ﬁne-tuning and probing

paradigms. Probing is used as a post-hoc method

to interpret linguistic knowledge in deep neural

models during pre-training (Liu et al.,2019a), ﬁne-

tuning (Miaschi et al.,2020;Mosbach et al.,2020;

Durrani et al.,2021;Yu and Ettinger,2021;Zhou

and Srikumar,2021), and other stages of model

development (Ebrahimi et al.,2021). From a per-

formance perspective, probing can sometimes re-

sult in higher performance metrics (e.g., accuracy)

than ﬁne-tuning (Liu et al.,2019a;Hall Maudslay

et al.,2020) and ﬁne-tuning can beneﬁt from addi-

tional data (Phang et al.,2018). We take a different

perspective, considering how the probing and ﬁne-

tuning results relate to each other, and more impor-

tantly, how the signals of probing can be helpful

towards developing large neural models.

3 Methods

We present the overall analysis method and eval-

uation metric in this section. §5elaborates the

detailed experiment settings.

Predicting ﬁne-tuning performance

A deep

neural model

can be ﬁne-tuned on task

achieve performance

. Let

S∈RN

be the test

accuracies of probing classiﬁcations on model

using

conﬁgurations. For example, a deep neu-

ral model

M=RoBERTa

can be ﬁne-tuned to

reach performance

At= 0.85

on a

t=RTE

task.

With post-hoc classiﬁers applied to the 12 layers

, we can probe for 12 test accuracies on a

probing task (e.g., detecting the past vs. present

tense), which constitute of S.2

To ﬁnd the pattern across a diverse category

of models, we regress over

models (we will

describe in §4.3). The collected probing results

{S(k)}K

k=1

can be used to predict the ﬁne-tuning

performance

{A(k)

t}K

k=1

via regression. Formally,

this procedure optimizes for

N+ 1

parameters,

θ∈RN+1 so that:

θ∗=argminθΣk||θTS(k)− A(k)

t||2(1)

Following the default implementation of linear regression,

we include an additional dimension in

S(k)

to multiply with

the bias term, so S(k)∈RN+1 in the following equations.

This procedure has closed-form solutions that

are implemented in various scientiﬁc computation

toolkits (e.g.,

and

scipy

). The minimum reach-

able RMSE is therefore:

RMSE =r1

KΣk||θ∗

TS(k)− A(k)

t||2(2)

RMSE-reduction

While RMSE can evaluate the

quality of this regression, it is insufﬁcient for mea-

suring the informativeness of

due to the discrep-

ancy among the ﬁne-tuning tasks

. Suppose we

have two tasks,

and

, where the probing re-

sults

can support high-precision regressions to

RMSE = 0.01

on both tasks. However, on

, even

features drawn from random distributions

might

be sufﬁcient to reach

RMSE = 0.02

, while on the

more difﬁcult task,

, random features could only

reach

RMSE = 0.10

maximum. The probing re-

sults

is more useful for

than

, but RMSE

itself does not capture this difference.

Considering this, we should further adjust

against a baseline, the minimum reachable RMSE

using random features.

θc∗=argminθΣk||θT(k)− A(k)

t||2,(3)

where the random features



are drawn from

N(0,0.1)

. Overall, the RMSE and the reduction

from the baseline are computed as:

RMSEc=r1

KΣk||θT

c∗(k)− A(k)

t||2

(4)

RMSE_reduction =RMSEc−RMSE

RMSEc

×100

(5)

In the experiments, all RMSE and

RMSEc

val-

ues follow 5-fold cross validation. We report the

RMSE_reduction as the score that measures the

utility of S.

4 Evaluation tasks and datasets

4.1 Fine-tuning tasks

We consider 6 binary classiﬁcation tasks in GLUE

(Wang et al.,2019) as ﬁne-tuning tasks:

RTE

con-

sists of a collection of challenges recognizing tex-

tual entailment. Given two sentences, the model

Considering the small data sizes (i.e., the total number

of models studied), even the “random features” drawn from

random noises contain artefacts – patterns that can be used to

regress the results.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PredictingFine-TuningPerformancewithProbingZiningZhu1;2,SorooshShahtalebi2,FrankRudzicz1;2;31UniversityofToronto2VectorInstituteforArticialIntelligence3UnityHealthTorontozining@cs.toronto.edu,soroosh.shahtalebi@vectorinstitute.aifrank@spoclab.comAbstractLargeNLPmodelshaverecentlyshownim-pressiveper...

展开>> 收起<<

Predicting Fine-Tuning Performance with Probing Zining Zhu12 Soroosh Shahtalebi2 Frank Rudzicz123 1University of Toronto2Vector Institute for Artiﬁcial Intelligence3Unity Health Toronto.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Predicting Fine-Tuning Performance with Probing Zining Zhu12 Soroosh Shahtalebi2 Frank Rudzicz123 1University of Toronto2Vector Institute for Artiﬁcial Intelligence3Unity Health Toronto

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: