Predicting Fine-Tuning Performance with Probing Zining Zhu12 Soroosh Shahtalebi2 Frank Rudzicz123 1University of Toronto2Vector Institute for Artificial Intelligence3Unity Health Toronto

2025-05-02 0 0 528.19KB 14 页 10玖币
侵权投诉
Predicting Fine-Tuning Performance with Probing
Zining Zhu1,2, Soroosh Shahtalebi2, Frank Rudzicz1,2,3
1University of Toronto 2Vector Institute for Artificial Intelligence 3Unity Health Toronto
zining@cs.toronto.edu, soroosh.shahtalebi@vectorinstitute.ai
frank@spoclab.com
Abstract
Large NLP models have recently shown im-
pressive performance in language understand-
ing tasks, typically evaluated by their fine-
tuned performance. Alternatively, probing
has received increasing attention as being a
lightweight method for interpreting the intrin-
sic mechanisms of large NLP models. In prob-
ing, post-hoc classifiers are trained on “out-of-
domain” datasets that diagnose specific abil-
ities. While probing the language models
has led to insightful findings, they appear dis-
jointed from the development of models. This
paper explores the utility of probing deep NLP
models to extract a proxy signal widely used
in model development – the fine-tuning perfor-
mance. We find that it is possible to use the ac-
curacies of only three probing tests to predict
the fine-tuning performance with errors 40%
-80% smaller than baselines. We further dis-
cuss possible avenues where probing can em-
power the development of deep NLP models.
1 Introduction
Large-scale neural models have recently demon-
strated state-of-the-art performance in a wide va-
riety of tasks, including sentiment detection, para-
phrase detection, linguistic acceptability, and en-
tailment detection (Devlin et al.,2019;Radford
et al.,2019;Peters et al.,2018). Developing sys-
tems for these tasks usually involves two stages: a
pre-training stage, where the large neural models
gain linguistic knowledge from weak supervision
signals in massive corpora, and a fine-tuning stage,
where the models acquire task-specific knowledge
from labeled data. The fine-tuning results are
widely used to benchmark the performances of
neural models and refine the models’ development
procedures.
However, these fine-tuning results are summary
statistics and do not paint the full picture of deep
neural models (Ethayarajh and Jurafsky,2020;Ben-
der and Koller,2020). As researchers are increas-
ingly concerned about interpreting the intrinsic
mechanisms of deep neural models, many data-
driven assessment methods have been developed.
These assessments usually follow the route of com-
piling a targeted dataset and running post-hoc anal-
yses. Until now, one of the most popular inter-
pretation methods is referred to as probing. To
probe a neural model, one uses a predictor to obtain
the labels from the representations that are embed-
ded using the neural model. Probing analyses on
deep neural models revealed some low-dimensional
syntactic structures (Hewitt and Manning,2019),
common-sense knowledge (Petroni et al.,2019)
and (to some extent) human-like abilities, including
being surprised upon witnessing linguistic irregu-
larity (Li et al.,2021) and reasoning about space
and time (Aroca-Ouellette et al.,2021).
From the viewpoint of data-driven assessments,
both fine-tuning and probing can reveal the abilities
of deep neural networks, but they appear to steer
towards different directions:
In-domain vs. out-of-domain. Fine-tuning uses
in-domain data – we evaluate the models on the
same distributions as those in deployment. Probing,
however, uses out-domain data: instead of simu-
lating the deployment environment, the targeted
datasets focus on diagnosing specific abilities.
Inclusive vs. specific. In fine-tuning, edge cases
should be included, so the unexpected behavior af-
ter deployment can be minimized (Ribeiro et al.,
2020) and the fine-tuning results can be stable
(Zhang et al.,2021). On the contrary, the probing
datasets are more specialized, so smaller datasets
suffice.1
High performances vs. faithful interpretations.
While fine-tuning methods are mainly studied from
an algorithmic perspective to enhance the perfor-
1
Another viewpoint for the dataset requirement can be
derived from learning theory. Loosely speaking, optimizing
more parameters requires more data to reach stable results.
Fine-tuning involves more parameters than probing. Zhu et al.
(2022) provides a more quantitative discussion.
arXiv:2210.07352v1 [cs.CL] 13 Oct 2022
mance of language models, probing methods aim
at assessing the faithfulness of language models.
To fulfill the former objective, fine-tuning is accom-
panied by efforts in pre-training, collecting more
data, building better representations, and explor-
ing novel model architectures (He et al.,2021;Sun
et al.,2021;Wang et al.,2021b;Jiang et al.,2020).
Conversely, the latter goal is pursued by borrowing
inspirations from a variety of other sources, includ-
ing psycholinguistic assessment protocols (Futrell
et al.,2019;Li et al.,2022a), information theory
(Voita and Titov,2020;Pimentel and Cotterell,
2021;Zhu and Rudzicz,2020), and causal anal-
ysis (Slobodkin et al.,2021;Elazar et al.,2021).
In short, probing assessments are more special-
ized (therefore more flexible) and less computation-
ally expensive. In contrast, the performance scores
of fine-tuning assessments are more relevant to the
design and training of deep neural models.
Can
probing be used in the development of deep neu-
ral models? This question involves two aspects:
Feasibility: Are probing results relevant in the
model development?
Operation: How to set up probing analyses to
get these useful results?
This paper attempts to answer both. For feasi-
bility, we show that a crucial feedback signal in
model development, the fine-tuning performance,
can be predicted via probing results, indicating a
positive answer to the feasibility question.
For operation, we run extensive ablation studies
to simplify the probing configurations, leading to
some heuristics to set up probing analyses. We start
with a battery of probing tasks and evaluate the util-
ities both task-wise and layer-wise (§5.2 - §5.3).
We then reduce the number of probing configura-
tions, showing that as few as 3 configurations can
predict fine-tuning results with RMSEs between
40%
and
80%
smaller than the control baseline
5.5). To further answer the operation question,
we run ablation studies on different probing config-
urations, including probing methods (§5.6) and the
number of data samples (§5.7). We also analyze
the uncertainty of the results (§5.8). Our analysis
shows the possibility of using probing in develop-
ing high-performance deep neural models.
All codes are open-sourced at
https://github.com/SPOClab-ca/
performance_prediction.
2 Related Work
Performance prediction
Xia et al. (2020) pro-
posed a framework that predicts task performance
using a collection of features, including the hy-
perparameters of the model and the percentage of
text overlap between the source and target datasets.
Srinivasan et al. (2021) extended this framework
into a multilingual setting. Ye et al. (2021) consid-
ered the reliability of performance – an idea similar
to that of Dodge et al. (2019). This paper differs
from the performance prediction literature in the set
of features – we use the probing results as features
– and more importantly, we aim at showing that the
probing results can improve the interpretability in
the development procedures of large models.
Out-of Domain generalization
The out-of-
domain generalization literature provides a vari-
ety of methods to improve the performance of out-
of-domain classification. We defer to Wang et al.
(2021a) for a summary. Gulrajani and Lopez-Paz
(2020) ran empirical comparisons on many algo-
rithms, and some theoretical analyses bound the
performance of out-of-domain classification (Li
et al.,2022b;Minsker and Mathieu,2019). In our
setting, the probing and the fine-tuning datasets
can be considered different domains, but our anal-
ysis predicts the out-of-domain performance. A
similar setting was presented in Kornblith et al.
(2019), which studied the correlation between the
performance on ImageNet and the performance of
transfer learning on a variety of image domains.
Our setting focuses on text domains, and use spe-
cialized, small-sized probing datasets.
Probing, and the utility of LODNA
The prob-
ing literature reveals various abilities of deep neu-
ral models, as summarized by Rogers et al. (2020);
Manning et al. (2020); Belinkov (2021); Pavlick
(2022). There have been some discussions on the
utility of probing results. Baroni (2021) argued that
these linguistic-oriented deep neural network anal-
yses (LODNA) should treat deep neural models as
algorithmic linguistic theories; otherwise, LODNA
has limited relevance to theoretical linguists. Re-
cent literature in LODNA drew interesting findings
by comparing the mechanisms in which algorithms
and humans respond to external stimuli, including
the relative importance of sentences (Hollenstein
and Beinborn,2021). Probing results, when used
jointly with evidence from datasets, can also be
used to predict the inductive bias of neural models
(Lovering et al.,2021;Immer et al.,2021). As we
show, probing results can explain the variance in
and even predict the fine-tuning performance of
neural NLP models.
Fine-tuning and probing
There have been mul-
tiple papers that explored fine-tuning and probing
paradigms. Probing is used as a post-hoc method
to interpret linguistic knowledge in deep neural
models during pre-training (Liu et al.,2019a), fine-
tuning (Miaschi et al.,2020;Mosbach et al.,2020;
Durrani et al.,2021;Yu and Ettinger,2021;Zhou
and Srikumar,2021), and other stages of model
development (Ebrahimi et al.,2021). From a per-
formance perspective, probing can sometimes re-
sult in higher performance metrics (e.g., accuracy)
than fine-tuning (Liu et al.,2019a;Hall Maudslay
et al.,2020) and fine-tuning can benefit from addi-
tional data (Phang et al.,2018). We take a different
perspective, considering how the probing and fine-
tuning results relate to each other, and more impor-
tantly, how the signals of probing can be helpful
towards developing large neural models.
3 Methods
We present the overall analysis method and eval-
uation metric in this section. §5elaborates the
detailed experiment settings.
Predicting fine-tuning performance
A deep
neural model
M
can be fine-tuned on task
t
to
achieve performance
At
. Let
SRN
be the test
accuracies of probing classifications on model
M
,
using
N
configurations. For example, a deep neu-
ral model
M=RoBERTa
can be fine-tuned to
reach performance
At= 0.85
on a
t=RTE
task.
With post-hoc classifiers applied to the 12 layers
of
M
, we can probe for 12 test accuracies on a
probing task (e.g., detecting the past vs. present
tense), which constitute of S.2
To find the pattern across a diverse category
of models, we regress over
K
models (we will
describe in §4.3). The collected probing results
{S(k)}K
k=1
can be used to predict the fine-tuning
performance
{A(k)
t}K
k=1
via regression. Formally,
this procedure optimizes for
N+ 1
parameters,
θRN+1 so that:
θ=argminθΣk||θTS(k)− A(k)
t||2(1)
2
Following the default implementation of linear regression,
we include an additional dimension in
S(k)
to multiply with
the bias term, so S(k)RN+1 in the following equations.
This procedure has closed-form solutions that
are implemented in various scientific computation
toolkits (e.g.,
R
and
scipy
). The minimum reach-
able RMSE is therefore:
RMSE =r1
KΣk||θ
TS(k)− A(k)
t||2(2)
RMSE-reduction
While RMSE can evaluate the
quality of this regression, it is insufficient for mea-
suring the informativeness of
S
due to the discrep-
ancy among the fine-tuning tasks
t
. Suppose we
have two tasks,
t1
and
t2
, where the probing re-
sults
S
can support high-precision regressions to
RMSE = 0.01
on both tasks. However, on
t1
, even
features drawn from random distributions
3
might
be sufficient to reach
RMSE = 0.02
, while on the
more difficult task,
t2
, random features could only
reach
RMSE = 0.10
maximum. The probing re-
sults
S
is more useful for
t2
than
t1
, but RMSE
itself does not capture this difference.
Considering this, we should further adjust
against a baseline, the minimum reachable RMSE
using random features.
θc=argminθΣk||θT(k)− A(k)
t||2,(3)
where the random features
are drawn from
N(0,0.1)
. Overall, the RMSE and the reduction
from the baseline are computed as:
RMSEc=r1
KΣk||θT
c(k)− A(k)
t||2
(4)
RMSE_reduction =RMSEcRMSE
RMSEc
×100
(5)
In the experiments, all RMSE and
RMSEc
val-
ues follow 5-fold cross validation. We report the
RMSE_reduction as the score that measures the
utility of S.
4 Evaluation tasks and datasets
4.1 Fine-tuning tasks
We consider 6 binary classification tasks in GLUE
(Wang et al.,2019) as fine-tuning tasks:
RTE
con-
sists of a collection of challenges recognizing tex-
tual entailment. Given two sentences, the model
3
Considering the small data sizes (i.e., the total number
of models studied), even the “random features” drawn from
random noises contain artefacts – patterns that can be used to
regress the results.
摘要:

PredictingFine-TuningPerformancewithProbingZiningZhu1;2,SorooshShahtalebi2,FrankRudzicz1;2;31UniversityofToronto2VectorInstituteforArticialIntelligence3UnityHealthTorontozining@cs.toronto.edu,soroosh.shahtalebi@vectorinstitute.aifrank@spoclab.comAbstractLargeNLPmodelshaverecentlyshownim-pressiveper...

展开>> 收起<<
Predicting Fine-Tuning Performance with Probing Zining Zhu12 Soroosh Shahtalebi2 Frank Rudzicz123 1University of Toronto2Vector Institute for Artificial Intelligence3Unity Health Toronto.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:528.19KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注