mance of language models, probing methods aim
at assessing the faithfulness of language models.
To fulfill the former objective, fine-tuning is accom-
panied by efforts in pre-training, collecting more
data, building better representations, and explor-
ing novel model architectures (He et al.,2021;Sun
et al.,2021;Wang et al.,2021b;Jiang et al.,2020).
Conversely, the latter goal is pursued by borrowing
inspirations from a variety of other sources, includ-
ing psycholinguistic assessment protocols (Futrell
et al.,2019;Li et al.,2022a), information theory
(Voita and Titov,2020;Pimentel and Cotterell,
2021;Zhu and Rudzicz,2020), and causal anal-
ysis (Slobodkin et al.,2021;Elazar et al.,2021).
In short, probing assessments are more special-
ized (therefore more flexible) and less computation-
ally expensive. In contrast, the performance scores
of fine-tuning assessments are more relevant to the
design and training of deep neural models.
Can
probing be used in the development of deep neu-
ral models? This question involves two aspects:
•
Feasibility: Are probing results relevant in the
model development?
•
Operation: How to set up probing analyses to
get these useful results?
This paper attempts to answer both. For feasi-
bility, we show that a crucial feedback signal in
model development, the fine-tuning performance,
can be predicted via probing results, indicating a
positive answer to the feasibility question.
For operation, we run extensive ablation studies
to simplify the probing configurations, leading to
some heuristics to set up probing analyses. We start
with a battery of probing tasks and evaluate the util-
ities both task-wise and layer-wise (§5.2 - §5.3).
We then reduce the number of probing configura-
tions, showing that as few as 3 configurations can
predict fine-tuning results with RMSEs between
40%
and
80%
smaller than the control baseline
(§5.5). To further answer the operation question,
we run ablation studies on different probing config-
urations, including probing methods (§5.6) and the
number of data samples (§5.7). We also analyze
the uncertainty of the results (§5.8). Our analysis
shows the possibility of using probing in develop-
ing high-performance deep neural models.
All codes are open-sourced at
https://github.com/SPOClab-ca/
performance_prediction.
2 Related Work
Performance prediction
Xia et al. (2020) pro-
posed a framework that predicts task performance
using a collection of features, including the hy-
perparameters of the model and the percentage of
text overlap between the source and target datasets.
Srinivasan et al. (2021) extended this framework
into a multilingual setting. Ye et al. (2021) consid-
ered the reliability of performance – an idea similar
to that of Dodge et al. (2019). This paper differs
from the performance prediction literature in the set
of features – we use the probing results as features
– and more importantly, we aim at showing that the
probing results can improve the interpretability in
the development procedures of large models.
Out-of Domain generalization
The out-of-
domain generalization literature provides a vari-
ety of methods to improve the performance of out-
of-domain classification. We defer to Wang et al.
(2021a) for a summary. Gulrajani and Lopez-Paz
(2020) ran empirical comparisons on many algo-
rithms, and some theoretical analyses bound the
performance of out-of-domain classification (Li
et al.,2022b;Minsker and Mathieu,2019). In our
setting, the probing and the fine-tuning datasets
can be considered different domains, but our anal-
ysis predicts the out-of-domain performance. A
similar setting was presented in Kornblith et al.
(2019), which studied the correlation between the
performance on ImageNet and the performance of
transfer learning on a variety of image domains.
Our setting focuses on text domains, and use spe-
cialized, small-sized probing datasets.
Probing, and the utility of LODNA
The prob-
ing literature reveals various abilities of deep neu-
ral models, as summarized by Rogers et al. (2020);
Manning et al. (2020); Belinkov (2021); Pavlick
(2022). There have been some discussions on the
utility of probing results. Baroni (2021) argued that
these linguistic-oriented deep neural network anal-
yses (LODNA) should treat deep neural models as
algorithmic linguistic theories; otherwise, LODNA
has limited relevance to theoretical linguists. Re-
cent literature in LODNA drew interesting findings
by comparing the mechanisms in which algorithms
and humans respond to external stimuli, including
the relative importance of sentences (Hollenstein
and Beinborn,2021). Probing results, when used
jointly with evidence from datasets, can also be
used to predict the inductive bias of neural models