
DATASET TASK TRAIN / DEV |Y| METRIC
CLASSIFICATION
AGNews (Zhang et al.,2015) Topic Classification 84K / 12K 4 micro-F1
Airline (Crowdflower,2020) Sentiment Analysis 10K / 1.5K 3 micro-F1
SciERC (Luan et al.,2018) Relation Classification 1.9K / 275 7 macro-F1
MNLI (Williams et al.,2018) Natural Language Inference 393K / 20K 3 micro-F1
QNLI (Rajpurkar et al.,2016) Q&A/Natural Language Inference 105K / 5.4K 2 micro-F1
RTE (Giampiccolo et al.,2007) Natural Language Inference 2.5K / 3K 3 micro-F1
STR. PRED.
EWT (Silveira et al.,2014) Dependency Labeling 12.5k / 2k 36 micro-F1
CrossNER (Liu et al.,2021) Named Entity Recognition 15K / 3.5K 4 span-F1
CrossNER (Liu et al.,2021) Named Entity Recognition 200 / 450 17 span-F1
JobStack (Jensen et al.,2021) De-identification 18K / 2K 11 span-F1
Table 1: Datasets. Indicated are the 10 datasets used in this study, distinguished between the two NLP problem
types C and SP for a wide variety of tasks and domains. C tasks cover AGNews (news articles), Twitter Airline Sen-
timent (Airline; Twitter feedback), SciERC (AI proceedings), MNLI (speech, (non-)fiction, government), QNLI
(Wikipedia) and RTE (Wikipedia, news). Within the SP tasks, we experiment on the English Web Treebank (EWT;
social media, reviews, emails), CrossNER (news, scientific Wikipedia) and JobStack (Stack Overflow job ads). For
each task, we report their TRAIN/DEV split, label space, and task-specific performance metric.
2 Transferability Estimation
Transferability estimation aims to quantify the abil-
ity of a model to transfer knowledge learned from
one task to another (Eaton et al.,2008;Sinapov
et al.,2015). Formally, given a pool of
L
pre-
trained LMs
{φl}L
l=1
and a dataset
D
, we calculate
a predictive score
Sl(D)
for each
φl
which ide-
ally correlates with the model’s final performance
Pl(D)
.
Sl(D)
is computed without fine-tuning
φl
on
D
such that the optimal
φ∗
l
can be chosen from
a large model pool at a low computational cost.
The CV community has begun to explore meth-
ods for encoder pre-selection and ranking through
metrics such as LogME and the Log Expected Em-
pirical Prediction (LEEP; Nguyen et al.,2020).
These are widely-used state-of-the-art methods in
CV. Recent work introduced the Gaussian Bhat-
tacharyya Coefficient (GBC; Pándy et al.,2021)
and Optimal Transport based Conditional Entropy
(OTCE; Tan et al.,2021), the exploration of which
we leave for future work. However, in the NLP
field, related work focus on choosing a task and not
an LM encoder for transferability (Vu et al.,2020;
Padmakumar et al.,2022), leaving the ranking of
encoders an unexplored question.
LogME
LogME measures the suitability of all
encoded dataset features
F∈R|D|×h
(e.g., embed-
dings with dimensionality
h
) to predict all scalar
labels
y∈R|D|
via the probability density
p(y|F)
.
As this density is intractable, it is estimated by map-
ping
F→y
using a linear transformation
w
; this is
akin to training a linear probe with optimal param-
eters
w∗
and using the likelihood
p(y|F, w∗)
as a
proxy for feature suitability. Because a simple lin-
ear model will overfit on the training data, it would
be beneficial to obtain the marginal likelihood, or
evidence, by integrating over all possible values of
w
:
p(y|F) = Rp(y|F, w)p(w)dw
. To once again
make this computation tractable, You et al. (2021)
reformulate it as an efficient, iterative evidence
maximization problem where both
w
as well as
y
are drawn from lightly parametrized, isotropic
Gaussian distributions. The normalized logarithm
of the maximized evidence (LogME) can then be
used as Sl(D)to rank encoder models directly.
NLP Setting
LogME has shown promise for CV,
and an initial study on the GLUE benchmark (Wang
et al.,2018) indicate the same for NLP (You et al.,
2021). However, for NLP, there are notable differ-
ences in setups across tasks. We adapt and apply
LogME extensively to a wide range of NLP settings
to identify empirically grounded guidelines.
In particular, we investigate variations concern-
ing the task, instance granularity, domain, and tun-
ing strategy. First, compared to most image classifi-
cation tasks, NLP tasks are subject to differences in
granularity, i.e.,
classification
(C) and
structured
prediction
(SP). Furthermore, there is less clarity
than for individual images as to which representa-
tion best captures the full language input (Mosbach
et al.,2020). Therefore, for C setups we experiment
with two representations: i.e., using
[CLS]
/
<s>
ver-
sus mean over sequence/subwords.
Second, depending on differences in the data