Evidence Intuition Transferability Estimation for Encoder Selection Elisa BassignanaequalsMax Müller-EbersteinequalsMike ZhangequalsBarbara Plankrobot Department of Computer Science IT University of Copenhagen Denmark

2025-05-08 0 0 451.26KB 10 页 10玖币
侵权投诉
Evidence >Intuition: Transferability Estimation for Encoder Selection
Elisa Bassignana %Max Müller-Eberstein %Mike Zhang %Barbara PlankSÄ
Department of Computer Science, IT University of Copenhagen, Denmark
SCenter for Information and Language Processing (CIS), LMU Munich, Germany
ÄMunich Center for Machine Learning (MCML), Munich, Germany
{elba, mamy, mikz}@itu.dk b.plank@lmu.de
Abstract
With the increase in availability of large
pre-trained language models (LMs) in Natural
Language Processing (NLP), it becomes criti-
cal to assess their fit for a specific target task a
priori—as fine-tuning the entire space of avail-
able LMs is computationally prohibitive and
unsustainable. However, encoder transferabil-
ity estimation has received little to no attention
in NLP. In this paper, we propose to generate
quantitative evidence to predict which LM,
out of a pool of models, will perform best on a
target task without having to fine-tune all can-
didates. We provide a comprehensive study
on LM ranking for 10 NLP tasks spanning
the two fundamental problem types of classi-
fication and structured prediction. We adopt
the state-of-the-art Logarithm of Maximum
Evidence (LogME) measure from Computer
Vision (CV) and find that it positively corre-
lates with final LM performance in 94% of the
setups. In the first study of its kind, we further
compare transferability measures with the de
facto standard of human practitioner ranking,
finding that evidence from quantitative metrics
is more robust than pure intuition and can help
identify unexpected LM candidates.
1 Introduction
Advances in Deep Learning-based NLP and CV
build on expressive representations from encoder
models pre-trained on massive corpora. Down-
stream models make use of latent information in
these representations to extract relevant features
for the task at hand. Within this paradigm, decid-
ing which pre-trained encoder to use in any task-
specific architecture is crucial, however training
a model using each encoder candidate is infeasi-
ble. In absence of prior heuristics (e.g., via related
work), the choice of encoder has therefore prevail-
ingly been based on practitioner intuition rather
than quantitative evidence.
%The authors contributed equally to this work.
In NLP, prior work has examined the different
yet related task of performance prediction (Xia
et al.,2020a;Ye et al.,2021), surveyed and cat-
egorized LMs (Xia et al.,2020b), and used probing
to predict LM performance specifically for depen-
dency parsing (Müller-Eberstein et al.,2022b), but
has yet to extensively investigate how to rank the
increasingly large number of pre-trained LM en-
coders across various tasks and domains. Prelim-
inary work by You et al. (2021) shows that the
LogME estimator holds promise, including the first
steps for encoder selection in NLP. With their main
focus being on CV, however, they evaluate only a
limited set of tasks and models for NLP and use
self-reported benchmark scores instead of running
controlled experiments which should include, e.g.,
the variance across initializations, domains, and
fine-tuning strategies (Section 2). As such, we seek
to answer: How well can we estimate the transfer-
ability of pre-trained LMs to specific NLP tasks?
To do so, we contribute:
The broadest encoder selection study in NLP
to date, on 10 domain-diverse classification
and structured prediction tasks (Section 3);
An extensive evaluation and analysis across
multiple dimensions of variation, includ-
ing seven general vs. domain-specific LMs,
[CLS]
vs. mean representations, and head vs.
full model fine-tuning (Section 4);
A study with NLP experts, comparing the pre-
vailing ranking of LMs by human intuition
with LogME’s empirical evidence (Section 5);
Guidelines for applying and interpreting trans-
ferability measures to NLP (Section 6), and an
open-source toolkit for efficient, task-adaptive
LM pre-selection.1
1https://github.com/mainlp/logme-nlp
arXiv:2210.11255v1 [cs.CL] 20 Oct 2022
DATASET TASK TRAIN / DEV |Y| METRIC
CLASSIFICATION
AGNews (Zhang et al.,2015) Topic Classification 84K / 12K 4 micro-F1
Airline (Crowdflower,2020) Sentiment Analysis 10K / 1.5K 3 micro-F1
SciERC (Luan et al.,2018) Relation Classification 1.9K / 275 7 macro-F1
MNLI (Williams et al.,2018) Natural Language Inference 393K / 20K 3 micro-F1
QNLI (Rajpurkar et al.,2016) Q&A/Natural Language Inference 105K / 5.4K 2 micro-F1
RTE (Giampiccolo et al.,2007) Natural Language Inference 2.5K / 3K 3 micro-F1
STR. PRED.
EWT (Silveira et al.,2014) Dependency Labeling 12.5k / 2k 36 micro-F1
CrossNER (Liu et al.,2021) Named Entity Recognition 15K / 3.5K 4 span-F1
CrossNER (Liu et al.,2021) Named Entity Recognition 200 / 450 17 span-F1
JobStack (Jensen et al.,2021) De-identification 18K / 2K 11 span-F1
Table 1: Datasets. Indicated are the 10 datasets used in this study, distinguished between the two NLP problem
types C and SP for a wide variety of tasks and domains. C tasks cover AGNews (news articles), Twitter Airline Sen-
timent (Airline; Twitter feedback), SciERC (AI proceedings), MNLI (speech, (non-)fiction, government), QNLI
(Wikipedia) and RTE (Wikipedia, news). Within the SP tasks, we experiment on the English Web Treebank (EWT;
social media, reviews, emails), CrossNER (news, scientific Wikipedia) and JobStack (Stack Overflow job ads). For
each task, we report their TRAIN/DEV split, label space, and task-specific performance metric.
2 Transferability Estimation
Transferability estimation aims to quantify the abil-
ity of a model to transfer knowledge learned from
one task to another (Eaton et al.,2008;Sinapov
et al.,2015). Formally, given a pool of
L
pre-
trained LMs
{φl}L
l=1
and a dataset
D
, we calculate
a predictive score
Sl(D)
for each
φl
which ide-
ally correlates with the model’s final performance
Pl(D)
.
Sl(D)
is computed without fine-tuning
φl
on
D
such that the optimal
φ
l
can be chosen from
a large model pool at a low computational cost.
The CV community has begun to explore meth-
ods for encoder pre-selection and ranking through
metrics such as LogME and the Log Expected Em-
pirical Prediction (LEEP; Nguyen et al.,2020).
These are widely-used state-of-the-art methods in
CV. Recent work introduced the Gaussian Bhat-
tacharyya Coefficient (GBC; Pándy et al.,2021)
and Optimal Transport based Conditional Entropy
(OTCE; Tan et al.,2021), the exploration of which
we leave for future work. However, in the NLP
field, related work focus on choosing a task and not
an LM encoder for transferability (Vu et al.,2020;
Padmakumar et al.,2022), leaving the ranking of
encoders an unexplored question.
LogME
LogME measures the suitability of all
encoded dataset features
FR|Dh
(e.g., embed-
dings with dimensionality
h
) to predict all scalar
labels
yR|D|
via the probability density
p(y|F)
.
As this density is intractable, it is estimated by map-
ping
Fy
using a linear transformation
w
; this is
akin to training a linear probe with optimal param-
eters
w
and using the likelihood
p(y|F, w)
as a
proxy for feature suitability. Because a simple lin-
ear model will overfit on the training data, it would
be beneficial to obtain the marginal likelihood, or
evidence, by integrating over all possible values of
w
:
p(y|F) = Rp(y|F, w)p(w)dw
. To once again
make this computation tractable, You et al. (2021)
reformulate it as an efficient, iterative evidence
maximization problem where both
w
as well as
y
are drawn from lightly parametrized, isotropic
Gaussian distributions. The normalized logarithm
of the maximized evidence (LogME) can then be
used as Sl(D)to rank encoder models directly.
NLP Setting
LogME has shown promise for CV,
and an initial study on the GLUE benchmark (Wang
et al.,2018) indicate the same for NLP (You et al.,
2021). However, for NLP, there are notable differ-
ences in setups across tasks. We adapt and apply
LogME extensively to a wide range of NLP settings
to identify empirically grounded guidelines.
In particular, we investigate variations concern-
ing the task, instance granularity, domain, and tun-
ing strategy. First, compared to most image classifi-
cation tasks, NLP tasks are subject to differences in
granularity, i.e.,
classification
(C) and
structured
prediction
(SP). Furthermore, there is less clarity
than for individual images as to which representa-
tion best captures the full language input (Mosbach
et al.,2020). Therefore, for C setups we experiment
with two representations: i.e., using
[CLS]
/
<s>
ver-
sus mean over sequence/subwords.
Second, depending on differences in the data
摘要:

Evidence>Intuition:TransferabilityEstimationforEncoderSelectionElisaBassignana%ÞMaxMüller-Eberstein%ÞMikeZhang%ÞBarbaraPlankÞSÄÞDepartmentofComputerScience,ITUniversityofCopenhagen,DenmarkSCenterforInformationandLanguageProcessing(CIS),LMUMunich,GermanyÄMunichCenterforMachineLearning(MCML),Munich,Ge...

展开>> 收起<<
Evidence Intuition Transferability Estimation for Encoder Selection Elisa BassignanaequalsMax Müller-EbersteinequalsMike ZhangequalsBarbara Plankrobot Department of Computer Science IT University of Copenhagen Denmark.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:451.26KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注