Evidence Intuition Transferability Estimation for Encoder Selection Elisa BassignanaequalsMax Müller-EbersteinequalsMike ZhangequalsBarbara Plankrobot Department of Computer Science IT University of Copenhagen Denmark

2025-05-08 1 0 451.26KB 10 页 10玖币

侵权投诉

Evidence >Intuition: Transferability Estimation for Encoder Selection

Elisa Bassignana %☼Max Müller-Eberstein %☼Mike Zhang %☼Barbara Plank☼SÄ

☼Department of Computer Science, IT University of Copenhagen, Denmark

SCenter for Information and Language Processing (CIS), LMU Munich, Germany

ÄMunich Center for Machine Learning (MCML), Munich, Germany

{elba, mamy, mikz}@itu.dk b.plank@lmu.de

Abstract

With the increase in availability of large

pre-trained language models (LMs) in Natural

Language Processing (NLP), it becomes criti-

cal to assess their ﬁt for a speciﬁc target task a

priori—as ﬁne-tuning the entire space of avail-

able LMs is computationally prohibitive and

unsustainable. However, encoder transferabil-

ity estimation has received little to no attention

in NLP. In this paper, we propose to generate

quantitative evidence to predict which LM,

out of a pool of models, will perform best on a

target task without having to ﬁne-tune all can-

didates. We provide a comprehensive study

on LM ranking for 10 NLP tasks spanning

the two fundamental problem types of classi-

ﬁcation and structured prediction. We adopt

the state-of-the-art Logarithm of Maximum

Evidence (LogME) measure from Computer

Vision (CV) and ﬁnd that it positively corre-

lates with ﬁnal LM performance in 94% of the

setups. In the ﬁrst study of its kind, we further

compare transferability measures with the de

facto standard of human practitioner ranking,

ﬁnding that evidence from quantitative metrics

is more robust than pure intuition and can help

identify unexpected LM candidates.

1 Introduction

Advances in Deep Learning-based NLP and CV

build on expressive representations from encoder

models pre-trained on massive corpora. Down-

stream models make use of latent information in

these representations to extract relevant features

for the task at hand. Within this paradigm, decid-

ing which pre-trained encoder to use in any task-

speciﬁc architecture is crucial, however training

a model using each encoder candidate is infeasi-

ble. In absence of prior heuristics (e.g., via related

work), the choice of encoder has therefore prevail-

ingly been based on practitioner intuition rather

than quantitative evidence.

%The authors contributed equally to this work.

In NLP, prior work has examined the different

yet related task of performance prediction (Xia

et al.,2020a;Ye et al.,2021), surveyed and cat-

egorized LMs (Xia et al.,2020b), and used probing

to predict LM performance speciﬁcally for depen-

dency parsing (Müller-Eberstein et al.,2022b), but

has yet to extensively investigate how to rank the

increasingly large number of pre-trained LM en-

coders across various tasks and domains. Prelim-

inary work by You et al. (2021) shows that the

LogME estimator holds promise, including the ﬁrst

steps for encoder selection in NLP. With their main

focus being on CV, however, they evaluate only a

limited set of tasks and models for NLP and use

self-reported benchmark scores instead of running

controlled experiments which should include, e.g.,

the variance across initializations, domains, and

ﬁne-tuning strategies (Section 2). As such, we seek

to answer: How well can we estimate the transfer-

ability of pre-trained LMs to speciﬁc NLP tasks?

To do so, we contribute:

•

The broadest encoder selection study in NLP

to date, on 10 domain-diverse classiﬁcation

and structured prediction tasks (Section 3);

•

An extensive evaluation and analysis across

multiple dimensions of variation, includ-

ing seven general vs. domain-speciﬁc LMs,

[CLS]

vs. mean representations, and head vs.

full model ﬁne-tuning (Section 4);

•

A study with NLP experts, comparing the pre-

vailing ranking of LMs by human intuition

with LogME’s empirical evidence (Section 5);

•

Guidelines for applying and interpreting trans-

ferability measures to NLP (Section 6), and an

open-source toolkit for efﬁcient, task-adaptive

LM pre-selection.1

1https://github.com/mainlp/logme-nlp

arXiv:2210.11255v1 [cs.CL] 20 Oct 2022

DATASET TASK TRAIN / DEV |Y| METRIC

CLASSIFICATION

AGNews (Zhang et al.,2015) Topic Classiﬁcation 84K / 12K 4 micro-F1

Airline (Crowdﬂower,2020) Sentiment Analysis 10K / 1.5K 3 micro-F1

SciERC (Luan et al.,2018) Relation Classiﬁcation 1.9K / 275 7 macro-F1

MNLI (Williams et al.,2018) Natural Language Inference 393K / 20K 3 micro-F1

QNLI (Rajpurkar et al.,2016) Q&A/Natural Language Inference 105K / 5.4K 2 micro-F1

RTE (Giampiccolo et al.,2007) Natural Language Inference 2.5K / 3K 3 micro-F1

STR. PRED.

EWT (Silveira et al.,2014) Dependency Labeling 12.5k / 2k 36 micro-F1

CrossNER (Liu et al.,2021) Named Entity Recognition 15K / 3.5K 4 span-F1

CrossNER (Liu et al.,2021) Named Entity Recognition 200 / 450 17 span-F1

JobStack (Jensen et al.,2021) De-identiﬁcation 18K / 2K 11 span-F1

Table 1: Datasets. Indicated are the 10 datasets used in this study, distinguished between the two NLP problem

types C and SP for a wide variety of tasks and domains. C tasks cover AGNews (news articles), Twitter Airline Sen-

timent (Airline; Twitter feedback), SciERC (AI proceedings), MNLI (speech, (non-)ﬁction, government), QNLI

(Wikipedia) and RTE (Wikipedia, news). Within the SP tasks, we experiment on the English Web Treebank (EWT;

social media, reviews, emails), CrossNER (news, scientiﬁc Wikipedia) and JobStack (Stack Overﬂow job ads). For

each task, we report their TRAIN/DEV split, label space, and task-speciﬁc performance metric.

2 Transferability Estimation

Transferability estimation aims to quantify the abil-

ity of a model to transfer knowledge learned from

one task to another (Eaton et al.,2008;Sinapov

et al.,2015). Formally, given a pool of

pre-

trained LMs

{φl}L

l=1

and a dataset

, we calculate

a predictive score

Sl(D)

for each

φl

which ide-

ally correlates with the model’s ﬁnal performance

Pl(D)

Sl(D)

is computed without ﬁne-tuning

φl

such that the optimal

φ∗

can be chosen from

a large model pool at a low computational cost.

The CV community has begun to explore meth-

ods for encoder pre-selection and ranking through

metrics such as LogME and the Log Expected Em-

pirical Prediction (LEEP; Nguyen et al.,2020).

These are widely-used state-of-the-art methods in

CV. Recent work introduced the Gaussian Bhat-

tacharyya Coefﬁcient (GBC; Pándy et al.,2021)

and Optimal Transport based Conditional Entropy

(OTCE; Tan et al.,2021), the exploration of which

we leave for future work. However, in the NLP

ﬁeld, related work focus on choosing a task and not

an LM encoder for transferability (Vu et al.,2020;

Padmakumar et al.,2022), leaving the ranking of

encoders an unexplored question.

LogME

LogME measures the suitability of all

encoded dataset features

F∈R|D|×h

(e.g., embed-

dings with dimensionality

) to predict all scalar

labels

y∈R|D|

via the probability density

p(y|F)

As this density is intractable, it is estimated by map-

ping

F→y

using a linear transformation

; this is

akin to training a linear probe with optimal param-

eters

w∗

and using the likelihood

p(y|F, w∗)

as a

proxy for feature suitability. Because a simple lin-

ear model will overﬁt on the training data, it would

be beneﬁcial to obtain the marginal likelihood, or

evidence, by integrating over all possible values of

p(y|F) = Rp(y|F, w)p(w)dw

. To once again

make this computation tractable, You et al. (2021)

reformulate it as an efﬁcient, iterative evidence

maximization problem where both

as well as

are drawn from lightly parametrized, isotropic

Gaussian distributions. The normalized logarithm

of the maximized evidence (LogME) can then be

used as Sl(D)to rank encoder models directly.

NLP Setting

LogME has shown promise for CV,

and an initial study on the GLUE benchmark (Wang

et al.,2018) indicate the same for NLP (You et al.,

2021). However, for NLP, there are notable differ-

ences in setups across tasks. We adapt and apply

LogME extensively to a wide range of NLP settings

to identify empirically grounded guidelines.

In particular, we investigate variations concern-

ing the task, instance granularity, domain, and tun-

ing strategy. First, compared to most image classiﬁ-

cation tasks, NLP tasks are subject to differences in

granularity, i.e.,

classiﬁcation

structured

prediction

(SP). Furthermore, there is less clarity

than for individual images as to which representa-

tion best captures the full language input (Mosbach

et al.,2020). Therefore, for C setups we experiment

with two representations: i.e., using

[CLS]

<s>

ver-

sus mean over sequence/subwords.

Second, depending on differences in the data

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Evidence>Intuition:TransferabilityEstimationforEncoderSelectionElisaBassignana%ÞMaxMüller-Eberstein%ÞMikeZhang%ÞBarbaraPlankÞSÄÞDepartmentofComputerScience,ITUniversityofCopenhagen,DenmarkSCenterforInformationandLanguageProcessing(CIS),LMUMunich,GermanyÄMunichCenterforMachineLearning(MCML),Munich,Ge...

展开>> 收起<<

Evidence Intuition Transferability Estimation for Encoder Selection Elisa BassignanaequalsMax Müller-EbersteinequalsMike ZhangequalsBarbara Plankrobot Department of Computer Science IT University of Copenhagen Denmark.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Evidence Intuition Transferability Estimation for Encoder Selection Elisa BassignanaequalsMax Müller-EbersteinequalsMike ZhangequalsBarbara Plankrobot Department of Computer Science IT University of Copenhagen Denmark

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: