Conformal Predictor for Improving Zero-shot Text Classiﬁcation Efﬁciency Prafulla Kumar Choubey1Yu Bai1Chien-Sheng Wu1

2025-05-02 0 0 623.22KB 9 页 10玖币

侵权投诉

Conformal Predictor for Improving Zero-shot Text Classiﬁcation

Efﬁciency

Prafulla Kumar Choubey1Yu Bai1Chien-Sheng Wu1

Wenhao Liu2†Nazneen Rajani3†

1Salesforce AI Research, 2Faire.com, 3Hugging Face

{pchoubey, yu.bai, wu.jason}@salesforce.com

wenhao@faire.com, nazneen@hf.co

Abstract

Pre-trained language models (PLMs) have

been shown effective for zero-shot (0shot) text

classiﬁcation. 0shot models based on natural

language inference (NLI) and next sentence

prediction (NSP) employ cross-encoder archi-

tecture and infer by making a forward pass

through the model for each label-text pair sep-

arately. This increases the computational cost

to make inferences linearly in the number of

labels. In this work, we improve the efﬁciency

of such cross-encoder-based 0shot models by

restricting the number of likely labels using an-

other fast base classiﬁer-based conformal pre-

dictor (CP) calibrated on samples labeled by

the 0shot model. Since a CP generates pre-

diction sets with coverage guarantees, it re-

duces the number of target labels without ex-

cluding the most probable label based on the

0shot model. We experiment with three in-

tent and two topic classiﬁcation datasets. With

a suitable CP for each dataset, we reduce

the average inference time for NLI- and NSP-

based models by 25.6% and 22.2% respec-

tively, without dropping performance below

the predeﬁned error rate of 1%.

1 Introduction

Zero-shot (0shot) text classiﬁcation is an important

NLP problem with many real-world applications.

The earliest approaches for 0shot text classiﬁca-

tion use a similarity score between text and labels

mapped to common embedding space (Chang et al.,

2008;Gabrilovich and Markovitch,2007;Chen

et al.,2015;Li et al.,2016;Sappadla et al.,2016;

Xia et al.,2018). These models calculate text and

label embeddings independently and make only

one forward pass over the text resulting in a mini-

mal increase in the computation. Later approaches

explicitly incorporate label information when pro-

cessing the text, e.g., Yogatama et al. (2017) uses

generative modeling and generates text given label

†work was done at Salesforce AI Research.

embedding, and Rios and Kavuluru (2018) uses

label embedding based attention over text, both re-

quiring multiple passes over the text and increasing

the computational cost.

Most recently, NLI- (Condoravdi et al.,2003;

Williams et al.,2018;Yin et al.,2019) and NSP-

(Ma et al.,2021) based 0shot text classiﬁcation for-

mulations have been proposed. NLI and NSP make

inferences by deﬁning a representative hypothesis

sentence for each label and producing a score corre-

sponding to every pair of input text and hypothesis.

To compute the score, they employ a cross-encoder

architecture that is full self-attention over the text

and hypothesis sentences, which requires recom-

puting the encoding for text and each hypothesis

separately. It increases the computational cost to

make inferences linearly in the number of target

labels.

NLI and NSP use large transformer-based PLMs

(Devlin et al.,2019;Liu et al.,2019b;Lewis et al.,

2019) and outperform previous non-transformer-

based models by a large margin. However, the size

of PLMs and the number of target labels drasti-

cally reduce the prediction efﬁciency, increasing

the computation and inference time, and may sig-

niﬁcantly increase the carbon footprint of making

predictions (Strubell et al.,2019;Moosavi et al.,

2020;Schwartz et al.,2020;Zhou et al.,2021).

In this work, we focus on the correlation between

the number of labels and prediction efﬁciency and

propose to use a conformal predictor (CP) (Vovk

et al.,2005;Shafer and Vovk,2008) to ﬁlter out un-

likely labels from the target. Conformal prediction

provides a model-agnostic framework to generate a

label set, instead of a single label prediction, within

a pre-deﬁned error rate. Consequently, we use a

CP, with a small error rate we select, based on

another fast base classiﬁer to generate candidate

target labels. Candidate labels are then used with

the larger NLI/NSP-based 0shot models to make

the ﬁnal prediction.

arXiv:2210.12619v1 [cs.CL] 23 Oct 2022

We experiment with three intent classiﬁcation

(SNIPS (Coucke et al.,2018), ATIS (Tur et al.,

2010) and HWU64 (Liu et al.,2019a)) and two

topic classiﬁcation (AG’s news and Yahoo! An-

swers (Zhang et al.,2015)) datasets using 0shot

models based on a moderately sized bart-large

(NLI) (Lewis et al.,2020) and a small bert-base

(NSP) PLM. We use four different base classiﬁers,

each with different computational complexity, and

a small error rate of 1%. By using the best CP for

each dataset, we reduce the average computational

time by 25.6% (22.2%) and the average number of

labels by 41.09% (43.38%) for NLI-(NSP-) based

models.

2 Methodology

We improve the efﬁciency of NLI/NSP models by

restricting the number of target labels with a Con-

formal Predictor (CP). Using a fast but weak base

classiﬁer-based CP, we produce the label set that

removes some of the target classes for the 0shot

model without reducing the coverage beyond a pre-

deﬁned error rate.

2.1 Building a Conformal Predictor (CP) for

Label Filtering

Conformal prediction (Vovk et al.,1999,2005;

Shafer and Vovk,2008;Maltoudoglou et al.,2020;

Angelopoulos and Bates,2021;Giovannotti and

Gammerman,2021;Dey et al.,2021) generates la-

bel sets with coverage guarantees. For a given error

rate

and a base classiﬁer

f:x→RK

(here

is the total number of class labels), a CP outputs

a label set

Γα

that also contains true class label

with probability at least 1−α.

To build a CP, we need

calibration data

{

(x1, y1),(x2, y2), .., (xn, yn)

} and

a measure of

non-conformity s(xi, yi)

that describes the dis-

agreement between the actual label

and the pre-

diction

f(xi)

from the base classiﬁer. As an ex-

ample, a non-conformity score can be deﬁned as

the negative output logit of the true class. Assum-

ing the base classiﬁer outputs logit scores, in this

case

s(xi, yi)

will be

−ˆ

f(xi)yi

. Next, we deﬁne

ˆq

to be the

d(n+ 1)(1 −α)e/n

empirical quantile

of scores

s(x1, y1), s(x2, y2), .., s(xn, yn)

on the

calibration set. Finally, for a new exchangeable

test data point

xtest

, we output the label set

Γα

{

yk:s(xtest, yk)<ˆq

}, i.e., the classes correspond-

ing to which the non-conformity score is lower than

the

ˆq

Γα

is ﬁnally used with the 0shot model to

predict the ﬁnal class label. Next, we discuss the

two components of a CP, namely the calibration

dataset and the non-conformity score.

2.2 Calibration Dataset

We require a calibration dataset that is exchange-

able with the test data. However, in a typical

0shot setting, we do not expect the availability of a

human-labeled dataset. Therefore, we use the 0shot

classiﬁer to label samples for calibration. Since

our goal is to obtain a label set that contains the

class label which is most probable according to the

0shot classiﬁer, we do not explicitly require human-

labeled samples. Using model-predicted labels for

calibration guarantees the required coverage.

2.3 Non-Conformity score based on a Base

Classiﬁer

We want the base classiﬁer to be computationally

efﬁcient when compared to the 0shot model. We

experiment with four base classiﬁers with different

complexity for building our CPs,

•

Token Overlap (CP-Token): For each target class

label (

yk∈ {y1, .., yK}

), we make a list of rep-

resentative tokens (

) that includes all tokens

in the calibration data samples corresponding to

that class. Then, we deﬁne the non-conformity

score using the percentage of common tokens

between

and the input text (

). Given

deﬁnes the unique tokens in

, the token overlap-

based non-conformity score is deﬁned as:

s(x, yk) = 1.0−|Ck

w∩x|

#x(1)

•

Cosine Similarity (CP-Glove): Token overlap-

based non-conformity score suffers from sparsity

unless we use a large representative words-set

for each target class label. Therefore, we also

experiment with the cosine distance between the

bag-of-words (BoW) representation of a target

label description (

) and input text (

). We

use static GloVe embeddings (Pennington et al.,

2014) to obtain BoW representations for labels.

s(x, yk) = 1.−Ck

E·xE

kCk

Ek2kxEk2

(2)

•

Classiﬁer (CP-CLS): Besides the broadly appli-

cable token overlap and cosine similarity, we

propose to use a task-speciﬁc base classiﬁer to

generate label sets of smaller sizes. We ﬁne-tune

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ConformalPredictorforImprovingZero-shotTextClassicationEfciencyPrafullaKumarChoubey1YuBai1Chien-ShengWu1WenhaoLiu2yNazneenRajani3y1SalesforceAIResearch,2Faire.com,3HuggingFace{pchoubey,yu.bai,wu.jason}@salesforce.comwenhao@faire.com,nazneen@hf.coAbstractPre-trainedlanguagemodels(PLMs)havebeenshown...

展开>> 收起<<

Conformal Predictor for Improving Zero-shot Text Classiﬁcation Efﬁciency Prafulla Kumar Choubey1Yu Bai1Chien-Sheng Wu1.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Conformal Predictor for Improving Zero-shot Text Classiﬁcation Efﬁciency Prafulla Kumar Choubey1Yu Bai1Chien-Sheng Wu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: