Conformal Predictor for Improving Zero-shot Text Classification Efficiency Prafulla Kumar Choubey1Yu Bai1Chien-Sheng Wu1

2025-05-02 0 0 623.22KB 9 页 10玖币
侵权投诉
Conformal Predictor for Improving Zero-shot Text Classification
Efficiency
Prafulla Kumar Choubey1Yu Bai1Chien-Sheng Wu1
Wenhao Liu2Nazneen Rajani3
1Salesforce AI Research, 2Faire.com, 3Hugging Face
{pchoubey, yu.bai, wu.jason}@salesforce.com
wenhao@faire.com, nazneen@hf.co
Abstract
Pre-trained language models (PLMs) have
been shown effective for zero-shot (0shot) text
classification. 0shot models based on natural
language inference (NLI) and next sentence
prediction (NSP) employ cross-encoder archi-
tecture and infer by making a forward pass
through the model for each label-text pair sep-
arately. This increases the computational cost
to make inferences linearly in the number of
labels. In this work, we improve the efficiency
of such cross-encoder-based 0shot models by
restricting the number of likely labels using an-
other fast base classifier-based conformal pre-
dictor (CP) calibrated on samples labeled by
the 0shot model. Since a CP generates pre-
diction sets with coverage guarantees, it re-
duces the number of target labels without ex-
cluding the most probable label based on the
0shot model. We experiment with three in-
tent and two topic classification datasets. With
a suitable CP for each dataset, we reduce
the average inference time for NLI- and NSP-
based models by 25.6% and 22.2% respec-
tively, without dropping performance below
the predefined error rate of 1%.
1 Introduction
Zero-shot (0shot) text classification is an important
NLP problem with many real-world applications.
The earliest approaches for 0shot text classifica-
tion use a similarity score between text and labels
mapped to common embedding space (Chang et al.,
2008;Gabrilovich and Markovitch,2007;Chen
et al.,2015;Li et al.,2016;Sappadla et al.,2016;
Xia et al.,2018). These models calculate text and
label embeddings independently and make only
one forward pass over the text resulting in a mini-
mal increase in the computation. Later approaches
explicitly incorporate label information when pro-
cessing the text, e.g., Yogatama et al. (2017) uses
generative modeling and generates text given label
work was done at Salesforce AI Research.
embedding, and Rios and Kavuluru (2018) uses
label embedding based attention over text, both re-
quiring multiple passes over the text and increasing
the computational cost.
Most recently, NLI- (Condoravdi et al.,2003;
Williams et al.,2018;Yin et al.,2019) and NSP-
(Ma et al.,2021) based 0shot text classification for-
mulations have been proposed. NLI and NSP make
inferences by defining a representative hypothesis
sentence for each label and producing a score corre-
sponding to every pair of input text and hypothesis.
To compute the score, they employ a cross-encoder
architecture that is full self-attention over the text
and hypothesis sentences, which requires recom-
puting the encoding for text and each hypothesis
separately. It increases the computational cost to
make inferences linearly in the number of target
labels.
NLI and NSP use large transformer-based PLMs
(Devlin et al.,2019;Liu et al.,2019b;Lewis et al.,
2019) and outperform previous non-transformer-
based models by a large margin. However, the size
of PLMs and the number of target labels drasti-
cally reduce the prediction efficiency, increasing
the computation and inference time, and may sig-
nificantly increase the carbon footprint of making
predictions (Strubell et al.,2019;Moosavi et al.,
2020;Schwartz et al.,2020;Zhou et al.,2021).
In this work, we focus on the correlation between
the number of labels and prediction efficiency and
propose to use a conformal predictor (CP) (Vovk
et al.,2005;Shafer and Vovk,2008) to filter out un-
likely labels from the target. Conformal prediction
provides a model-agnostic framework to generate a
label set, instead of a single label prediction, within
a pre-defined error rate. Consequently, we use a
CP, with a small error rate we select, based on
another fast base classifier to generate candidate
target labels. Candidate labels are then used with
the larger NLI/NSP-based 0shot models to make
the final prediction.
arXiv:2210.12619v1 [cs.CL] 23 Oct 2022
We experiment with three intent classification
(SNIPS (Coucke et al.,2018), ATIS (Tur et al.,
2010) and HWU64 (Liu et al.,2019a)) and two
topic classification (AG’s news and Yahoo! An-
swers (Zhang et al.,2015)) datasets using 0shot
models based on a moderately sized bart-large
(NLI) (Lewis et al.,2020) and a small bert-base
(NSP) PLM. We use four different base classifiers,
each with different computational complexity, and
a small error rate of 1%. By using the best CP for
each dataset, we reduce the average computational
time by 25.6% (22.2%) and the average number of
labels by 41.09% (43.38%) for NLI-(NSP-) based
models.
2 Methodology
We improve the efficiency of NLI/NSP models by
restricting the number of target labels with a Con-
formal Predictor (CP). Using a fast but weak base
classifier-based CP, we produce the label set that
removes some of the target classes for the 0shot
model without reducing the coverage beyond a pre-
defined error rate.
2.1 Building a Conformal Predictor (CP) for
Label Filtering
Conformal prediction (Vovk et al.,1999,2005;
Shafer and Vovk,2008;Maltoudoglou et al.,2020;
Angelopoulos and Bates,2021;Giovannotti and
Gammerman,2021;Dey et al.,2021) generates la-
bel sets with coverage guarantees. For a given error
rate
α
and a base classifier
ˆ
f:xRK
(here
K
is the total number of class labels), a CP outputs
a label set
Γα
that also contains true class label
y
with probability at least 1α.
To build a CP, we need
calibration data
{
(x1, y1),(x2, y2), .., (xn, yn)
} and
a measure of
non-conformity s(xi, yi)
that describes the dis-
agreement between the actual label
yi
and the pre-
diction
ˆ
f(xi)
from the base classifier. As an ex-
ample, a non-conformity score can be defined as
the negative output logit of the true class. Assum-
ing the base classifier outputs logit scores, in this
case
s(xi, yi)
will be
ˆ
f(xi)yi
. Next, we define
ˆq
to be the
d(n+ 1)(1 α)e/n
empirical quantile
of scores
s(x1, y1), s(x2, y2), .., s(xn, yn)
on the
calibration set. Finally, for a new exchangeable
test data point
xtest
, we output the label set
Γα
=
{
yk:s(xtest, yk)<ˆq
}, i.e., the classes correspond-
ing to which the non-conformity score is lower than
the
ˆq
.
Γα
is finally used with the 0shot model to
predict the final class label. Next, we discuss the
two components of a CP, namely the calibration
dataset and the non-conformity score.
2.2 Calibration Dataset
We require a calibration dataset that is exchange-
able with the test data. However, in a typical
0shot setting, we do not expect the availability of a
human-labeled dataset. Therefore, we use the 0shot
classifier to label samples for calibration. Since
our goal is to obtain a label set that contains the
class label which is most probable according to the
0shot classifier, we do not explicitly require human-
labeled samples. Using model-predicted labels for
calibration guarantees the required coverage.
2.3 Non-Conformity score based on a Base
Classifier
We want the base classifier to be computationally
efficient when compared to the 0shot model. We
experiment with four base classifiers with different
complexity for building our CPs,
Token Overlap (CP-Token): For each target class
label (
yk∈ {y1, .., yK}
), we make a list of rep-
resentative tokens (
Ck
w
) that includes all tokens
in the calibration data samples corresponding to
that class. Then, we define the non-conformity
score using the percentage of common tokens
between
Ck
w
and the input text (
x
). Given
#x
defines the unique tokens in
x
, the token overlap-
based non-conformity score is defined as:
s(x, yk) = 1.0|Ck
wx|
#x(1)
Cosine Similarity (CP-Glove): Token overlap-
based non-conformity score suffers from sparsity
unless we use a large representative words-set
for each target class label. Therefore, we also
experiment with the cosine distance between the
bag-of-words (BoW) representation of a target
label description (
Ck
E
) and input text (
xE
). We
use static GloVe embeddings (Pennington et al.,
2014) to obtain BoW representations for labels.
s(x, yk) = 1.Ck
E·xE
kCk
Ek2kxEk2
(2)
Classifier (CP-CLS): Besides the broadly appli-
cable token overlap and cosine similarity, we
propose to use a task-specific base classifier to
generate label sets of smaller sizes. We fine-tune
摘要:

ConformalPredictorforImprovingZero-shotTextClassicationEfciencyPrafullaKumarChoubey1YuBai1Chien-ShengWu1WenhaoLiu2yNazneenRajani3y1SalesforceAIResearch,2Faire.com,3HuggingFace{pchoubey,yu.bai,wu.jason}@salesforce.comwenhao@faire.com,nazneen@hf.coAbstractPre-trainedlanguagemodels(PLMs)havebeenshown...

展开>> 收起<<
Conformal Predictor for Improving Zero-shot Text Classification Efficiency Prafulla Kumar Choubey1Yu Bai1Chien-Sheng Wu1.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:623.22KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注