The Devil is in the Details On Models and Training Regimes for Few-Shot Intent Classification Mohsen Mesgar1Thy Thy Tran1Goran Glavaš2Iryna Gurevych1

2025-04-26 0 0 326.09KB 11 页 10玖币
侵权投诉
The Devil is in the Details: On Models and Training Regimes
for Few-Shot Intent Classification
Mohsen Mesgar1Thy Thy Tran1Goran Glavaš2Iryna Gurevych1
1Ubiquitous Knowledge Processing Lab (UKP)
Department of Computer Science
Technical University of Darmstadt
2CAIDAS, University of Würzburg, Germany
www.ukp.tu-darmstadt.de
Abstract
Few-shot Intent Classification (FSIC) is one
of the key challenges in modular task-oriented
dialog systems. While advanced FSIC meth-
ods are similar in using pretrained language
models to encode texts and nearest-neighbour-
based inference for classification, these meth-
ods differ in details. They start from differ-
ent pretrained text encoders, use different en-
coding architectures with varying similarity
functions, and adopt different training regimes.
Coupling these mostly independent design de-
cisions and the lack of accompanying abla-
tion studies are big obstacle to identify the
factors that drive the reported FSIC perfor-
mance. We study these details across three
key dimensions: (1) Encoding architectures:
Cross-Encoder vs Bi-Encoders; (2) Similar-
ity function: Parameterized (i.e., trainable)
functions vs non-parameterized function; (3)
Training regimes: Episodic meta-learning vs
the straightforward (i.e., non-episodic) train-
ing. Our experimental results on seven FSIC
benchmarks reveal three important findings.
First, the unexplored combination of the cross-
encoder architecture (with parameterized sim-
ilarity scoring function) and episodic meta-
learning consistently yields the best FSIC per-
formance. Second, Episodic training yields a
more robust FSIC classifier than non-episodic
one. Third, in meta-learning methods, splitting
an episode to support and query sets is not a
must. Our findings paves the way for conduct-
ing state-of-the-art research in FSIC and more
importantly raise the community’s attention to
details of FSIC methods. We release our code
and data publicly.
1 Introduction
Intent classification deals with assigning one la-
bel from the predefined set of classes or intents
to user utterances. This task is vital for under-
standing user demands in dialogue systems and the
predicted intent of an utterance is a key input to
other modules (i.e., dialog management) in task-
orientated dialog systems (Ma et al.,2022;Louvan
and Magnini,2020;Razumovskaia et al.,2021).
Although this task has been widely studied, the
task is still challenging when dialogue systems in-
cluding their intent classifiers should be extended
to a wide variety of domains. One of main chal-
lenges in training intent classifiers is costly labelled
utterances (Zhang et al.,2022a;Wen et al.,2017;
Budzianowski et al.,2018;Rastogi et al.,2020;
Hung et al.,2022;Mueller et al.,2022). Therefore,
the ability to adjust intent classifiers to new intents,
given only a few labelled instances is imperative.
Various methods (§2) for few-shot intent classi-
fication (FSIC) have been proposed (Larson et al.,
2019b;Casanueva et al.,2020a;Zhang et al.,2020;
Mehri et al.,2020;Krone et al.,2020;Casanueva
et al.,2020b;Nguyen et al.,2020;Zhang et al.,
2021;Dopierre et al.,2021;Vuli´
c et al.,2021;
Zhang et al.,2022b). These method are gener-
ally similar in utilizing pretrained language models
(LMs) and resorting to
k
nearest neighbour (
k
NN)
inference – the label of a new instance is deter-
mined based on the labels of instances with which it
has the highest similarity score. Despite these simi-
larities, these FSIC methods differ in detailed but
shared crucial design dimensions including encod-
ing architectures, similarity functions, and training
regimes. These methods couple different choices
across these dimensions, hindering ablations and
insights into which factors drive the performance.
In this work, we propose a formulation for com-
paring nearest neighbour-based FSIC methods (§3).
Within this scope, our formulation focuses on three
key design decisions:
(1)
model architecture for
encoding utterance pairs, where we contrast the
less frequently adopted Cross-Encoder architecture
(e.g., (Vuli´
c et al.,2021)) against the more com-
mon Bi-Encoder architecture (Zhang et al.,2020;
Krone et al.,2020;Zhang et al.,2021)
1
;
(2)
simi-
1Also known as Dual Encoder or Siamese Network.
arXiv:2210.06440v1 [cs.CL] 12 Oct 2022
larity function for scoring utterance pairs based on
their joint or separate representations, contrasting
the parameterized (i.e., trainable) neural scoring
components against cosine similarity as the simple
non-parameterized scoring function; and (3) train-
ing regimes, comparing the standard non-episodic
training (adopted, e.g., by Zhang et al. (2021)
or Vuli´
c et al. (2021)) against the episodic meta-
learning training (implemented, e.g., by Nguyen
et al. (2020)orKrone et al. (2020)).
We use our formulation to conduct empirical
multi-dimensional comparison for two different
text encoders (BERT (Devlin et al.,2019) as a
vanilla PLM and SimCSE (Gao et al.,2021) as the
state-of-the-art sentence encoder) and, more impor-
tantly, under the same evaluation setup (datasets,
intent splits, evaluation protocols and measures)
while controlling for confounding factors that im-
pede direct comparison between existing FSIC
methods. Our extensive experimental results re-
veal two important findings. First, a Cross-Encoder
coupled with episodic training, which has never
been previously explored for FSIC, consistently
yields best performance across seven established
intent classification datasets. Second, although
episodic meta-learning methods split utterances
of an episode into a support and query set during
training, for the first time, we show that this is
not a must. In fact, without such splits the FSIC
methods generalize better than (or similar to) the
case without such splits to unseen intents in new
domains.
In sum, our work raises the attention of the com-
munity to the importance of the pragmatical de-
tails, which are formulated as three dimensions, in
the performance achieved by recent FSIC methods.
Alongside our novel findings pave the way for fu-
ture research in conducting comprehensive FSIC
methods.
2 Related Work
Our work focuses on few-shot intent classification
(FSIC) methods, which use the nearest neighbor
(
k
NN) algorithm. Therefore, we describe existing
inference algorithms and why we focus on
k
NN-
based methods. Then, we categorize the literature
about
k
NN-based methods concerning our three
evaluation dimensions.
Inference algorithms for FSIC.
Classical meth-
ods (Xu and Sarikaya,2013;Meng and Huang,
2017;Wang et al.,2019;Gupta et al.,2019)
for FSIC use the maximum likelihood algorithm,
where a vector representation of an utterance is
given to a probability distribution function to ob-
tain the likelihood of each intent class. Training
such probability distribution functions, in partic-
ular when they are modeled by neural networks,
mostly requires a large number of utterances an-
notated with intent labels, which are substantially
expensive to collect for any new domain and in-
tent. With advances in pre-trained language mod-
els, recent FSIC methods leverage the knowledge
encoded in such language models to alleviate the
need for training a probability distribution for FSIC.
These advanced FSIC methods (Krone et al.,2020;
Casanueva et al.,2020b;Nguyen et al.,2020;
Zhang et al.,2021;Dopierre et al.,2021;Vuli´
c
et al.,2021;Zhang et al.,2022b) mostly use the
nearest neighbor algorithm (
k
NN-based) to find
the most similar instance from a few labeled ut-
terances while classifying an unlabelled utterance.
These methods then identify the label of the found
utterance as the intent class of the unlabelled utter-
ance. Since nearest neighbor-based FSIC methods
achieve state-of-the-art FSIC performance, we fo-
cus on the major differences between these meth-
ods as our comparison dimensions.
Model architectures for encoding utterance
pairs.
One of the main differences between the
k
NN-based FSIC methods is their model archi-
tecture for encoding two utterances. The dom-
inant FSIC methods (Zhang et al.,2020;Krone
et al.,2020;Zhang et al.,2021;Xia et al.,2021)
use Bi-Encoder architecture (Bromley et al.,1993;
Reimers and Gurevych,2019a;Zhang et al.,2022a).
The core idea of Bi-Encoders is to map an unlabled
utterance and a candidate labeled utterance sepa-
rately into a common dense vector space and per-
form similarity scoring via a distance metric such
as dot product or cosine similarity. In contrast,
some FSIC methods (Vuli´
c et al.,2021;Zhang
et al.,2020;Wang et al.,2021;Zhang et al.,2021)
use the Cross-Encoder architecture (Devlin et al.,
2019). The idea is to represent a pair of utter-
ances together using an LM, where each utter-
ance becomes a context for the other. A Cross-
Encoder does not produce an utterance embedding
but represents the semantic relations between its
input utterances. In general, Bi-Encoders are more
computationally efficient than Cross-Encoders be-
cause of the Bi-Encoder’s ability to cache the rep-
resentations of the candidates. In return, Cross-
Encoders capture semantic relations between utter-
ances where such relations are crucial for nearest
neighbour-based FSIC methods.
Similarity scoring function.
A crucial compo-
nent in nearest neighbor-based methods for FSIC
is the employed similarity function. This function
estimates the similarity between input utterances
to LMs. Concerning this comparison dimension,
we categorize FSIC methods into two groups. First,
FSIC methods (Zhou et al.,2022;Zhang et al.,
2020;Xia et al.,2021) which use parametric neural
layers to estimate the similarity score between utter-
ances. Second, those methods (Sauer et al.,2022;
Zhang et al.,2022a;Krone et al.,2020;Vuli´
c et al.,
2021;Zhang et al.,2022b;Xu et al.,2021;Zhang
et al.,2021), which rely on non-parametric meth-
ods (a.k.a metric-based methods) such dot-product,
cosine similarity, and Euclidean distance function.
Training strategy.
To simulate FSIC, the best
practice is to split an intent classification corpus
into two disjoint sets of intent classes. In this way,
one set includes high-resource intents to train a
FSIC classifier, and the other set includes low-
resource intents to evaluate the classifier. Con-
cerning the training strategy on the high-resource
intents, FSIC methods can be divided into two
clusters. One cluster of methods (Zhang et al.,
2022a;Nguyen et al.,2020;Krone et al.,2020)
adopts meta-learning or episodic training. Under
this training regime, the goal is to train a meta-
learner that could be used to quickly adapt to any
few-shot intent classification task with very few la-
beled examples. To do so, the set of high-resource
intents are split to construct many episodes, where
each episode is a few-shot intent classification for
a small number of intents. The other cluster in-
cludes methods (Zhang et al.,2021;Vuli´
c et al.,
2021;Xu et al.,2021;Xia et al.,2021;Zhang et al.,
2020,2021) that use conventional supervised (or
non-episodic) training. The non-episode training
strategy takes into the set of high-resource intents
as a large training set and fine-tune the parameters
of the FSIC model on all samples in this set.
3 Method
We first describe the commonly adopted FSIC
framework based on utterance similarities and near-
est neighbour inference algorithm. We then present
the alternative configurations along our three cen-
tral dimensions of comparison: (i) model architec-
ture for encoding utterance pairs, (ii) functions for
scoring utterance pairs, and (iii) training regimes.
3.1 Nearest Neighbours (NN) inference.
Following previous work on FSIC (Zhang et al.,
2020;Vuli´
c et al.,2021), we cast the FSIC task
as a sentence similarity task in which each intent
being a latent semantic class that is represented by
sentences associated with the intent. The task is
then to find the most similar labelled utterances for
the given query/input that can directly derive the
underlying semantic intent. During inference, an
FSIC method should deal with an
N
-way
k
-shot
intent classification.
N
is the number of intents,
and
k
is the number of labeled utterances given
for each intent label. The advanced FSIC methods
infer the intent of an utterance (i.e., query) based on
its similarity with a given few labeled utterances.
Let
q
be a query utterance and
C={c1, ..., cn}
be a set of its labeled neighbours. The nearest
neighbour inference relies on a similarity function,
non-parameterized or trainable (which is learned
on high-resource intents), to estimate the similarity
score
si
between the query utterance
q
and any
neighbour
ciC
. The query’s label
ˆyq
is inferred
as the ground-truth label of the neighbour with the
maximum similarity score (i.e.,
k= 1
in
k
NN
inference): ˆy=yk,k=argmax({s1, ..., sn}).
3.2 Model Architectures for Encoding
Utterance Pairs
The main component of an FSIC model is an en-
coder which represents a pair of utterances: a
query and a labelled utterance. We explore two
model architectures as used in recent FSIC meth-
ods: Bi-Encoder and Cross-Encoder.
Bi-Encoder (BE).
BE encodes a pair of utter-
ances independently, deriving independent repre-
sentations of the query and the labelled utterance.
In particular, for each utterance
x
in a pair, we pass,
[CLS] x
”, to a BERT-like language model and use
the vector representation of “
[CLS]
” to represent
x
. It is worth noting that the parameters of the LM
are shared in BE.
Cross-Encoder (CE).
Different from BE, CE en-
codes a pair of query
q
and utterance
ci
jointly.
We concatenate
q
with each of its neighbours
to form a set of query–neighbour pairs
P=
{(q, c1), ..., (q, cn)}
. We then pass each pair from
P
as a sequence of tokens to a language model,
摘要:

TheDevilisintheDetails:OnModelsandTrainingRegimesforFew-ShotIntentClassicationMohsenMesgar1ThyThyTran1GoranGlavaš2IrynaGurevych11UbiquitousKnowledgeProcessingLab(UKP)DepartmentofComputerScienceTechnicalUniversityofDarmstadt2CAIDAS,UniversityofWürzburg,Germanywww.ukp.tu-darmstadt.deAbstractFew-shotI...

展开>> 收起<<
The Devil is in the Details On Models and Training Regimes for Few-Shot Intent Classification Mohsen Mesgar1Thy Thy Tran1Goran Glavaš2Iryna Gurevych1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:11 页 大小:326.09KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注