The Devil is in the Details On Models and Training Regimes for Few-Shot Intent Classiﬁcation Mohsen Mesgar1Thy Thy Tran1Goran Glavaš2Iryna Gurevych1

2025-04-26 0 0 326.09KB 11 页 10玖币

侵权投诉

The Devil is in the Details: On Models and Training Regimes

for Few-Shot Intent Classiﬁcation

Mohsen Mesgar1Thy Thy Tran1Goran Glavaš2Iryna Gurevych1

1Ubiquitous Knowledge Processing Lab (UKP)

Department of Computer Science

Technical University of Darmstadt

2CAIDAS, University of Würzburg, Germany

www.ukp.tu-darmstadt.de

Abstract

Few-shot Intent Classiﬁcation (FSIC) is one

of the key challenges in modular task-oriented

dialog systems. While advanced FSIC meth-

ods are similar in using pretrained language

models to encode texts and nearest-neighbour-

based inference for classiﬁcation, these meth-

ods differ in details. They start from differ-

ent pretrained text encoders, use different en-

coding architectures with varying similarity

functions, and adopt different training regimes.

Coupling these mostly independent design de-

cisions and the lack of accompanying abla-

tion studies are big obstacle to identify the

factors that drive the reported FSIC perfor-

mance. We study these details across three

key dimensions: (1) Encoding architectures:

Cross-Encoder vs Bi-Encoders; (2) Similar-

ity function: Parameterized (i.e., trainable)

functions vs non-parameterized function; (3)

Training regimes: Episodic meta-learning vs

the straightforward (i.e., non-episodic) train-

ing. Our experimental results on seven FSIC

benchmarks reveal three important ﬁndings.

First, the unexplored combination of the cross-

encoder architecture (with parameterized sim-

ilarity scoring function) and episodic meta-

learning consistently yields the best FSIC per-

formance. Second, Episodic training yields a

more robust FSIC classiﬁer than non-episodic

one. Third, in meta-learning methods, splitting

an episode to support and query sets is not a

must. Our ﬁndings paves the way for conduct-

ing state-of-the-art research in FSIC and more

importantly raise the community’s attention to

details of FSIC methods. We release our code

and data publicly.

1 Introduction

Intent classiﬁcation deals with assigning one la-

bel from the predeﬁned set of classes or intents

to user utterances. This task is vital for under-

standing user demands in dialogue systems and the

predicted intent of an utterance is a key input to

other modules (i.e., dialog management) in task-

orientated dialog systems (Ma et al.,2022;Louvan

and Magnini,2020;Razumovskaia et al.,2021).

Although this task has been widely studied, the

task is still challenging when dialogue systems in-

cluding their intent classiﬁers should be extended

to a wide variety of domains. One of main chal-

lenges in training intent classiﬁers is costly labelled

utterances (Zhang et al.,2022a;Wen et al.,2017;

Budzianowski et al.,2018;Rastogi et al.,2020;

Hung et al.,2022;Mueller et al.,2022). Therefore,

the ability to adjust intent classiﬁers to new intents,

given only a few labelled instances is imperative.

Various methods (§2) for few-shot intent classi-

ﬁcation (FSIC) have been proposed (Larson et al.,

2019b;Casanueva et al.,2020a;Zhang et al.,2020;

Mehri et al.,2020;Krone et al.,2020;Casanueva

et al.,2020b;Nguyen et al.,2020;Zhang et al.,

2021;Dopierre et al.,2021;Vuli´

c et al.,2021;

Zhang et al.,2022b). These method are gener-

ally similar in utilizing pretrained language models

(LMs) and resorting to

nearest neighbour (

NN)

inference – the label of a new instance is deter-

mined based on the labels of instances with which it

has the highest similarity score. Despite these simi-

larities, these FSIC methods differ in detailed but

shared crucial design dimensions including encod-

ing architectures, similarity functions, and training

regimes. These methods couple different choices

across these dimensions, hindering ablations and

insights into which factors drive the performance.

In this work, we propose a formulation for com-

paring nearest neighbour-based FSIC methods (§3).

Within this scope, our formulation focuses on three

key design decisions:

(1)

model architecture for

encoding utterance pairs, where we contrast the

less frequently adopted Cross-Encoder architecture

(e.g., (Vuli´

c et al.,2021)) against the more com-

mon Bi-Encoder architecture (Zhang et al.,2020;

Krone et al.,2020;Zhang et al.,2021)

;

(2)

simi-

1Also known as Dual Encoder or Siamese Network.

arXiv:2210.06440v1 [cs.CL] 12 Oct 2022

larity function for scoring utterance pairs based on

their joint or separate representations, contrasting

the parameterized (i.e., trainable) neural scoring

components against cosine similarity as the simple

non-parameterized scoring function; and (3) train-

ing regimes, comparing the standard non-episodic

training (adopted, e.g., by Zhang et al. (2021)

or Vuli´

c et al. (2021)) against the episodic meta-

learning training (implemented, e.g., by Nguyen

et al. (2020)orKrone et al. (2020)).

We use our formulation to conduct empirical

multi-dimensional comparison for two different

text encoders (BERT (Devlin et al.,2019) as a

vanilla PLM and SimCSE (Gao et al.,2021) as the

state-of-the-art sentence encoder) and, more impor-

tantly, under the same evaluation setup (datasets,

intent splits, evaluation protocols and measures)

while controlling for confounding factors that im-

pede direct comparison between existing FSIC

methods. Our extensive experimental results re-

veal two important ﬁndings. First, a Cross-Encoder

coupled with episodic training, which has never

been previously explored for FSIC, consistently

yields best performance across seven established

intent classiﬁcation datasets. Second, although

episodic meta-learning methods split utterances

of an episode into a support and query set during

training, for the ﬁrst time, we show that this is

not a must. In fact, without such splits the FSIC

methods generalize better than (or similar to) the

case without such splits to unseen intents in new

domains.

In sum, our work raises the attention of the com-

munity to the importance of the pragmatical de-

tails, which are formulated as three dimensions, in

the performance achieved by recent FSIC methods.

Alongside our novel ﬁndings pave the way for fu-

ture research in conducting comprehensive FSIC

methods.

2 Related Work

Our work focuses on few-shot intent classiﬁcation

(FSIC) methods, which use the nearest neighbor

(

NN) algorithm. Therefore, we describe existing

inference algorithms and why we focus on

NN-

based methods. Then, we categorize the literature

about

NN-based methods concerning our three

evaluation dimensions.

Inference algorithms for FSIC.

Classical meth-

ods (Xu and Sarikaya,2013;Meng and Huang,

2017;Wang et al.,2019;Gupta et al.,2019)

for FSIC use the maximum likelihood algorithm,

where a vector representation of an utterance is

given to a probability distribution function to ob-

tain the likelihood of each intent class. Training

such probability distribution functions, in partic-

ular when they are modeled by neural networks,

mostly requires a large number of utterances an-

notated with intent labels, which are substantially

expensive to collect for any new domain and in-

tent. With advances in pre-trained language mod-

els, recent FSIC methods leverage the knowledge

encoded in such language models to alleviate the

need for training a probability distribution for FSIC.

These advanced FSIC methods (Krone et al.,2020;

Casanueva et al.,2020b;Nguyen et al.,2020;

Zhang et al.,2021;Dopierre et al.,2021;Vuli´

et al.,2021;Zhang et al.,2022b) mostly use the

nearest neighbor algorithm (

NN-based) to ﬁnd

the most similar instance from a few labeled ut-

terances while classifying an unlabelled utterance.

These methods then identify the label of the found

utterance as the intent class of the unlabelled utter-

ance. Since nearest neighbor-based FSIC methods

achieve state-of-the-art FSIC performance, we fo-

cus on the major differences between these meth-

ods as our comparison dimensions.

Model architectures for encoding utterance

pairs.

One of the main differences between the

NN-based FSIC methods is their model archi-

tecture for encoding two utterances. The dom-

inant FSIC methods (Zhang et al.,2020;Krone

et al.,2020;Zhang et al.,2021;Xia et al.,2021)

use Bi-Encoder architecture (Bromley et al.,1993;

Reimers and Gurevych,2019a;Zhang et al.,2022a).

The core idea of Bi-Encoders is to map an unlabled

utterance and a candidate labeled utterance sepa-

rately into a common dense vector space and per-

form similarity scoring via a distance metric such

as dot product or cosine similarity. In contrast,

some FSIC methods (Vuli´

c et al.,2021;Zhang

et al.,2020;Wang et al.,2021;Zhang et al.,2021)

use the Cross-Encoder architecture (Devlin et al.,

2019). The idea is to represent a pair of utter-

ances together using an LM, where each utter-

ance becomes a context for the other. A Cross-

Encoder does not produce an utterance embedding

but represents the semantic relations between its

input utterances. In general, Bi-Encoders are more

computationally efﬁcient than Cross-Encoders be-

cause of the Bi-Encoder’s ability to cache the rep-

resentations of the candidates. In return, Cross-

Encoders capture semantic relations between utter-

ances where such relations are crucial for nearest

neighbour-based FSIC methods.

Similarity scoring function.

A crucial compo-

nent in nearest neighbor-based methods for FSIC

is the employed similarity function. This function

estimates the similarity between input utterances

to LMs. Concerning this comparison dimension,

we categorize FSIC methods into two groups. First,

FSIC methods (Zhou et al.,2022;Zhang et al.,

2020;Xia et al.,2021) which use parametric neural

layers to estimate the similarity score between utter-

ances. Second, those methods (Sauer et al.,2022;

Zhang et al.,2022a;Krone et al.,2020;Vuli´

c et al.,

2021;Zhang et al.,2022b;Xu et al.,2021;Zhang

et al.,2021), which rely on non-parametric meth-

ods (a.k.a metric-based methods) such dot-product,

cosine similarity, and Euclidean distance function.

Training strategy.

To simulate FSIC, the best

practice is to split an intent classiﬁcation corpus

into two disjoint sets of intent classes. In this way,

one set includes high-resource intents to train a

FSIC classiﬁer, and the other set includes low-

resource intents to evaluate the classiﬁer. Con-

cerning the training strategy on the high-resource

intents, FSIC methods can be divided into two

clusters. One cluster of methods (Zhang et al.,

2022a;Nguyen et al.,2020;Krone et al.,2020)

adopts meta-learning or episodic training. Under

this training regime, the goal is to train a meta-

learner that could be used to quickly adapt to any

few-shot intent classiﬁcation task with very few la-

beled examples. To do so, the set of high-resource

intents are split to construct many episodes, where

each episode is a few-shot intent classiﬁcation for

a small number of intents. The other cluster in-

cludes methods (Zhang et al.,2021;Vuli´

c et al.,

2021;Xu et al.,2021;Xia et al.,2021;Zhang et al.,

2020,2021) that use conventional supervised (or

non-episodic) training. The non-episode training

strategy takes into the set of high-resource intents

as a large training set and ﬁne-tune the parameters

of the FSIC model on all samples in this set.

3 Method

We ﬁrst describe the commonly adopted FSIC

framework based on utterance similarities and near-

est neighbour inference algorithm. We then present

the alternative conﬁgurations along our three cen-

tral dimensions of comparison: (i) model architec-

ture for encoding utterance pairs, (ii) functions for

scoring utterance pairs, and (iii) training regimes.

3.1 Nearest Neighbours (NN) inference.

Following previous work on FSIC (Zhang et al.,

2020;Vuli´

c et al.,2021), we cast the FSIC task

as a sentence similarity task in which each intent

being a latent semantic class that is represented by

sentences associated with the intent. The task is

then to ﬁnd the most similar labelled utterances for

the given query/input that can directly derive the

underlying semantic intent. During inference, an

FSIC method should deal with an

-way

-shot

intent classiﬁcation.

is the number of intents,

and

is the number of labeled utterances given

for each intent label. The advanced FSIC methods

infer the intent of an utterance (i.e., query) based on

its similarity with a given few labeled utterances.

Let

be a query utterance and

C={c1, ..., cn}

be a set of its labeled neighbours. The nearest

neighbour inference relies on a similarity function,

non-parameterized or trainable (which is learned

on high-resource intents), to estimate the similarity

score

between the query utterance

and any

neighbour

ci∈C

. The query’s label

ˆyq

is inferred

as the ground-truth label of the neighbour with the

maximum similarity score (i.e.,

k= 1

inference): ˆy=yk,k=argmax({s1, ..., sn}).

3.2 Model Architectures for Encoding

Utterance Pairs

The main component of an FSIC model is an en-

coder which represents a pair of utterances: a

query and a labelled utterance. We explore two

model architectures as used in recent FSIC meth-

ods: Bi-Encoder and Cross-Encoder.

Bi-Encoder (BE).

BE encodes a pair of utter-

ances independently, deriving independent repre-

sentations of the query and the labelled utterance.

In particular, for each utterance

in a pair, we pass,

“

[CLS] x

”, to a BERT-like language model and use

the vector representation of “

[CLS]

” to represent

. It is worth noting that the parameters of the LM

are shared in BE.

Cross-Encoder (CE).

Different from BE, CE en-

codes a pair of query

and utterance

jointly.

We concatenate

with each of its neighbours

to form a set of query–neighbour pairs

{(q, c1), ..., (q, cn)}

. We then pass each pair from

as a sequence of tokens to a language model,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheDevilisintheDetails:OnModelsandTrainingRegimesforFew-ShotIntentClassicationMohsenMesgar1ThyThyTran1GoranGlava2IrynaGurevych11UbiquitousKnowledgeProcessingLab(UKP)DepartmentofComputerScienceTechnicalUniversityofDarmstadt2CAIDAS,UniversityofWürzburg,Germanywww.ukp.tu-darmstadt.deAbstractFew-shotI...

展开>> 收起<<

The Devil is in the Details On Models and Training Regimes for Few-Shot Intent Classiﬁcation Mohsen Mesgar1Thy Thy Tran1Goran Glavaš2Iryna Gurevych1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Devil is in the Details On Models and Training Regimes for Few-Shot Intent Classiﬁcation Mohsen Mesgar1Thy Thy Tran1Goran Glavaš2Iryna Gurevych1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: