larity function for scoring utterance pairs based on
their joint or separate representations, contrasting
the parameterized (i.e., trainable) neural scoring
components against cosine similarity as the simple
non-parameterized scoring function; and (3) train-
ing regimes, comparing the standard non-episodic
training (adopted, e.g., by Zhang et al. (2021)
or Vuli´
c et al. (2021)) against the episodic meta-
learning training (implemented, e.g., by Nguyen
et al. (2020)orKrone et al. (2020)).
We use our formulation to conduct empirical
multi-dimensional comparison for two different
text encoders (BERT (Devlin et al.,2019) as a
vanilla PLM and SimCSE (Gao et al.,2021) as the
state-of-the-art sentence encoder) and, more impor-
tantly, under the same evaluation setup (datasets,
intent splits, evaluation protocols and measures)
while controlling for confounding factors that im-
pede direct comparison between existing FSIC
methods. Our extensive experimental results re-
veal two important findings. First, a Cross-Encoder
coupled with episodic training, which has never
been previously explored for FSIC, consistently
yields best performance across seven established
intent classification datasets. Second, although
episodic meta-learning methods split utterances
of an episode into a support and query set during
training, for the first time, we show that this is
not a must. In fact, without such splits the FSIC
methods generalize better than (or similar to) the
case without such splits to unseen intents in new
domains.
In sum, our work raises the attention of the com-
munity to the importance of the pragmatical de-
tails, which are formulated as three dimensions, in
the performance achieved by recent FSIC methods.
Alongside our novel findings pave the way for fu-
ture research in conducting comprehensive FSIC
methods.
2 Related Work
Our work focuses on few-shot intent classification
(FSIC) methods, which use the nearest neighbor
(
k
NN) algorithm. Therefore, we describe existing
inference algorithms and why we focus on
k
NN-
based methods. Then, we categorize the literature
about
k
NN-based methods concerning our three
evaluation dimensions.
Inference algorithms for FSIC.
Classical meth-
ods (Xu and Sarikaya,2013;Meng and Huang,
2017;Wang et al.,2019;Gupta et al.,2019)
for FSIC use the maximum likelihood algorithm,
where a vector representation of an utterance is
given to a probability distribution function to ob-
tain the likelihood of each intent class. Training
such probability distribution functions, in partic-
ular when they are modeled by neural networks,
mostly requires a large number of utterances an-
notated with intent labels, which are substantially
expensive to collect for any new domain and in-
tent. With advances in pre-trained language mod-
els, recent FSIC methods leverage the knowledge
encoded in such language models to alleviate the
need for training a probability distribution for FSIC.
These advanced FSIC methods (Krone et al.,2020;
Casanueva et al.,2020b;Nguyen et al.,2020;
Zhang et al.,2021;Dopierre et al.,2021;Vuli´
c
et al.,2021;Zhang et al.,2022b) mostly use the
nearest neighbor algorithm (
k
NN-based) to find
the most similar instance from a few labeled ut-
terances while classifying an unlabelled utterance.
These methods then identify the label of the found
utterance as the intent class of the unlabelled utter-
ance. Since nearest neighbor-based FSIC methods
achieve state-of-the-art FSIC performance, we fo-
cus on the major differences between these meth-
ods as our comparison dimensions.
Model architectures for encoding utterance
pairs.
One of the main differences between the
k
NN-based FSIC methods is their model archi-
tecture for encoding two utterances. The dom-
inant FSIC methods (Zhang et al.,2020;Krone
et al.,2020;Zhang et al.,2021;Xia et al.,2021)
use Bi-Encoder architecture (Bromley et al.,1993;
Reimers and Gurevych,2019a;Zhang et al.,2022a).
The core idea of Bi-Encoders is to map an unlabled
utterance and a candidate labeled utterance sepa-
rately into a common dense vector space and per-
form similarity scoring via a distance metric such
as dot product or cosine similarity. In contrast,
some FSIC methods (Vuli´
c et al.,2021;Zhang
et al.,2020;Wang et al.,2021;Zhang et al.,2021)
use the Cross-Encoder architecture (Devlin et al.,
2019). The idea is to represent a pair of utter-
ances together using an LM, where each utter-
ance becomes a context for the other. A Cross-
Encoder does not produce an utterance embedding
but represents the semantic relations between its
input utterances. In general, Bi-Encoders are more
computationally efficient than Cross-Encoders be-
cause of the Bi-Encoder’s ability to cache the rep-
resentations of the candidates. In return, Cross-