Enhancing Out-of-Distribution Detection in Natural Language Understanding via Implicit Layer Ensemble Hyunsoo Choy Choonghyun Parky Jaewook Kang

2025-05-06 0 0 920.75KB 16 页 10玖币
侵权投诉
Enhancing Out-of-Distribution Detection in Natural Language
Understanding via Implicit Layer Ensemble
Hyunsoo Cho, Choonghyun Park, Jaewook Kang\,
Kang Min Yoo\, Taeuk Kim§∗, Sang-goo Lee
Seoul National University, NAVER AI Lab, \NAVER CLOVA, §Hanyang University
{johyunsoo,pch330,sglee}@europa.snu.ac.kr
{jaewook.kang, kangmin.yoo}@navercorp.com
kimtaeuk@hanyang.ac.kr
Abstract
Out-of-distribution (OOD) detection aims to
discern outliers from the intended data distri-
bution, which is crucial to maintaining high
reliability and a good user experience. Most
recent studies in OOD detection utilize the
information from a single representation that
resides in the penultimate layer to determine
whether the input is anomalous or not. Al-
though such a method is straightforward, the
potential of diverse information in the inter-
mediate layers is overlooked. In this paper,
we propose a novel framework based on con-
trastive learning that encourages intermediate
features to learn layer-specialized representa-
tions and assembles them implicitly into a sin-
gle representation to absorb rich information
in the pre-trained language model. Exten-
sive experiments in various intent classifica-
tion and OOD datasets demonstrate that our
approach is significantly more effective than
other works. The source code for our model
is available online.1
1 Introduction
Natural language understanding (NLU) in dialog
systems, which often formalizes as a classification
task to identify intentions behind user input, is a
vital component as their decision propagates to
the downstream pipelines. Numerous works have
achieved immense success on sundry tasks (e.g.,
intention classification, NLI, QA) reaching parity
with human performance (Wang et al.,2019). De-
spite their success in many different benchmarks,
neural models are known to be vulnerable to test
inputs from an unknown distribution (Hendrycks
and Gimpel,2017;Hein et al.,2019), commonly
referred to as outliers, since they depend strongly
on the closed-world assumption (i.e., I.I.D assump-
tion). Thus, out-of-distribution (OOD) detection
(Aggarwal,2017), which aims to discern outliers
*Corresponding author.
1https://github.com/HyunsooCho77/LaCL-official
Figure 1: Layer-wise performances and their explicit
ensemble (Shen et al.,2021) performance on BERT-
base. Explicit ensemble often lead to worse AUROC
(higher the better) than using a single well-performing
layer. Detailed explanations about setting and baseline
model are elaborated in Sec.4.2.1 and Sec.4.3 individu-
ally.
from the train distribution, is a essential research
problem for ensuring a high-quality user experi-
ence and maintaining strong reliability as the sys-
tems in the wild encounter myriad unseen data
ceaselessly.
The most prevailing paradigm in OOD detection
is to extract and score. Namely, it extracts the
representation of the input from a neural model and
passes it to a pre-defined scoring function. Then,
the scoring function gauges the appropriateness of
the input based on the extracted feature and decides
whether the input is from the normal distribution.
The most common rule of thumb for extracting
representation from neural models is employing the
the last layer, a simple and intuitive way to obtain a
holistic representation, which is universally utilized
in broad machine learning areas.
Meanwhile, previous studies (Tenney et al.,
2019;Clark et al.,2019) revealed that the middle
layers of the language model also conceal copi-
ous information. For instance, prior studies on
language model probing suggest that syntactic lin-
guistic knowledge is most prominent in the middle
layers (Hewitt and Manning,2019;Goldberg,2019;
Jawahar et al.,2019), and semantic knowledge in
arXiv:2210.11034v1 [cs.CL] 20 Oct 2022
BERT is spread in all layers widely (Tenney et al.,
2019). In this regard, leveraging intermediate lay-
ers can lead to a better OOD detection performance,
as they retain some complementary information to
the last layer feature, which might be beneficial in
discriminating outliers. Several studies (Shen et al.,
2021;Sastry and Oore,2020;Lee et al.,2018b)
have shown empirical evidence that intermediate
representations are indeed beneficial in detecting
outliers. Precisely, they attempted to utilize middle
layers via naïvely aggregating the individual result
of every single intermediate feature explicitly.
Although previous studies have shown the poten-
tial of intermediate layer representations in OOD
detection, we confirmed that the aforementioned
naïve ensemble scheme spawns several problems:
(Fig. 1illustrates OOD performance of the layer-
wise and their explicit ensemble in two different
datasets.) The first problem we observed is that
the ensemble result (red bar) nor the last layer can
not guarantee the best performance among the en-
tire layer depending on the setting. Such a phe-
nomenon raises the necessity for a more elaborate
approach of deriving a more meaningful ensemble
representation from various representations rather
than a current simple summation or selecting a
single layer. Secondly, even when this explicit
ensemble gives a sound performance, it requires
multiple computations of the scoring function by
birth. Thus, explicit ensemble inevitably delays
the detecting time, which is a critical shortcoming
in OOD detection, as swift and precise decision-
making is the cornerstone in this area.
To remedy the limitations of the explicit en-
semble schemes, we propose a novel frame-
work dubbed Layer-agnostic Contrastive Learning
(LaCL). Our framework is inspired by the founda-
tion of an ensemble, which seeks a more calibrated
output by combining heterogeneous decisions from
multiple models (Kuncheva and Whitaker,2003;
Gashler et al.,2008). Specifically, LaCL regards in-
termediate layers as independent decision-makers
and assembles them into a single vector to yield
a more accurate prediction: LaCL makes middle-
layer representations richer and more diverse by in-
jecting the advantage of contrastive learning (CL)
into intermediate layers while suppressing inter-
layer representations from being similar through
additional regularization loss. Then, LaCL assem-
bles them into a single ensemble representation
implicitly
to circumvent multiple computations of
the scoring function.
We demonstrate the effectiveness of our ap-
proach in 9 different OOD scenarios where LaCL
consistently surpasses other competitive works and
their explicit ensemble performance by a signifi-
cant margin. Moreover, we conducted an in-depth
analysis of LaCL to elucidate its behavior in con-
junction with our intuition.
2 Related Work
OOD detection.
Methodologies in OOD detec-
tion can be divided into supervised (Hendrycks
et al.,2019;Lee et al.,2018a;Dhamija et al.,2018)
and unsupervised settings according to the pres-
ence of training data from OOD. Since the scope
of OOD covers nigh infinite space, gathering the
data in the whole OOD space is infeasible. For
this realistic reason, the most recent OOD detec-
tion studies generally discriminate OOD input in
an unsupervised manner, including this work. Nu-
merous branches of machine learning tactics are
employed for unsupervised OOD detection: gen-
erating pseudo-OOD data (Chen and Yu,2021;
Zheng et al.,2020), Bayesian methods (Malinin
and Gales,2018), self-supervised learning based
approaches (Moon et al.,2021;Manolache et al.,
2021;Li et al.,2021;Zhou et al.,2021;Zeng et al.,
2021;Zhan et al.,2021), and novel scoring func-
tions which measure the uncertainty of the given
input (Hendrycks and Gimpel,2017;Lee et al.,
2018b;Liu et al.,2020;Tack et al.,2020).
Contrastive learning & OOD detection.
Among the numerous approaches mentioned, con-
trastive learning (CL) based methods (Chen et al.,
2020;Zbontar et al.,2021;Grill et al.,2020) are
recently spurring predominant interest in OOD de-
tection research. The superiority of CL in OOD
detection comes from the fact that it can guide a
neural model to learn semantic similarity within
data instances. Such property is also precious for
unsupervised OOD detection, as there is no accessi-
ble clue regarding outliers or abnormal distribution.
Despite its potential, CL has been utilized in the
computer vision field (Cho et al.,2021;Sehwag
et al.,2021;Tack et al.,2020;Winkens et al.,2020)
in the early works due to its high reliance on data
augmentation. However, now it is also widely used
in various NLP applications with the help of re-
cent progress (Li et al.,2021;Liu et al.,2021;Kim
et al.,2021;Carlsson et al.,2020;Gao et al.,2021;
Sennrich et al.,2016). Specifically, Li et al. (2021)
verified that CL is also helpful in the NLP field, and
Zhou et al. (2021); Zeng et al. (2021) redesigned
the contrastive-learning objective into a more ap-
propriate form for OOD detection.
Potential of intermediate representation.
The
leading driver of the recent upheaval in NLP is
the pre-trained language model (PLM), such as
BERT (Devlin et al.,2019) and GPT (Radford
et al.,2018), which trains a large-scale dataset on
a transformer-based architecture (Vaswani et al.,
2017). Numerous studies attempted to reveal the
role and characteristics of each layer in PLMs and
verified that diverse information is concealed in
the middle layer, which is now a pervasive notion
in the machine learning community. For instance,
Tenney et al. (2019) showed that the different lay-
ers of the BERT network could resolve syntactic
and semantic structure within a sentence. Clark
et al. (2019) proposed an attention-based probing
classifier leveraging syntactic information in the
middle layer of BERT. Several studies (Shen et al.,
2021;Sastry and Oore,2020;Lee et al.,2018b)
have shown the potential of intermediate represen-
tations in OOD detection by explicitly aggregating
the individual result of every single intermediate
feature.
3 Layer-agnostic Contrastive Learning
3.1 Intuition
The prime objective of our framework is to assem-
ble rich information in the entire layers into a single
ensemble representation to derive a more reliable
decision. Inspired by the foundation of ensemble
learning, which seeks better predictive performance
by combining the predictions from multiple mod-
els, we regard each intermediate layer as an inde-
pendent model (or decision maker). To make each
layer a better decision-maker, LaCL injects a sound
representation learning signal (i.e., supervised con-
trastive learning) to the entire layer by training
objective function in a layer-agnostic manner to
engage every layer more directly. Additionally, we
propose correlation regularization loss (CR loss)
which decorrelates a pair of strongly correlated ad-
jacent representations to encourage each layer to
learn layer-specialized representations from com-
plementary information of each layer. Then, the
global compression layer (GCL)
implicitly
assem-
bles various features in each layer into a single
calibrated ensemble representation . In the follow-
ing subsections, we explain the components of our
model in detail.
3.2 Supervised Contrastive Learning
Supervised contrastive learning (SCL) is a super-
vised variant of vanilla contrastive learning, which
employs label information of the input to group
samples into known classes more tightly. Thus,
SCL can learn data-label relationships as well as
data-data relationships as in CL.
In SCL, each batch
B={(xb, yb)}|B|
b=1
in the
dataset, where
xb, yb
denotes a sentence and a
label for index
b
respectively, generates an aug-
mented batch
¯
B={(¯
xb,¯yb)}|¯
B|
b=1
, where labels of
augmented views are preserved as the original one.
The augmented batch
¯
B
consists of two augmented
input;
¯
x2b1=t1(xb)
and
¯
x2b=t2(xb)
, where
t1, t2
indicate data augmentation functions speci-
fied in Section 3.6. Then,
(¯
x2b1,¯
x2b)
are passed
through PLM and projector, generating latent vec-
tors
(z2b1,z2b)
that are utilized to calculate the
supervised contrastive loss:
LSCL =log X
jP(i)
exp(zi·zj)
P|¯
B|
k=1 1[k6=i]exp(zi·zk)
,
(1)
where
P(i) = {p∈ B : ¯yj= ¯yi}
is the set of
indices of all positives in the augmented batch with
query index
i
and
τ
represents temperature hyper-
parameter.
3.3 Global Compression Layer
The global compression layer (GCL) is a two-layer
MLP that is directly connected to entire layers to
assemble intermediate representations into a single
representation
z
. GCL can be viewed as a particu-
lar type of projection head in contrastive learning.
By linking the projection head to the entire layer,
GCL facilitates layer-agnostic training to engage
every middle layer in a training objective directly.
The process of extracting the final latent vector
z
with GCL is as follows: (The batch index term
b
is omitted for brevity from now.)
First, each layer
l
(
l∈ |L|
, where
|L|
refers to
the cardinality of the layers) in PLM, outputs token
embeddings
Hl= [hl
1,hl
2,· · · ,hl
len(x)]
for sen-
tence
x
. Then we combine token embeddings
Hl
into a single vector
hl=pool(Hl)
by applying
the pooling function (i.e., mean pooling). Lastly,
GCL receives the pooled token embedding of each
layer
hl
(where,
hlR|D|
) as an input and outputs
Figure 2: Overall structure of Layer-agnostic Contrastive Learning (LaCL). The global compression layer trains
the SCL loss in a layer-agnostic manner by engaging entire layers in the CL task. And the correlation regularization
(CR) loss decorrelates each intermediate layer to avoid ovelapping information between each layer.
compact low-dimensional representation
cl
(where,
clR|D|/|L|
). And we concatenate all compact
representations
cl
to generate a single sentence rep-
resentation zfrom x:
z(x) = [c1c2c3 · · · c|L|],(2)
where indicates concatenation and zR|D|.
LaCL trains the SCL loss with the final repre-
sentation from GCL
z
, which inheres information
from entire layers.
3.4 Correlation Regularization Loss
The correlation regularization (CR) loss restrains a
pair of features from each adjacent layer from be-
ing similar, following the intuition of an ensemble
where its performance boost springs from various
decisions (Kuncheva and Whitaker,2003;Gashler
et al.,2008). Specifically, it encourages adjacent
layers to activate different dimensions given the
same input. First, we define the correlation in the
dimension
d
of the adjacent layer (
l
and
l+ 1
) as
follows:
cor d
(l,l+1) =Pbcl
b,d ·cl+1
b,d
qPb(cl
b,d)2qPb(cl+1
b,d )2
.(3)
where
d
indicates the index of hidden embedding
dimension (
d∈ |D|/|L|
, where
clR|D|/|L|
) and
brefers to a data index of the augmented batch ¯
B.
Then, the CR loss selects a strongly correlated
dimension set
S
by picking the dimensions that
exceed the pre-set margin value
m
and decorrelates
set Siterating over every adjacent layer:
S={d∈ |D|:cor d
(l,l+1) m}
LCR =X
lX
dS
cor d
(l,l+1).(4)
Finally, the overall loss term for LaCL can be
described as follows:
LLaCL =LSCL +λ1LCR,(5)
where λ1denote weights for CR loss.
3.5 Classification & OOD Scoring
Since there is no task-specific final layer (i.e., clas-
sification layer for cross-entropy loss) in LaCL,
classification and anomaly detection are conducted
via a cosine similarity scoring function (Tack et al.,
2020). Employing the cosine similarity scoring
function in LaCL is straightforward and shows
good compatibility, as the model trained with con-
trastive learning can measure meaningful cosine
similarity between data instances.
For input
x
, we first extract the implicit ensem-
ble representation
z(x)
and find the nearest neigh-
bor instance
xnn
, i.e.,
maxnn sim(z(x),z(xnn))
,
from the training dataset. Then we classify label of
x
as the label of the nearest neighbor
ynn
. And for
the OOD detection, we use the similarity between
input and its nearest neighbor as follows:
Score(x) = sim(z(x),z(xnn)) (6)
摘要:

EnhancingOut-of-DistributionDetectioninNaturalLanguageUnderstandingviaImplicitLayerEnsembleHyunsooChoy,ChoonghyunParky,JaewookKang\,KangMinYooz\y,TaeukKimx,Sang-gooLeeyySeoulNationalUniversity,zNAVERAILab,\NAVERCLOVA,xHanyangUniversity{johyunsoo,pch330,sglee}@europa.snu.ac.kr{jaewook.kang,kangmin.y...

展开>> 收起<<
Enhancing Out-of-Distribution Detection in Natural Language Understanding via Implicit Layer Ensemble Hyunsoo Choy Choonghyun Parky Jaewook Kang.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:920.75KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注