Enhancing Out-of-Distribution Detection in Natural Language Understanding via Implicit Layer Ensemble Hyunsoo Choy Choonghyun Parky Jaewook Kang

2025-05-06 1 0 920.75KB 16 页 10玖币

侵权投诉

Enhancing Out-of-Distribution Detection in Natural Language

Understanding via Implicit Layer Ensemble

Hyunsoo Cho†, Choonghyun Park†, Jaewook Kang\,

Kang Min Yoo‡\†, Taeuk Kim§∗, Sang-goo Lee†

†Seoul National University, ‡NAVER AI Lab, \NAVER CLOVA, §Hanyang University

{johyunsoo,pch330,sglee}@europa.snu.ac.kr

{jaewook.kang, kangmin.yoo}@navercorp.com

kimtaeuk@hanyang.ac.kr

Abstract

Out-of-distribution (OOD) detection aims to

discern outliers from the intended data distri-

bution, which is crucial to maintaining high

reliability and a good user experience. Most

recent studies in OOD detection utilize the

information from a single representation that

resides in the penultimate layer to determine

whether the input is anomalous or not. Al-

though such a method is straightforward, the

potential of diverse information in the inter-

mediate layers is overlooked. In this paper,

we propose a novel framework based on con-

trastive learning that encourages intermediate

features to learn layer-specialized representa-

tions and assembles them implicitly into a sin-

gle representation to absorb rich information

in the pre-trained language model. Exten-

sive experiments in various intent classiﬁca-

tion and OOD datasets demonstrate that our

approach is signiﬁcantly more effective than

other works. The source code for our model

is available online.1

1 Introduction

Natural language understanding (NLU) in dialog

systems, which often formalizes as a classiﬁcation

task to identify intentions behind user input, is a

vital component as their decision propagates to

the downstream pipelines. Numerous works have

achieved immense success on sundry tasks (e.g.,

intention classiﬁcation, NLI, QA) reaching parity

with human performance (Wang et al.,2019). De-

spite their success in many different benchmarks,

neural models are known to be vulnerable to test

inputs from an unknown distribution (Hendrycks

and Gimpel,2017;Hein et al.,2019), commonly

referred to as outliers, since they depend strongly

on the closed-world assumption (i.e., I.I.D assump-

tion). Thus, out-of-distribution (OOD) detection

(Aggarwal,2017), which aims to discern outliers

*Corresponding author.

1https://github.com/HyunsooCho77/LaCL-ofﬁcial

Figure 1: Layer-wise performances and their explicit

ensemble (Shen et al.,2021) performance on BERT-

base. Explicit ensemble often lead to worse AUROC

(higher the better) than using a single well-performing

layer. Detailed explanations about setting and baseline

model are elaborated in Sec.4.2.1 and Sec.4.3 individu-

ally.

from the train distribution, is a essential research

problem for ensuring a high-quality user experi-

ence and maintaining strong reliability as the sys-

tems in the wild encounter myriad unseen data

ceaselessly.

The most prevailing paradigm in OOD detection

is to extract and score. Namely, it extracts the

representation of the input from a neural model and

passes it to a pre-deﬁned scoring function. Then,

the scoring function gauges the appropriateness of

the input based on the extracted feature and decides

whether the input is from the normal distribution.

The most common rule of thumb for extracting

representation from neural models is employing the

the last layer, a simple and intuitive way to obtain a

holistic representation, which is universally utilized

in broad machine learning areas.

Meanwhile, previous studies (Tenney et al.,

2019;Clark et al.,2019) revealed that the middle

layers of the language model also conceal copi-

ous information. For instance, prior studies on

language model probing suggest that syntactic lin-

guistic knowledge is most prominent in the middle

layers (Hewitt and Manning,2019;Goldberg,2019;

Jawahar et al.,2019), and semantic knowledge in

arXiv:2210.11034v1 [cs.CL] 20 Oct 2022

BERT is spread in all layers widely (Tenney et al.,

2019). In this regard, leveraging intermediate lay-

ers can lead to a better OOD detection performance,

as they retain some complementary information to

the last layer feature, which might be beneﬁcial in

discriminating outliers. Several studies (Shen et al.,

2021;Sastry and Oore,2020;Lee et al.,2018b)

have shown empirical evidence that intermediate

representations are indeed beneﬁcial in detecting

outliers. Precisely, they attempted to utilize middle

layers via naïvely aggregating the individual result

of every single intermediate feature explicitly.

Although previous studies have shown the poten-

tial of intermediate layer representations in OOD

detection, we conﬁrmed that the aforementioned

naïve ensemble scheme spawns several problems:

(Fig. 1illustrates OOD performance of the layer-

wise and their explicit ensemble in two different

datasets.) The ﬁrst problem we observed is that

the ensemble result (red bar) nor the last layer can

not guarantee the best performance among the en-

tire layer depending on the setting. Such a phe-

nomenon raises the necessity for a more elaborate

approach of deriving a more meaningful ensemble

representation from various representations rather

than a current simple summation or selecting a

single layer. Secondly, even when this explicit

ensemble gives a sound performance, it requires

multiple computations of the scoring function by

birth. Thus, explicit ensemble inevitably delays

the detecting time, which is a critical shortcoming

in OOD detection, as swift and precise decision-

making is the cornerstone in this area.

To remedy the limitations of the explicit en-

semble schemes, we propose a novel frame-

work dubbed Layer-agnostic Contrastive Learning

(LaCL). Our framework is inspired by the founda-

tion of an ensemble, which seeks a more calibrated

output by combining heterogeneous decisions from

multiple models (Kuncheva and Whitaker,2003;

Gashler et al.,2008). Speciﬁcally, LaCL regards in-

termediate layers as independent decision-makers

and assembles them into a single vector to yield

a more accurate prediction: LaCL makes middle-

layer representations richer and more diverse by in-

jecting the advantage of contrastive learning (CL)

into intermediate layers while suppressing inter-

layer representations from being similar through

additional regularization loss. Then, LaCL assem-

bles them into a single ensemble representation

implicitly

to circumvent multiple computations of

the scoring function.

We demonstrate the effectiveness of our ap-

proach in 9 different OOD scenarios where LaCL

consistently surpasses other competitive works and

their explicit ensemble performance by a signiﬁ-

cant margin. Moreover, we conducted an in-depth

analysis of LaCL to elucidate its behavior in con-

junction with our intuition.

2 Related Work

OOD detection.

Methodologies in OOD detec-

tion can be divided into supervised (Hendrycks

et al.,2019;Lee et al.,2018a;Dhamija et al.,2018)

and unsupervised settings according to the pres-

ence of training data from OOD. Since the scope

of OOD covers nigh inﬁnite space, gathering the

data in the whole OOD space is infeasible. For

this realistic reason, the most recent OOD detec-

tion studies generally discriminate OOD input in

an unsupervised manner, including this work. Nu-

merous branches of machine learning tactics are

employed for unsupervised OOD detection: gen-

erating pseudo-OOD data (Chen and Yu,2021;

Zheng et al.,2020), Bayesian methods (Malinin

and Gales,2018), self-supervised learning based

approaches (Moon et al.,2021;Manolache et al.,

2021;Li et al.,2021;Zhou et al.,2021;Zeng et al.,

2021;Zhan et al.,2021), and novel scoring func-

tions which measure the uncertainty of the given

input (Hendrycks and Gimpel,2017;Lee et al.,

2018b;Liu et al.,2020;Tack et al.,2020).

Contrastive learning & OOD detection.

Among the numerous approaches mentioned, con-

trastive learning (CL) based methods (Chen et al.,

2020;Zbontar et al.,2021;Grill et al.,2020) are

recently spurring predominant interest in OOD de-

tection research. The superiority of CL in OOD

detection comes from the fact that it can guide a

neural model to learn semantic similarity within

data instances. Such property is also precious for

unsupervised OOD detection, as there is no accessi-

ble clue regarding outliers or abnormal distribution.

Despite its potential, CL has been utilized in the

computer vision ﬁeld (Cho et al.,2021;Sehwag

et al.,2021;Tack et al.,2020;Winkens et al.,2020)

in the early works due to its high reliance on data

augmentation. However, now it is also widely used

in various NLP applications with the help of re-

cent progress (Li et al.,2021;Liu et al.,2021;Kim

et al.,2021;Carlsson et al.,2020;Gao et al.,2021;

Sennrich et al.,2016). Speciﬁcally, Li et al. (2021)

veriﬁed that CL is also helpful in the NLP ﬁeld, and

Zhou et al. (2021); Zeng et al. (2021) redesigned

the contrastive-learning objective into a more ap-

propriate form for OOD detection.

Potential of intermediate representation.

The

leading driver of the recent upheaval in NLP is

the pre-trained language model (PLM), such as

BERT (Devlin et al.,2019) and GPT (Radford

et al.,2018), which trains a large-scale dataset on

a transformer-based architecture (Vaswani et al.,

2017). Numerous studies attempted to reveal the

role and characteristics of each layer in PLMs and

veriﬁed that diverse information is concealed in

the middle layer, which is now a pervasive notion

in the machine learning community. For instance,

Tenney et al. (2019) showed that the different lay-

ers of the BERT network could resolve syntactic

and semantic structure within a sentence. Clark

et al. (2019) proposed an attention-based probing

classiﬁer leveraging syntactic information in the

middle layer of BERT. Several studies (Shen et al.,

2021;Sastry and Oore,2020;Lee et al.,2018b)

have shown the potential of intermediate represen-

tations in OOD detection by explicitly aggregating

the individual result of every single intermediate

feature.

3 Layer-agnostic Contrastive Learning

3.1 Intuition

The prime objective of our framework is to assem-

ble rich information in the entire layers into a single

ensemble representation to derive a more reliable

decision. Inspired by the foundation of ensemble

learning, which seeks better predictive performance

by combining the predictions from multiple mod-

els, we regard each intermediate layer as an inde-

pendent model (or decision maker). To make each

layer a better decision-maker, LaCL injects a sound

representation learning signal (i.e., supervised con-

trastive learning) to the entire layer by training

objective function in a layer-agnostic manner to

engage every layer more directly. Additionally, we

propose correlation regularization loss (CR loss)

which decorrelates a pair of strongly correlated ad-

jacent representations to encourage each layer to

learn layer-specialized representations from com-

plementary information of each layer. Then, the

global compression layer (GCL)

implicitly

assem-

bles various features in each layer into a single

calibrated ensemble representation . In the follow-

ing subsections, we explain the components of our

model in detail.

3.2 Supervised Contrastive Learning

Supervised contrastive learning (SCL) is a super-

vised variant of vanilla contrastive learning, which

employs label information of the input to group

samples into known classes more tightly. Thus,

SCL can learn data-label relationships as well as

data-data relationships as in CL.

In SCL, each batch

B={(xb, yb)}|B|

b=1

in the

dataset, where

xb, yb

denotes a sentence and a

label for index

respectively, generates an aug-

mented batch

B={(¯

xb,¯yb)}|¯

b=1

, where labels of

augmented views are preserved as the original one.

The augmented batch

consists of two augmented

input;

x2b−1=t1(xb)

and

x2b=t2(xb)

, where

t1, t2

indicate data augmentation functions speci-

ﬁed in Section 3.6. Then,

(¯

x2b−1,¯

x2b)

are passed

through PLM and projector, generating latent vec-

tors

(z2b−1,z2b)

that are utilized to calculate the

supervised contrastive loss:

LSCL =−log X

j∈P(i)

exp(zi·zj/τ)

P|¯

k=1 1[k6=i]exp(zi·zk/τ)

(1)

where

P(i) = {p∈ B : ¯yj= ¯yi}

is the set of

indices of all positives in the augmented batch with

query index

and

represents temperature hyper-

parameter.

3.3 Global Compression Layer

The global compression layer (GCL) is a two-layer

MLP that is directly connected to entire layers to

assemble intermediate representations into a single

representation

. GCL can be viewed as a particu-

lar type of projection head in contrastive learning.

By linking the projection head to the entire layer,

GCL facilitates layer-agnostic training to engage

every middle layer in a training objective directly.

The process of extracting the ﬁnal latent vector

with GCL is as follows: (The batch index term

is omitted for brevity from now.)

First, each layer

(

l∈ |L|

, where

|L|

refers to

the cardinality of the layers) in PLM, outputs token

embeddings

Hl= [hl

1,hl

2,· · · ,hl

len(x)]

for sen-

tence

. Then we combine token embeddings

into a single vector

hl=pool(Hl)

by applying

the pooling function (i.e., mean pooling). Lastly,

GCL receives the pooled token embedding of each

layer

(where,

hl∈R|D|

) as an input and outputs

Figure 2: Overall structure of Layer-agnostic Contrastive Learning (LaCL). The global compression layer trains

the SCL loss in a layer-agnostic manner by engaging entire layers in the CL task. And the correlation regularization

(CR) loss decorrelates each intermediate layer to avoid ovelapping information between each layer.

compact low-dimensional representation

(where,

cl∈R|D|/|L|

). And we concatenate all compact

representations

to generate a single sentence rep-

resentation zfrom x:

z(x) = [c1⊕c2⊕c3⊕ · · · ⊕ c|L|],(2)

where ⊕indicates concatenation and z∈R|D|.

LaCL trains the SCL loss with the ﬁnal repre-

sentation from GCL

, which inheres information

from entire layers.

3.4 Correlation Regularization Loss

The correlation regularization (CR) loss restrains a

pair of features from each adjacent layer from be-

ing similar, following the intuition of an ensemble

where its performance boost springs from various

decisions (Kuncheva and Whitaker,2003;Gashler

et al.,2008). Speciﬁcally, it encourages adjacent

layers to activate different dimensions given the

same input. First, we deﬁne the correlation in the

dimension

of the adjacent layer (

and

l+ 1

) as

follows:

cor d

(l,l+1) =Pbcl

b,d ·cl+1

b,d

qPb(cl

b,d)2qPb(cl+1

b,d )2

.(3)

where

indicates the index of hidden embedding

dimension (

d∈ |D|/|L|

, where

cl∈R|D|/|L|

) and

brefers to a data index of the augmented batch ¯

Then, the CR loss selects a strongly correlated

dimension set

by picking the dimensions that

exceed the pre-set margin value

and decorrelates

set Siterating over every adjacent layer:

S={d∈ |D|:cor d

(l,l+1) ≥m}

LCR =X

d∈S

cor d

(l,l+1).(4)

Finally, the overall loss term for LaCL can be

described as follows:

LLaCL =LSCL +λ1LCR,(5)

where λ1denote weights for CR loss.

3.5 Classiﬁcation & OOD Scoring

Since there is no task-speciﬁc ﬁnal layer (i.e., clas-

siﬁcation layer for cross-entropy loss) in LaCL,

classiﬁcation and anomaly detection are conducted

via a cosine similarity scoring function (Tack et al.,

2020). Employing the cosine similarity scoring

function in LaCL is straightforward and shows

good compatibility, as the model trained with con-

trastive learning can measure meaningful cosine

similarity between data instances.

For input

, we ﬁrst extract the implicit ensem-

ble representation

z(x)

and ﬁnd the nearest neigh-

bor instance

xnn

, i.e.,

maxnn sim(z(x),z(xnn))

from the training dataset. Then we classify label of

as the label of the nearest neighbor

ynn

. And for

the OOD detection, we use the similarity between

input and its nearest neighbor as follows:

Score(x) = sim(z(x),z(xnn)) (6)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EnhancingOut-of-DistributionDetectioninNaturalLanguageUnderstandingviaImplicitLayerEnsembleHyunsooChoy,ChoonghyunParky,JaewookKang\,KangMinYooz\y,TaeukKimx,Sang-gooLeeyySeoulNationalUniversity,zNAVERAILab,\NAVERCLOVA,xHanyangUniversity{johyunsoo,pch330,sglee}@europa.snu.ac.kr{jaewook.kang,kangmin.y...

展开>> 收起<<

Enhancing Out-of-Distribution Detection in Natural Language Understanding via Implicit Layer Ensemble Hyunsoo Choy Choonghyun Parky Jaewook Kang.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Enhancing Out-of-Distribution Detection in Natural Language Understanding via Implicit Layer Ensemble Hyunsoo Choy Choonghyun Parky Jaewook Kang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: