CADet Fully Self-Supervised Out-Of-Distribution Detection With Contrastive Learning Charles Guille-Escuret

2025-04-27 1 0 535.36KB 16 页 10玖币

侵权投诉

CADet: Fully Self-Supervised Out-Of-Distribution

Detection With Contrastive Learning

Charles Guille-Escuret

ServiceNow Research, Mila,

Université de Montréal

guillech@mila.quebec

Pau Rodriguez

ServiceNow Research

pau.rodriguez@servicenow.com

David Vazquez

ServiceNow Research

david.vazquez@servicenow.com

Ioannis Mitliagkas

Mila, Université de Montréal,

Canada CIFAR AI chair

ioannis@mila.quebec

Joao Monteiro

ServiceNow Research

joao.monteiro@servicenow.com

Abstract

Handling out-of-distribution (OOD) samples has become a major stake in the

real-world deployment of machine learning systems. This work explores the use

of self-supervised contrastive learning to the simultaneous detection of two types

of OOD samples: unseen classes and adversarial perturbations. First, we pair

self-supervised contrastive learning with the maximum mean discrepancy (MMD)

two-sample test. This approach enables us to robustly test whether two indepen-

dent sets of samples originate from the same distribution, and we demonstrate its

effectiveness by discriminating between CIFAR-10 and CIFAR-10.1 with higher

conﬁdence than previous work. Motivated by this success, we introduce CADet

(Contrastive Anomaly Detection), a novel method for OOD detection of single

samples. CADet draws inspiration from MMD, but leverages the similarity be-

tween contrastive transformations of a same sample. CADet outperforms existing

adversarial detection methods in identifying adversarially perturbed samples on

ImageNet and achieves comparable performance to unseen label detection methods

on two challenging benchmarks: ImageNet-O and iNaturalist. Signiﬁcantly, CADet

is fully self-supervised and requires neither labels for in-distribution samples nor

access to OOD examples.1

1 Introduction

While modern machine learning systems have achieved countless successful real-world applications,

handling out-of-distribution (OOD) inputs remains a tough challenge of signiﬁcant importance. The

1Our code to compute CADet scores is publicly available as an OpenOOD fork at https://github.com/

charlesGE/OpenOOD-CADet.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.01742v4 [cs.LG] 9 Dec 2024

problem is especially acute for high-dimensional problems like image classiﬁcation. Models are

typically trained in a close-world setting but inevitably faced with novel input classes when deployed

in the real world. The impact can range from displeasing customer experience to dire consequences

in the case of safety-critical applications such as autonomous driving [

] or medical analysis [

Although achieving high accuracy against all meaningful distributional shifts is the most desirable

solution, it is particularly challenging. An efﬁcient method to mitigate the consequences of unexpected

inputs is to perform anomaly detection, which allows the system to anticipate its inability to process

unusual inputs and react adequately.

Anomaly detection methods generally rely on one of three types of statistics: features, logits, and

softmax probabilities, with some systems leveraging a mix of these [

]. An anomaly score

f(x)

computed, and then detection with threshold

is performed based on whether

f(x)> τ

. The goal of

a detection system is to ﬁnd an anomaly score that efﬁciently discriminates between in-distribution

and out-of-distribution samples. However, the common problem of these systems is that different

distributional shifts will unpredictably affect these statistics. Accordingly, detection systems either

achieve good performance on speciﬁc types of distributions or require tuning on OOD samples. In

both cases, their practical use is severely limited. Motivated by these issues, recent work has tackled

the challenge of designing detection systems for unseen classes without prior knowledge of the

unseen label set or access to OOD samples [68, 63, 66].

We ﬁrst investigate the use of maximum mean discrepancy two-sample test (MMD) [

] in con-

junction with self-supervised contrastive learning to assess whether two sets of samples have been

drawn from the same distribution. Motivated by the strong testing power of this method, we then

introduce a statistic inspired by MMD and leveraging contrastive transformations. Based on this

statistic, we propose CADet (Contrastive Anomaly Detection), which is able to detect OOD samples

from single inputs and performs well on both label-based and adversarial detection benchmarks,

without requiring access to any OOD samples to train or tune the method.

Only a few works have addressed these tasks simultaneously. These works either focus on particular

in-distribution data such as medical imaging for speciﬁc diseases [

] or evaluate their performances

on datasets with very distant classes such as CIFAR10 [

], SVHN [

], and LSUN [

], resulting in

simple benchmarks that do not translate to general real world applications [33, 51].

Contributions Our main contributions are as follows:

•

We use similarity functions learned by self-supervised contrastive learning with MMD to

show that the test sets of CIFAR10 and CIFAR10.1 [52] have different distributions.

•

We propose a novel improvement to MMD and show it can also be used to conﬁdently detect

distributional shifts when given a small number of samples.

•

We introduce CADet, a fully self-supervised method for OOD detection inspired by MMD,

and show it outperforms current methods in adversarial detection tasks while performing

well on label-based OOD detection.

The outline is as follows: in Section 2, we discuss relevant previous work. Section 3 describes the

self-supervised contrastive method based on SimCLRv2 [

] used in this work. Section 4 explores

the application of learned similarity functions in conjunction with MMD to verify whether two

independent sets of samples are drawn from the same distribution. Section 5 presents CADet and

evaluates its empirical performance. Finally, we discuss results and limitations in Section 6.

2 Related work

We propose a self-supervised contrastive method for anomaly detection (both unknown classes and

adversarial attacks) inspired by MMD. Thus, our work intersects with the MMD, label-based OOD

detection, adversarial detection, and self-supervised contrastive learning literature.

MMD two-sample test has been extensively studied [

], though it is, to the best

of our knowledge, the ﬁrst time a similarity function trained via contrastive learning is used in

conjunction with MMD. Liu et al.

[35]

uses MMD with a deep kernel trained on a fraction of the

samples to argue that CIFAR10 and CIFAR10.1 have different test distributions. We build upon that

work by conﬁrming their ﬁnding with higher conﬁdence levels, using fewer samples. Dong et al.

[11]

explored applications of MMD to OOD detection.

Label-based OOD detection methods discriminate samples that differ from those in the training

distribution. We focus on unsupervised OOD detection in this work, i.e., we do not assume access

to data labeled as OOD. Unsupervised OOD detection methods include density-based [

], reconstruction-based [

], one-class classiﬁers [

self-supervised [

], and supervised approaches [

], though some works do not fall

into any of these categories [

]. We refer to Yang et al.

[71]

for an overview of the many recent

works in this ﬁeld.

Adversarial detection discriminates adversarial samples from the original data. Adversarial samples

are generated by minimally perturbing actual samples to produce a change in the model’s output,

such as a misclassiﬁcation. Most works rely on the knowledge of some attacks for training [

39, 76, 48, 40], with the exception of [27].

Self-supervised contrastive learning methods [

] are commonly used to pre-train a model

from unlabeled data to solve a downstream task such as image classiﬁcation. Contrastive learning

relies on instance discrimination trained with a contrastive loss [21] such as infoNCE [20].

Contrastive learning for OOD detection aims to ﬁnd good representations for detecting OOD

samples in a supervised [

] or unsupervised [

] setting. Perhaps the closest work

in the literature is CSI [

], which found SimCLR features to have good discriminative power for

unknown classes detection and leveraged similarities between transformed samples in their score.

However, this method is not well-suited for adversarial detection. CSI ignores the similarities

between different transformations of a same sample, an essential component to perform adversarial

detection (see Section 6.2). In addition, CSI scales their score with the norm of input representations.

While efﬁcient on samples with unknown classes, it is unreliable on adversarial perturbations, which

typically increase representation norms. Finally, we failed to scale CSI to ImageNet.

3 Contrastive model

We build our model on top of SimCLRv2 [

] for its simplicity and efﬁciency. It is composed of an

encoder backbone network

fθ

as well as a

-layer contrastive head

hθ′

. Given an in-distribution

sample

, a similarity function sim, and a distribution of training transformations

Ttrain

, the goal is

to simultaneously maximize

Ex∼X ;t0,t1∼Ttrain [sim(hθ′◦fθ(t0(x)), hθ′◦fθ(t1(x)))] ,

and minimize

Ex,y∼X ;t0,t1∼Ttrain [sim(hθ′◦fθ(t0(x)), hθ′◦fθ(t1(y)))] ,

i.e., we want to learn representations in which random transformations of a same example are close

while random transformations of different examples are distant.

To achieve this, given an input batch

{xi}i=1,...,N

, we compute the set

{x(j)

i}j=0,1;i=1,...,N

applying two transformations independently sampled from

Ttrain

to each

. We then compute the

embeddings z(j)

i=hθ′◦fθ(x(j)

i)and apply the following contrastive loss:

L(z) = X

i=1,...,N

−log ui,j

Pj∈{1,...,N}(ui,j +vi,j ),(1)

where

ui,j =esim(z(0)

i,z(1)

j)/τ and vi,j =1i̸=jesim(z(0)

i,z(0)

j)/τ .

τis the temperature hyperparameter and sim(x, y) = ⟨x|y⟩

∥x∥2∥y∥2is the cosine similarity.

Hyperparameters: We follow as closely as possible the setting from SimCLRv2 with a few modiﬁ-

cations to adapt to hardware limitations. In particular, we use the LARS optimizer [

] with learning

rate

1.2

, momentum

0.9

, and weight decay

10−4

. Iteration-wise, we scale up the learning rate for

the ﬁrst

epochs linearly, then use an iteration-wise cosine decaying schedule until epoch

800

with temperature

τ= 0.1

. We train on

8V100

GPUs with an accumulated batch size of

1024

. We

compute the contrastive loss on all batch samples by aggregating the embeddings computed by each

GPU. We use synchronized BatchNorm and fp32 precision and do not use a memory buffer. We use

the same set of transformations, i.e., Gaussian blur and horizontal ﬂip with probability

0.5

, color

jittering with probability

0.8

, random crop with scale uniformly sampled in

[0.08,1]

, and grayscale

with probability 0.2.

For computational simplicity and comparison with previous work, we use a ResNet50 encoder

architecture with ﬁnal features of size

2048

. Following SimCLRv2, we use a three-layer fully

connected contrastive head with hidden layers of width

2048

using ReLU activation and batchNorm

and set the last layer projection to dimension

128

. For evaluation, we use the features produced by

the encoder without the contrastive head. We do not, at any point, use supervised ﬁne-tuning.

4 MMD two-sample test

The Maximum Mean Discrepancy (MMD) is a statistic used in the MMD two-sample test to

assess whether two sets of samples

and

are drawn from the same distribution. It estimates the

expected difference between the intra-set distances and the across-sets distances.

Deﬁnition 4.1 (Gretton et al.

[19]

).Let

k:X × X → R

be the kernel of a reproducing Hilbert

space

, with feature maps

k(·, x)∈ Hk

. Let

X, X′∼P

and

Y, Y ′∼Q

. Under mild integrability

conditions,

MMD(P,Q;Hk) := sup

f∈H,∥f∥Hk≤1

|E[f(X)] −E[f(Y)]|(2)

=pE[k(X, X′) + k(Y, Y ′)−2k(X, Y )].(3)

Given two sets of

samples

SP={Xi}i≤n

and

SQ={Yi}i≤n

, respectively drawn from

and

we can compute the following unbiased estimator [35]:

MMD2

u(SP, SQ;k) := 1

n(n−1) X

i̸=j

(k(Xi, Xj) + k(Yi, Yj)−k(Xi, Yj)−k(Yi, Xj)).(4)

Under the null hypothesis

h0:P=Q

, this estimator follows a normal distribution of mean

[

]. Its variance can be directly estimated [

], but it is simpler to perform a permutation test

as suggested in Sutherland et al.

[62]

, which directly yields a

-value for

. The idea is to use

random splits

X, Y

of the input sample sets to obtain

nperm

different (though not independent)

samplings of

MMD2

u(X, Y ;k)

, which approximate the distribution of

MMD2

u(SP, SP;k)

under

the null hypothesis.

Liu et al.

[35]

trains a deep kernel to maximize the test power of the MMD two-sample test on a

training split of the sets of samples to test. We propose instead to use our learned similarity function

without any ﬁne-tuning. Note that we return the

-value

nperm +1 1 + Pnperm

i=1 1(pi≥est)

instead

nperm Pnperm

i=1 1(pi≥est)

. Indeed, under the null hypothesis

P=Q

est

and

are drawn from

the same distribution, so for

j∈ {0,1, . . . , nperm}

, the probability for

est

to be smaller than exactly

elements of

{pi}

nperm +1

. Therefore, the probability that

elements or less of

{pi}i

are larger

than

est

i=0 1

nperm +1 =j+1

nperm +1

. While this change has a small impact for large values of

nperm

, it is essential to guarantee that we indeed return a correct

-value. Notably, the algorithm of

Liu et al.

[35]

has a probability

nperm >0

to return an output of

0.00

even under the null hypothesis.

Additionnally, we propose a novel version of MMD called MMD-CC (MMD with Clean Calibration).

Instead of computing

based on random splits of

SPSSQ

, we require as input two disjoint sets

of samples drawn from

and compute

based on random splits of

S(1)

PSS(2)

(see Algorithm 1).

This change requires to use twice as many samples from

, but reduces the variance induced by the

random splits of

SPSSQ

, which is signiﬁcant when the number of samples is small. Note that

S(1)

S(2)

and

must always have the same size. Under the null hypothesis,

P=Q

S(1)

S(2)

and

are identically distributed, so the

conserve the same distribution as for MMD (this is not the case

when

P̸=Q

, hence the difference in testing power). Thus, the validity of MMD-CC follows from the

validity of MMD.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CADet:FullySelf-SupervisedOut-Of-DistributionDetectionWithContrastiveLearningCharlesGuille-EscuretServiceNowResearch,Mila,UniversitédeMontréalguillech@mila.quebecPauRodriguezServiceNowResearchpau.rodriguez@servicenow.comDavidVazquezServiceNowResearchdavid.vazquez@servicenow.comIoannisMitliagkasMila,...

展开>> 收起<<

CADet Fully Self-Supervised Out-Of-Distribution Detection With Contrastive Learning Charles Guille-Escuret.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CADet Fully Self-Supervised Out-Of-Distribution Detection With Contrastive Learning Charles Guille-Escuret

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: