CADet Fully Self-Supervised Out-Of-Distribution Detection With Contrastive Learning Charles Guille-Escuret

2025-04-27 0 0 535.36KB 16 页 10玖币
侵权投诉
CADet: Fully Self-Supervised Out-Of-Distribution
Detection With Contrastive Learning
Charles Guille-Escuret
ServiceNow Research, Mila,
Université de Montréal
guillech@mila.quebec
Pau Rodriguez
ServiceNow Research
pau.rodriguez@servicenow.com
David Vazquez
ServiceNow Research
david.vazquez@servicenow.com
Ioannis Mitliagkas
Mila, Université de Montréal,
Canada CIFAR AI chair
ioannis@mila.quebec
Joao Monteiro
ServiceNow Research
joao.monteiro@servicenow.com
Abstract
Handling out-of-distribution (OOD) samples has become a major stake in the
real-world deployment of machine learning systems. This work explores the use
of self-supervised contrastive learning to the simultaneous detection of two types
of OOD samples: unseen classes and adversarial perturbations. First, we pair
self-supervised contrastive learning with the maximum mean discrepancy (MMD)
two-sample test. This approach enables us to robustly test whether two indepen-
dent sets of samples originate from the same distribution, and we demonstrate its
effectiveness by discriminating between CIFAR-10 and CIFAR-10.1 with higher
confidence than previous work. Motivated by this success, we introduce CADet
(Contrastive Anomaly Detection), a novel method for OOD detection of single
samples. CADet draws inspiration from MMD, but leverages the similarity be-
tween contrastive transformations of a same sample. CADet outperforms existing
adversarial detection methods in identifying adversarially perturbed samples on
ImageNet and achieves comparable performance to unseen label detection methods
on two challenging benchmarks: ImageNet-O and iNaturalist. Significantly, CADet
is fully self-supervised and requires neither labels for in-distribution samples nor
access to OOD examples.1
1 Introduction
While modern machine learning systems have achieved countless successful real-world applications,
handling out-of-distribution (OOD) inputs remains a tough challenge of significant importance. The
1Our code to compute CADet scores is publicly available as an OpenOOD fork at https://github.com/
charlesGE/OpenOOD-CADet.
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.01742v4 [cs.LG] 9 Dec 2024
problem is especially acute for high-dimensional problems like image classification. Models are
typically trained in a close-world setting but inevitably faced with novel input classes when deployed
in the real world. The impact can range from displeasing customer experience to dire consequences
in the case of safety-critical applications such as autonomous driving [
31
] or medical analysis [
55
].
Although achieving high accuracy against all meaningful distributional shifts is the most desirable
solution, it is particularly challenging. An efficient method to mitigate the consequences of unexpected
inputs is to perform anomaly detection, which allows the system to anticipate its inability to process
unusual inputs and react adequately.
Anomaly detection methods generally rely on one of three types of statistics: features, logits, and
softmax probabilities, with some systems leveraging a mix of these [
66
]. An anomaly score
f(x)
is
computed, and then detection with threshold
τ
is performed based on whether
f(x)> τ
. The goal of
a detection system is to find an anomaly score that efficiently discriminates between in-distribution
and out-of-distribution samples. However, the common problem of these systems is that different
distributional shifts will unpredictably affect these statistics. Accordingly, detection systems either
achieve good performance on specific types of distributions or require tuning on OOD samples. In
both cases, their practical use is severely limited. Motivated by these issues, recent work has tackled
the challenge of designing detection systems for unseen classes without prior knowledge of the
unseen label set or access to OOD samples [68, 63, 66].
We first investigate the use of maximum mean discrepancy two-sample test (MMD) [
19
] in con-
junction with self-supervised contrastive learning to assess whether two sets of samples have been
drawn from the same distribution. Motivated by the strong testing power of this method, we then
introduce a statistic inspired by MMD and leveraging contrastive transformations. Based on this
statistic, we propose CADet (Contrastive Anomaly Detection), which is able to detect OOD samples
from single inputs and performs well on both label-based and adversarial detection benchmarks,
without requiring access to any OOD samples to train or tune the method.
Only a few works have addressed these tasks simultaneously. These works either focus on particular
in-distribution data such as medical imaging for specific diseases [
65
] or evaluate their performances
on datasets with very distant classes such as CIFAR10 [
32
], SVHN [
47
], and LSUN [
73
], resulting in
simple benchmarks that do not translate to general real world applications [33, 51].
Contributions Our main contributions are as follows:
We use similarity functions learned by self-supervised contrastive learning with MMD to
show that the test sets of CIFAR10 and CIFAR10.1 [52] have different distributions.
We propose a novel improvement to MMD and show it can also be used to confidently detect
distributional shifts when given a small number of samples.
We introduce CADet, a fully self-supervised method for OOD detection inspired by MMD,
and show it outperforms current methods in adversarial detection tasks while performing
well on label-based OOD detection.
The outline is as follows: in Section 2, we discuss relevant previous work. Section 3 describes the
self-supervised contrastive method based on SimCLRv2 [
5
] used in this work. Section 4 explores
the application of learned similarity functions in conjunction with MMD to verify whether two
independent sets of samples are drawn from the same distribution. Section 5 presents CADet and
evaluates its empirical performance. Finally, we discuss results and limitations in Section 6.
2 Related work
We propose a self-supervised contrastive method for anomaly detection (both unknown classes and
adversarial attacks) inspired by MMD. Thus, our work intersects with the MMD, label-based OOD
detection, adversarial detection, and self-supervised contrastive learning literature.
MMD two-sample test has been extensively studied [
19
,
67
,
18
,
62
,
8
,
29
], though it is, to the best
of our knowledge, the first time a similarity function trained via contrastive learning is used in
conjunction with MMD. Liu et al.
[35]
uses MMD with a deep kernel trained on a fraction of the
samples to argue that CIFAR10 and CIFAR10.1 have different test distributions. We build upon that
work by confirming their finding with higher confidence levels, using fewer samples. Dong et al.
[11]
explored applications of MMD to OOD detection.
2
Label-based OOD detection methods discriminate samples that differ from those in the training
distribution. We focus on unsupervised OOD detection in this work, i.e., we do not assume access
to data labeled as OOD. Unsupervised OOD detection methods include density-based [
74
,
45
,
46
,
7
,
12
,
53
,
59
,
17
,
37
,
10
], reconstruction-based [
56
,
75
,
9
,
50
,
49
,
7
], one-class classifiers [
57
,
54
],
self-supervised [
15
,
25
,
2
,
63
], and supervised approaches [
34
,
23
], though some works do not fall
into any of these categories [
66
,
60
]. We refer to Yang et al.
[71]
for an overview of the many recent
works in this field.
Adversarial detection discriminates adversarial samples from the original data. Adversarial samples
are generated by minimally perturbing actual samples to produce a change in the model’s output,
such as a misclassification. Most works rely on the knowledge of some attacks for training [
1
,
43
,
13
,
39, 76, 48, 40], with the exception of [27].
Self-supervised contrastive learning methods [
69
,
22
,
4
,
5
] are commonly used to pre-train a model
from unlabeled data to solve a downstream task such as image classification. Contrastive learning
relies on instance discrimination trained with a contrastive loss [21] such as infoNCE [20].
Contrastive learning for OOD detection aims to find good representations for detecting OOD
samples in a supervised [
36
,
30
] or unsupervised [
68
,
44
,
58
] setting. Perhaps the closest work
in the literature is CSI [
63
], which found SimCLR features to have good discriminative power for
unknown classes detection and leveraged similarities between transformed samples in their score.
However, this method is not well-suited for adversarial detection. CSI ignores the similarities
between different transformations of a same sample, an essential component to perform adversarial
detection (see Section 6.2). In addition, CSI scales their score with the norm of input representations.
While efficient on samples with unknown classes, it is unreliable on adversarial perturbations, which
typically increase representation norms. Finally, we failed to scale CSI to ImageNet.
3 Contrastive model
We build our model on top of SimCLRv2 [
5
] for its simplicity and efficiency. It is composed of an
encoder backbone network
fθ
as well as a
3
-layer contrastive head
hθ
. Given an in-distribution
sample
X
, a similarity function sim, and a distribution of training transformations
Ttrain
, the goal is
to simultaneously maximize
Ex∼X ;t0,t1∼Ttrain [sim(hθfθ(t0(x)), hθfθ(t1(x)))] ,
and minimize
Ex,y∼X ;t0,t1∼Ttrain [sim(hθfθ(t0(x)), hθfθ(t1(y)))] ,
i.e., we want to learn representations in which random transformations of a same example are close
while random transformations of different examples are distant.
To achieve this, given an input batch
{xi}i=1,...,N
, we compute the set
{x(j)
i}j=0,1;i=1,...,N
by
applying two transformations independently sampled from
Ttrain
to each
xi
. We then compute the
embeddings z(j)
i=hθfθ(x(j)
i)and apply the following contrastive loss:
L(z) = X
i=1,...,N
log ui,j
Pj∈{1,...,N}(ui,j +vi,j ),(1)
where
ui,j =esim(z(0)
i,z(1)
j)and vi,j =1i̸=jesim(z(0)
i,z(0)
j).
τis the temperature hyperparameter and sim(x, y) = x|y
x2y2is the cosine similarity.
Hyperparameters: We follow as closely as possible the setting from SimCLRv2 with a few modifi-
cations to adapt to hardware limitations. In particular, we use the LARS optimizer [
72
] with learning
rate
1.2
, momentum
0.9
, and weight decay
104
. Iteration-wise, we scale up the learning rate for
the first
40
epochs linearly, then use an iteration-wise cosine decaying schedule until epoch
800
,
with temperature
τ= 0.1
. We train on
8V100
GPUs with an accumulated batch size of
1024
. We
compute the contrastive loss on all batch samples by aggregating the embeddings computed by each
GPU. We use synchronized BatchNorm and fp32 precision and do not use a memory buffer. We use
the same set of transformations, i.e., Gaussian blur and horizontal flip with probability
0.5
, color
3
jittering with probability
0.8
, random crop with scale uniformly sampled in
[0.08,1]
, and grayscale
with probability 0.2.
For computational simplicity and comparison with previous work, we use a ResNet50 encoder
architecture with final features of size
2048
. Following SimCLRv2, we use a three-layer fully
connected contrastive head with hidden layers of width
2048
using ReLU activation and batchNorm
and set the last layer projection to dimension
128
. For evaluation, we use the features produced by
the encoder without the contrastive head. We do not, at any point, use supervised fine-tuning.
4 MMD two-sample test
The Maximum Mean Discrepancy (MMD) is a statistic used in the MMD two-sample test to
assess whether two sets of samples
SP
and
SQ
are drawn from the same distribution. It estimates the
expected difference between the intra-set distances and the across-sets distances.
Definition 4.1 (Gretton et al.
[19]
).Let
k:X × X R
be the kernel of a reproducing Hilbert
space
Hk
, with feature maps
k(·, x)∈ Hk
. Let
X, XP
and
Y, Y Q
. Under mild integrability
conditions,
MMD(P,Q;Hk) := sup
f∈H,fHk1
|E[f(X)] E[f(Y)]|(2)
=pE[k(X, X) + k(Y, Y )2k(X, Y )].(3)
Given two sets of
n
samples
SP={Xi}in
and
SQ={Yi}in
, respectively drawn from
P
and
Q
,
we can compute the following unbiased estimator [35]:
\
MMD2
u(SP, SQ;k) := 1
n(n1) X
i̸=j
(k(Xi, Xj) + k(Yi, Yj)k(Xi, Yj)k(Yi, Xj)).(4)
Under the null hypothesis
h0:P=Q
, this estimator follows a normal distribution of mean
0
[
19
]. Its variance can be directly estimated [
18
], but it is simpler to perform a permutation test
as suggested in Sutherland et al.
[62]
, which directly yields a
p
-value for
h0
. The idea is to use
random splits
X, Y
of the input sample sets to obtain
nperm
different (though not independent)
samplings of
\
MMD2
u(X, Y ;k)
, which approximate the distribution of
\
MMD2
u(SP, SP;k)
under
the null hypothesis.
Liu et al.
[35]
trains a deep kernel to maximize the test power of the MMD two-sample test on a
training split of the sets of samples to test. We propose instead to use our learned similarity function
without any fine-tuning. Note that we return the
p
-value
1
nperm +1 1 + Pnperm
i=1 1(piest)
instead
of
1
nperm Pnperm
i=1 1(piest)
. Indeed, under the null hypothesis
P=Q
,
est
and
pi
are drawn from
the same distribution, so for
j∈ {0,1, . . . , nperm}
, the probability for
est
to be smaller than exactly
j
elements of
{pi}
is
1
nperm +1
. Therefore, the probability that
j
elements or less of
{pi}i
are larger
than
est
is
Pj
i=0 1
nperm +1 =j+1
nperm +1
. While this change has a small impact for large values of
nperm
, it is essential to guarantee that we indeed return a correct
p
-value. Notably, the algorithm of
Liu et al.
[35]
has a probability
1
nperm >0
to return an output of
0.00
even under the null hypothesis.
Additionnally, we propose a novel version of MMD called MMD-CC (MMD with Clean Calibration).
Instead of computing
pi
based on random splits of
SPSSQ
, we require as input two disjoint sets
of samples drawn from
P
and compute
pi
based on random splits of
S(1)
PSS(2)
P
(see Algorithm 1).
This change requires to use twice as many samples from
P
, but reduces the variance induced by the
random splits of
SPSSQ
, which is significant when the number of samples is small. Note that
S(1)
P
,
S(2)
P
and
SQ
must always have the same size. Under the null hypothesis,
P=Q
,
S(1)
P
,
S(2)
P
and
SQ
are identically distributed, so the
pi
conserve the same distribution as for MMD (this is not the case
when
P̸=Q
, hence the difference in testing power). Thus, the validity of MMD-CC follows from the
validity of MMD.
4
摘要:

CADet:FullySelf-SupervisedOut-Of-DistributionDetectionWithContrastiveLearningCharlesGuille-EscuretServiceNowResearch,Mila,UniversitédeMontréalguillech@mila.quebecPauRodriguezServiceNowResearchpau.rodriguez@servicenow.comDavidVazquezServiceNowResearchdavid.vazquez@servicenow.comIoannisMitliagkasMila,...

展开>> 收起<<
CADet Fully Self-Supervised Out-Of-Distribution Detection With Contrastive Learning Charles Guille-Escuret.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:535.36KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注