Is Out-of-Distribution Detection Learnable Zhen Fang1 Yixuan Li2 Jie Lu1 Jiahua Dong34 Bo Han5 Feng Liu16 1Australian Artificial Intelligence Institute University of Technology Sydney.

2025-05-03 0 0 2.79MB 53 页 10玖币
侵权投诉
Is Out-of-Distribution Detection Learnable?
Zhen Fang1, Yixuan Li2, Jie Lu1
, Jiahua Dong3,4, Bo Han5, Feng Liu1,6
1Australian Artificial Intelligence Institute, University of Technology Sydney.
2Department of Computer Sciences, University of Wisconsin-Madison.
3State Key Laboratory of Robotics, Shenyang Institute of Automation,
Chinese Academy of Sciences. 4ETH Zurich, Switzerland.
5Department of Computer Science, Hong Kong Baptist University.
6School of Mathematics and Statistics, University of Melbourne.
{zhen.fang,jie.lu}@uts.edu.au, sharonli@cs.wisc.edu,
dongjiahua1995@gmail.com,bhanml@comp.hkbu.edu.hk,feng.liu1@unimelb.edu.au
Abstract
Supervised learning aims to train a classifier under the assumption that training and
test data are from the same distribution. To ease the above assumption, researchers
have studied a more realistic setting: out-of-distribution (OOD) detection, where
test data may come from classes that are unknown during training (i.e., OOD data).
Due to the unavailability and diversity of OOD data, good generalization ability
is crucial for effective OOD detection algorithms. To study the generalization of
OOD detection, in this paper, we investigate the probably approximately correct
(PAC) learning theory of OOD detection, which is proposed by researchers as an
open problem. First, we find a necessary condition for the learnability of OOD
detection. Then, using this condition, we prove several impossibility theorems for
the learnability of OOD detection under some scenarios. Although the impossibil-
ity theorems are frustrating, we find that some conditions of these impossibility
theorems may not hold in some practical scenarios. Based on this observation, we
next give several necessary and sufficient conditions to characterize the learnability
of OOD detection in some practical scenarios. Lastly, we also offer theoretical
supports for several representative OOD detection works based on our OOD theory.
1 Introduction
The success of supervised learning is established on an implicit assumption that training and test data
share a same distribution, i.e.,in-distribution (ID) [
1
,
2
,
3
,
4
]. However, test data distribution in many
real-world scenarios may violate the assumption and, instead, contain out-of-distribution (OOD) data
whose labels have not been seen during the training process [
5
,
6
]. To mitigate the risk of OOD data,
researchers have considered a more practical learning scenario: OOD detection which determines
whether an input is ID/OOD, while classifying the ID data into respective classes. OOD detection has
shown great potential to ensure the reliable deployment of machine learning models in the real world.
A rich line of algorithms have been developed to empirically address the OOD detection problem
[
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
,
14
,
15
,
16
,
17
,
18
,
19
,
20
]. However, very few works study theory of OOD
detection, which hinders the rigorous path forward for the field. This paper aims to bridge the gap.
In this paper, we provide a theoretical framework to understand the learnability of the OOD detection
problem. We investigate the probably approximately correct (PAC) learning theory of OOD detection,
which is posed as an open problem to date. Unlike the classical PAC learning theory in a supervised
setting, our problem setting is fundamentally challenging due to the absence of OOD data in training.
Corresponding author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.14707v3 [cs.LG] 23 Feb 2023
In many real-world scenarios, OOD data can be diverse and priori-unknown. Given this, we study
whether there exists an algorithm that can be used to detect various OOD data instead of merely some
specified OOD data. Such is the significance of studying the learning theory for OOD detection [
4
].
This motivates our question: is OOD detection PAC learnable? i.e., is there the PAC learning theory
to guarantee the generalization ability of OOD detection?
To investigate the learning theory, we mainly focus on two basic spaces: domain space and hypothesis
space. The domain space is a space consisting of some distributions, and the hypothesis space is a
space consisting of some classifiers. Existing agnostic PAC theories in supervised learning [
21
,
22
]
are distribution-free, i.e., the domain space consists of all domains. Yet, in Theorem 4, we shows that
the learning theory of OOD detection is not distribution-free. In fact, we discover that OOD detection
is learnable only if the domain space and the hypothesis space satisfy some special conditions, e.g.,
Conditions 1and 3. Notably, there are many conditions and theorems in existing learning theories
and many OOD detection algorithms in the literature. Thus, it is very difficult to analyze the relation
between these theories and algorithms, and explore useful conditions to ensure the learnability of
OOD detection, especially when we have to explore them from the scratch. Thus, the main aim of our
paper is to study these essential conditions. From these essential conditions, we can know when OOD
detection can be successful in practical scenarios. We restate our question and goal in following:
Given hypothesis spaces and several representative domain spaces, what are
the conditions to ensure the learnability of OOD detection? If possible, we
hope that these conditions are necessary and sufficient in some scenarios.
Main Results.
We investigate the learnability of OOD detection starting from the largest space—the
total space, and give a necessary condition (Condition 1) for the learnability. However, we find that
the overlap between ID and OOD data may result in that the necessary condition does not hold.
Therefore, we give an impossibility theorem to demonstrate that OOD detection fails in the total
space (Theorem 4). Next, we study OOD detection in the separate space, where there are no overlaps
between the ID and OOD data. Unfortunately, there still exists impossibility theorem (Theorem 5),
which demonstrates that OOD detection is not learnable in the separate space under some conditions.
Although the impossibility theorems obtained in the separate space are frustrating, we find that some
conditions of these impossibility theorems may not hold in some practical scenarios. Based on this
observation, we give several necessary and sufficient conditions to characterize the learnability of
OOD detection in the separate space (Theorems 6and 10). Especially, when our model is based on
fully-connected neural network (FCNN), OOD detection is learnable in the separate space if and
only if the feature space is finite. Furthermore, we investigate the learnability of OOD detection in
other more practical domain spaces, e.g., the finite-ID-distribution space (Theorem 8) and the density-
based space (Theorem 9). By studying the finite-ID-distribution space, we discover a compatibility
condition (Condition 3) that is a necessary and sufficient condition for this space. Next, we further
investigate the compatibility condition in the density-based space, and find that such condition is also
the necessary and sufficient condition in some practical scenarios (Theorem 11).
Implications and Impacts of Theory.
Our study is not of purely theoretical interest; it has also
practical impacts. First, when we design OOD detection algorithms, we normally only have finite ID
datasets, corresponding to the finite-ID-distribution space. In this case, Theorem 8gives the necessary
and sufficient condition to the success of OOD detection. Second, our theory provides theoretical
support (Theorems 10 and 11) for several representative OOD detection works [
7
,
8
,
23
]. Third, our
theory shows that OOD detection is learnable in image-based scenarios when ID images have clearly
different semantic labels and styles (far-OOD) from OOD images. Fourth, we should not expect a
universally working algorithm. It is necessary to design different algorithms in different scenarios.
2 Learning Setups
We start by introducing the necessary concepts and notations for our theoretical framework. Given
a feature space
X Rd
and a label space
Y:= {1, ..., K}
, we have an ID joint distribution
DXIYI
over
X × Y
, where
XI∈ X
and
YI∈ Y
are random variables. We also have an OOD joint
distribution
DXOYO
, where
XO
is a random variable from
X
, but
YO
is a random variable whose
outputs do not belong to
Y
. During testing, we will meet a mixture of ID and OOD joint distributions:
DXY := (1 πout)DXIYI+πoutDXOYO
, and can only observe the marginal distribution
DX:=
(1 πout)DXI+πoutDXO, where the constant πout [0,1) is an unknown class-prior probability.
2
Problem 1
(OOD Detection [
4
])
.
Given an ID joint distribution
DXIYI
and a training data
S:=
{(x1, y1), ..., (xn, yn)}
drawn independent and identically distributed from
DXIYI
, the aim of OOD
detection is to train a classifier
f
by using the training data
S
such that, for any test data
x
drawn
from the mixed marginal distribution
DX
: 1) if
x
is an observation from
DXI
,
f
can classify
x
into
correct ID classes; and 2) if xis an observation from DXO,fcan detect xas OOD data.
According to the survey [
4
], when
K > 1
, OOD detection is also known as the open-set recognition
or open-set learning [
24
,
25
]; and when
K= 1
, OOD detection reduces to one-class novelty detection
and semantic anomaly detection [26,27,28].
OOD Label and Domain Space.
Based on Problem 1, we know it is not necessary to classify OOD
data into the correct OOD classes. Without loss of generality, let all OOD data be allocated to one
big OOD class, i.e.,
YO=K+ 1
[
24
,
29
]. To investigate the PAC learnability of OOD detection,
we define a domain space
DXY
, which is a set consisting of some joint distributions
DXY
mixed by
some ID joint distributions and some OOD joint distributions. In this paper, the joint distribution
DXY mixed by ID joint distribution DXIYIand OOD joint distribution DXOYOis called domain.
Hypothesis Spaces and Scoring Function Spaces.
A hypothesis space
H
is a subset of function
space, i.e.,H ⊂ {h:X Y ∪ {K+ 1}}.We set Hin ⊂ {h:X Y} to the ID hypothesis space.
We also define
Hb⊂ {h:X → {1,2}}
as the hypothesis space for binary classification, where
1
represents the ID data, and
2
represents the OOD data. The function
h
is called the hypothesis
function. A scoring function space is a subset of function space, i.e.,
Fl⊂ {f:X Rl}
, where
l
is
the output’s dimension of the vector-valued function
f
. The function
f
is called the scoring function.
Loss and Risks.
Let
Yall =Y ∪ {K+ 1}
. Given a loss function
`2:Yall × Yall R0
satisfying
that `(y1, y2)=0if and only if y1=y2, and any h∈ H, then the risk with respect to DXY is
RD(h) := E(x,y)DXY `(h(x), y).(1)
The
α
-risk
Rα
D(h) := (1 α)Rin
D(h) + αRout
D(h),α[0,1]
, where the risks
Rin
D(h)
,
Rout
D(h)
are
Rin
D(h) := E(x,y)DXIYI`(h(x), y), Rout
D(h) := ExDXO`(h(x), K + 1).
Learnability.
We aim to select a hypothesis function
h∈ H
with approximately minimal risk, based
on finite data. Generally, we expect the approximation to get better, with the increase in sample size.
Algorithms achieving this are said to be consistent. Formally, we introduce the following definition:
Definition 1
(Learnability of OOD Detection)
.
Given a domain space
DXY
and a hypothesis space
H ⊂ {h:X → Yall}
, we say OOD detection is
learnable
in
DXY
for
H
, if there exists an
algorithm
A3:+
n=1(X × Y)n→ H
and a monotonically decreasing sequence
cons(n)
, such that
cons(n)0, as n+, and for any domain DXY DXY ,
ESDn
XIYIRD(A(S)) inf
h∈H RD(h)cons(n),(2)
An algorithm Afor which this holds is said to be consistent with respect to DXY .
Definition 1is a natural extension of agnostic PAC learnability of supervised learning [
30
]. If for any
DXY DXY
,
πout = 0
, then Definition 2is the agnostic PAC learnability of supervised learning.
Although the expression of Definition 1is different from the normal definition of agnostic PAC
learning in [
21
], one can easily prove that they are equivalent when
`
is bounded, see Appendix D.3.
Since OOD data are unavailable, it is impossible to obtain information about the class-prior probability
πout
. Furthermore, in the real world, it is possible that
πout
can be any value in
[0,1)
. Therefore,
the imbalance issue between ID and OOD distributions, and the priori-unknown issue (i.e.,
πout
is
unknown) are the core challenges. To ease these challenges, researchers use AUROC, AUPR and
FPR95 to estimate the performance of OOD detection [
18
,
31
,
32
,
33
,
34
,
35
]. It seems that there is a
gap between Definition 1and existing works. To eliminate this gap, we revise Eq. (2) as follows:
ESDn
XIYIRα
D(A(S)) inf
h∈H Rα
D(h)cons(n),α[0,1].(3)
If an algorithm
A
satisfies Eq.
(3)
, then the imbalance issue and the prior-unknown issue disappear.
That is,
A
can simultaneously classify the ID data and detect the OOD data well. Based on the above
discussion, we define the strong learnability of OOD detection as follows:
2Note that Yall × Yall is a finite set, therefore, `is bounded.
3Similar to [30], in this paper, we regard an algorithm as a mapping from +
n=1(X × Y)nto H.
3
Definition 2
(Strong Learnability of OOD Detection)
.
Given a domain space
DXY
and a hypothesis
space
H ⊂ {h:X → Yall}
, we say OOD detection is
strongly learnable
in
DXY
for
H
, if there
exists an algorithm
A:+
n=1(X × Y)n→ H
and a monotonically decreasing sequence
cons(n)
,
such that cons(n)0, as n+, and for any domain DXY DXY ,
ESDn
XIYIRα
D(A(S)) inf
h∈H Rα
D(h)cons(n),α[0,1].
In Theorem 1, we have shown that the strong learnability of OOD detection is equivalent to the
learnability of OOD detection, if the domain space
DXY
is a prior-unknown space (see Definition 3).
In this paper, we mainly discuss the learnability in the prior-unknown space. Therefore, when we
mention that OOD detection is learnable, we also mean that OOD detection is strongly learnable.
Goal of Theory.
Note that the agnostic PAC learnability of supervised learning is distribution-free,
i.e., the domain space
DXY
consists of all domains. However, due to the absence of OOD data
during the training process [
8
,
14
,
24
], it is obvious that the learnability of OOD detection is not
distribution-free (i.e., Theorem 4). In fact, we discover that the learnability of OOD detection is
deeply correlated with the relationship between the domain space
DXY
and the hypothesis space
H
.
That is, OOD detection is learnable only when the domain space
DXY
and the hypothesis space
H
satisfy some special conditions, e.g., Condition 1and Condition 3. We present our goal as follows:
Goal:
given a hypothesis space
H
and several representative domain spaces
DXY
,
what are the
conditions
to ensure the learnability of OOD detection? Furthermore, if
possible, we hope that these conditions are
necessary and sufficient
in some scenarios.
Therefore, compared to the agnostic PAC learnability of supervised learning, our theory doesn’t
focus on the distribution-free case, but focuses on discovering essential conditions to guarantee the
learnability of OOD detection in several representative and practical domain spaces
DXY
. By these
essential conditions, we can know when OOD detection can be successful in real applications.
3 Learning in Priori-unknown Spaces
We first investigate a special space, called prior-unknown space. In such space, Definition 1and
Definition 2are equivalent. Furthermore, we also prove that if OOD detection is strongly learnable
in a space
DXY
, then one can discover a larger domain space, which is prior-unknown, to ensure
the learnability of OOD detection. These results imply that it is enough to consider our theory in the
prior-unknown spaces. The prior-unknown space is introduced as follows:
Definition 3.
Given a domain space
DXY
, we say
DXY
is a priori-unknown space, if for any domain
DXY DXY and any α[0,1), we have Dα
XY := (1 α)DXIYI+αDXOYODXY .
Theorem 1. Given domain spaces DXY and D0
XY ={Dα
XY :DXY DXY ,α[0,1)}, then
1) D0
XY is a priori-unknown space and DXY D0
XY ;
2) if DXY is a priori-unknown space, then Definition 1and Definition 2are equivalent;
3) OOD detection is strongly learnable in DXY if and only if OOD detection is learnable in D0
XY .
The second result of Theorem 1bridges the learnability and strong learnability, which implies that
if an algorithm
A
is consistent with respect to a prior-unknown space, then this algorithm
A
can
address the imbalance issue between ID and OOD distributions, and the priori-unknown issue well.
Based on Theorem 1, we focus on our theory in the prior-unknown spaces. Furthermore, to demystify
the learnability of OOD detection, we introduce five representative priori-unknown spaces:
Single-distribution space DDXY
XY . For a domain DXY ,DDXY
XY := {Dα
XY :α[0,1)}.
Total space Dall
XY , which consists of all domains.
Separate space
Ds
XY
, which consists of all domains that satisfy the separate condition, that is for
any DXY Ds
XY ,suppDXOsuppDXI=,where supp means the support set.
Finite-ID-distribution space
DF
XY
, which is a prior-unknown space satisfying that the number of
distinct ID joint distributions DXIYIin DF
XY is finite, i.e.,|{DXIYI:DXY DF
XY }| <+.
Density-based space
Dµ,b
XY
, which is a prior-unknown space consisting of some domains satisfying
that: for any
DXY
, there exists a density function
f
with
1/b fb
in
suppµ
and
0.5DXI+
4
Class 1 Class 2 Class K
OOD
Data
(a) ID and OOD Distributions
0.00 0.20 0.40 0.60 0.80 1.00
Values of α
0.00
0.20
0.40
0.60
0.80
α-risk
n=15,000
n=20,000
n=25,000
infhRα
D(h)
(b) Overlap Exists
0.00 0.20 0.40 0.60 0.80 1.00
Values of α
0.00
0.05
0.10
0.15
0.20
α-risk
n=15,000
n=20,000
n=25,000
infhRα
D(h)
(c) No Overlap
Figure 1: Illustration of
infh∈H Rα
D(h)
(solid lines with triangle marks) and the estimated
ESDn
in Rα
D(A(S))
(dash lines) with
α[0,1)
in different scenarios, where
Din =DXIYI
and the algorithm
A
is the free-energy
OOD detection method [
23
]. Subfigure (a) shows the ID and OOD distributions. In (a),
gapIO
represents the
distance between the support sets of ID and OOD distributions. In (b), since there is an overlap between ID
and OOD data, the solid line is a ployline. In (c), since there is no overlap between ID and OOD data, we can
check that
infh∈H Rα
D(h)
forms a straight line (the solid line). However, since dash lines are always straight
lines, two observations can be obtained from (b) and (c): 1) dash lines cannot approximate the solid ployline in
(b), which implies the unlearnability of OOD detection; and 2) the solid line in (c) is a straight line and may be
approximated by the dash lines in (c). The above observations motivate us to propose Condition 1.
0.5DXO=Rfdµ
, where
µ
is a measure defined over
X
. Note that if
µ
is discrete, then
DX
is a
discrete distribution; and if µis the Lebesgue measure, then DXis a continuous distribution.
The above representative spaces widely exist in real applications. For example, 1) if the images
from different semantic labels with different styles are clearly different, then those images can form
a distribution belonging to a separate space
Ds
XY
; and 2) when designing an algorithm, we only
have finite ID datasets, e.g., CIFAR-10, MNIST, SVHN, and ImageNet, to build a model. Then,
finite-ID-distribution space
DF
XY
can handle this real scenario. Note that the single-distribution space
is a special case of the finite-ID-distribution space. In this paper, we mainly discuss these five spaces.
4 Impossibility Theorems for OOD Detection
In this section, we first give a necessary condition for the learnability of OOD detection. Then, we
show this necessary condition does not hold in the total space Dall
XY and the separate space Ds
XY .
Necessary Condition.
We find a necessary condition for the learnability of OOD detection, i.e., Con-
dition 1, motivated by the experiments in Figure 1. Details of Figure 1can be found in Appendix C.2.
Condition 1 (Linear Condition).For any DXY DXY and any α[0,1),
inf
h∈H Rα
D(h) = (1 α) inf
h∈H Rin
D(h) + αinf
h∈H Rout
D(h).
To reveal the importance of Condition 1, Theorem 2shows that Condition 1is a necessary and
sufficient condition for the learnability of OOD detection if the
DXY
is the single-distribution space.
Theorem 2.
Given a hypothesis space
H
and a domain
DXY
, OOD detection is learnable in the
single-distribution space DDXY
XY for Hif and only if linear condition (i.e., Condition 1) holds.
Theorem 2implies that Condition 1is important for the learnability of OOD detection. Due to the
simplicity of single-distribution space, Theorem 2implies that Condition 1is the necessary condition
for the learnability of OOD detection in the prior-unknown space, see Lemma 1in Appendix F.
Impossibility Theorems.
Here, we first study whether Condition 1holds in the total space
Dall
XY
. If
Condition 1does not hold, then OOD detection is not learnable. Theorem 3shows that Condition 1is
not always satisfied, especially, when there is an overlap between the ID and OOD distributions:
Definition 4
(Overlap Between ID and OOD)
.
We say a domain
DXY
has overlap between ID and
OOD distributions, if there is a
σ
-finite measure
˜µ
such that
DX
is absolutely continuous with respect
to
˜µ
, and
˜µ(Aoverlap)>0,where Aoverlap ={x∈ X :fI(x)>0and fO(x)>0}
. Here
fI
and
fOare the representers of DXIand DXOin Radon–Nikodym Theorem [36],
DXI=ZfId˜µ, DXO=ZfOd˜µ.
5
摘要:

IsOut-of-DistributionDetectionLearnable?ZhenFang1,YixuanLi2,JieLu1,JiahuaDong3,4,BoHan5,FengLiu1,61AustralianArticialIntelligenceInstitute,UniversityofTechnologySydney.2DepartmentofComputerSciences,UniversityofWisconsin-Madison.3StateKeyLaboratoryofRobotics,ShenyangInstituteofAutomation,ChineseAc...

展开>> 收起<<
Is Out-of-Distribution Detection Learnable Zhen Fang1 Yixuan Li2 Jie Lu1 Jiahua Dong34 Bo Han5 Feng Liu16 1Australian Artificial Intelligence Institute University of Technology Sydney..pdf

共53页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:53 页 大小:2.79MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 53
客服
关注