Is Out-of-Distribution Detection Learnable Zhen Fang1 Yixuan Li2 Jie Lu1 Jiahua Dong34 Bo Han5 Feng Liu16 1Australian Artiﬁcial Intelligence Institute University of Technology Sydney.

2025-05-03 0 0 2.79MB 53 页 10玖币

侵权投诉

Is Out-of-Distribution Detection Learnable?

Zhen Fang1, Yixuan Li2, Jie Lu1∗

, Jiahua Dong3,4, Bo Han5, Feng Liu1,6∗

1Australian Artiﬁcial Intelligence Institute, University of Technology Sydney.

2Department of Computer Sciences, University of Wisconsin-Madison.

3State Key Laboratory of Robotics, Shenyang Institute of Automation,

Chinese Academy of Sciences. 4ETH Zurich, Switzerland.

5Department of Computer Science, Hong Kong Baptist University.

6School of Mathematics and Statistics, University of Melbourne.

{zhen.fang,jie.lu}@uts.edu.au, sharonli@cs.wisc.edu,

dongjiahua1995@gmail.com,bhanml@comp.hkbu.edu.hk,feng.liu1@unimelb.edu.au

Abstract

Supervised learning aims to train a classiﬁer under the assumption that training and

test data are from the same distribution. To ease the above assumption, researchers

have studied a more realistic setting: out-of-distribution (OOD) detection, where

test data may come from classes that are unknown during training (i.e., OOD data).

Due to the unavailability and diversity of OOD data, good generalization ability

is crucial for effective OOD detection algorithms. To study the generalization of

OOD detection, in this paper, we investigate the probably approximately correct

(PAC) learning theory of OOD detection, which is proposed by researchers as an

open problem. First, we ﬁnd a necessary condition for the learnability of OOD

detection. Then, using this condition, we prove several impossibility theorems for

the learnability of OOD detection under some scenarios. Although the impossibil-

ity theorems are frustrating, we ﬁnd that some conditions of these impossibility

theorems may not hold in some practical scenarios. Based on this observation, we

next give several necessary and sufﬁcient conditions to characterize the learnability

of OOD detection in some practical scenarios. Lastly, we also offer theoretical

supports for several representative OOD detection works based on our OOD theory.

1 Introduction

The success of supervised learning is established on an implicit assumption that training and test data

share a same distribution, i.e.,in-distribution (ID) [

]. However, test data distribution in many

real-world scenarios may violate the assumption and, instead, contain out-of-distribution (OOD) data

whose labels have not been seen during the training process [

]. To mitigate the risk of OOD data,

researchers have considered a more practical learning scenario: OOD detection which determines

whether an input is ID/OOD, while classifying the ID data into respective classes. OOD detection has

shown great potential to ensure the reliable deployment of machine learning models in the real world.

A rich line of algorithms have been developed to empirically address the OOD detection problem

[

]. However, very few works study theory of OOD

detection, which hinders the rigorous path forward for the ﬁeld. This paper aims to bridge the gap.

In this paper, we provide a theoretical framework to understand the learnability of the OOD detection

problem. We investigate the probably approximately correct (PAC) learning theory of OOD detection,

which is posed as an open problem to date. Unlike the classical PAC learning theory in a supervised

setting, our problem setting is fundamentally challenging due to the absence of OOD data in training.

∗Corresponding author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.14707v3 [cs.LG] 23 Feb 2023

In many real-world scenarios, OOD data can be diverse and priori-unknown. Given this, we study

whether there exists an algorithm that can be used to detect various OOD data instead of merely some

speciﬁed OOD data. Such is the signiﬁcance of studying the learning theory for OOD detection [

This motivates our question: is OOD detection PAC learnable? i.e., is there the PAC learning theory

to guarantee the generalization ability of OOD detection?

To investigate the learning theory, we mainly focus on two basic spaces: domain space and hypothesis

space. The domain space is a space consisting of some distributions, and the hypothesis space is a

space consisting of some classiﬁers. Existing agnostic PAC theories in supervised learning [

]

are distribution-free, i.e., the domain space consists of all domains. Yet, in Theorem 4, we shows that

the learning theory of OOD detection is not distribution-free. In fact, we discover that OOD detection

is learnable only if the domain space and the hypothesis space satisfy some special conditions, e.g.,

Conditions 1and 3. Notably, there are many conditions and theorems in existing learning theories

and many OOD detection algorithms in the literature. Thus, it is very difﬁcult to analyze the relation

between these theories and algorithms, and explore useful conditions to ensure the learnability of

OOD detection, especially when we have to explore them from the scratch. Thus, the main aim of our

paper is to study these essential conditions. From these essential conditions, we can know when OOD

detection can be successful in practical scenarios. We restate our question and goal in following:

Given hypothesis spaces and several representative domain spaces, what are

the conditions to ensure the learnability of OOD detection? If possible, we

hope that these conditions are necessary and sufﬁcient in some scenarios.

Main Results.

We investigate the learnability of OOD detection starting from the largest space—the

total space, and give a necessary condition (Condition 1) for the learnability. However, we ﬁnd that

the overlap between ID and OOD data may result in that the necessary condition does not hold.

Therefore, we give an impossibility theorem to demonstrate that OOD detection fails in the total

space (Theorem 4). Next, we study OOD detection in the separate space, where there are no overlaps

between the ID and OOD data. Unfortunately, there still exists impossibility theorem (Theorem 5),

which demonstrates that OOD detection is not learnable in the separate space under some conditions.

Although the impossibility theorems obtained in the separate space are frustrating, we ﬁnd that some

conditions of these impossibility theorems may not hold in some practical scenarios. Based on this

observation, we give several necessary and sufﬁcient conditions to characterize the learnability of

OOD detection in the separate space (Theorems 6and 10). Especially, when our model is based on

fully-connected neural network (FCNN), OOD detection is learnable in the separate space if and

only if the feature space is ﬁnite. Furthermore, we investigate the learnability of OOD detection in

other more practical domain spaces, e.g., the ﬁnite-ID-distribution space (Theorem 8) and the density-

based space (Theorem 9). By studying the ﬁnite-ID-distribution space, we discover a compatibility

condition (Condition 3) that is a necessary and sufﬁcient condition for this space. Next, we further

investigate the compatibility condition in the density-based space, and ﬁnd that such condition is also

the necessary and sufﬁcient condition in some practical scenarios (Theorem 11).

Implications and Impacts of Theory.

Our study is not of purely theoretical interest; it has also

practical impacts. First, when we design OOD detection algorithms, we normally only have ﬁnite ID

datasets, corresponding to the ﬁnite-ID-distribution space. In this case, Theorem 8gives the necessary

and sufﬁcient condition to the success of OOD detection. Second, our theory provides theoretical

support (Theorems 10 and 11) for several representative OOD detection works [

]. Third, our

theory shows that OOD detection is learnable in image-based scenarios when ID images have clearly

different semantic labels and styles (far-OOD) from OOD images. Fourth, we should not expect a

universally working algorithm. It is necessary to design different algorithms in different scenarios.

2 Learning Setups

We start by introducing the necessary concepts and notations for our theoretical framework. Given

a feature space

X ⊂ Rd

and a label space

Y:= {1, ..., K}

, we have an ID joint distribution

DXIYI

over

X × Y

, where

XI∈ X

and

YI∈ Y

are random variables. We also have an OOD joint

distribution

DXOYO

, where

is a random variable from

, but

is a random variable whose

outputs do not belong to

. During testing, we will meet a mixture of ID and OOD joint distributions:

DXY := (1 −πout)DXIYI+πoutDXOYO

, and can only observe the marginal distribution

DX:=

(1 −πout)DXI+πoutDXO, where the constant πout ∈[0,1) is an unknown class-prior probability.

Problem 1

(OOD Detection [

])

Given an ID joint distribution

DXIYI

and a training data

S:=

{(x1, y1), ..., (xn, yn)}

drawn independent and identically distributed from

DXIYI

, the aim of OOD

detection is to train a classiﬁer

by using the training data

such that, for any test data

drawn

from the mixed marginal distribution

: 1) if

is an observation from

DXI

can classify

into

correct ID classes; and 2) if xis an observation from DXO,fcan detect xas OOD data.

According to the survey [

], when

K > 1

, OOD detection is also known as the open-set recognition

or open-set learning [

]; and when

K= 1

, OOD detection reduces to one-class novelty detection

and semantic anomaly detection [26,27,28].

OOD Label and Domain Space.

Based on Problem 1, we know it is not necessary to classify OOD

data into the correct OOD classes. Without loss of generality, let all OOD data be allocated to one

big OOD class, i.e.,

YO=K+ 1

[

]. To investigate the PAC learnability of OOD detection,

we deﬁne a domain space

DXY

, which is a set consisting of some joint distributions

DXY

mixed by

some ID joint distributions and some OOD joint distributions. In this paper, the joint distribution

DXY mixed by ID joint distribution DXIYIand OOD joint distribution DXOYOis called domain.

Hypothesis Spaces and Scoring Function Spaces.

A hypothesis space

is a subset of function

space, i.e.,H ⊂ {h:X → Y ∪ {K+ 1}}.We set Hin ⊂ {h:X → Y} to the ID hypothesis space.

We also deﬁne

Hb⊂ {h:X → {1,2}}

as the hypothesis space for binary classiﬁcation, where

represents the ID data, and

represents the OOD data. The function

is called the hypothesis

function. A scoring function space is a subset of function space, i.e.,

Fl⊂ {f:X → Rl}

, where

the output’s dimension of the vector-valued function

. The function

is called the scoring function.

Loss and Risks.

Let

Yall =Y ∪ {K+ 1}

. Given a loss function

`2:Yall × Yall →R≥0

satisfying

that `(y1, y2)=0if and only if y1=y2, and any h∈ H, then the risk with respect to DXY is

RD(h) := E(x,y)∼DXY `(h(x), y).(1)

The

-risk

Rα

D(h) := (1 −α)Rin

D(h) + αRout

D(h),∀α∈[0,1]

, where the risks

Rin

D(h)

Rout

D(h)

are

Rin

D(h) := E(x,y)∼DXIYI`(h(x), y), Rout

D(h) := Ex∼DXO`(h(x), K + 1).

Learnability.

We aim to select a hypothesis function

h∈ H

with approximately minimal risk, based

on ﬁnite data. Generally, we expect the approximation to get better, with the increase in sample size.

Algorithms achieving this are said to be consistent. Formally, we introduce the following deﬁnition:

Deﬁnition 1

(Learnability of OOD Detection)

Given a domain space

DXY

and a hypothesis space

H ⊂ {h:X → Yall}

, we say OOD detection is

learnable

DXY

for

, if there exists an

algorithm

A3:∪+∞

n=1(X × Y)n→ H

and a monotonically decreasing sequence

cons(n)

, such that

cons(n)→0, as n→+∞, and for any domain DXY ∈DXY ,

ES∼Dn

XIYIRD(A(S)) −inf

h∈H RD(h)≤cons(n),(2)

An algorithm Afor which this holds is said to be consistent with respect to DXY .

Deﬁnition 1is a natural extension of agnostic PAC learnability of supervised learning [

]. If for any

DXY ∈DXY

πout = 0

, then Deﬁnition 2is the agnostic PAC learnability of supervised learning.

Although the expression of Deﬁnition 1is different from the normal deﬁnition of agnostic PAC

learning in [

], one can easily prove that they are equivalent when

is bounded, see Appendix D.3.

Since OOD data are unavailable, it is impossible to obtain information about the class-prior probability

πout

. Furthermore, in the real world, it is possible that

πout

can be any value in

[0,1)

. Therefore,

the imbalance issue between ID and OOD distributions, and the priori-unknown issue (i.e.,

πout

unknown) are the core challenges. To ease these challenges, researchers use AUROC, AUPR and

FPR95 to estimate the performance of OOD detection [

]. It seems that there is a

gap between Deﬁnition 1and existing works. To eliminate this gap, we revise Eq. (2) as follows:

ES∼Dn

XIYIRα

D(A(S)) −inf

h∈H Rα

D(h)≤cons(n),∀α∈[0,1].(3)

If an algorithm

satisﬁes Eq.

(3)

, then the imbalance issue and the prior-unknown issue disappear.

That is,

can simultaneously classify the ID data and detect the OOD data well. Based on the above

discussion, we deﬁne the strong learnability of OOD detection as follows:

2Note that Yall × Yall is a ﬁnite set, therefore, `is bounded.

3Similar to [30], in this paper, we regard an algorithm as a mapping from ∪+∞

n=1(X × Y)nto H.

Deﬁnition 2

(Strong Learnability of OOD Detection)

Given a domain space

DXY

and a hypothesis

space

H ⊂ {h:X → Yall}

, we say OOD detection is

strongly learnable

DXY

for

, if there

exists an algorithm

A:∪+∞

n=1(X × Y)n→ H

and a monotonically decreasing sequence

cons(n)

such that cons(n)→0, as n→+∞, and for any domain DXY ∈DXY ,

ES∼Dn

XIYIRα

D(A(S)) −inf

h∈H Rα

D(h)≤cons(n),∀α∈[0,1].

In Theorem 1, we have shown that the strong learnability of OOD detection is equivalent to the

learnability of OOD detection, if the domain space

DXY

is a prior-unknown space (see Deﬁnition 3).

In this paper, we mainly discuss the learnability in the prior-unknown space. Therefore, when we

mention that OOD detection is learnable, we also mean that OOD detection is strongly learnable.

Goal of Theory.

Note that the agnostic PAC learnability of supervised learning is distribution-free,

i.e., the domain space

DXY

consists of all domains. However, due to the absence of OOD data

during the training process [

], it is obvious that the learnability of OOD detection is not

distribution-free (i.e., Theorem 4). In fact, we discover that the learnability of OOD detection is

deeply correlated with the relationship between the domain space

DXY

and the hypothesis space

That is, OOD detection is learnable only when the domain space

DXY

and the hypothesis space

satisfy some special conditions, e.g., Condition 1and Condition 3. We present our goal as follows:

Goal:

given a hypothesis space

and several representative domain spaces

DXY

what are the

conditions

to ensure the learnability of OOD detection? Furthermore, if

possible, we hope that these conditions are

necessary and sufﬁcient

in some scenarios.

Therefore, compared to the agnostic PAC learnability of supervised learning, our theory doesn’t

focus on the distribution-free case, but focuses on discovering essential conditions to guarantee the

learnability of OOD detection in several representative and practical domain spaces

DXY

. By these

essential conditions, we can know when OOD detection can be successful in real applications.

3 Learning in Priori-unknown Spaces

We ﬁrst investigate a special space, called prior-unknown space. In such space, Deﬁnition 1and

Deﬁnition 2are equivalent. Furthermore, we also prove that if OOD detection is strongly learnable

in a space

DXY

, then one can discover a larger domain space, which is prior-unknown, to ensure

the learnability of OOD detection. These results imply that it is enough to consider our theory in the

prior-unknown spaces. The prior-unknown space is introduced as follows:

Deﬁnition 3.

Given a domain space

DXY

, we say

DXY

is a priori-unknown space, if for any domain

DXY ∈DXY and any α∈[0,1), we have Dα

XY := (1 −α)DXIYI+αDXOYO∈DXY .

Theorem 1. Given domain spaces DXY and D0

XY ={Dα

XY :∀DXY ∈DXY ,∀α∈[0,1)}, then

1) D0

XY is a priori-unknown space and DXY ⊂D0

XY ;

2) if DXY is a priori-unknown space, then Deﬁnition 1and Deﬁnition 2are equivalent;

3) OOD detection is strongly learnable in DXY if and only if OOD detection is learnable in D0

XY .

The second result of Theorem 1bridges the learnability and strong learnability, which implies that

if an algorithm

is consistent with respect to a prior-unknown space, then this algorithm

can

address the imbalance issue between ID and OOD distributions, and the priori-unknown issue well.

Based on Theorem 1, we focus on our theory in the prior-unknown spaces. Furthermore, to demystify

the learnability of OOD detection, we introduce ﬁve representative priori-unknown spaces:

•Single-distribution space DDXY

XY . For a domain DXY ,DDXY

XY := {Dα

XY :∀α∈[0,1)}.

•Total space Dall

XY , which consists of all domains.

•

Separate space

, which consists of all domains that satisfy the separate condition, that is for

any DXY ∈Ds

XY ,suppDXO∩suppDXI=∅,where supp means the support set.

•

Finite-ID-distribution space

, which is a prior-unknown space satisfying that the number of

distinct ID joint distributions DXIYIin DF

XY is ﬁnite, i.e.,|{DXIYI:∀DXY ∈DF

XY }| <+∞.

•

Density-based space

Dµ,b

, which is a prior-unknown space consisting of some domains satisfying

that: for any

DXY

, there exists a density function

with

1/b ≤f≤b

suppµ

and

0.5∗DXI+

Class 1 Class 2 Class K

OOD

Data

(a) ID and OOD Distributions

0.00 0.20 0.40 0.60 0.80 1.00

Values of α

0.00

0.20

0.40

0.60

0.80

α-risk

n=15,000

n=20,000

n=25,000

infh∈Rα

D(h)

(b) Overlap Exists

0.00 0.20 0.40 0.60 0.80 1.00

Values of α

0.00

0.05

0.10

0.15

0.20

α-risk

n=15,000

n=20,000

n=25,000

infh∈Rα

D(h)

Figure 1: Illustration of

infh∈H Rα

D(h)

(solid lines with triangle marks) and the estimated

ES∼Dn

in Rα

D(A(S))

(dash lines) with

α∈[0,1)

in different scenarios, where

Din =DXIYI

and the algorithm

is the free-energy

OOD detection method [

]. Subﬁgure (a) shows the ID and OOD distributions. In (a),

gapIO

represents the

distance between the support sets of ID and OOD distributions. In (b), since there is an overlap between ID

and OOD data, the solid line is a ployline. In (c), since there is no overlap between ID and OOD data, we can

check that

infh∈H Rα

D(h)

forms a straight line (the solid line). However, since dash lines are always straight

lines, two observations can be obtained from (b) and (c): 1) dash lines cannot approximate the solid ployline in

(b), which implies the unlearnability of OOD detection; and 2) the solid line in (c) is a straight line and may be

approximated by the dash lines in (c). The above observations motivate us to propose Condition 1.

0.5∗DXO=Rfdµ

, where

is a measure deﬁned over

. Note that if

is discrete, then

is a

discrete distribution; and if µis the Lebesgue measure, then DXis a continuous distribution.

The above representative spaces widely exist in real applications. For example, 1) if the images

from different semantic labels with different styles are clearly different, then those images can form

a distribution belonging to a separate space

; and 2) when designing an algorithm, we only

have ﬁnite ID datasets, e.g., CIFAR-10, MNIST, SVHN, and ImageNet, to build a model. Then,

ﬁnite-ID-distribution space

can handle this real scenario. Note that the single-distribution space

is a special case of the ﬁnite-ID-distribution space. In this paper, we mainly discuss these ﬁve spaces.

4 Impossibility Theorems for OOD Detection

In this section, we ﬁrst give a necessary condition for the learnability of OOD detection. Then, we

show this necessary condition does not hold in the total space Dall

XY and the separate space Ds

XY .

Necessary Condition.

We ﬁnd a necessary condition for the learnability of OOD detection, i.e., Con-

dition 1, motivated by the experiments in Figure 1. Details of Figure 1can be found in Appendix C.2.

Condition 1 (Linear Condition).For any DXY ∈DXY and any α∈[0,1),

inf

h∈H Rα

D(h) = (1 −α) inf

h∈H Rin

D(h) + αinf

h∈H Rout

D(h).

To reveal the importance of Condition 1, Theorem 2shows that Condition 1is a necessary and

sufﬁcient condition for the learnability of OOD detection if the

DXY

is the single-distribution space.

Theorem 2.

Given a hypothesis space

and a domain

DXY

, OOD detection is learnable in the

single-distribution space DDXY

XY for Hif and only if linear condition (i.e., Condition 1) holds.

Theorem 2implies that Condition 1is important for the learnability of OOD detection. Due to the

simplicity of single-distribution space, Theorem 2implies that Condition 1is the necessary condition

for the learnability of OOD detection in the prior-unknown space, see Lemma 1in Appendix F.

Impossibility Theorems.

Here, we ﬁrst study whether Condition 1holds in the total space

Dall

. If

Condition 1does not hold, then OOD detection is not learnable. Theorem 3shows that Condition 1is

not always satisﬁed, especially, when there is an overlap between the ID and OOD distributions:

Deﬁnition 4

(Overlap Between ID and OOD)

We say a domain

DXY

has overlap between ID and

OOD distributions, if there is a

-ﬁnite measure

˜µ

such that

is absolutely continuous with respect

˜µ

, and

˜µ(Aoverlap)>0,where Aoverlap ={x∈ X :fI(x)>0and fO(x)>0}

. Here

and

fOare the representers of DXIand DXOin Radon–Nikodym Theorem [36],

DXI=ZfId˜µ, DXO=ZfOd˜µ.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IsOut-of-DistributionDetectionLearnable?ZhenFang1,YixuanLi2,JieLu1,JiahuaDong3,4,BoHan5,FengLiu1,61AustralianArticialIntelligenceInstitute,UniversityofTechnologySydney.2DepartmentofComputerSciences,UniversityofWisconsin-Madison.3StateKeyLaboratoryofRobotics,ShenyangInstituteofAutomation,ChineseAc...

展开>> 收起<<

Is Out-of-Distribution Detection Learnable Zhen Fang1 Yixuan Li2 Jie Lu1 Jiahua Dong34 Bo Han5 Feng Liu16 1Australian Artiﬁcial Intelligence Institute University of Technology Sydney..pdf

共53页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Is Out-of-Distribution Detection Learnable Zhen Fang1 Yixuan Li2 Jie Lu1 Jiahua Dong34 Bo Han5 Feng Liu16 1Australian Artiﬁcial Intelligence Institute University of Technology Sydney.

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: