
In many real-world scenarios, OOD data can be diverse and priori-unknown. Given this, we study
whether there exists an algorithm that can be used to detect various OOD data instead of merely some
specified OOD data. Such is the significance of studying the learning theory for OOD detection [
4
].
This motivates our question: is OOD detection PAC learnable? i.e., is there the PAC learning theory
to guarantee the generalization ability of OOD detection?
To investigate the learning theory, we mainly focus on two basic spaces: domain space and hypothesis
space. The domain space is a space consisting of some distributions, and the hypothesis space is a
space consisting of some classifiers. Existing agnostic PAC theories in supervised learning [
21
,
22
]
are distribution-free, i.e., the domain space consists of all domains. Yet, in Theorem 4, we shows that
the learning theory of OOD detection is not distribution-free. In fact, we discover that OOD detection
is learnable only if the domain space and the hypothesis space satisfy some special conditions, e.g.,
Conditions 1and 3. Notably, there are many conditions and theorems in existing learning theories
and many OOD detection algorithms in the literature. Thus, it is very difficult to analyze the relation
between these theories and algorithms, and explore useful conditions to ensure the learnability of
OOD detection, especially when we have to explore them from the scratch. Thus, the main aim of our
paper is to study these essential conditions. From these essential conditions, we can know when OOD
detection can be successful in practical scenarios. We restate our question and goal in following:
Given hypothesis spaces and several representative domain spaces, what are
the conditions to ensure the learnability of OOD detection? If possible, we
hope that these conditions are necessary and sufficient in some scenarios.
Main Results.
We investigate the learnability of OOD detection starting from the largest space—the
total space, and give a necessary condition (Condition 1) for the learnability. However, we find that
the overlap between ID and OOD data may result in that the necessary condition does not hold.
Therefore, we give an impossibility theorem to demonstrate that OOD detection fails in the total
space (Theorem 4). Next, we study OOD detection in the separate space, where there are no overlaps
between the ID and OOD data. Unfortunately, there still exists impossibility theorem (Theorem 5),
which demonstrates that OOD detection is not learnable in the separate space under some conditions.
Although the impossibility theorems obtained in the separate space are frustrating, we find that some
conditions of these impossibility theorems may not hold in some practical scenarios. Based on this
observation, we give several necessary and sufficient conditions to characterize the learnability of
OOD detection in the separate space (Theorems 6and 10). Especially, when our model is based on
fully-connected neural network (FCNN), OOD detection is learnable in the separate space if and
only if the feature space is finite. Furthermore, we investigate the learnability of OOD detection in
other more practical domain spaces, e.g., the finite-ID-distribution space (Theorem 8) and the density-
based space (Theorem 9). By studying the finite-ID-distribution space, we discover a compatibility
condition (Condition 3) that is a necessary and sufficient condition for this space. Next, we further
investigate the compatibility condition in the density-based space, and find that such condition is also
the necessary and sufficient condition in some practical scenarios (Theorem 11).
Implications and Impacts of Theory.
Our study is not of purely theoretical interest; it has also
practical impacts. First, when we design OOD detection algorithms, we normally only have finite ID
datasets, corresponding to the finite-ID-distribution space. In this case, Theorem 8gives the necessary
and sufficient condition to the success of OOD detection. Second, our theory provides theoretical
support (Theorems 10 and 11) for several representative OOD detection works [
7
,
8
,
23
]. Third, our
theory shows that OOD detection is learnable in image-based scenarios when ID images have clearly
different semantic labels and styles (far-OOD) from OOD images. Fourth, we should not expect a
universally working algorithm. It is necessary to design different algorithms in different scenarios.
2 Learning Setups
We start by introducing the necessary concepts and notations for our theoretical framework. Given
a feature space
X ⊂ Rd
and a label space
Y:= {1, ..., K}
, we have an ID joint distribution
DXIYI
over
X × Y
, where
XI∈ X
and
YI∈ Y
are random variables. We also have an OOD joint
distribution
DXOYO
, where
XO
is a random variable from
X
, but
YO
is a random variable whose
outputs do not belong to
Y
. During testing, we will meet a mixture of ID and OOD joint distributions:
DXY := (1 −πout)DXIYI+πoutDXOYO
, and can only observe the marginal distribution
DX:=
(1 −πout)DXI+πoutDXO, where the constant πout ∈[0,1) is an unknown class-prior probability.
2