
classify the close-set samples but also (2) discriminate the open-set samples from the close-set ones.
In this complicated setting, how to evaluate model performance becomes a challenging problem.
Existing work has proposed several metrics, which fall into two categories:
The first direction extends traditional classification metrics to the open-set scenario. To this end,
one should first extend the close-set confusion matrix with unknown classes, where a threshold
decides whether the input sample belongs to the unknown classes. On top of this,
open-set F-score
[
2
,
9
,
11
,
12
,
15
,
16
] summarizes the True Positive (TP), False Positive (FP), and False Negative
(FN) performance of known classes.
Youden’s index
[
17
] takes the sum of the True Positive Rate
(TPR) and True Negative Rate (TNR) performance of known classes as the performance measure.
Besides,
Normalized Accuracy
[
15
] summarizes the close-set accuracy and the open-set accuracy
via a convex combination. Although it is intuitive to extend close-set metrics, we point out that
these metrics are essentially inconsistent with the goal of OSR. Specifically, for open-set F-score and
Youden’s index, only the FP/FN performances of known classes evaluate the open-set performance
implicitly. As a result, these metrics will encourage classifying close-set samples into the open-set
to decrease the FN of known classes. Moreover, Normalized Accuracy encourages selecting the
threshold classifying more open-set samples into known classes. In extreme cases, even a close-set
model (i.e., all the open-set samples are classified into known classes) can obtain a high performance
on these metrics.
The second category regards OSR as a novelty detection problem [
18
,
19
] with multiple known
classes. Based on such observation, the Area Under ROC Curve (
AUC
) [
20
,
21
], which measures
the ranking performance between known classes and unknown classes, has become a popular metric
[
3
,
4
,
5
,
6
,
8
,
10
]. Compared with classification-based metrics, AUC is insensitive to the selection of
threshold since it summarizes the True Positive Rate (TPR) performance for all possible thresholds.
However, the limitation of AUC is also obvious: the close-set performance is ignored. A natural
remedy is to adopt the close-set accuracy as a complementary metric [
3
]. However, what we expect is
a model that can make correct predictions on close-set and open-set simultaneously. This decoupling
strategy will induce a challenging multi-objective optimization problem and is also unfavorable to
comparing the overall performances of different models. What’s more, simply aggregating these two
metrics will induce another inconsistency property.
In view of this, a natural question arises:
Whether there exists a numeric metric that is consistent with the goal of OSR?
To answer this question, we propose a novel metric named
OpenAUC
. Specifically, the proposed
metric enjoys a concise pairwise formulation, where each pair consists of a close-set sample and an
open-set sample. For each pair, only if the close-set sample has been classified into the correct known
class, OpenAUC will check whether the open-set sample is ranked higher than the close-set one. In
this sense, OpenAUC evaluates the close-set performance and the open-set performance in a coupling
manner, which is consistent with the goal of OSR. What’s more, benefiting from the ranking operator,
OpenAUC overcomes the sensitivity of the threshold, and further analysis shows that maximizing
OpenAUC will guarantee a better open-set performance under a mild assumption on the threshold.
Considering these advantages, we further establish an end-to-end learning method to maximize
OpenAUC. Finally, extensive experiments conducted on multiple benchmark datasets validate the
proposed metric and learning method. To sum up, the contribution of this paper is three-fold:
•
We make a detailed analysis of existing metrics for OSR. The theoretical results show
that existing metrics, including the classification-based ones and AUC, are essentially
inconsistent with the goal of OSR due to their own limitations.
•
A novel metric, named OpenAUC, is proposed. Benefiting from its concise formulation,
further analysis shows that OpenAUC overcomes the limitations of existing metrics and thus
is free from the inconsistency properties.
•
An end-to-end learning method is proposed to optimize OpenAUC, and the empirical results
on multiple benchmark datasets validate its effectiveness.
2