Asymptotically Unbiased Instance-wise Regularized Partial AUC Optimization Theory and Algorithm Huiyang Shao12Qianqian Xu1Zhiyong Yang2

2025-05-06 0 0 1.21MB 36 页 10玖币
侵权投诉
Asymptotically Unbiased Instance-wise Regularized
Partial AUC Optimization: Theory and Algorithm
Huiyang Shao1,2Qianqian Xu1Zhiyong Yang2
Shilong Bao3,4Qingming Huang1,2,5,6
1Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS
2School of Computer Science and Tech., University of Chinese Academy of Sciences
3State Key Lab of Info. Security, Inst. of Info. Engineering, CAS
4School of Cyber Security, University of Chinese Academy of Sciences
5BDKM, University of Chinese Academy of Sciences
6Peng Cheng Laboratory
shaohuiyang21@mails.ucas.ac.cn xuqianqian@ict.ac.cn
yangzhiyong21@ucas.ac.cn baoshilong@iie.ac.cn qmhuang@ucas.ac.cn
Abstract
The Partial Area Under the ROC Curve (PAUC), typically including One-way
Partial AUC (OPAUC) and Two-way Partial AUC (TPAUC), measures the average
performance of a binary classifier within a specific false positive rate and/or true
positive rate interval, which is a widely adopted measure when decision constraints
must be considered. Consequently, PAUC optimization has naturally attracted
increasing attention in the machine learning community within the last few years.
Nonetheless, most of the existing methods could only optimize PAUC approxi-
mately, leading to inevitable biases that are not controllable. Fortunately, a recent
work presents an unbiased formulation of the PAUC optimization problem via dis-
tributional robust optimization. However, it is based on the pair-wise formulation
of AUC, which suffers from the limited scalability w.r.t. sample size and a slow
convergence rate, especially for TPAUC. To address this issue, we present a simpler
reformulation of the problem in an asymptotically unbiased and instance-wise
manner. For both OPAUC and TPAUC, we come to a nonconvex strongly concave
minimax regularized problem of instance-wise functions. On top of this, we em-
ploy an efficient solver enjoys a linear per-iteration computational complexity w.r.t.
the sample size and a time-complexity of
O(1/3)
to reach a
stationary point.
Furthermore, we find that the minimax reformulation also facilitates the theoretical
analysis of generalization error as a byproduct. Compared with the existing results,
we present new error bounds that are much easier to prove and could deal with
hypotheses with real-valued outputs. Finally, extensive experiments on several
benchmark datasets demonstrate the effectiveness of our method.
1 Introduction
AUC refers to the Area Under the Receiver Operating Characteristic (ROC) curve [
1
], where the ROC
curve is obtained by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) of a
given classifier for all possible thresholds. Since it is insensitive to the class distribution, AUC has
become one of the standard metrics for long-tail, and imbalanced datasets [
1
,
22
,
38
]. Consequently,
AUC optimization has attracted increasing attention in the machine learning community ever since
the early 2000s [
13
,
5
,
35
,
17
]. Over the last two decades, research on AUC optimization has
evolved from the simplest linear models and decision trees [
27
,
10
,
29
,
41
] to state-of-the-art deep
Corresponding authors.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.03967v2 [cs.LG] 11 Oct 2022
Table 1: Comparison with existing partial AUC algorithms. The convergence rate represents the
number of iterations after which an algorithm can find an
-stationary point, where
-sp is
-stationary
point.
4
implies a natural result of non-convex SGD.
nB
+
(
nB
resp.) is the number of positive
(negative resp.) instances for each mini-batch B.
SOPA [44] SOPA-S [44] TPAUC [39] Ours
Convergence Rate (OPAUC)O(4)O(4)O(4)4O(3)
Convergence Rate (TPAUC)O(6)O(4)O(4)4O(3)
Convergence Measure -sp (non-smooth) -sp -sp -sp
Smoothness non-smooth smooth smooth smooth
Unbiasedness × × with bias O(1)
when ω= 0
Per-Iteration Time Complexity O(nB
+nB
)O(nB
+nB
)O(nB
+nB
)O(nB
++nB
)
learning architectures [
21
,
14
,
37
,
43
,
42
,
33
]. With such remarkable success, one can now easily
apply AUC optimization to deal with various real-world problems ranging from financial fraud
detection[16,4,23], spam email detection [24], to medical diagnosis [24,38,37,43], etc.
FPR FPR
TPR TPR
0101
1
0
1
0
FPR
TPR
0 1
1
0
(a) AUC (b) OPAUC (c) TPAUC
Figure 1: The comparison among different AUC variants. (a) The area under the ROC curve (AUC).
(b) The one-way partial AUC (OPAUC). (c) The two-way partial AUC (TPAUC).
However, in such long-tailed applications, we are often interested in a specific region in the ROC curve
where its area is called Partial AUC (PAUC). As illustrated in Fig.1, there are two types of PAUC.
Here, One-way Partial AUC (OPAUC) measures the area within an FPR interval (
0FPR β
);
while Two-way Partial AUC (TPAUC) measures the area with
FPR β
,
TPR α
. Unlike
the full AUC, optimizing PAUC requires selecting top-ranked or/and bottom-ranked instances,
leading to a hard combinatorial optimization problem. Many efforts have been made to solve the
problem [
6
,
24
,
25
,
20
,
39
]. However, the majority of them rely on full-batch optimization and
the approximation of the top (bottom)-
k
ranking process, which suffers from immeasurable biases
and undesirable efficiency. Most recently, researchers have started to explore mini-batch PAUC
optimization for deep learning models. [
39
] proposed a novel end-to-end optimization framework
for PAUC. This formulation has a fast convergence rate with the help of a stochastic optimization
algorithm, but the estimation of PAUC is still biased. Later, [
44
] proposed a Distributional Robust
Optimization (DRO) framework for PAUC optimization, where the bias can be eliminated by a clever
reformulation and the compositional SGD algorithms [
28
]. However, they adopt the pair-wise loss
function which has limited scalability w.r.t. sample size and
O(4)
/
O(6)
time complexity to
reach the
-stationary point for
OPAUC
/
TPAUC
. Considering the efficiency bottleneck comes from
the pair-wise formulation, we will explore the following question in this paper:
Can we design a simpler, nearly asymptotically unbiased and instance-wise formulation
to optimize OPAUC and TPAUC in an efficient way?
To answer this question, we propose an efficient and nearly unbiased optimization algorithm (the bias
vanishes asymptotically when
κ0
) for regularized PAUC maximization with a faster convergence
guarantee. The comparison with previous results are listed in Tab.1. We consider both
OPAUC
and
TPAUC
maximization, where for
OPAUC
, we focus on maximizing PAUC in the region:
FPR β
and for
TPAUC
we focus on the region:
FPR β
and
TPR α
. We summarize our contributions
below.
2
With a proper regularization, we propose a nonconvex strongly concave minimax instance-
wise formulation for
OPAUC
and
TPAUC
maximization. On top of our proposed formu-
lation, we employ an efficient stochastic minimax algorithm that finds a
-stationary point
within O(3)iterations.
We conduct a generalization analysis of our proposed methods. Our instance-wise reformula-
tion can overcome the interdependent issue of the original pair-wise generalization analysis.
The proof is much easier than the existing results. Moreover, compared with [
25
,
39
], it can
be applied to any real-valued hypothesis functions other than the hard-threshold functions.
We conduct extensive experiments on multiple imbalanced image classification tasks. The
experimental results speak to the effectiveness of our proposed methods.
2 Preliminaries
Notations. In this section, we give some definitions and preliminaries about OPAUC and TPAUC.
Let
X Rd
be the input space,
Y={0,1}
be the label space. We denote
DP
and
DN
as positive
and negative instance distribution, respectively. Let
S={z= (xi, yi)}n
i=1
be a set of training
data drawn from distribution
DZ
, where
n
is the number of samples. Let
P
(
N
resp.) be a set
of positive (negative resp.) instances in the dataset.
In this paper we only focus on the scoring
functions f:X 7→ [0,1]
. The output range can be simply implemented by any deep neural network
with sigmoid outputs.
Standard AUC.
The standard AUC calculates the entire area under the ROC curve. For mathematical
convenience, our arguments start with the pair-wise reformulation of AUC. Specifically, as shown
in [
1
],
AUC
measures the probability of a positive instance having a higher score than a negative
instance:
AUC(f) = Pr
x∼DP,x0∼DN
[f(x)> f(x0)] .(1)
OPAUC.
As mentioned in the introduction, instead of considering the entire region of ROC, we
focus on two forms of PAUC, namely TPAUC and OPAUC. According to [
6
],
OPAUC
is equivalent
to the probability of a positive instance
x
being scored higher than a negative instance
x0
within the
specific range f(x0)[ηβ(f),1] s.t.Pr
x0∼DN
[f(x0)ηβ] = β:
OPAUC(f) = Pr
x∼DP,x0∼DN
[f(x)> f(x0), f(x0)ηβ(f)] .(2)
Practically, we do not know the exact data distributions
DP
,
DN
to calculate Eq.
(2)
. Therefore, we
turn to the empirical estimation of Eq.
(2)
. Given a finite dataset
S
with
n
instances, let
n+
,
n
be
the numbers of positive/negative instances, respectively. For the
OPAUC
, its empirical estimation
could be expressed as [24]:
ˆ
AUCβ(f, S)=1
n+
X
i=1
nβ
X
j=1
`0,1f(xi)f(x0
[j])
n+nβ
,(3)
where
nβ
=bn·βc
;
x0
[j]
denotes the
j
-th largest score among negative samples;
`0,1(t)
is the
01
loss, which returns 1if t < 0and 0otherwise.
TPAUC.
More recently, [
36
] argued that an efficient classifier should have low
FPR
and high
TPR simultaneously. Therefore, we also study a more general variant called Two-way Partial AUC
(
TPAUC
), where the restricted regions satisfy
TPR α
and
FPR β
. Similar to
OPAUC
,
TPAUC
measures the probability that a positive instance
x
ranks higher than a negative instance
x0
where
f(x)[0, ηα(f)]
,s.t.
Pr
x∼DP
[f(x)ηα] = α, f(x0)[ηβ(f),1]
s.t.
Pr
x0∼DN
[f(x0)ηβ] =
β.
TPAUC(f) = Pr
x∼DP,x0∼DN
[f(x)> f(x0), f(x)ηα(f), f(x0)ηβ(f)] .(4)
Similarly to OPAUC, for the TPAUC, we adopt its empirical estimation [36,39]:
ˆ
AUCα,β (f, S)=1
nα
+
X
i=1
nβ
X
j=1
`0,1f(x[i])f(x0
[j])
nα
+nβ
,(5)
where nα
+=bn+·αcand x[i]is i-th smallest score among all positive instances.
3
3 Problem Formulation
In this section, we introduce how to optimize
OPAUC
and
TPAUC
in an asymptotically unbiased
instance-wise manner. Note that Eq.
(3)
and Eq.
(5)
are hard to optimize since it is complicated to
determine the positive quantile function
ηα(f)
and the negative quantile function
ηβ(f)
. So we can
not obtain the bottom-ranked positive instances and top-ranked negative instances directly. In this
section, we will elaborate on how to tackle these challenges.
3.1 Optimizing the OPAUC
According to Eq.
(3)
, given a surrogate loss
`
and the finite dataset
S
, maximizing
OPAUC
and
ˆ
AUCβ(f, S)is equivalent to solving the following problems, respectively:
min
fRβ(f) = Ex∼DP,x0∼DNIf(x0)ηβ(f)·`(f(x)f(x0)),(6)
min
f
ˆ
Rβ(f, S) =
n+
X
i=1
nβ
X
j=1
`f(xi)f(x0
[j])
n+nβ
.(7)
Step 1: Instance-wise Reformulation.
Here, to simplify the reformulation, we will use the most
popular
surrogate squared loss `(x) = (1 x)2
. Under this setting, the following theorem shows
an instance-wise reformulation of the
OPAUC
optimization problem (please see Appendix.Ffor the
proof):
Theorem 1. Assuming that f(x)[0,1],x∈ X,Fop(f, a, b, γ, t, z)is defined as:
Fop(f, a, b, γ, t, z) =[(f(x)a)22(1 + γ)f(x)]y/p γ2
[(f(x)b)2+ 2(1 + γ)f(x)] ·[(1 y)If(x)t]/[(1 p)β],(8)
where
y= 1
for positive instances,
y= 0
for negative instances and we have the following
conclusions:
(a) (Population Version.) We have:
min
fRβ(f)min
f,(a,b)[0,1]2max
γ[1,1]
E
z∼DZ
[Fop(f, a, b, γ, ηβ(f),z)] ,(9)
where ηβ(f)= arg minηβREx0∼DN[If(x0)ηβ] = β.
(b) (Empirical Version.) Moreover, given a training dataset Swith sample size n, denote:
ˆ
E
zS[Fop(f, a, b, γ, ˆηβ(f),z)] = 1
n
n
X
i=1
Fop(f, a, b, γ, ˆηβ(f),zi),
where ˆηβ(f)is the empirical quantile of the negative instances in S. We have:
min
f
ˆ
Rβ(f, S)min
f,(a,b)[0,1]2max
γ[1,1]
ˆ
E
zS[Fop(f, a, b, γ, ˆηβ(f),z)] ,(10)
Step 2: Differentiable Sample Selection.
Thm.1provides a support to convert the pair-wise loss
into instance-wise loss for
OPAUC
. However, the minimax problem Eq.
(10)
is still difficult to solve
due to the operation
If(x0)ηβ(f)
, which requires selecting top-ranked negative instances. To make
the sample selection process differentiable, we adopt the following lemma.
Lemma 1. Pk
i=1 x[i]
is a convex function of
(x1,··· , xn)
where
x[i]
is the top-i element of a
set
{x1, x2,··· , xn}
. Furthermore, for
xi, i = 1,··· , n
, we have
1
kPk
i=1 x[i]= mins{s+
1
kPn
i=1[xis]+}
, where
[a]+= max{0, a}
. The population version is
Ex[x·Ixη(α)] =
mins1
αEx[αs + [xs]+]
, where
η(α) = arg minηR[Ex[Ixη] = α]
(please see Appendix.F
for the proof).
4
Lem.1proposes an Average Top-
k
(ATk) loss which is the surrogate loss for top-
k
loss to eliminate
the sorting problem. Optimizing the ATk loss is equivalent to selecting top-ranked instances. Actually,
for
Rβ(f)
, we can just reformulate it as an Average Top-
k
(ATk) loss. Denote
`(x0)=(f(x0)
b)2+ 2(1 + γ)f(x0)
. In the proof of the next theorem, we will show that
`(x0)
is an increasing
function w.r.t. f(x0), namely:
Ex0∼DN[If(x0)ηβ(f)·`(x0)|f(x0)ηβ(f))] = min
s
1
β·Ex0∼DN[βs + [`(x0)s]+].(11)
The similar result holds for
ˆ
Rβ(f, S)
. Then, we can reach to Thm.2(please see Appendix.Ffor the
proof):
Theorem 2.
Assuming that
f(x)[0,1]
, for all
x∈ X
, we have the equivalent optimization for
OPAUC:
min
f,(a,b)[0,1]2max
γ[1,1]
E
z∼DZ
[Fop(f, a, b, γ, ηβ(f),z)]
min
f,(a,b)[0,1]2max
γγ
min
s0s0
E
z∼DZ
[Gop(f, a, b, γ, z, s0)],(12)
min
f,(a,b)[0,1]2max
γ[1,1]
ˆ
E
zS[Fop(f, a, b, γ, ˆηβ(f),z)]
min
f,(a,b)[0,1]2max
γγ
min
s0s0
ˆ
E
zS[Gop(f, a, b, γ, z, s0)],(13)
where γ= [b1,1],s0= [0,5] and
Gop(f, a, b,γ, z, s0) = [(f(x)a)22(1 + γ)f(x)]y/p γ2
+βs0+(f(x)b)2+ 2(1 + γ)f(x)s0+(1 y)/[β(1 p)].(14)
Step 3: Asymptotically Unbiased Smoothing.
Even with Thm.2, it is hard to optimize the min-
max-min formulation in Eq.
(13)
. A solution is to swap the order
maxγ
and
mins0
to reformulate it as
a min-max problem. The key obstacle to this idea is the non-smooth function
[·]+
. To avoid the
[·]+
,
we apply the softplus function [12]:
rκ(x) = log (1 + exp(κ·x))
κ,(15)
as a smooth surrogate. It is easy to show that
rκ(x)κ→∞
[x]+
.
Denote Gκ
op(f, a, b, γ, z, s0)the
surrogate objective where the [·]+in Gop(f, a, b, γ, z, s0)is replaced with rκ(·)
. We then proceed
to solve the surrogate problem:
min
f,(a,b)[0,1]2max
γγ
min
s0s0
E
z∼DZ
[Gκ
op(f, a, b, γ, z, s0)]
min
f,(a,b)[0,1]2max
γγ
min
s0s0
ˆ
E
zS[Gκ
op(f, a, b, γ, z, s0)],(16)
respectively for the population and empirical version. In Appendix.B, we will proof that such a
approximation has a convergence rate O(1).
Step 4: The Regularized Problem.
It is easy to check that
rκ(x)
has a bounded second-order deriva-
tion. In this way, we can regard
Gκ
op(f, a, b, γ, z, s0)
as a weakly-concave function [
3
] of
γ
. By
employing an `2regularization, we turn to a regularized form:
Gκ,ω
op (f, a, b, γ, z, s0) = Gκ
op(f, a, b, γ, z, s0)ω·γ2,
With a sufficiently large
ω
,
Gκ,ω
op (f, a, b, γ, z, s0)
is strongly-concave w.r.t.
γ
when all the other
variables are fixed. Note that the regularization scheme will inevitably bias. As a very general result,
regularization will inevitably induce bias. However, it is known to be a necessary building block to
stabilize the solutions and improve generalization performance. We then reach a minimax problem in
the final step.
Step 5: Min-Max Swapping.
According to min-max theorem [
3
], if we replace
Gκ
op(f, a, b, γ, z, s0)
with Gκ,ω
op (f, a, b, γ, z, s0), the surrogate optimization problem satisfies:
min
f,(a,b)[0,1]2max
γγ
min
s0s0
E
z∼DZ
[Gκ,ω
op ]min
f,(a,b)[0,1]2,s0s0
max
γγ
E
z∼DZ
[Gκ,ω
op ],(17)
5
摘要:

AsymptoticallyUnbiasedInstance-wiseRegularizedPartialAUCOptimization:TheoryandAlgorithmHuiyangShao1;2QianqianXu1ZhiyongYang2ShilongBao3;4QingmingHuang1;2;5;61KeyLabofIntell.Info.Process.,Inst.ofComput.Tech.,CAS2SchoolofComputerScienceandTech.,UniversityofChineseAcademyofSciences3StateKeyLabofInfo....

展开>> 收起<<
Asymptotically Unbiased Instance-wise Regularized Partial AUC Optimization Theory and Algorithm Huiyang Shao12Qianqian Xu1Zhiyong Yang2.pdf

共36页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:36 页 大小:1.21MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 36
客服
关注