Tight Analysis of Extra-gradient and Optimistic Gradient Methods For Nonconvex Minimax Problems Pouria MahdaviniaYuyang DengHaochuan LiMehrdad Mahdavi

2025-05-06 0 0 7.03MB 64 页 10玖币
侵权投诉
Tight Analysis of Extra-gradient and Optimistic
Gradient Methods For Nonconvex Minimax Problems
Pouria MahdaviniaYuyang DengHaochuan Li§Mehrdad Mahdavi
Department of Computer Science and Engineering
The Pennsylvania State University
{pxm5426, yzd82, mzm616}@psu.edu
§Department of Electrical and Computer Engineering
Massachusetts Institute of Technology
haochuan@mit.edu
Abstract
Despite the established convergence theory of Optimistic Gradient Descent Ascent (OGDA) and
Extragradient (EG) methods for the convex-concave minimax problems, little is known about the
theoretical guarantees of these methods in nonconvex settings. To bridge this gap, for the first time, this
paper establishes the convergence of OGDA and EG methods under the nonconvex-strongly-concave (NC-
SC) and nonconvex-concave (NC-C) settings by providing a unified analysis through the lens of single-call
extra-gradient methods. We further establish lower bounds on the convergence of GDA/OGDA/EG,
shedding light on the tightness of our analysis. We also conduct experiments supporting our theoretical
results. We believe our results will advance the theoretical understanding of OGDA and EG methods
for solving complicated nonconvex minimax real-world problems, e.g., Generative Adversarial Networks
(GANs) or robust neural networks training.
1 Introduction
In this paper, we consider the following minimax problem:
min
xRdmax
y∈Y f(x,y)(1)
where
Y
could be a bounded convex or unbounded set, and the function
f
:
Rd× Y R
is smooth and
strongly-concave/concave with respect to
y
, but possibly nonconvex in
x
. Minimax optimization (Problem
1) has been explored in a variety of fields, including classical game theory, online learning, and control
theory [
2
,
50
,
21
]. Minimax has emerged as a key optimization framework for machine learning applications
such as generative adversarial networks (GANs) [
14
], robust and adversarial machine learning [
46
,
37
,
15
],
and reinforcement learning [54,43].
Gradient descent ascent (GDA) is a well-known algorithm for solving minimax problems, and it is widely
used to optimize generative adversarial networks. GDA performs a gradient descent step on the primal
variable
x
and a gradient ascent step on the dual variable
y
simultaneously in each iteration. GDA with equal
step sizes for both variables converges linearly to Nash equilibrium under the strongly-convex strongly-concave
(SC-SC) assumption [
28
,
12
], but diverges even under the convex-concave (C-C) setting for functions such as
bilinear [22,38].
Given the high nonconvexity of practical applications such as GANs, exploring convergence guarantees of
minimax optimization algorithms beyond the convex-concave (C-C) setting is one of the canonical research
directions in minimax optimization. Several algorithms with convergence guarantees beyond the C-C domain
have been explored in the literature. Alternating Gradient Descent Ascent (AGDA) is one of these methods
demonstrated to have excellent convergence properties beyond the C-C setting [
51
,
52
,
6
]. Additionally, two
alternative powerful algorithms are Extragradient (EG) and Optimistic GDA (OGDA), which have recently
acquired prominence due to their superior empirical performance in optimizing GANs compared to other
minimax optimization algorithms [28,8,38].
1
arXiv:2210.09382v1 [cs.LG] 17 Oct 2022
Algorithm NC-C NC-SC
Deterministic Stochastic Deterministic Stochastic
PG-SVRG [44] - ˜
O(6)- -
HiBSA [36]O(8)---
Prox-DIAG [48]˜
O(3)---
Minimax-PPA [31]O(4)-O(κ
2)-
ALSET [4] - - O(κ3
2)O(κ3
4)
Smoothed-AGDA [52] - - O(κ
2)O(κ2
4)
GDA [30]O(6)O(8)O(κ2
2)O(κ3
4)
OGDA/EG (Theorems 4.2,4.4,4.8,4.9)O(6)O(8)O(κ2
2)O(κ3
4)
Table 1: A summary of prior and our convergence rates in nonconvex-concave (NC-C) and nonconvex-strongly-
concave (NC-SC) minimax optimization. For NC-C, we assume
f
(
x,y
)is
`
-smooth,
G
-Lipschitz in
x
, and
concave in y, and for NC-SC we assume `-smoothness, and µ-strong concavity in y, where κ=`
µ.
Spurred by the empirical success of EG and OGDA methods, there has been a tremendous amount of
work in theoretical understanding of their convergence rate under different sets of assumptions. Specifically,
recently the convergence properties of EG and OGDA were investigated for SC-SC and C-C settings, where it
has been shown that they tend to converge significantly faster than GDA in both deterministic and stochastic
settings [
39
,
12
,
40
]. Despite these remarkable advances, there is a dearth of theoretical understanding of the
convergence of OGDA and EG methods in the nonconvex setting. This naturally motivates us to rigorously
examine the convergence of these methods in nonconvex minimax optimization that we aim to investigate.
Thus, we emphasize that our focus is on vanilla variants of OGDA/EG, and improved rates in NC-C and
NC-SC problems have already been obtained with novel algorithms as mentioned in Section 2.
Contributions.
We propose a unified framework for analyzing and establishing the convergence of OGDA
and EG methods for solving NC-SC and NC-C minimax problems. To the best of our knowledge, our analysis
provides the first theoretical guarantees for such problems. Our contribution can be summarized as follows:
For NC-SC objectives, we demonstrate that OGDA and EG iterates converge to the
stationary
point, with a gradient complexity of
O
(
κ2
2
)for deterministic case, and
O
(
κ3
4
)for the stochastic setting,
matching the gradient complexity of GDA in [30].
For NC-C objectives, we establish the gradient complexity of
O
(
6
)for the deterministic and
O
(
8
)
for stochastic oracles, respectively. Compared to the most analogous work on GDA [
30
], our rate
matches the gradient complexity of GDA our results show that OGDA and EG have the advantage of
shaving off a significant term related to primal function gap ( ˆ
0= Φ(x0)minxΦ(x)).
We establish impossibility results on the achievable rates by providing an Ω(
κ2
2
), and Ω(
6
)lower
bounds based on the common choice of parameters for both OGDA and EG in deterministic NC-SC
and NC-C settings, respectively, thus demonstrating the tightness of our analysis of upper bounds.
By carefully designing hard instances, we establish a general lower bound of
O
(
κ
2
), independent of
the learning rate, for GDA/OGDA/EG methods in deterministic NC-SC setting– demonstrating the
optimality of obtained upper bound up to a factor of κ.
2 Related Work
Extra-gradient (EG), and OGDA methods.
Under smooth SC-SC assumption, deterministic OGDA and
EG have been shown to converge to an
O
(
)neighborhood of the optimal solution with rate of
O
(
κlog
(
1
)) [
39
,
49
]. Fallah et al.
[12]
improved upon the previous rates by proposing multistage OGDA, which achieved the
best-known rate of
O
(
max
(
κlog
(
1
)
,σ2
µ22
)) for the stochastic OGDA in SC-SC setting. Under monotone and
gradient Lipschitzness assumption (a slightly weaker notion of smooth convex-concave problems), Cai et al.
[3]
established the tight last iterate convergence of
O
(
1
T
)for OGDA and EG, and similar results for EG
2
has been achieved in [
17
,
16
]. Furthermore, To the best of our knowledge, OGDA and EG methods have
not been extensively explored in nonconvex-nonconcave settings except in a few recent works on structured
nonconvex-nonconcave problems in which the analysis is done through the lens of a variational inequality.
This line of work is discussed in the Nonconvex-nonconcave section. Moreover, recently, Guo et al.
[18]
established the convergence rate of OGDA in NC-SC, however, they have
µ
-PL assumption on Φ(
x
), which is
a strong assumption and further allows them to show the convergence rate in terms of the objective gap.
However, we did not make such an assumption on the primal function, and hence unlike [
18
], we measure the
convergence by the gradient norm of the primal function.
Nonconvex-strongly-concave (NC-SC) problems.
In deterministic setting, Lin et al.
[30]
demonstrated
the first non-asymptotic convergence of GDA to
-stationary point of Φ(
x
), with the gradient complexity of
O
(
κ2
2
). Lin et al.
[31]
and Zhang et al.
[55]
proposed triple loop algorithms achieving gradient complexity of
O
(
κ
2
)by leveraging ideas from catalyst methods (adding
αkxx0k2
to the objective function), and inexact
proximal point methods, which nearly match the existing lower bound [
27
,
55
,
20
]. Approximating the inner
loop optimization of catalyst idea by one step of GDA, Yang et al [
52
] developed a single loop algorithm
called smoothed AGDA, which provably converges to
-stationary point, with gradient complexity of
O
(
κ
2
).
For stochastic setting, Lin et al [
30
] showed that Stochastic GDA, with choosing dual and primal learning rate
ratio of
O
(
1
κ2
), converges to
-stationary point with gradient complexity of
O
(
κ3
4
). Chen et al.
[4]
proposed a
double loop algorithm whose outer loop performs one step of gradient descent on the primal variable, and
inner loop performs multiple steps of gradient ascent. Using this idea, they achieved gradient complexity of
O
(
κ3
4
)with fixed batch size. However, their algorithm is double loop, and the iteration complexity of the
inner loop is
O
(
κ
). Yang et al [
52
] also introduced the stochastic version of smoothed AGDA we mentioned
earlier. They showed gradient complexity of
O
(
κ2
4
), using fixed batch size. They achieved the best-known
rate for NC-PL problems, which is an even weaker assumption than NC-SC.
Nonconvex-concave.
Recently, due to the surge of GANs [
14
] and adversarially robust neural network
training, a line of researches are focusing on nonconvex-concave or even nonconvex-nonconcave minimax
optimization problems [
36
,
29
,
41
,
44
,
48
,
13
,
32
,
33
,
24
]. For nonconvex-concave setting, to our best knowledge,
Rafique et al [
44
] is the pioneer to propose provable nonconvex-concave minimax algorithm, where they
proposed Proximally Guided Stochastic Mirror Descent Method, which achieves
O
(
6
)gradient complexity
to find stationary point. Nouiehed et al [
41
] presented a double-loop algorithm to solve nonconvex-concave
with constraint on both
x
and
y
, and achieved
O
(
7
)rate. Lin et al [
30
] provided the first analysis of
the classic algorithm (S)GDA on nonconvex-strongly-concave and nonconvex-concave functions, and in
nonconvex-concave setting they achieve
O
(
6
)for GDA and
O
(
8
)for SGDA. Zhang et al [
53
] proposed
smoothed-GDA and also achieve
O
(
8
)rate. Thekumparampil et al.
[48]
proposed Proximal Dual Implicit
Accelerated Gradient method and achieved the best known rate
O
(
3
)for nonconvex-concave problem. Kong
and Monteiro
[26]
proposed an accelerated inexact proximal point method and also achieve
O
(
3
)rate. Lin
et al [
31
] designed near-optimal algorithm using an acceleration method with
O
(
3
)rate. However, their
algorithms require double or triple loops and are not as easy to implement as GDA, OGDA, or EG methods.
Nonconvex-nonconcave.
Minimax optimization problems can be cast as one of the special cases of
variational inequality problems (VIPs) [
1
,
34
]. Thus, one way of studying the convergence in Nonconvex-
nonconcave problems is to leverage some variants of Variational Inequality properties such as Monotone
variational inequality, Minty variational inequality (MVI), weak MVI, and negative comonotone, which are
weaker assumptions compared to convex-concave problems. For instance, Loizou et al.
[35]
showed the linear
convergence of SGDA under expected co-coercivity, a condition that potentially holds for the non-monotone
problem. Moreover, it has been shown that deterministic EG obtains gradient complexity of
O
(
1
2
)for the
aforementioned settings [
7
,
10
,
47
,
42
]. Alternatively, another line of works established the convergence
under the weaker notions of strong convexity such as the Polyak-Łojasiewicz (PL) condition, or
ρ
-weakly
convex. Yang et al [
51
] established the linear convergence of the AGDA algorithm assuming the two-sided PL
condition. Hajizadeh et al [
19
] achieved the same results for EG under the weakly-convex, weakly-concave
assumption.
3
3 Problem setup and preliminaries
We use lower-case boldface letters such as
x
to denote vectors and let
k·k
denote the
`2
-norm of vectors. In
Problem 1, we refer to
x
as the primal variable and to
y
as the dual variable. For a function
f
:
Rm×RnR
,
we use
xf
(
x,y
)to denote the gradient of
f
(
x,y
)with respect to primal variable
x
, and
yf
(
x,y
)to
denote the gradient of
f
(
x,y
)with respect to dual variable
y
. In stochastic setting, we let
gx,t
to be the
unbiased estimator of
xf
(
xt,yt
), computed by a minibatch of size
Mx
and
gy,t
to be the unbiased estimator
of
yf
(
xt,yt
), computed by a minibatch of size
My
, where
xt
and
yt
are the
t
th iterates of the algorithms.
Particularly,
gx,t
=
1
MxPMx
i=1 xf
(
xt,yt, ξx
t,i
), and
gy,t
=
1
MyPMy
i=1 yf
(
xt,yt, ξy
t,i
), where
{ξx
t,i}Mx
i=1
, and
{ξy
t,i}My
i=1
are i.i.d minibatch samples utilized to compute stochastic gradients at each iteration
t∈ {
1
, . . . , T }
.
Definition 3.1
(Primal Function)
.
We introduce Φ(
x
) =
maxyf
(
x,y
)as the primal function, and define
y(x) = arg maxy∈Y f(x,y)as the optimal dual variable at a point x.
Definition 3.2
(Smoothness)
.
A function
f
(
x,y
)is
`
-smooth in both
x
, and
y
, if it is differentiable, and
the following inequalities hold: k∇f(x1,y1)− ∇f(x2,y2)k2`2kx1x2k2+`2ky1y2k2.
Definition 3.3.
A function
g
is
µ
-strongly-convex, if for any
x1,x2Rd
the following holds:
g
(
x2
)
g(x1) + h∇g(x1),x2x1i+µ
2kx1x2k2.
Definition 3.4. We say xis is an -stationary point for a differentiable function Φif k∇Φ(x)k ≤ .
We note that
-stationary point is a common optimality criterion used in the NC-SC setting. As pointed
out in [
30
], considering Φ(
x
)as convergence measure is natural since in many application scenarios, we
mainly care about the value of the objective
f
(
x,y
)under the maximized
y
, e.g., adversarial training or
distributionally robust learning.
When
f
(
x,y
)is merely concave in
y
,Φ(
x
)could be non-differentiable. Hence, following the routine of
nonsmooth nonconvex minimization [9], we consider the following Moreau envelope function:
Definition 3.5
(Moreau envelope)
.
A function Φ
p
(
x
)is the
p
-Moreau envelope of a function Φif Φ
p
(
x
) :=
minx0Rd{Φ(x0) + 1
2pkx0xk2}.
We will utilize the following property of the Moreau envelope of a nonsmooth function:
Lemma 3.6
(Davis and Drusvyatskiy
[9]
)
.
Let
ˆ
x
=
arg minx0Rd
Φ(
x0
) +
1
2pkx0xk2
, then the following
inequalities hold: kˆ
xxk ≤ pk∇Φp(x)k,minvΦ(ˆ
x)kvk ≤ k∇Φp(x)k.
Lemma 3.6 suggests that, if we can find a
x
with a small
k∇
Φ
p
(
x
)
k
, then
x
is near some point
ˆ
x
which is a near-stationary point of Φ. We will use 1
/
2
`
-Moreau envelope of Φ, following the setting
in [
30
,
45
], and establish the convergence rates in terms of
k∇
Φ
1/2`
(
x
)
k
. We also define two quantities
ˆ
Φ
= Φ
1/2`
(
x0
)
minxRd
Φ
1/2`
(
x
)and
ˆ
0
= Φ(
x0
)
minxRd
Φ(
x
)that appear in our convergence bounds.
Before presenting our results on EG and OGDA, we briefly revisit the most related algorithm, Gradient
Descent Ascent (GDA).
4
3.1 Gradient Descent Ascent (GDA) algorithm
Algorithm 1 GDA
Input: (x0,y0), stepsizes (ηx, ηy)
for t= 1,2, . . . , T do
xtxt1ηxxf(xt1,yt1);
yt← PY(yt1+ηyyf(xt1,yt1)) ;
end for
Randomly choose ¯
xfrom x1,...,xT
Output:¯
x
The GDA method, as detailed in Algorithm 1, per-
forms simultaneous gradient descent and ascent updates
on primal and dual variables, respectively. This simple
algorithm has been deployed extensively for minimax op-
timization applications such as Generative Adversarial
Networks (GANs). Under Assumptions 4.1, and 4.3, Lin
et al.
[30]
established the convergence of GDA by choos-
ing
ηx
= Θ(
1
κ2`
), and
ηy
= Θ(
1
`
). In particular, they
showed that deterministic GDA requires
O
(
κ2
2
)calls to
a gradient oracle, and stochastic GDA requires
O
(
κ3
4
)
calls using the minibatch size of O(κ
2)to find an -stationary point of the primal function.
3.2 Optimistic Gradient Descent Ascent (OGDA) and Extra-gradient (EG)
Method
We now turn to reviewing the algorithms we study in this paper: Optimistic GDA (OGDA) and Extra-gradient
(EG) methods. To optimize Problem (1), at each iteration
t
= 1
,
2
, . . . , T
, OGDA performs the following
updates on the primal and dual variables:
xt+1 =xtηxxf(xt,yt)ηx(xf(xt,yt)− ∇xf(xt1,yt1))
yt+1 =PYyt+ηyyf(xt,yt) + ηy(yf(xt,yt)− ∇yf(xt1,yt1))(OGDA)
where correction terms (e.g.
xf
(
xt,yt
)
− ∇xf
(
xt1,yt1
)) are added to the updates of the GDA. EG
method performs the following updates:
xt+1/2=xtηxxf(xt,yt)
xt+1 =xtηxxf(xt+1/2,yt+1/2);
yt+1/2=PY(yt+ηyyf(xt,yt))
yt+1 =PYyt+ηyyf(xt+1/2,yt+1/2)(EG)
where the gradient at the current point is used to find a mid-point, and then the gradient at the mid-point is
used to find the next iterate. We also consider stochastic variants of the two algorithms where we replace full
gradients with unbiased stochastic estimations. The detailed versions of these algorithms are provided in
Algorithm 2, and Algorithm 3in Appendix A.
4 Main Results
We provide upper bounds on the gradient complexity and iteration complexity of OGDA and EG methods
for NC-C and NC-SC objectives in both deterministic and stochastic settings. We also show the tightness of
obtained bounds for the choice of learning rates made. We will derive general stepsize-independent lower
bounds in Section 5.
4.1 Nonconvex-strongly-concave minimax problems
We start by establishing the convergence of deterministic OGDA/EG in the NC-SC setting by making the
following standard assumption on the loss function.
Assumption 4.1. We assume f:Rm×RnRis `-smooth, and f(x, .)is µ-strongly-concave.
Moreover, we assume the initial primal optimality gap is bounded. i.e.,
Φ
=
max
(Φ(
x1
)
,
Φ(
x0
))
minxΦ(x).
5
摘要:

TightAnalysisofExtra-gradientandOptimisticGradientMethodsForNonconvexMinimaxProblemsPouriaMahdavinia„YuyangDeng„HaochuanLiŸMehrdadMahdavi„„DepartmentofComputerScienceandEngineeringThePennsylvaniaStateUniversity{pxm5426,yzd82,mzm616}@psu.eduŸDepartmentofElectricalandComputerEngineeringMassachusettsIn...

展开>> 收起<<
Tight Analysis of Extra-gradient and Optimistic Gradient Methods For Nonconvex Minimax Problems Pouria MahdaviniaYuyang DengHaochuan LiMehrdad Mahdavi.pdf

共64页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:64 页 大小:7.03MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 64
客服
关注