Tight Analysis of Extra-gradient and Optimistic Gradient Methods For Nonconvex Minimax Problems Pouria MahdaviniaYuyang DengHaochuan LiMehrdad Mahdavi

2025-05-06 0 0 7.03MB 64 页 10玖币

侵权投诉

Tight Analysis of Extra-gradient and Optimistic

Gradient Methods For Nonconvex Minimax Problems

Pouria Mahdavinia†Yuyang Deng†Haochuan Li§Mehrdad Mahdavi†

†Department of Computer Science and Engineering

The Pennsylvania State University

{pxm5426, yzd82, mzm616}@psu.edu

§Department of Electrical and Computer Engineering

Massachusetts Institute of Technology

haochuan@mit.edu

Abstract

Despite the established convergence theory of Optimistic Gradient Descent Ascent (OGDA) and

Extragradient (EG) methods for the convex-concave minimax problems, little is known about the

theoretical guarantees of these methods in nonconvex settings. To bridge this gap, for the ﬁrst time, this

paper establishes the convergence of OGDA and EG methods under the nonconvex-strongly-concave (NC-

SC) and nonconvex-concave (NC-C) settings by providing a uniﬁed analysis through the lens of single-call

extra-gradient methods. We further establish lower bounds on the convergence of GDA/OGDA/EG,

shedding light on the tightness of our analysis. We also conduct experiments supporting our theoretical

results. We believe our results will advance the theoretical understanding of OGDA and EG methods

for solving complicated nonconvex minimax real-world problems, e.g., Generative Adversarial Networks

(GANs) or robust neural networks training.

1 Introduction

In this paper, we consider the following minimax problem:

min

x∈Rdmax

y∈Y f(x,y)(1)

where

could be a bounded convex or unbounded set, and the function

Rd× Y → R

is smooth and

strongly-concave/concave with respect to

, but possibly nonconvex in

. Minimax optimization (Problem

1) has been explored in a variety of ﬁelds, including classical game theory, online learning, and control

theory [

]. Minimax has emerged as a key optimization framework for machine learning applications

such as generative adversarial networks (GANs) [

], robust and adversarial machine learning [

and reinforcement learning [54,43].

Gradient descent ascent (GDA) is a well-known algorithm for solving minimax problems, and it is widely

used to optimize generative adversarial networks. GDA performs a gradient descent step on the primal

variable

and a gradient ascent step on the dual variable

simultaneously in each iteration. GDA with equal

step sizes for both variables converges linearly to Nash equilibrium under the strongly-convex strongly-concave

(SC-SC) assumption [

], but diverges even under the convex-concave (C-C) setting for functions such as

bilinear [22,38].

Given the high nonconvexity of practical applications such as GANs, exploring convergence guarantees of

minimax optimization algorithms beyond the convex-concave (C-C) setting is one of the canonical research

directions in minimax optimization. Several algorithms with convergence guarantees beyond the C-C domain

have been explored in the literature. Alternating Gradient Descent Ascent (AGDA) is one of these methods

demonstrated to have excellent convergence properties beyond the C-C setting [

]. Additionally, two

alternative powerful algorithms are Extragradient (EG) and Optimistic GDA (OGDA), which have recently

acquired prominence due to their superior empirical performance in optimizing GANs compared to other

minimax optimization algorithms [28,8,38].

arXiv:2210.09382v1 [cs.LG] 17 Oct 2022

Algorithm NC-C NC-SC

Deterministic Stochastic Deterministic Stochastic

PG-SVRG [44] - ˜

O(−6)- -

HiBSA [36]O(−8)---

Prox-DIAG [48]˜

O(−3)---

Minimax-PPA [31]O(−4)-O(√κ

2)-

ALSET [4] - - O(κ3

2)O(κ3

4)

Smoothed-AGDA [52] - - O(κ

2)O(κ2

4)

GDA [30]O(−6)O(−8)O(κ2

2)O(κ3

4)

OGDA/EG (Theorems 4.2,4.4,4.8,4.9)O(−6)O(−8)O(κ2

2)O(κ3

4)

Table 1: A summary of prior and our convergence rates in nonconvex-concave (NC-C) and nonconvex-strongly-

concave (NC-SC) minimax optimization. For NC-C, we assume

(

x,y

)is

-smooth,

-Lipschitz in

, and

concave in y, and for NC-SC we assume `-smoothness, and µ-strong concavity in y, where κ=`

µ.

Spurred by the empirical success of EG and OGDA methods, there has been a tremendous amount of

work in theoretical understanding of their convergence rate under diﬀerent sets of assumptions. Speciﬁcally,

recently the convergence properties of EG and OGDA were investigated for SC-SC and C-C settings, where it

has been shown that they tend to converge signiﬁcantly faster than GDA in both deterministic and stochastic

settings [

]. Despite these remarkable advances, there is a dearth of theoretical understanding of the

convergence of OGDA and EG methods in the nonconvex setting. This naturally motivates us to rigorously

examine the convergence of these methods in nonconvex minimax optimization that we aim to investigate.

Thus, we emphasize that our focus is on vanilla variants of OGDA/EG, and improved rates in NC-C and

NC-SC problems have already been obtained with novel algorithms as mentioned in Section 2.

Contributions.

We propose a uniﬁed framework for analyzing and establishing the convergence of OGDA

and EG methods for solving NC-SC and NC-C minimax problems. To the best of our knowledge, our analysis

provides the ﬁrst theoretical guarantees for such problems. Our contribution can be summarized as follows:

•

For NC-SC objectives, we demonstrate that OGDA and EG iterates converge to the

−

stationary

point, with a gradient complexity of

(

κ2

2

)for deterministic case, and

(

κ3

4

)for the stochastic setting,

matching the gradient complexity of GDA in [30].

•

For NC-C objectives, we establish the gradient complexity of

(

−6

)for the deterministic and

(

−8

)

for stochastic oracles, respectively. Compared to the most analogous work on GDA [

], our rate

matches the gradient complexity of GDA our results show that OGDA and EG have the advantage of

shaving oﬀ a signiﬁcant term related to primal function gap ( ˆ

∆0= Φ(x0)−minxΦ(x)).

•

We establish impossibility results on the achievable rates by providing an Ω(

κ2

2

), and Ω(

−6

)lower

bounds based on the common choice of parameters for both OGDA and EG in deterministic NC-SC

and NC-C settings, respectively, thus demonstrating the tightness of our analysis of upper bounds.

•

By carefully designing hard instances, we establish a general lower bound of

(

2

), independent of

the learning rate, for GDA/OGDA/EG methods in deterministic NC-SC setting– demonstrating the

optimality of obtained upper bound up to a factor of κ.

2 Related Work

Extra-gradient (EG), and OGDA methods.

Under smooth SC-SC assumption, deterministic OGDA and

EG have been shown to converge to an

(



)neighborhood of the optimal solution with rate of

(

κlog

(



)) [

]. Fallah et al.

[12]

improved upon the previous rates by proposing multistage OGDA, which achieved the

best-known rate of

(

max

(

κlog

(



)

,σ2

µ22

)) for the stochastic OGDA in SC-SC setting. Under monotone and

gradient Lipschitzness assumption (a slightly weaker notion of smooth convex-concave problems), Cai et al.

[3]

established the tight last iterate convergence of

(

√T

)for OGDA and EG, and similar results for EG

has been achieved in [

]. Furthermore, To the best of our knowledge, OGDA and EG methods have

not been extensively explored in nonconvex-nonconcave settings except in a few recent works on structured

nonconvex-nonconcave problems in which the analysis is done through the lens of a variational inequality.

This line of work is discussed in the Nonconvex-nonconcave section. Moreover, recently, Guo et al.

[18]

established the convergence rate of OGDA in NC-SC, however, they have

-PL assumption on Φ(

), which is

a strong assumption and further allows them to show the convergence rate in terms of the objective gap.

However, we did not make such an assumption on the primal function, and hence unlike [

], we measure the

convergence by the gradient norm of the primal function.

Nonconvex-strongly-concave (NC-SC) problems.

In deterministic setting, Lin et al.

[30]

demonstrated

the ﬁrst non-asymptotic convergence of GDA to



-stationary point of Φ(

), with the gradient complexity of

(

κ2

2

). Lin et al.

[31]

and Zhang et al.

[55]

proposed triple loop algorithms achieving gradient complexity of

(

√κ

2

)by leveraging ideas from catalyst methods (adding

αkx−x0k2

to the objective function), and inexact

proximal point methods, which nearly match the existing lower bound [

]. Approximating the inner

loop optimization of catalyst idea by one step of GDA, Yang et al [

] developed a single loop algorithm

called smoothed AGDA, which provably converges to



-stationary point, with gradient complexity of

(

2

For stochastic setting, Lin et al [

] showed that Stochastic GDA, with choosing dual and primal learning rate

ratio of

(

κ2

), converges to



-stationary point with gradient complexity of

(

κ3

4

). Chen et al.

[4]

proposed a

double loop algorithm whose outer loop performs one step of gradient descent on the primal variable, and

inner loop performs multiple steps of gradient ascent. Using this idea, they achieved gradient complexity of

(

κ3

4

)with ﬁxed batch size. However, their algorithm is double loop, and the iteration complexity of the

inner loop is

(

). Yang et al [

] also introduced the stochastic version of smoothed AGDA we mentioned

earlier. They showed gradient complexity of

(

κ2

4

), using ﬁxed batch size. They achieved the best-known

rate for NC-PL problems, which is an even weaker assumption than NC-SC.

Nonconvex-concave.

Recently, due to the surge of GANs [

] and adversarially robust neural network

training, a line of researches are focusing on nonconvex-concave or even nonconvex-nonconcave minimax

optimization problems [

]. For nonconvex-concave setting, to our best knowledge,

Raﬁque et al [

] is the pioneer to propose provable nonconvex-concave minimax algorithm, where they

proposed Proximally Guided Stochastic Mirror Descent Method, which achieves

(

−6

)gradient complexity

to ﬁnd stationary point. Nouiehed et al [

] presented a double-loop algorithm to solve nonconvex-concave

with constraint on both

and

, and achieved

(

−7

)rate. Lin et al [

] provided the ﬁrst analysis of

the classic algorithm (S)GDA on nonconvex-strongly-concave and nonconvex-concave functions, and in

nonconvex-concave setting they achieve

(

−6

)for GDA and

(

−8

)for SGDA. Zhang et al [

] proposed

smoothed-GDA and also achieve

(

−8

)rate. Thekumparampil et al.

[48]

proposed Proximal Dual Implicit

Accelerated Gradient method and achieved the best known rate

(

−3

)for nonconvex-concave problem. Kong

and Monteiro

[26]

proposed an accelerated inexact proximal point method and also achieve

(

−3

)rate. Lin

et al [

] designed near-optimal algorithm using an acceleration method with

(

−3

)rate. However, their

algorithms require double or triple loops and are not as easy to implement as GDA, OGDA, or EG methods.

Nonconvex-nonconcave.

Minimax optimization problems can be cast as one of the special cases of

variational inequality problems (VIPs) [

]. Thus, one way of studying the convergence in Nonconvex-

nonconcave problems is to leverage some variants of Variational Inequality properties such as Monotone

variational inequality, Minty variational inequality (MVI), weak MVI, and negative comonotone, which are

weaker assumptions compared to convex-concave problems. For instance, Loizou et al.

[35]

showed the linear

convergence of SGDA under expected co-coercivity, a condition that potentially holds for the non-monotone

problem. Moreover, it has been shown that deterministic EG obtains gradient complexity of

(

2

)for the

aforementioned settings [

]. Alternatively, another line of works established the convergence

under the weaker notions of strong convexity such as the Polyak-Łojasiewicz (PL) condition, or

-weakly

convex. Yang et al [

] established the linear convergence of the AGDA algorithm assuming the two-sided PL

condition. Hajizadeh et al [

] achieved the same results for EG under the weakly-convex, weakly-concave

assumption.

3 Problem setup and preliminaries

We use lower-case boldface letters such as

to denote vectors and let

k·k

denote the

-norm of vectors. In

Problem 1, we refer to

as the primal variable and to

as the dual variable. For a function

Rm×Rn→R

we use

∇xf

(

x,y

)to denote the gradient of

(

x,y

)with respect to primal variable

, and

∇yf

(

x,y

)to

denote the gradient of

(

x,y

)with respect to dual variable

. In stochastic setting, we let

gx,t

to be the

unbiased estimator of

∇xf

(

xt,yt

), computed by a minibatch of size

and

gy,t

to be the unbiased estimator

∇yf

(

xt,yt

), computed by a minibatch of size

, where

and

are the

th iterates of the algorithms.

Particularly,

gx,t

MxPMx

i=1 ∇xf

(

xt,yt, ξx

t,i

), and

gy,t

MyPMy

i=1 ∇yf

(

xt,yt, ξy

t,i

), where

{ξx

t,i}Mx

i=1

, and

{ξy

t,i}My

i=1

are i.i.d minibatch samples utilized to compute stochastic gradients at each iteration

t∈ {

, . . . , T }

Deﬁnition 3.1

(Primal Function)

We introduce Φ(

) =

maxyf

(

x,y

)as the primal function, and deﬁne

y∗(x) = arg maxy∈Y f(x,y)as the optimal dual variable at a point x.

Deﬁnition 3.2

(Smoothness)

A function

(

x,y

)is

-smooth in both

, and

, if it is diﬀerentiable, and

the following inequalities hold: k∇f(x1,y1)− ∇f(x2,y2)k2≤`2kx1−x2k2+`2ky1−y2k2.

Deﬁnition 3.3.

A function

-strongly-convex, if for any

x1,x2∈Rd

the following holds:

(

)

≥

g(x1) + h∇g(x1),x2−x1i+µ

2kx1−x2k2.

Deﬁnition 3.4. We say xis is an -stationary point for a diﬀerentiable function Φif k∇Φ(x)k ≤ .

We note that



-stationary point is a common optimality criterion used in the NC-SC setting. As pointed

out in [

], considering Φ(

)as convergence measure is natural since in many application scenarios, we

mainly care about the value of the objective

(

x,y

)under the maximized

, e.g., adversarial training or

distributionally robust learning.

When

(

x,y

)is merely concave in

,Φ(

)could be non-diﬀerentiable. Hence, following the routine of

nonsmooth nonconvex minimization [9], we consider the following Moreau envelope function:

Deﬁnition 3.5

(Moreau envelope)

A function Φ

(

)is the

-Moreau envelope of a function Φif Φ

(

) :=

minx0∈Rd{Φ(x0) + 1

2pkx0−xk2}.

We will utilize the following property of the Moreau envelope of a nonsmooth function:

Lemma 3.6

(Davis and Drusvyatskiy

[9]

)

Let

arg minx0∈Rd

Φ(

) +

2pkx0−xk2

, then the following

inequalities hold: kˆ

x−xk ≤ pk∇Φp(x)k,minv∈∂Φ(ˆ

x)kvk ≤ k∇Φp(x)k.

Lemma 3.6 suggests that, if we can ﬁnd a

with a small

k∇

(

)

, then

is near some point

which is a near-stationary point of Φ. We will use 1

-Moreau envelope of Φ, following the setting

in [

], and establish the convergence rates in terms of

k∇

1/2`

(

)

. We also deﬁne two quantities

∆Φ

= Φ

1/2`

(

)

−minx∈Rd

1/2`

(

)and

∆0

= Φ(

)

−minx∈Rd

Φ(

)that appear in our convergence bounds.

Before presenting our results on EG and OGDA, we brieﬂy revisit the most related algorithm, Gradient

Descent Ascent (GDA).

3.1 Gradient Descent Ascent (GDA) algorithm

Algorithm 1 GDA

Input: (x0,y0), stepsizes (ηx, ηy)

for t= 1,2, . . . , T do

xt←xt−1−ηx∇xf(xt−1,yt−1);

yt← PY(yt−1+ηy∇yf(xt−1,yt−1)) ;

end for

Randomly choose ¯

xfrom x1,...,xT

Output:¯

The GDA method, as detailed in Algorithm 1, per-

forms simultaneous gradient descent and ascent updates

on primal and dual variables, respectively. This simple

algorithm has been deployed extensively for minimax op-

timization applications such as Generative Adversarial

Networks (GANs). Under Assumptions 4.1, and 4.3, Lin

et al.

[30]

established the convergence of GDA by choos-

ing

ηx

= Θ(

κ2`

), and

ηy

= Θ(

). In particular, they

showed that deterministic GDA requires

(

κ2

2

)calls to

a gradient oracle, and stochastic GDA requires

(

κ3

4

)

calls using the minibatch size of O(κ

2)to ﬁnd an -stationary point of the primal function.

3.2 Optimistic Gradient Descent Ascent (OGDA) and Extra-gradient (EG)

Method

We now turn to reviewing the algorithms we study in this paper: Optimistic GDA (OGDA) and Extra-gradient

(EG) methods. To optimize Problem (1), at each iteration

= 1

, . . . , T

, OGDA performs the following

updates on the primal and dual variables:

xt+1 =xt−ηx∇xf(xt,yt)−ηx(∇xf(xt,yt)− ∇xf(xt−1,yt−1))

yt+1 =PYyt+ηy∇yf(xt,yt) + ηy(∇yf(xt,yt)− ∇yf(xt−1,yt−1))(OGDA)

where correction terms (e.g.

∇xf

(

xt,yt

)

− ∇xf

(

xt−1,yt−1

)) are added to the updates of the GDA. EG

method performs the following updates:

xt+1/2=xt−ηx∇xf(xt,yt)

xt+1 =xt−ηx∇xf(xt+1/2,yt+1/2);

yt+1/2=PY(yt+ηy∇yf(xt,yt))

yt+1 =PYyt+ηy∇yf(xt+1/2,yt+1/2)(EG)

where the gradient at the current point is used to ﬁnd a mid-point, and then the gradient at the mid-point is

used to ﬁnd the next iterate. We also consider stochastic variants of the two algorithms where we replace full

gradients with unbiased stochastic estimations. The detailed versions of these algorithms are provided in

Algorithm 2, and Algorithm 3in Appendix A.

4 Main Results

We provide upper bounds on the gradient complexity and iteration complexity of OGDA and EG methods

for NC-C and NC-SC objectives in both deterministic and stochastic settings. We also show the tightness of

obtained bounds for the choice of learning rates made. We will derive general stepsize-independent lower

bounds in Section 5.

4.1 Nonconvex-strongly-concave minimax problems

We start by establishing the convergence of deterministic OGDA/EG in the NC-SC setting by making the

following standard assumption on the loss function.

Assumption 4.1. We assume f:Rm×Rn→Ris `-smooth, and f(x, .)is µ-strongly-concave.

Moreover, we assume the initial primal optimality gap is bounded. i.e., ∆

max

(Φ(

)

Φ(

))

−

minxΦ(x).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TightAnalysisofExtra-gradientandOptimisticGradientMethodsForNonconvexMinimaxProblemsPouriaMahdaviniaYuyangDengHaochuanLiMehrdadMahdaviDepartmentofComputerScienceandEngineeringThePennsylvaniaStateUniversity{pxm5426,yzd82,mzm616}@psu.eduDepartmentofElectricalandComputerEngineeringMassachusettsIn...

展开>> 收起<<

Tight Analysis of Extra-gradient and Optimistic Gradient Methods For Nonconvex Minimax Problems Pouria MahdaviniaYuyang DengHaochuan LiMehrdad Mahdavi.pdf

共64页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Tight Analysis of Extra-gradient and Optimistic Gradient Methods For Nonconvex Minimax Problems Pouria MahdaviniaYuyang DengHaochuan LiMehrdad Mahdavi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: