
Convergence of Proximal Point and Extragradient-Based Methods Beyond Monotonicity: the Case of Negative Comonotonicity
(Krizhevsky et al.,2009), (Brown & Sandholm,2019) where
extragradient-based methods were applied in regret match-
ing, (Farina et al.,2019) for the application to counterfactual
regret minimization, and (Anagnostides et al.,2022) where
these methods were used for training agents to play poker.
Interestingly, until recently, the convergence rate for the
last iterate of neither EG nor OG were known even when
F
is (maximally) monotone and Lipschitz. First results in
this direction were obtained by Golowich et al. (2020b;a)
under some additional assumptions (namely the Jacobian
of
F
being Lipschitz). Later, Gorbunov et al. (2022b;c);
Cai et al. (2022b) closed this question by proving the tight
worst-case last iterate convergence rate of these methods
under monotonicity and Lipschitzness of F.
As some important motivating applications involve deep
neural networks, the operator
F
under consideration is typ-
ically not monotone. However, for general non-monotone
problems approximating first-order locally optimal solutions
can be intractable (Daskalakis et al.,2021;Diakonikolas
et al.,2021). Thus, it is natural to consider assumptions on
structured non-monotonicity. Recently Diakonikolas et al.
(2021) proposed to analyse EG using a weaker assumption
than the traditional monotonicity. In the sequel, this as-
sumption is referred to as
ρ
-negative comonotonicity (with
ρ≥0). That is, for all x, y ∈Rd, the operator Fsatisfies:
⟨F(x)−F(y), x −y⟩≥−ρ∥F(x)−F(y)∥2.(1)
A number of works have followed the idea of Diakonikolas
et al. (2021) and considered
(1)
as their working assumption,
see, e.g., (Yoon & Ryu,2021;Lee & Kim,2021;Luo &
Tran-Dinh,2022;Cai et al.,2022a;Gorbunov et al.,2022a).
Albeit being a reasonable first step toward the understand-
ing of the behavior of algorithms for
(IP)
beyond
F
being
monotone, it remains unclear by what means the
ρ
-negative
comonotonicity assumption is general enough to capture
complex non-monotone operators. This question is crucial
for developing a clean optimization theory that can fully
encompass ML applications involving neural networks.
To the best of our knowledge,
ρ
(-star)-negative comono-
tonicity is the weakest known assumption under which
extragradient-type methods can be analyzed for solving
(IP)
.
The first part of this work is devoted to providing simple
interpretations of this assumption. Then, we close the prob-
lem of studying the convergence rate of the PP method
in this setting, the base ingredient underlying most algo-
rithms for solving
(IP)
(which are traditionally interpreted
as approximations to PP, see (Nemirovski,2004)). That
is, we provide upper and lower convergence bounds as well
as a tight conditions on its stepsize for PP under negative
comonotonicity. We eventually consider the last-iterate con-
vergence of EG and OG and provide an almost complete
picture in that case, listing the remaining open questions.
Before moving to the next sections, let us mention that many
of our results were discovered using the performance esti-
mation approach, first coined by (Drori & Teboulle,2014)
and formalized by (Taylor et al.,2017c;a). The operator ver-
sion of the framework is due to (Ryu et al.,2020). We used
the framework through the packages PESTO (Taylor et al.,
2017b) and PEPit (Goujaud et al.,2022), thereby providing
a simple way to validate our results numerically.
1.1. Preliminaries
In the context of
(IP)
, we refer to
F
as being
ρ
-star-negative
comonotone (
ρ≥0
) – a relaxation
2
of
(1)
– if for all
x∈Rd
and x∗being a solution to (IP), we have:
⟨F(x), x −x∗⟩≥−ρ∥F(x)∥2.(2)
Furthermore, similar to monotone operators (see (Bauschke
et al.,2011) or (Ryu & Yin,2020) for details), we assume
that the mapping
F
is maximal in the sense that its graph is
not strictly contained in the graph of any other
ρ
-negative
comonotone operator (resp.,
ρ
-star-negative comonotone),
which ensures the corresponding proximal operator used
in the sequel to be well-defined. Some examples of star-
negative comonotone operators are given in (Pethick et al.,
2022, Appendix C). Moreover, if
F
is star-monotone or
quasi-strongly monotone (Loizou et al.,2021), then
F
is also star-negative comonotone. The examples of star-
monotone/quasi-strongly monotone operators that are not
monotone are given in (Loizou et al.,2021, Appendix
A.6). Next, there are some studies of the eigenvalues
of the Jacobian around the equilibrium of GAN games
(Mescheder et al.,2018;Nagarajan & Kolter,2017;Berard
et al.,2019). These studies imply that the corresponding
variational inequalities are locally quasi-strongly monotone.
Finally, when
F
is
L
-Lipschitz it satisfies
⟨F(x), x −x∗⟩ ≥
−L∥x−x∗∥2
. If in addition
∥F(x)∥ ≥ η∥x−x∗∥
for
some
η > 0
(meaning that
F
changes not “too slowly”),
then
⟨F(x), x −x∗⟩ ≥ − L
η2∥F(x)∥2
, i.e., condition
(2)
holds with ρ=L
η2.
For the analysis of the EG and OG, we further assume
F
to
be L-Lipschitz, meaning that for all x, y ∈Rd:
∥F(x)−F(y)∥ ≤ L∥x−y∥.(3)
Note that in that case,
F
is a single-valued mapping. In this
case, IP transforms into a variational inequality:
find x∗∈Rdsuch that F(x∗) = 0.(VIP)
1.2. Related Work
Last-iterate convergence rates in the monotone case.
Several recent theoretical advances focus on the last-iterate
2
For the example of star-negative comonotone operator that
is not negative comonotone we refer to (Daskalakis et al.,2020,
Section 5.1) and (Diakonikolas et al.,2021, Section 2.2).
2