Distributed Distributionally Robust Optimization with Non-Convex Objectives Yang Jiao

2025-05-03 0 0 1.29MB 45 页 10玖币
侵权投诉
Distributed Distributionally Robust Optimization
with Non-Convex Objectives
Yang Jiao
Tongji University
yangjiao@tongji.edu.cn
Kai Yang
Tongji University
kaiyang@tongji.edu.cn
Dongjin Song
University of Connecticut
dongjin.song@uconn.edu
Abstract
Distributionally Robust Optimization (DRO), which aims to find an optimal de-
cision that minimizes the worst case cost over the ambiguity set of probability
distribution, has been widely applied in diverse applications, e.g., network behavior
analysis, risk management, etc. However, existing DRO techniques face three key
challenges: 1) how to deal with the asynchronous updating in a distributed envi-
ronment; 2) how to leverage the prior distribution effectively; 3) how to properly
adjust the degree of robustness according to different scenarios. To this end, we
propose an asynchronous distributed algorithm, named
A
synchronous
S
ingle-loo
P
alternat
I
ve g
R
adient proj
E
ction (ASPIRE) algorithm with the it
E
rative
A
ctive
SE
t method (EASE) to tackle the distributed distributionally robust optimization
(DDRO) problem. Furthermore, a new uncertainty set, i.e., constrained
D
-norm
uncertainty set, is developed to effectively leverage the prior distribution and flex-
ibly control the degree of robustness. Finally, our theoretical analysis elucidates
that the proposed algorithm is guaranteed to converge and the iteration complexity
is also analyzed. Extensive empirical studies on real-world datasets demonstrate
that the proposed method can not only achieve fast convergence, and remain robust
against data heterogeneity as well as malicious attacks, but also tradeoff robustness
with performance.
1 Introduction
The past decade has witnessed the proliferation of smartphones and Internet of Things (IoT) devices,
which generate a plethora of data everyday. Centralized machine learning requires gathering the
data to a particular server to train models which incurs high communication overhead [
46
] and
suffers privacy risks [
43
]. As a remedy, distributed machine learning methods have been proposed.
Considering a distributed system composed of
N
workers (devices), we denote the dataset of
these workers as
{D1,· · · , DN}
. For the
jth
(
1jN
) worker, the labeled dataset is given as
Dj={xi
j, yi
j}
, where
xi
jRd
and
yi
j∈ {1,· · · , c}
denote the
ith
data sample and the corresponding
label, respectively. The distributed learning tasks can be formulated as the following optimization
problem,
min
wWF(w) with F(w) := Xjfj(w),(1)
where
wRp
is the model parameter to be learned and
WRp
is a nonempty closed convex set,
fj(·)is the empirical risk over the jth worker involving only the local data:
fj(w) = Xi:xi
jDj
1
|Dj|Lj(xi
j, yi
j;w),(2)
Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.07588v2 [cs.LG] 17 Dec 2022
where
Lj
is the local objective function over the
jth
worker. Problem in Eq. (1) arises in numerous
areas, such as distributed signal processing [
19
], multi-agent optimization [
36
], etc. However, such
problem does not consider the data heterogeneity [
57
,
40
,
39
,
30
] among different workers (i.e.,
data distribution of workers could be substantially different from each other [
44
]). Indeed, it has
been shown that traditional federated approaches, such as FedAvg [
33
], built for independent and
identically distributed (IID) data may perform poorly when applied to Non-IID data [
27
]. This issue
can be mitigated via learning a robust model that aims to achieve uniformly good performance over all
workers by solving the following distributionally robust optimization (DRO) problem in a distributed
manner:
min
wWmax
pN
F(w,p) := Xjpjfj(w),(3)
where
p= [p1,· · · , pN]RN
is the adversarial distribution in
N
workers, the
jth
entry in this vector,
i.e.,
pj
represents the adversarial distribution value for the
jth
worker.
N={pRN
+:1>p= 1}
and
is a subset of
N
. Agnostic federated learning (AFL) [
35
] firstly introduces the distributionally
robust (agnostic) loss in federated learning and provides the convergence rate for (strongly) convex
functions. However, AFL does not discuss the setting of
. DRFA-Prox [
16
] considers
= ∆N
and imposes a regularizer on adversarial distribution to leverage the prior distribution. Nevertheless,
three key challenges have not yet been addressed by prior works. First, whether it is possible to
construct an uncertainty framework that can not only flexibly maintain the trade-off between the
model robustness and performance but also effectively leverage the prior distribution? Second, how to
design asynchronous algorithms with guaranteed convergence? Compared to synchronous algorithms,
the master in asynchronous algorithms can update its parameters after receiving updates from only
a small subset of workers [
58
,
10
]. Asynchronous algorithms are particularly desirable in practice
since they can relax strict data dependencies and ensure convergence even in the presence of device
failures [58]. Finally, whether it is possible to flexibly adjust the degree of robustness? Moreover, it
is necessary to provide convergence guarantee when the objectives (i.e.,
fj(wj),j
) are non-convex.
To this end, we propose ASPIRE-EASE to effectively address the aforementioned challenges. Firstly,
different from existing works, the prior distribution is incorporated within the constraint in our
formulation, which can not only leverage the prior distribution more effectively but also achieve
guaranteed feasibility for any adversarial distribution within the uncertainty set. The prior distribution
can be obtained from side information or uniform distribution [
41
], which is necessary to construct
the uncertainty (ambiguity) set and obtain a more robust model [
16
]. Specifically, we formulate the
prior distribution informed distributionally robust optimization (PD-DRO) problem as:
min
zZ,{wjW}max
pPXjpjfj(wj)(4)
s.t.z=wj, j =1,· · ·, N,
var.z,w1,w2,· · · ,wN,
where
zRp
is the global consensus variable,
wjRp
is the local variable (local model parameter) of
jth
worker and
ZRp
is a nonempty closed convex set.
PRN
+
is the uncertainty (ambiguity) set of
adversarial distribution
p
, which is set based on the prior distribution. To solve the PD-DRO problem
in an asynchronous distributed manner, we first propose
A
synchronous
S
ingle-loo
P
alternat
I
ve
g
R
adient proj
E
ction (ASPIRE), which employs simple gradient projection steps for the update of
primal and dual variables at every iteration, thus is computationally efficient. Next, the it
E
rative
A
ctive
SE
t method (EASE) is employed to replace the traditional cutting plane method to improve
the computational efficiency and speed up the convergence. We further provide the convergence
guarantee for the proposed algorithm. Furthermore, a new uncertainty set, i.e., constrained
D
-norm
(
CD
-norm), is proposed in this paper and its advantages include: 1) it can flexibly control the degree
of robustness; 2) the resulting subproblem is computationally simple; 3) it can effectively leverage
the prior distribution and flexibly set the bounds for every pj.
Contributions. Our contributions can be summarized as follows:
1.
We formulate a PD-DRO problem with
CD
-norm uncertainty set. PD-DRO incorporates the prior
distribution as constraints which can leverage prior distribution more effectively and guarantee ro-
bustness. In addition,
CD
-norm is developed to model the ambiguity set around the prior distribution
and it provides a flexible way to control the trade-off between model robustness and performance.
2.
We develop a single-loop asynchronous algorithm, namely ASPIRE-EASE, to optimize PD-
DRO in an asynchronous distributed manner. ASPIRE employs simple gradient projection steps to
2
update the variables at every iteration, which is computationally efficient. And EASE is proposed to
replace cutting plane method to enhance the computational efficiency and speed up the convergence.
We demonstrate that even if the objectives
fj(wj),j
are non-convex, the proposed algorithm is
guaranteed to converge. We also theoretically derive the iteration complexity of ASPIRE-EASE.
3.
Extensive empirical studies on four different real world datasets demonstrate the superior perfor-
mance of the proposed algorithm. It is seen that ASPIRE-EASE can not only ensure the model’s
robustness against data heterogeneity but also mitigate malicious attacks.
2 Preliminaries
2.1 Distributionally Robust Optimization
Optimization problems often contain uncertain parameters. A small perturbation of the parameters
could render the optimal solution of the original optimization problem infeasible or completely
meaningless [
5
]. Distributionally robust optimization (DRO) [
28
,
17
,
7
] assumes that the probability
distributions of uncertain parameters are unknown but remain in an ambiguity (uncertainty) set and
aims to find a decision that minimizes the worst case expected cost over the ambiguity set, whose
general form can be expressed as,
min
xXmax
PP
EP[r(x,ξ)],(5)
where
xX
represents the decision variable,
P
is the ambiguity set of probability distributions
P
of uncertain parameters
ξ
. Existing methods for solving DRO can be broadly grouped into two
widely-used categories [
42
]: 1) Dual methods [
15
,
50
,
18
] reformulate the primal DRO problems
as deterministic optimization problems through duality theory. Ben-Tal et al. [
2
] reformulate the
robust linear optimization (RLO) problem with an ellipsoidal uncertainty set as a second-order cone
optimization problem (SOCP). 2) Cutting plane methods [
34
,
6
] (also called adversarial approaches
[
21
]) continuously solve an approximate problem with a finite number of constraints of the primal
DRO problem, and subsequently check whether new constraints are needed to refine the feasible set.
Recently, several new methods [
41
,
29
,
23
] have been developed to solve DRO, which need to solve
the inner maximization problem at every iteration.
2.2 Cutting Plane Method for PD-DRO
In this section, we introduce the cutting plane method for PD-DRO in Eq. (4). We first reformulate
PD-DRO by introducing an additional variable
hH
(
HR1
is a nonempty closed convex set) and
protection function
g({wj})
[
55
]. Introducing additional variable
h
is an epigraph reformulation
[3, 56]. In this case, Eq. (4) can be reformulated as the form with uncertainty in the constraints:
min
zZ,{wjW},hHh
s.t.Xjpfj(wj)+g({wj})h0,(6)
z=wj, j =1,· · ·, N,
var.z,w1,w2,· · · ,wN, h,
where
p
is the nominal value of the adversarial distribution for every worker and
g({wj}) =
max
pPPj(pjp)fj(wj)
is the protection function. Eq. (6) is a semi-infinite program (SIP) which
contains infinite constraints and cannot be solved directly [
42
]. Denoting the set of cutting plane
parameters in
(t+1)th
iteration as
AtRN
, the following function is used to approximate
g({wj})
:
g({wj}) = max
alAta>
lf(w) = max
alAtXjal,j fj(wj),(7)
where
al= [al,1,· · · , al,N ]RN
denotes the parameters of
lth
cutting plane in
At
and
f(w) =
[f1(w1),· · · , fN(wN)] RN
. Substituting the protection function
g({wj})
with
g({wj})
, we can
obtain the following approximate problem:
min
zZ,{wjW},hHh
s.t.Xj(p+al,j )fj(wj)h0,alAt,(8)
z=wj, j =1,· · ·, N,
var.z,w1,w2,· · · ,wN, h.
3
3 ASPIRE
Distributed optimization is an attractive approach for large-scale learning tasks [
54
,
8
] since it does
not require data aggregation, which protects data privacy while also reducing bandwidth requirements
[
45
]. When the neural network models (i.e.,
fj(wj),j
are non-convex functions) are used, solving
problem in Eq. (8) in a distributed manner facing two challenges: 1) Computing the optimal
solution to a non-convex subproblem requires a large number of iterations and therefore is highly
computationally intensive if not impossible. Thus, the traditional Alternating Direction Method of
Multipliers (ADMM) is ineffective. 2) The communication delays of workers may differ significantly
[11], thus, asynchronous algorithms are strongly preferred.
To this end, we propose the
A
synchronous
S
ingle-loo
P
alternat
I
ve g
R
adient proj
E
ction (ASPIRE).
The advantages of the proposed algorithm include: 1) ASPIRE uses simple gradient projection
steps to update variables in each iteration and therefore it is computationally more efficient than
the traditional ADMM method, which seeks to find the optimal solution in non-convex (for
wj,j
)
and convex (for
z
and
h
) optimization subproblems every iteration, 2) the proposed asynchronous
algorithm does not need strict synchronization among different workers. Therefore, ASPIRE remains
resilient against communication delays and potential hardware failures from workers. Details of the
algorithm are given below. Firstly, we define the node as master which is responsible for updating the
global variable
z
, and we define the node which is responsible for updating the local variable
wj
as
worker
j
. In each iteration, the master updates its variables once it receives updates from at least
S
workers, i.e., active workers, satisfying
1SN
.
Qt+1
denotes the index subset of workers from
which the master receives updates during
(t+ 1)th
iteration. We also assume the master will receive
updated variables from every worker at least once for each
τ
iterations. The augmented Lagrangian
function of Eq. (8) can be written as:
Lp=h+Xlλl(Xj(p+al,j )fj(wj)h)+Xjφ>
j(zwj)+Xj
κ1
2||zwj||2
,(9)
where
Lp=Lp({wj},z, h, {λl},{φj})
,
λlΛ,l
and
φjΦ,j
represent the dual variables of
inequality and equality constraints in Eq. (8), respectively.
ΛR1
and
ΦRp
are nonempty
closed convex sets, constant
κ1>0
is a penalty parameter. Note that Eq. (9) does not consider the
second-order penalty term for inequality constraint since it will invalidate the distributed optimization.
Following [52], the regularized version of Eq. (9) is employed to update all variables as follows,
e
Lp({wj},z, h, {λl},{φj}) = LpXl
ct
1
2||λl||2Xj
ct
2
2||φj||2,(10)
where
ct
1
and
ct
2
denote the regularization terms in
(t+ 1)th
iteration. To avoid enumerating the
whole dataset, the mini-batch loss could be used. A batch of instances with size
m
can be randomly
sampled from each worker during each iteration. The loss function of these instances from
jth
worker is given by
ˆ
fj(wj) =
m
P
i=1
1
mLj(xi
j, yi
j;wj).
It is evident that
E[ˆ
fj(wj)] = fj(wj)
and
E[ˆ
fj(wj)]=fj(wj). In (t+ 1)th master iteration, the proposed algorithm proceeds as follows.
1) Active workers update the local variables wjas follows,
wt+1
j=(PW(wt
jα
e
tj
wwje
Lp({w
e
tj
j},z
e
tj, h
e
tj,{λ
e
tj
l},{φ
e
tj
j})),jQt+1,
wt
j,j /Qt+1,(11)
where
e
tj
is the last iteration during which worker
j
was active. It is seen that
jQt+1,wt
j=w
e
tj
j
and
φt
j=φ
e
tj
j
.
α
e
tj
w
represents the step-size and let
αt
w=ηt
w
when
t<T1
and
αt
w=ηw
when
tT1
,
where
ηt
w
and constant
ηw
will be introduced below.
PW
represents the projection onto the closed
convex set
W
and we set
W={wj| ||wj||α1}
,
α1
is a positive constant. And then, the active
workers (jQt+1) transmit their local model parameters wt+1
jand loss fj(wj)to the master.
2)
After receiving the updates from active workers, the master updates the global consensus variable
z, additional variable hand dual variables λlas follows,
zt+1 =PZ(ztηt
zze
Lp({wt+1
j},zt, ht,{λt
l},{φt
j})),(12)
ht+1 =PH(htηt
hhe
Lp({wt+1
j},zt+1, ht,{λt
l},{φt
j})),(13)
4
λt+1
l=PΛ(λt
l+ρ1λle
Lp({wt+1
j},zt+1, ht+1,{λt
l},{φt
j})), l =1,· · ·,|At|,(14)
where
ηt
z
,
ηt
h
and
ρ1
represent the step-sizes.
PZ
,
PH
and
PΛ
respectively represent the projection
onto the closed convex sets
Z
,
H
and
Λ
. We set
Z={z| ||z||α1}
,
H={h|0hα2}
and
Λ={λl|0λlα3}
, where
α2
and
α3
are positive constants.
|At|
denotes the number of cutting
planes. Then, master broadcasts zt+1,ht+1,{λt+1
l}to the active workers.
3) Active workers update the local dual variables φjas follows,
φt+1
j=PΦ(φt
j+ρ2φje
Lp({wt+1
j},zt+1, ht+1,{λt+1
l},{φt
j})),jQt+1,
φt
j,j /Qt+1,(15)
where
ρ2
represents the step-size and
PΦ
represents the projection onto the closed convex set
Φ
and we set
Φ={φj| ||φj||α4}
,
α4
is a positive constant. And master can also obtain
{φt+1
j}
according to Eq. (15). It is seen that the projection operation in each step is computationally simple
since the closed convex sets have simple structures [4].
4 Iterative Active Set Method
Cutting plane methods may give rise to numerous linear constraints and lots of extra message passing
[
55
]. Moreover, more iterations are required to obtain the
ε
-stationary point when the size of a set
containing cutting planes increases (which corresponds to a larger
M
), which can be seen in Theorem
1. To improve the computational efficiency and speed up the convergence, we consider removing
the inactive cutting planes. The proposed it
E
rative
A
ctive
SE
t method (EASE) can be divided into
the two steps: during
T1
iterations, 1) solving the cutting plane generation subproblem to generate
cutting plane, and 2) removing the inactive cutting plane every
k
iterations, where
k >0
is a pre-set
constant and can be controlled flexibly.
The cutting planes are generated according to the uncertainty set. For example, if we employ ellipsoid
uncertainty set, the cutting plane is generated via solving a SOCP. In this paper, we propose
CD
-norm
uncertainty set, which can be expressed as follows,
P={p:epjpjqjepj,Xj|pjqj
epj
|Γ,1>p=1},(16)
where
ΓR1
can flexibly control the level of robustness,
q= [q1,· · · , qN]RN
represents the prior
distribution,
epj
and
epj
(
epj0
) represent the lower and upper bounds for
pjqj
, respectively.
The setting of
q
and
epj,j
are based on the prior knowledge.
D
-norm is a classical uncertainty set
(which is also called as budget uncertainty set) [
5
]. We call Eq. (16)
CD
-norm uncertainty set since
p
is a probability vector so all the entries of this vector are non-negative and add up to exactly one,
i.e.,
1>p= 1
. Due to the special structure of
CD
-norm, the cutting plane generation subproblem is
easy to solve and the level of robustness in terms of the outage probability, i.e., probabilistic bounds
of the violations of constraints can be flexibly adjusted via a single parameter
Γ
. We claim that
l1
-norm (or twice total variation distance) uncertainty set is closely related to
CD
-norm uncertainty
set. Nevertheless, there are two differences: 1)
CD
-norm uncertainty set could be regarded as a
weighted
l1
-norm with additional constraints. 2)
CD
-norm uncertainty set can flexibly set the lower
and upper bounds for every
pj
(i.e.,
qjepjpjpj+epj
), while
0pj1,j
in
l1
-norm uncertainty
set. Based on the CD-norm uncertainty set, the cutting plane can be derived as follows,
1) Solve the following problem,
pt+1 = arg max
p1,···,pNXj(pjp)fj(wj)
s.t.Xj|pjqj
epj
|Γ,epjpjqjepj,j, Xjpj=1 (17)
var. p1,· · · , pN,
where
pt+1 = [pt+1
1,· · · , pt+1
N]RN
. Let
e
at+1 =pt+1 p
, where
p= [p, · · · , p]RN
. This
first step aims to obtain the distribution
e
at+1
by solving problem in Eq. (17). This problem can
be effectively solved through combining merge sort [
13
] (for sorting
epjfj(wj), j =1,· · · , N
) with
few basic arithmetic operations (for obtaining
pt+1
j, j = 1,· · · , N
). Since
N
is relatively large in
5
摘要:

DistributedDistributionallyRobustOptimizationwithNon-ConvexObjectivesYangJiaoTongjiUniversityyangjiao@tongji.edu.cnKaiYangTongjiUniversitykaiyang@tongji.edu.cnDongjinSongUniversityofConnecticutdongjin.song@uconn.eduAbstractDistributionallyRobustOptimization(DRO),whichaimstondanoptimalde-cisionthat...

展开>> 收起<<
Distributed Distributionally Robust Optimization with Non-Convex Objectives Yang Jiao.pdf

共45页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:45 页 大小:1.29MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 45
客服
关注