Distributed Distributionally Robust Optimization with Non-Convex Objectives Yang Jiao

2025-05-03 0 0 1.29MB 45 页 10玖币

侵权投诉

Distributed Distributionally Robust Optimization

with Non-Convex Objectives

Yang Jiao

Tongji University

yangjiao@tongji.edu.cn

Kai Yang∗

Tongji University

kaiyang@tongji.edu.cn

Dongjin Song

University of Connecticut

dongjin.song@uconn.edu

Abstract

Distributionally Robust Optimization (DRO), which aims to ﬁnd an optimal de-

cision that minimizes the worst case cost over the ambiguity set of probability

distribution, has been widely applied in diverse applications, e.g., network behavior

analysis, risk management, etc. However, existing DRO techniques face three key

challenges: 1) how to deal with the asynchronous updating in a distributed envi-

ronment; 2) how to leverage the prior distribution effectively; 3) how to properly

adjust the degree of robustness according to different scenarios. To this end, we

propose an asynchronous distributed algorithm, named

synchronous

ingle-loo

alternat

ve g

adient proj

ction (ASPIRE) algorithm with the it

rative

ctive

t method (EASE) to tackle the distributed distributionally robust optimization

(DDRO) problem. Furthermore, a new uncertainty set, i.e., constrained

-norm

uncertainty set, is developed to effectively leverage the prior distribution and ﬂex-

ibly control the degree of robustness. Finally, our theoretical analysis elucidates

that the proposed algorithm is guaranteed to converge and the iteration complexity

is also analyzed. Extensive empirical studies on real-world datasets demonstrate

that the proposed method can not only achieve fast convergence, and remain robust

against data heterogeneity as well as malicious attacks, but also tradeoff robustness

with performance.

1 Introduction

The past decade has witnessed the proliferation of smartphones and Internet of Things (IoT) devices,

which generate a plethora of data everyday. Centralized machine learning requires gathering the

data to a particular server to train models which incurs high communication overhead [

] and

suffers privacy risks [

]. As a remedy, distributed machine learning methods have been proposed.

Considering a distributed system composed of

workers (devices), we denote the dataset of

these workers as

{D1,· · · , DN}

. For the

jth

(

1≤j≤N

) worker, the labeled dataset is given as

Dj={xi

j, yi

, where

j∈Rd

and

j∈ {1,· · · , c}

denote the

ith

data sample and the corresponding

label, respectively. The distributed learning tasks can be formulated as the following optimization

problem,

min

w∈WF(w) with F(w) := Xjfj(w),(1)

where

w∈Rp

is the model parameter to be learned and

W⊆Rp

is a nonempty closed convex set,

fj(·)is the empirical risk over the jth worker involving only the local data:

fj(w) = Xi:xi

j∈Dj

|Dj|Lj(xi

j, yi

j;w),(2)

∗Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.07588v2 [cs.LG] 17 Dec 2022

where

is the local objective function over the

jth

worker. Problem in Eq. (1) arises in numerous

areas, such as distributed signal processing [

], multi-agent optimization [

], etc. However, such

problem does not consider the data heterogeneity [

] among different workers (i.e.,

data distribution of workers could be substantially different from each other [

]). Indeed, it has

been shown that traditional federated approaches, such as FedAvg [

], built for independent and

identically distributed (IID) data may perform poorly when applied to Non-IID data [

]. This issue

can be mitigated via learning a robust model that aims to achieve uniformly good performance over all

workers by solving the following distributionally robust optimization (DRO) problem in a distributed

manner:

min

w∈Wmax

p∈Ω⊆∆N

F(w,p) := Xjpjfj(w),(3)

where

p= [p1,· · · , pN]∈RN

is the adversarial distribution in

workers, the

jth

entry in this vector,

i.e.,

represents the adversarial distribution value for the

jth

worker.

∆N={p∈RN

+:1>p= 1}

and

Ω

is a subset of

∆N

. Agnostic federated learning (AFL) [

] ﬁrstly introduces the distributionally

robust (agnostic) loss in federated learning and provides the convergence rate for (strongly) convex

functions. However, AFL does not discuss the setting of

Ω

. DRFA-Prox [

] considers

Ω= ∆N

and imposes a regularizer on adversarial distribution to leverage the prior distribution. Nevertheless,

three key challenges have not yet been addressed by prior works. First, whether it is possible to

construct an uncertainty framework that can not only ﬂexibly maintain the trade-off between the

model robustness and performance but also effectively leverage the prior distribution? Second, how to

design asynchronous algorithms with guaranteed convergence? Compared to synchronous algorithms,

the master in asynchronous algorithms can update its parameters after receiving updates from only

a small subset of workers [

]. Asynchronous algorithms are particularly desirable in practice

since they can relax strict data dependencies and ensure convergence even in the presence of device

failures [58]. Finally, whether it is possible to ﬂexibly adjust the degree of robustness? Moreover, it

is necessary to provide convergence guarantee when the objectives (i.e.,

fj(wj),∀j

) are non-convex.

To this end, we propose ASPIRE-EASE to effectively address the aforementioned challenges. Firstly,

different from existing works, the prior distribution is incorporated within the constraint in our

formulation, which can not only leverage the prior distribution more effectively but also achieve

guaranteed feasibility for any adversarial distribution within the uncertainty set. The prior distribution

can be obtained from side information or uniform distribution [

], which is necessary to construct

the uncertainty (ambiguity) set and obtain a more robust model [

]. Speciﬁcally, we formulate the

prior distribution informed distributionally robust optimization (PD-DRO) problem as:

min

z∈Z,{wj∈W}max

p∈PXjpjfj(wj)(4)

s.t.z=wj, j =1,· · ·, N,

var.z,w1,w2,· · · ,wN,

where

z∈Rp

is the global consensus variable,

wj∈Rp

is the local variable (local model parameter) of

jth

worker and

Z⊆Rp

is a nonempty closed convex set.

P⊆RN

is the uncertainty (ambiguity) set of

adversarial distribution

, which is set based on the prior distribution. To solve the PD-DRO problem

in an asynchronous distributed manner, we ﬁrst propose

synchronous

ingle-loo

alternat

adient proj

ction (ASPIRE), which employs simple gradient projection steps for the update of

primal and dual variables at every iteration, thus is computationally efﬁcient. Next, the it

rative

ctive

t method (EASE) is employed to replace the traditional cutting plane method to improve

the computational efﬁciency and speed up the convergence. We further provide the convergence

guarantee for the proposed algorithm. Furthermore, a new uncertainty set, i.e., constrained

-norm

(

-norm), is proposed in this paper and its advantages include: 1) it can ﬂexibly control the degree

of robustness; 2) the resulting subproblem is computationally simple; 3) it can effectively leverage

the prior distribution and ﬂexibly set the bounds for every pj.

Contributions. Our contributions can be summarized as follows:

We formulate a PD-DRO problem with

-norm uncertainty set. PD-DRO incorporates the prior

distribution as constraints which can leverage prior distribution more effectively and guarantee ro-

bustness. In addition,

-norm is developed to model the ambiguity set around the prior distribution

and it provides a ﬂexible way to control the trade-off between model robustness and performance.

We develop a single-loop asynchronous algorithm, namely ASPIRE-EASE, to optimize PD-

DRO in an asynchronous distributed manner. ASPIRE employs simple gradient projection steps to

update the variables at every iteration, which is computationally efﬁcient. And EASE is proposed to

replace cutting plane method to enhance the computational efﬁciency and speed up the convergence.

We demonstrate that even if the objectives

fj(wj),∀j

are non-convex, the proposed algorithm is

guaranteed to converge. We also theoretically derive the iteration complexity of ASPIRE-EASE.

Extensive empirical studies on four different real world datasets demonstrate the superior perfor-

mance of the proposed algorithm. It is seen that ASPIRE-EASE can not only ensure the model’s

robustness against data heterogeneity but also mitigate malicious attacks.

2 Preliminaries

2.1 Distributionally Robust Optimization

Optimization problems often contain uncertain parameters. A small perturbation of the parameters

could render the optimal solution of the original optimization problem infeasible or completely

meaningless [

]. Distributionally robust optimization (DRO) [

] assumes that the probability

distributions of uncertain parameters are unknown but remain in an ambiguity (uncertainty) set and

aims to ﬁnd a decision that minimizes the worst case expected cost over the ambiguity set, whose

general form can be expressed as,

min

x∈Xmax

P∈P

EP[r(x,ξ)],(5)

where

x∈X

represents the decision variable,

is the ambiguity set of probability distributions

of uncertain parameters

. Existing methods for solving DRO can be broadly grouped into two

widely-used categories [

]: 1) Dual methods [

] reformulate the primal DRO problems

as deterministic optimization problems through duality theory. Ben-Tal et al. [

] reformulate the

robust linear optimization (RLO) problem with an ellipsoidal uncertainty set as a second-order cone

optimization problem (SOCP). 2) Cutting plane methods [

] (also called adversarial approaches

[

]) continuously solve an approximate problem with a ﬁnite number of constraints of the primal

DRO problem, and subsequently check whether new constraints are needed to reﬁne the feasible set.

Recently, several new methods [

] have been developed to solve DRO, which need to solve

the inner maximization problem at every iteration.

2.2 Cutting Plane Method for PD-DRO

In this section, we introduce the cutting plane method for PD-DRO in Eq. (4). We ﬁrst reformulate

PD-DRO by introducing an additional variable

h∈H

(

H⊆R1

is a nonempty closed convex set) and

protection function

g({wj})

[

]. Introducing additional variable

is an epigraph reformulation

[3, 56]. In this case, Eq. (4) can be reformulated as the form with uncertainty in the constraints:

min

z∈Z,{wj∈W},h∈Hh

s.t.Xjpfj(wj)+g({wj})−h≤0,(6)

z=wj, j =1,· · ·, N,

var.z,w1,w2,· · · ,wN, h,

where

is the nominal value of the adversarial distribution for every worker and

g({wj}) =

max

p∈PPj(pj−p)fj(wj)

is the protection function. Eq. (6) is a semi-inﬁnite program (SIP) which

contains inﬁnite constraints and cannot be solved directly [

]. Denoting the set of cutting plane

parameters in

(t+1)th

iteration as

At⊆RN

, the following function is used to approximate

g({wj})

g({wj}) = max

al∈Ata>

lf(w) = max

al∈AtXjal,j fj(wj),(7)

where

al= [al,1,· · · , al,N ]∈RN

denotes the parameters of

lth

cutting plane in

and

f(w) =

[f1(w1),· · · , fN(wN)] ∈RN

. Substituting the protection function

g({wj})

with

g({wj})

, we can

obtain the following approximate problem:

min

z∈Z,{wj∈W},h∈Hh

s.t.Xj(p+al,j )fj(wj)−h≤0,∀al∈At,(8)

z=wj, j =1,· · ·, N,

var.z,w1,w2,· · · ,wN, h.

3 ASPIRE

Distributed optimization is an attractive approach for large-scale learning tasks [

] since it does

not require data aggregation, which protects data privacy while also reducing bandwidth requirements

[

]. When the neural network models (i.e.,

fj(wj),∀j

are non-convex functions) are used, solving

problem in Eq. (8) in a distributed manner facing two challenges: 1) Computing the optimal

solution to a non-convex subproblem requires a large number of iterations and therefore is highly

computationally intensive if not impossible. Thus, the traditional Alternating Direction Method of

Multipliers (ADMM) is ineffective. 2) The communication delays of workers may differ signiﬁcantly

[11], thus, asynchronous algorithms are strongly preferred.

To this end, we propose the

synchronous

ingle-loo

alternat

ve g

adient proj

ction (ASPIRE).

The advantages of the proposed algorithm include: 1) ASPIRE uses simple gradient projection

steps to update variables in each iteration and therefore it is computationally more efﬁcient than

the traditional ADMM method, which seeks to ﬁnd the optimal solution in non-convex (for

wj,∀j

)

and convex (for

and

) optimization subproblems every iteration, 2) the proposed asynchronous

algorithm does not need strict synchronization among different workers. Therefore, ASPIRE remains

resilient against communication delays and potential hardware failures from workers. Details of the

algorithm are given below. Firstly, we deﬁne the node as master which is responsible for updating the

global variable

, and we deﬁne the node which is responsible for updating the local variable

worker

. In each iteration, the master updates its variables once it receives updates from at least

workers, i.e., active workers, satisfying

1≤S≤N

Qt+1

denotes the index subset of workers from

which the master receives updates during

(t+ 1)th

iteration. We also assume the master will receive

updated variables from every worker at least once for each

iterations. The augmented Lagrangian

function of Eq. (8) can be written as:

Lp=h+Xlλl(Xj(p+al,j )fj(wj)−h)+Xjφ>

j(z−wj)+Xj

κ1

2||z−wj||2

,(9)

where

Lp=Lp({wj},z, h, {λl},{φj})

λl∈Λ,∀l

and

φj∈Φ,∀j

represent the dual variables of

inequality and equality constraints in Eq. (8), respectively.

Λ⊆R1

and

Φ⊆Rp

are nonempty

closed convex sets, constant

κ1>0

is a penalty parameter. Note that Eq. (9) does not consider the

second-order penalty term for inequality constraint since it will invalidate the distributed optimization.

Following [52], the regularized version of Eq. (9) is employed to update all variables as follows,

Lp({wj},z, h, {λl},{φj}) = Lp−Xl

2||λl||2−Xj

2||φj||2,(10)

where

and

denote the regularization terms in

(t+ 1)th

iteration. To avoid enumerating the

whole dataset, the mini-batch loss could be used. A batch of instances with size

can be randomly

sampled from each worker during each iteration. The loss function of these instances from

jth

worker is given by

fj(wj) =

i=1

mLj(xi

j, yi

j;wj).

It is evident that

E[ˆ

fj(wj)] = fj(wj)

and

E[∇ˆ

fj(wj)]=∇fj(wj). In (t+ 1)th master iteration, the proposed algorithm proceeds as follows.

1) Active workers update the local variables wjas follows,

wt+1

j=(PW(wt

j−α

w∇wje

Lp({w

j},z

tj, h

tj,{λ

l},{φ

j})),∀j∈Qt+1,

j,∀j /∈Qt+1,(11)

where

is the last iteration during which worker

was active. It is seen that

∀j∈Qt+1,wt

j=w

and

φt

j=φ

represents the step-size and let

αt

w=ηt

when

t<T1

and

αt

w=ηw

when

t≥T1

where

ηt

and constant

ηw

will be introduced below.

represents the projection onto the closed

convex set

and we set

W={wj| ||wj||∞≤α1}

α1

is a positive constant. And then, the active

workers (j∈Qt+1) transmit their local model parameters wt+1

jand loss fj(wj)to the master.

After receiving the updates from active workers, the master updates the global consensus variable

z, additional variable hand dual variables λlas follows,

zt+1 =PZ(zt−ηt

z∇ze

Lp({wt+1

j},zt, ht,{λt

l},{φt

j})),(12)

ht+1 =PH(ht−ηt

h∇he

Lp({wt+1

j},zt+1, ht,{λt

l},{φt

j})),(13)

λt+1

l=PΛ(λt

l+ρ1∇λle

Lp({wt+1

j},zt+1, ht+1,{λt

l},{φt

j})), l =1,· · ·,|At|,(14)

where

ηt

and

ρ1

represent the step-sizes.

and

PΛ

respectively represent the projection

onto the closed convex sets

and

. We set

Z={z| ||z||∞≤α1}

H={h|0≤h≤α2}

and

Λ={λl|0≤λl≤α3}

, where

α2

and

α3

are positive constants.

|At|

denotes the number of cutting

planes. Then, master broadcasts zt+1,ht+1,{λt+1

l}to the active workers.

3) Active workers update the local dual variables φjas follows,

φt+1

j=PΦ(φt

j+ρ2∇φje

Lp({wt+1

j},zt+1, ht+1,{λt+1

l},{φt

j})),∀j∈Qt+1,

φt

j,∀j /∈Qt+1,(15)

where

ρ2

represents the step-size and

PΦ

represents the projection onto the closed convex set

and we set

Φ={φj| ||φj||∞≤α4}

α4

is a positive constant. And master can also obtain

{φt+1

according to Eq. (15). It is seen that the projection operation in each step is computationally simple

since the closed convex sets have simple structures [4].

4 Iterative Active Set Method

Cutting plane methods may give rise to numerous linear constraints and lots of extra message passing

[

]. Moreover, more iterations are required to obtain the

-stationary point when the size of a set

containing cutting planes increases (which corresponds to a larger

), which can be seen in Theorem

1. To improve the computational efﬁciency and speed up the convergence, we consider removing

the inactive cutting planes. The proposed it

rative

ctive

t method (EASE) can be divided into

the two steps: during

iterations, 1) solving the cutting plane generation subproblem to generate

cutting plane, and 2) removing the inactive cutting plane every

iterations, where

k >0

is a pre-set

constant and can be controlled ﬂexibly.

The cutting planes are generated according to the uncertainty set. For example, if we employ ellipsoid

uncertainty set, the cutting plane is generated via solving a SOCP. In this paper, we propose

-norm

uncertainty set, which can be expressed as follows,

P={p:−epj≤pj−qj≤epj,Xj|pj−qj

epj

|≤Γ,1>p=1},(16)

where

Γ∈R1

can ﬂexibly control the level of robustness,

q= [q1,· · · , qN]∈RN

represents the prior

distribution,

−epj

and

epj

(

epj≥0

) represent the lower and upper bounds for

pj−qj

, respectively.

The setting of

and

epj,∀j

are based on the prior knowledge.

-norm is a classical uncertainty set

(which is also called as budget uncertainty set) [

]. We call Eq. (16)

-norm uncertainty set since

is a probability vector so all the entries of this vector are non-negative and add up to exactly one,

i.e.,

1>p= 1

. Due to the special structure of

-norm, the cutting plane generation subproblem is

easy to solve and the level of robustness in terms of the outage probability, i.e., probabilistic bounds

of the violations of constraints can be ﬂexibly adjusted via a single parameter

. We claim that

-norm (or twice total variation distance) uncertainty set is closely related to

-norm uncertainty

set. Nevertheless, there are two differences: 1)

-norm uncertainty set could be regarded as a

weighted

-norm with additional constraints. 2)

-norm uncertainty set can ﬂexibly set the lower

and upper bounds for every

(i.e.,

qj−epj≤pj≤pj+epj

), while

0≤pj≤1,∀j

-norm uncertainty

set. Based on the CD-norm uncertainty set, the cutting plane can be derived as follows,

1) Solve the following problem,

pt+1 = arg max

p1,···,pNXj(pj−p)fj(wj)

s.t.Xj|pj−qj

epj

|≤Γ,−epj≤pj−qj≤epj,∀j, Xjpj=1 (17)

var. p1,· · · , pN,

where

pt+1 = [pt+1

1,· · · , pt+1

N]∈RN

. Let

at+1 =pt+1 −p

, where

p= [p, · · · , p]∈RN

. This

ﬁrst step aims to obtain the distribution

at+1

by solving problem in Eq. (17). This problem can

be effectively solved through combining merge sort [

] (for sorting

epjfj(wj), j =1,· · · , N

) with

few basic arithmetic operations (for obtaining

pt+1

j, j = 1,· · · , N

). Since

is relatively large in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DistributedDistributionallyRobustOptimizationwithNon-ConvexObjectivesYangJiaoTongjiUniversityyangjiao@tongji.edu.cnKaiYangTongjiUniversitykaiyang@tongji.edu.cnDongjinSongUniversityofConnecticutdongjin.song@uconn.eduAbstractDistributionallyRobustOptimization(DRO),whichaimstondanoptimalde-cisionthat...

展开>> 收起<<

Distributed Distributionally Robust Optimization with Non-Convex Objectives Yang Jiao.pdf

共45页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Distributed Distributionally Robust Optimization with Non-Convex Objectives Yang Jiao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: