Alternating Dierentiation for Optimization Layers Haixiang Sun1 Ye Shi1 Jingya Wang1 Hoang Duong Tuan2 H. Vincent Poor3 and Dacheng Tao45_2

2025-04-30 0 0 932.22KB 23 页 10玖币

侵权投诉

Alternating Diﬀerentiation for Optimization Layers

Haixiang Sun1, Ye Shi∗1, Jingya Wang1, Hoang Duong Tuan2, H. Vincent Poor3, and

Dacheng Tao4,5

1School of Information Science and Technology, ShanghaiTech University, China

2School of Electrical and Data Engineering, University of Technology Sydney, Australia

3Department of Electrical and Computer Engineering, Princeton University, USA

4School of Computer Science, The University of Sydney, Australia

5JD Explore Academy, China

April 2023

Abstract

The idea of embedding optimization problems into deep neural networks as optimization layers

to encode constraints and inductive priors has taken hold in recent years. Most existing methods

focus on implicitly diﬀerentiating Karush–Kuhn–Tucker (KKT) conditions in a way that requires

expensive computations on the Jacobian matrix, which can be slow and memory-intensive. In

this paper, we developed a new framework, named Alternating Diﬀerentiation (Alt-Diﬀ), that

diﬀerentiates optimization problems (here, speciﬁcally in the form of convex optimization problems

with polyhedral constraints) in a fast and recursive way. Alt-Diﬀ decouples the diﬀerentiation

procedure into a primal update and a dual update in an alternating way. Accordingly, Alt-Diﬀ

substantially decreases the dimensions of the Jacobian matrix especially for optimization with large-

scale constraints and thus increases the computational speed of implicit diﬀerentiation. We show

that the gradients obtained by Alt-Diﬀ are consistent with those obtained by diﬀerentiating KKT

conditions. In addition, we propose to truncate Alt-Diﬀ to further accelerate the computational

speed. Under some standard assumptions, we show that the truncation error of gradients is upper

bounded by the same order of variables’ estimation error. Therefore, Alt-Diﬀ can be truncated to

further increase computational speed without sacriﬁcing much accuracy. A series of comprehensive

experiments validate the superiority of Alt-Diﬀ.

1 Introduction

Recent years have seen a variety of applications in machine learning that consider optimization

as a tool for inference learning [2, 4, 5, 8, 9, 17]. Embedding optimization problems as optimization

layers in deep neural networks can capture useful inductive bias, such as domain-speciﬁc knowledge and

priors. Unlike conventional neural networks, which are deﬁned by an explicit formulation in each layer,

optimization layers are deﬁned implicitly by solving optimization problems. They can be treated as

implicit functions where inputs are mapped to optimal solutions. However, training optimization layers

together with explicit layers is not easy since explicit closed-form solutions typically do not exist for

the optimization layers.

Generally, computing the gradients of the optimization layers can be classiﬁed into two main cate-

gories: diﬀerentiating the optimality conditions implicitly and applying unrolling methods. The ideas

of diﬀerentiating optimality conditions have been used in bilevel optimization [28, 36] and sensitivity

analysis [12]. Recently, OptNet [4] and CvxpyLayer [2] have extended this method to optimization

layers so as to enable end-to-end learning within the deep learning structure. However, these methods

∗Corresponding author: shiye@shanghaitech.edu.cn

arXiv:2210.01802v2 [cs.LG] 24 Apr 2023

inevitably require expensive computation on the Jacobian matrix. Thus they are prone to instability

and are often intractable, especially for large-scale optimization layers. Another direction to obtain the

gradients of optimization layers is based on the unrolling methods [17, 50], where an iterative ﬁrst-order

gradient method is applied. However, they are memory-intensive since all the intermediate results have

to be recorded. Besides, unrolling methods are not quite suitable to constrained optimization problems

as expensive projection operators are needed.

In this paper, we aim to develop a new method that signiﬁcantly increases the computational speed of

the diﬀerentiation procedure for convex optimization problems with polyhedral constraints. Motivated

by the scalability of the operator splitting method for optimization problem [27, 48], we developed a

new framework, namely Alternating Diﬀerentiation (Alt-Diﬀ), that diﬀerentiates optimization layers in

a fast and recursive way. Alt-Diﬀ ﬁrst decouples the constrained optimization problem into multiple

subproblems based on the well-known alternating direction method of multipliers (ADMM) [13]. Then,

the diﬀerentiation operators for obtaining the derivatives of the primal and dual variables w.r.t the

parameters are implemented on these subproblems in an alternating manner. Accordingly, Alt-Diﬀ

substantially decreases the dimensions of the Jacobian matrix, signiﬁcantly increasing the computational

speed of implicit diﬀerentiation, especially for large-scale problems. Unlike most existing methods that

directly diﬀerentiate KKT conditions after obtaining the optimal solution for an optimization problem,

Alt-Diﬀ performs the forward and backward procedures simultaneously. Both the forward and backward

procedures can be truncated without sacriﬁcing much accuracy in terms of the gradients of optimization

layers. Overall, our contributions are three-fold:

•We develop a new diﬀerentiation framework Alt-Diﬀ that decouples the optimization layers in an

alternating way. Alt-Diﬀ signiﬁcantly reduces the dimension of the KKT matrix, especially for

optimization with large-scale constraints and thus increases the computational speed of implicit

diﬀerentiation.

•We show that: 1) the gradient obtained by Alt-Diﬀ are consistent with those obtained by diﬀer-

entiating the KKT conditions; 2) the truncation error of gradients is upper bounded by the same

order of variables’ estimation error given some standard assumptions. Therefore, Alt-Diﬀ can be

truncated to accelerate the computational speed without scarifying much accuracy.

•We conduct a series of experiments and show that Alt-Diﬀ can achieve results comparable to

state-of-the-art methods in much less time, especially for large-scale optimization problems. The

fast performance of Alt-Diﬀ comes from the dimension reduction of the KKT matrix and the

truncated capability of Alt-Diﬀ.

2 Related work

Diﬀerentiation for optimization layers Embedding optimization problems into deep learning

architectures as optimization layers to integrate domain knowledge has received increasing attention

in recent years [2, 4, 28, 29]. For example, [28] collected some general techniques for diﬀerentiating

argmin and argmax optimization problems based on the Implicit Function Theorem, while [4] proposed

a network architecture, namely OptNet, that integrates quadratic optimization into deep networks.

This work was further extended for end-to-end learning with stochastic programming [19]. Recently, an

eﬃcient diﬀerentiable quadratic optimization layer [15] based on ADMM was proposed to accelerate the

diﬀerentiation procedure. However, these methods are only restricted to handling the diﬀerentiation of

quadratic problems. For more general convex optimization, [3] proposed CvxpyLayer that diﬀerentiates

through disciplined convex programs and some sparse operations and LSQR [46] to speed up the dif-

ferentiation procedure. Although these have been worthwhile attempts to accelerate the diﬀerentiation

procedure, they still largely focus on implicitly diﬀerentiating the KKT conditions in a way that requires

expensive operations on a large-scale Jacobian matrix. Recently, a diﬀerentiation solver named JaxOpt

[10] has also been put forward that is based on an implicit automatic diﬀerentiation mechanism under

the Jax framework.

Unrolling methods Another direction for diﬀerentiating optimization problems is the unrolling

methods [18, 45]. These approximate the objective function with a ﬁrst-order gradient method and then

incorporate the gradient of the inner loop optimization into the training procedures [5, 8, 9, 43]. This is

an iterative procedure, which is usually truncated to a ﬁxed number of iterations. Although unrolling

methods are easy to implement, most of their applications are limited to unconstrained problems. If

constraints are added, the unrolling solutions have to be projected into the feasible region, signiﬁcantly

increasing the computational burdens. By contrast, Alt-Diﬀ only requires a very simple operation that

projects the slack variable to the nonnegative orthant. This substantially improves the eﬃciency of

subsequent updates and reduces the method’s computational complexity.

Implicit models There has been growing interest in implicit models in recent years [11, 34, 51].

Implicit layers replace the traditional feed-forward layers of neural networks with ﬁxed point iterations to

compute inferences. They have been responsible for substantial advances in many applications, including

Neural ODE [16, 20], Deep Equilibrium Models [6, 7, 32], nonconvex optimization problems [49] and

implicit 3D surface layers [44, 47], etc. Implicit layers have a similar computational complexity to

optimization layers as they also requires solving a costly Jacobian-based equation based on the Implicit

Function Theorem. Recently, [23] proposed a Jacobian-free backpropagation method to accelerate the

speed of training implicit layers. However, this method is not suitable for the optimization layers with

complex constraints. In terms of training implicit models, [25] also proposed a novel gradient estimate

called phantom gradient which relies on ﬁxed-point unrolling and a Neumann series to provide a new

update direction; computation of precise gradient is forgone. Implicit models have also been extended

to more complex learning frameworks, such as attention mechanisms [24] and Graph Neural Networks

[30].

3 Preliminary: Diﬀerentiable optimization layers

We consider a parameterized convex optimization problems with polyhedral constraints:

min

xf(x;θ)

s.t. x ∈ C(θ)(1)

where x∈Rnis the decision variable, the objective function f:Rn→Ris convex and the constraint

C(θ) := {x|Ax =b, Gx ≤h}is a polyhedron. For simpliﬁcation, we use θto collect the parameters

in the objective function and constraints. For any given θ, a solution of optimization problem (1) is

x?∈Rnthat minimizes f(x;θ) while satisfying the constraints, i.e. x?∈ C(θ). Then the optimization

problem (1) can be viewed as a mapping that maps the parameters θto the solution x?. Here, we

focus on convex optimization with aﬃne constraints due to its wide applications in control systems [31],

signal processing [42], communication networks [14, 37], etc.

Since optimization problems can capture well-deﬁned speciﬁc domain knowledge in a model-driven

way, embedding such optimization problems into neural networks within an end-to-end learning frame-

work can simultaneously leverage the strengths of both the model-driven and the data-driven methods.

Unlike conventional neural networks, which are deﬁned through explicit expressions of each layer, an

optimization layer embedded in optimization problem (1) is deﬁned as follows:

Deﬁnition 3.1. (Optimization Layer) A layer in a neural network is deﬁned as an optimization layer

if its input is the optimization parameters θ∈Rmand its output x?∈Rnis the solution of the

optimization problem (1).

The optimization layer can be treated as an implicit function F:Rn×Rm→Rnwith F(x?, θ) = 0.

In the deep learning architecture, optimization layers are implemented together with other explicit layers

in an end-to-end framework. During the training procedure, the chain rule is used to back propagate

the gradient through the optimization layer. Given a loss function R, the derivative of the loss w.r.t

the parameter θof the optimization layer is

∂R

∂θ =∂R

∂x?

∂θ .(2)

Obviously, the derivative ∂R

∂x?can be easily obtained by automatic diﬀerentiation techniques on explicit

layers, such as fully connected layers and convolutional layers. However, since explicit closed-form

solutions typically do not exist for optimization layers, this brings additional computational diﬃculties

during implementation. Recent work, such as OptNet [4] and CvxpyLayer [2] have shown that the

Jacobian ∂x?

∂θ can be derived by diﬀerentiating the KKT conditions of the optimization problem based

on the Implicit Function Theorem [35]. This is brieﬂy recalled in Lemma 3.2.

Lemma 3.2. (Implicit Function Theorem) Let F(x, θ)denote a continuously diﬀerentiable function

with F(x?, θ) = 0. Suppose the Jacobian of F(x, θ)is invertible at (x?;θ), then the derivative of the

solution x?with respect to θis

∂x?

∂θ =−[JF;x]−1JF;θ,(3)

where JF;x:= ∇xF(x?, θ)and JF;θ:= ∇θF(x?, θ)are respectively the Jacobian of F(x, θ)w.r.t. xand

θ.

It is worth noting that the diﬀerentiation procedure needs to calculate the optimal value x?in the

forward pass and involves solving a linear system with Jacobian matrix JF;xin the backward pass. Both

the forward and backward pass are generally very computationally expensive, especially for large-scale

problems. This means that diﬀerentiating KKT conditions directly is not scalable to large optimization

problems. Existing solvers have made some attempts to alleviate this issue. Speciﬁcally, CvxpyLayer

adopted an LSQR technique to accelerate implicit diﬀerentiation for sparse optimization problems.

However, the method is not necessarily eﬃcient for more general cases, which may not be sparse.

Although OptNet uses a primal-dual interior point method in the forward pass and its backward pass

can be very easy but OptNet is suitable only for quadratic optimization problems. In this paper, our

main target is to develop a new method that can increase the computational speed of the diﬀerentiation

procedure especially for large-scale constrained optimization problems.

4 Alternating diﬀerentiation for optimization layers

This section provides the details of the Alt-Diﬀ algorithm for optimization layers. Alt-Diﬀ decom-

poses the diﬀerentiation of a large-scale KKT system into a smaller problems that are solved in a

primal-dual alternating way (see Section 4.1). Alt-Diﬀ reduces the computational complexity of the

model and signiﬁcantly improves the computational eﬃciency of the optimization layers. In section 4.2,

we theoretically analyze the truncated capability of Alt-Diﬀ. Notably, Alt-Diﬀ can be truncated for

inexact solutions to further increase computational speeds without sacriﬁcing much accuracy.

4.1 Alternating diﬀerentiation

Motivated by the scalability of the operator splitting method [27, 48], Alt-Diﬀ ﬁrst decouples a

constrained optimization problem into multiple subproblems based on ADMM. Each splitted operator

is then diﬀerentiated to establish the derivatives of the primal and dual variables w.r.t the parameters

in an alternating fashion. The augmented Lagrange function of problem (1) with a quadratic penalty

term is:

max

λ,ν min

x,s≥0L(x, s, λ, ν;θ)

=f(x;θ) + hλ, Ax −bi+hν, Gx +s−hi+ρ

2(kAx −bk2+kGx +s−hk2),

(4)

where s≥0 is a non-negative slack variable, and ρ > 0 is a hyperparameter associated with the penalty

term. Accordingly, the following ADMM procedures are used to alternatively update the primary, slack

and dual variables: 









xk+1 = arg min

xL(x, sk, λk, νk;θ),

sk+1 = arg min

s≥0L(xk+1, s, λk, νk;θ),

λk+1 =λk+ρ(Axk+1 −b),

νk+1 =νk+ρ(Gxk+1 +sk+1 −h).

(5a)

(5b)

(5c)

(5d)

Note that the primal variable xk+1 is updated by solving an unconstrained optimization problem.

The update of the slack variable sk+1 is easily obtained by a closed-form solution via a Rectiﬁed Linear

Unit (ReLU) as

sk+1 =ReLU −1

ρνk−(Gxk+1 −h).(6)

The diﬀerentiations for the slack and dual variables are trivial and can be done using an automatic

diﬀerentiation technique. Therefore, the computational diﬃculty is now concentrated around diﬀer-

entiating the primal variables, which can be done using the Implicit Function Theorem. Applying a

diﬀerentiation technique to procedure (5) leads to:











∂xk+1

∂θ =−∇2

xL(xk+1)−1∇x,θ L(xk+1),

∂sk+1

∂θ =−1

ρsgn(sk+1)·1T∂νk

∂θ +ρ∂(Gxk+1 −h)

∂θ ,

∂λk+1

∂θ =∂λk

∂θ +ρ∂(Axk+1 −b)

∂θ ,

∂νk+1

∂θ =∂νk

∂θ +ρ∂(Gxk+1 +sk+1 −h)

∂θ ,

(7a)

(7b)

(7c)

(7d)

where ∇2

xL(xk+1) = ∇2

xf(xk+1)T+ρATA+ρGTG,represents Hadamard production, sgn(sk+1)

denotes a function such that sgn(si

k+1) = 1 if si

k+1 ≥0 and sgn(si

k+1) = 0 vice versa.

Algorithm 1 Alt-Diﬀ

Parameter:θfrom the previous layer

Initialize slack variable and dual variables

while kxk+1 −xkk/kxkk ≥ do

Forward update by (5)

Primal update ∂xk+1

∂θ by (7a)

Slack update ∂sk+1

∂θ by (7b)

Dual update ∂λk+1

∂θ and ∂νk+1

∂θ by (7c) and (7d)

k:= k+ 1

end while

return x?and its gradient ∂x?

∂θ

We summarize the procedure of Alt-Diﬀ

in Algorithm 1 and provide a framework of

Alt-Diﬀ in Appendix A. The convergence be-

havior of Alt-Diﬀ can be inspected by check-

ing kxk+1 −xkk/kxkk< , where is a pre-

deﬁned threshold.

Notably, Alt-Diﬀ is somehow similar to

unrolling methods as in [18, 22]. However,

these unrolling methods were designed for

unconstrained optimization. If constraints

are added, the unrolling solutions have to

be projected into a feasible region. Gener-

ally this projector operators are very com-

putationally expensive. By contrast, Alt-Diﬀ can decouple constraints from the optimization and only

involves a very simple operation that projects the slack variable sto the nonnegative orthant s≥0.

This signiﬁcantly improves the eﬃciency of subsequent updates and reduces the overall computational

complexity of Alt-Diﬀ. Besides, conventional unrolling methods usually need more memory consump-

tion as all the intermediate computational results have to be recorded. However, when updating (7)

continuously, Alt-Diﬀ does not need to save the results of the previous round. Instead, it only saves

the results of the last round, that is, the previous ∂xk

∂θ is replaced by the ∂xk+1

∂θ iteratively. Besides, we

show that the gradient obtained by Alt-Diﬀ is consistent with the one obtained by the gradient of the

optimality conditions implicitly. A detailed proof as well as the procedure for diﬀerentiating the KKT

conditions is presented in the Appendix E.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AlternatingDierentiationforOptimizationLayersHaixiangSun1,YeShi*1,JingyaWang1,HoangDuongTuan2,H.VincentPoor3,andDachengTao4,51SchoolofInformationScienceandTechnology,ShanghaiTechUniversity,China2SchoolofElectricalandDataEngineering,UniversityofTechnologySydney,Australia3DepartmentofElectricalandCom...

展开>> 收起<<

Alternating Dierentiation for Optimization Layers Haixiang Sun1 Ye Shi1 Jingya Wang1 Hoang Duong Tuan2 H. Vincent Poor3 and Dacheng Tao45_2.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Alternating Dierentiation for Optimization Layers Haixiang Sun1 Ye Shi1 Jingya Wang1 Hoang Duong Tuan2 H. Vincent Poor3 and Dacheng Tao45_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: